Searching for memory-lighter architectures for OCR-augmented image captioning
Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work...
Gespeichert in:
Veröffentlicht in: | Journal of intelligent & fuzzy systems 2022-01, Vol.42 (5), p.4399-4410 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Current State-of-the-Art image captioning systems that can read and integrate read text into the generated descriptions need high processing power and memory usage, which limits the sustainability and usability of the models (as they require expensive and very specialized hardware). The present work introduces two alternative versions (L-M4C and L-CNMT) of top architectures (on the TextCaps challenge), which were mainly adapted to achieve near-State-of-The-Art performance while being memory-lighter when compared to the original architectures, this is mainly achieved by using distilled or smaller pre-trained models on the text-and-OCR embedding modules. On the one hand, a distilled version of BERT was used in order to reduce the size of the text-embedding module (the distilled model has 59% fewer parameters), on the other hand, the OCR context processor on both architectures was replaced by Global Vectors (GloVe), instead of using FastText pre-trained vectors, this can reduce the memory used by the OCR-embedding module up to a 94% . Two of the three models presented in this work surpassed the baseline (M4C-Captioner) of the challenge on the evaluation and test sets, also, our best lighter architecture reached a CIDEr score of 88.24 on the test set, which is 7.25 points above the baseline model. |
---|---|
ISSN: | 1064-1246 1875-8967 |
DOI: | 10.3233/JIFS-219230 |