Large-Scale Bidirectional Training for Zero-Shot Image Captioning
When trained on large-scale datasets, image captioning models can understand the content of images from a general domain but often fail to generate accurate, detailed captions. To improve performance, pretraining-and-finetuning has been a key strategy for image captioning. However, we find that larg...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | When trained on large-scale datasets, image captioning models can understand
the content of images from a general domain but often fail to generate
accurate, detailed captions. To improve performance, pretraining-and-finetuning
has been a key strategy for image captioning. However, we find that large-scale
bidirectional training between image and text enables zero-shot image
captioning. In this paper, we introduce Bidirectional Image Text Training in
largER Scale, BITTERS, an efficient training and inference framework for
zero-shot image captioning. We also propose a new evaluation benchmark which
comprises of high quality datasets and an extensive set of metrics to properly
evaluate zero-shot captioning accuracy and societal bias. We additionally
provide an efficient finetuning approach for keyword extraction. We show that
careful selection of large-scale training set and model architecture is the key
to achieving zero-shot image captioning. |
---|---|
DOI: | 10.48550/arxiv.2211.06774 |