XMeCap: Meme Caption Generation with Sub-Image Adaptability
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively by memes. This paper poses a particular emphasis on...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Humor, deeply rooted in societal meanings and cultural details, poses a
unique challenge for machines. While advances have been made in natural
language processing, real-world humor often thrives in a multi-modal context,
encapsulated distinctively by memes. This paper poses a particular emphasis on
the impact of multi-images on meme captioning. After that, we introduce the
\textsc{XMeCap} framework, a novel approach that adopts supervised fine-tuning
and reinforcement learning based on an innovative reward model, which factors
in both global and local similarities between visuals and text. Our results,
benchmarked against contemporary models, manifest a marked improvement in
caption generation for both single-image and multi-image memes, as well as
different meme categories. \textsc{XMeCap} achieves an average evaluation score
of 75.85 for single-image memes and 66.32 for multi-image memes, outperforming
the best baseline by 3.71\% and 4.82\%, respectively. This research not only
establishes a new frontier in meme-related studies but also underscores the
potential of machines in understanding and generating humor in a multi-modal
setting. |
---|---|
DOI: | 10.48550/arxiv.2407.17152 |