Differently processed modality and appropriate model selection lead to richer representation of the multimodal input

We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of information technology (Singapore. Online) 2024-10, Vol.16 (7), p.4505-4516
Hauptverfasser:	Panda, Saroj Kumar, Diwan, Tausif, Kakde, Omprakash G.
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computer Imaging Computer Science Image Processing and Computer Vision Machine Learning Original Research Pattern Recognition and Graphics Software Engineering Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	We aim to effectively solve and improvise the Meta Meme Challenge for the binary classification of hateful memes detection on a multimodal dataset launched by Meta. This problem has its challenges in terms of individual modality processing and its impact on the final classification of hateful memes. We focus on feature-level fusion methodologies in proposing the solutions for hateful memes detection in comparison with the decision-level fusion as feature-level fusion generates richer features’ representation for further processing. Appropriate model selection in multimodal data processing plays an important role in the downstream tasks. Moreover, inherent negativity associated with the visual modality may not be detected completely through the visual processing models, necessitating the differently processed visual data through some other techniques. Specifically, we propose two feature-level fusion-based methodologies for the aforesaid classification problem, employing VisualBERT for the effective representation of textual and visual modality. Additionally, we employ image captioning generating the textual captions from the visual modality of the multimodal input which is further fused with the actual text associated with the input through the Tensor Fusion Networks. Our proposed model considerably outperforms the state of the arts on accuracy and AuROC performance metrics.
ISSN:	2511-2104 2511-2112
DOI:	10.1007/s41870-024-02113-4