Multimodal Food Image Classification with Large Language Models

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LL...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Electronics (Basel) 2024-11, Vol.13 (22), p.4552
Hauptverfasser:	Kim, Jun-Hwa, Kim, Nam-Ho, Jo, Donghyeok, Won, Chee Sun
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Classification Computer vision Feature extraction Food Image classification Image enhancement Ingredients Language Large language models Linguistics Machine learning Semantics Visual discrimination
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model’s ability to extract discriminative features beyond what can be achieved with visual features alone.
ISSN:	2079-9292 2079-9292
DOI:	10.3390/electronics13224552