Abstract
In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model’s ability to extract discriminative features beyond what can be achieved with visual features alone.
Original language | English |
---|---|
Article number | 4552 |
Journal | Electronics (Switzerland) |
Volume | 13 |
Issue number | 22 |
DOIs | |
State | Published - Nov 2024 |
Keywords
- deep learning
- fine-grained visual classification
- food image classification
- large language model
- multimodal image feature