Multimodal Food Image Classification with Large Language Models

Jun Hwa Kim, Nam Ho Kim, Donghyeok Jo, Chee Sun Won

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

In this study, we leverage advancements in large language models (LLMs) for fine-grained food image classification. We achieve this by integrating textual features extracted from images using an LLM into a multimodal learning framework. Specifically, semantic textual descriptions generated by the LLM are encoded and combined with image features obtained from a transformer-based architecture to improve food image classification. Our approach employs a cross-attention mechanism to effectively fuse visual and textual modalities, enhancing the model’s ability to extract discriminative features beyond what can be achieved with visual features alone.

Original languageEnglish
Article number4552
JournalElectronics (Switzerland)
Volume13
Issue number22
DOIs
StatePublished - Nov 2024

Keywords

  • deep learning
  • fine-grained visual classification
  • food image classification
  • large language model
  • multimodal image feature

Fingerprint

Dive into the research topics of 'Multimodal Food Image Classification with Large Language Models'. Together they form a unique fingerprint.

Cite this