TY - GEN
T1 - CIC
T2 - 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
AU - Yun, Youngsik
AU - Kim, Jihie
N1 - Publisher Copyright:
© 2024 International Joint Conferences on Artificial Intelligence. All rights reserved.
PY - 2024
Y1 - 2024
N2 - Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly.However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups.In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures.Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts.Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs.Resources can be found at https://shane3606.github.io/cic.
AB - Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly.However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups.In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures.Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts.Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs.Resources can be found at https://shane3606.github.io/cic.
UR - http://www.scopus.com/inward/record.url?scp=85204308434&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85204308434
T3 - IJCAI International Joint Conference on Artificial Intelligence
SP - 1625
EP - 1633
BT - Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
A2 - Larson, Kate
PB - International Joint Conferences on Artificial Intelligence
Y2 - 3 August 2024 through 9 August 2024
ER -