TY - JOUR
T1 - SCoFT
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Liu, Zhixuan
AU - Schaldenbrand, Peter
AU - Okogwu, Beverley Claire
AU - Peng, Wenxuan
AU - Yun, Youngsik
AU - Hundt, Andrew
AU - Kim, Jihie
AU - Oh, Jean
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT, pronounced /sô ft/) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline, which is further improved with our SCoFT technique. Resources and code are at https://ariannaliu.github.io/SCoFT.
AB - Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT, pronounced /sô ft/) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets, encode only high-level information from the data, and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline, which is further improved with our SCoFT technique. Resources and code are at https://ariannaliu.github.io/SCoFT.
KW - Computer Vision for Social Good
KW - Image Synthesis
UR - http://www.scopus.com/inward/record.url?scp=85203188037&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.01029
DO - 10.1109/CVPR52733.2024.01029
M3 - Conference article
AN - SCOPUS:85203188037
SN - 1063-6919
SP - 10822
EP - 10832
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 16 June 2024 through 22 June 2024
ER -