TY - GEN
T1 - Advanced Facial Analysis in Multi-Modal Data with Cascaded Cross-Attention based Transformer
AU - Kim, Jun Hwa
AU - Kim, Namho
AU - Hong, Minsoo
AU - Won, Chee Sun
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - One of the most crucial elements in deeply understanding humans on a psychological level is manifested through facial expressions. The analysis of human behavior can be informed by their facial expressions, making it essential to employ indicators such as expression (EXPR), valence-arousal (VA), and action units (AU). In this paper, we introduce the method proposed in the Challenge of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) at CVPR 2024. Our proposed method utilizes the multi-modal Aff-Wild2 dataset, which is split into visual and audio modalities. For the visual data, we extract features using the SimMIM model that was pre-trained on a diverse set of facial expression data. For the audio data, we extract features using the Wav2Vec model. Then, to fuse the extracted visual and audio features, we proposed a cascaded cross-attention mechanism in a transformer. Our approach achieved average F1 scores of 0.4652 and 0.3005 on the AU and the EXPR tracks, respectively, and an average Concordance Correlation Coefficient (CCC) of 0.5077, outperforming the baseline performance on all tracks of the ABAW6 competition. Our approach placed 5th, 6th, and 7th on the AU, the EXPR, and the VA tracks, respectively. The code used in the 6th ABAW competition is available at https://github.com/namho-96/ABAW2024.
AB - One of the most crucial elements in deeply understanding humans on a psychological level is manifested through facial expressions. The analysis of human behavior can be informed by their facial expressions, making it essential to employ indicators such as expression (EXPR), valence-arousal (VA), and action units (AU). In this paper, we introduce the method proposed in the Challenge of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW) at CVPR 2024. Our proposed method utilizes the multi-modal Aff-Wild2 dataset, which is split into visual and audio modalities. For the visual data, we extract features using the SimMIM model that was pre-trained on a diverse set of facial expression data. For the audio data, we extract features using the Wav2Vec model. Then, to fuse the extracted visual and audio features, we proposed a cascaded cross-attention mechanism in a transformer. Our approach achieved average F1 scores of 0.4652 and 0.3005 on the AU and the EXPR tracks, respectively, and an average Concordance Correlation Coefficient (CCC) of 0.5077, outperforming the baseline performance on all tracks of the ABAW6 competition. Our approach placed 5th, 6th, and 7th on the AU, the EXPR, and the VA tracks, respectively. The code used in the 6th ABAW competition is available at https://github.com/namho-96/ABAW2024.
KW - ABAW
KW - Cross-attention
KW - Facial Analysis
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85206483361&partnerID=8YFLogxK
U2 - 10.1109/CVPRW63382.2024.00784
DO - 10.1109/CVPRW63382.2024.00784
M3 - Conference contribution
AN - SCOPUS:85206483361
T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
SP - 7870
EP - 7877
BT - Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
PB - IEEE Computer Society
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024
Y2 - 16 June 2024 through 22 June 2024
ER -