TY - JOUR
T1 - SimFLE
T2 - Simple Facial Landmark Encoding for Self-Supervised Facial Expression Recognition in the Wild
AU - Moon, Jiyong
AU - Jang, Hyeryung
AU - Park, Seongsik
N1 - Publisher Copyright:
© 2010-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - Facial expression recognition in the wild (FER-W) entails classifying facial emotions in natural environments. The major challenges in FER-W stem from the complexity and ambiguity of facial images, making it difficult to curate a large-scale labeled dataset for training. Additionally, the subtle differences in emotions often reside in the fine-grained details of local facial landmarks, demanding innovative solutions to capture these crucial features efficiently. To address these issues, we employ two distinct self-supervised methods. First, we adopt a contrastive learning method to capture generalized global representations, enabling the model to understand the semantic context of facial expressions without relying on labeled data. Simultaneously, we leverage masked image modeling to focus on embedding fine-grained, local facial landmark information at the patch-level. We introduce a novel module called FaceMAE, which aims to reconstruct the masked facial patches. The semantic masking scheme is designed to preserve highly activated feature activations, allowing the encoding of crucial details of unmasked facial landmarks and their relationships within the broader facial context at the patch-level. It finally guides the backbone network to calibrate the learned global features to be attentive to facial landmarks. Our proposed method, called Simple Facial Landmark Encoding (SimFLE), significantly outperforms supervised baseline and other self-supervised methods in terms of facial landmark localization and overall performance, as demonstrated through extensive experiments across several FER-W benchmarks.
AB - Facial expression recognition in the wild (FER-W) entails classifying facial emotions in natural environments. The major challenges in FER-W stem from the complexity and ambiguity of facial images, making it difficult to curate a large-scale labeled dataset for training. Additionally, the subtle differences in emotions often reside in the fine-grained details of local facial landmarks, demanding innovative solutions to capture these crucial features efficiently. To address these issues, we employ two distinct self-supervised methods. First, we adopt a contrastive learning method to capture generalized global representations, enabling the model to understand the semantic context of facial expressions without relying on labeled data. Simultaneously, we leverage masked image modeling to focus on embedding fine-grained, local facial landmark information at the patch-level. We introduce a novel module called FaceMAE, which aims to reconstruct the masked facial patches. The semantic masking scheme is designed to preserve highly activated feature activations, allowing the encoding of crucial details of unmasked facial landmarks and their relationships within the broader facial context at the patch-level. It finally guides the backbone network to calibrate the learned global features to be attentive to facial landmarks. Our proposed method, called Simple Facial Landmark Encoding (SimFLE), significantly outperforms supervised baseline and other self-supervised methods in terms of facial landmark localization and overall performance, as demonstrated through extensive experiments across several FER-W benchmarks.
KW - Contrastive learning
KW - facial expression recognition
KW - masked image modeling
KW - self-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85205673149&partnerID=8YFLogxK
U2 - 10.1109/TAFFC.2024.3470980
DO - 10.1109/TAFFC.2024.3470980
M3 - Article
AN - SCOPUS:85205673149
SN - 1949-3045
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
ER -