Abstract

Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39 k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39 k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.

Original languageEnglish
Pages (from-to)1-17
Number of pages17
JournalIEEE Transactions on Affective Computing
DOIs
Publication statusE-pub ahead of print - 2025

Fingerprint

Dive into the research topics of 'Modeling Fine-Grained Relations in Dynamic Space-Time Graphs for Video-Based Facial Expression Recognition'. Together they form a unique fingerprint.

Cite this