Skip to main navigation Skip to search Skip to main content

Modeling fine-grained relations in dynamic space-time graphs for video-based facial expression recognition

  • Changqin Huang
  • , Fan Jiang
  • , Zhongmei Han
  • , Xiaodi Huang
  • , Shijin Wang
  • , Yanlai Zhu
  • , Yunliang Jiang
  • , Bin Hu
  • Zhejiang Normal University
  • Zhejiang University
  • IFLYTEK Co., Ltd.
  • Hangzhou Hikvision Digital Technology Co.,Ltd.
  • Beijing Institute of Technology

Research output: Contribution to journalArticlepeer-review

Abstract

Facial expressions in videos inherently mirror the dynamic nature of real-world facial events. Consequently, facial expression recognition (FER) should employ a dynamic graph-based representation to effectively capture the relational structure of facial expressions rather than relying on conventional grid or sequence methods. However, existing graph-based approaches have their limitations. Frame-level graph methods provide a coarse representation of the facial graph across time and space, while landmark-based graph methods need to introduce additional facial landmarks, resulting in a static graph structure. To address these challenges, we propose spatial-temporal relation-aware dynamic graph convolutional networks (ST-RDGCN). This fine-grained relation modeling approach enables the dynamic modeling of evolving facial expressions in videos through dynamic space-time graphs, eliminating the need for facial landmarks. ST-RDGCN encompasses three graph construction paradigms: dynamic independent space graph, dynamic joint space-time graph, and dynamic cross space-time graph. Furthermore, we propose a relation-aware space-time graph convolution (RSTG-Conv) operator to learn informative spatiotemporal correlations in dynamic space-time graphs. In extensive experimental evaluations, our ST-RDGCN demonstrates state-of-the-art performance on the five popular video-based FER datasets, achieving overall accuracy scores of 99.69%, 91.67%, 56.51%, 69.37%, and 49.03% on the CK+, Oulu-CASIA, AFEW, DFEW, and FERV39 k datasets, respectively. In particular, our ST-RDGCN outperforms the current best method by 3.6% in UAR on the most challenging FERV39 k dataset. Furthermore, our analysis reveals that the dynamic cross space-time graph scheme is the most effective among the three dynamic graph construction schemes.
Original languageEnglish
Pages (from-to)1675-1692
Number of pages18
JournalIEEE Transactions on Affective Computing
Volume16
Issue number3
DOIs
Publication statusPublished - Jul 2025

Fingerprint

Dive into the research topics of 'Modeling fine-grained relations in dynamic space-time graphs for video-based facial expression recognition'. Together they form a unique fingerprint.

Cite this