Text-centered cross-sample fusion network for multimodal sentiment analysis

Qionghao Huang, Jili Chen, Changqin Huang, Xiaodi Huang, Yi Wang

Research output: Contribution to journalArticlepeer-review

Abstract

Significant advancements in multimodal sentiment analysis tasks have been achieved through cross-modal attention mechanisms (CMA). However, the importance of modality-specific information for distinguishing similar samples is often overlooked due to the inherent limitations of CMA. To address this issue, we propose a Text-centered Cross-sample Fusion Network (TeCaFN), which employs cross-sample fusion to perceive modality-specific information during modal fusion. Specifically, we develop a cross-sample fusion method that merges modalities from distinct samples. This method maintains detailed modality-specific information through the use of adversarial training combined with a task of pairwise prediction. Furthermore, a robust mechanism using a two-stage text-centric contrastive learning approach is developed to enhance the stability of cross-sample fusion learning. TeCaFN achieves state-of-the-art results on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Moreover, our ablation studies further demonstrate the effectiveness of contrastive learning and adversarial training as the components of TeCaFN in improving model performance. The code implementation of this paper is available at https://github.com/TheShy-Dream/MSA-TeCaFN.
Original languageEnglish
Article number228
Pages (from-to)1-19
Number of pages19
JournalMultimedia Systems
Volume30
Issue number4
DOIs
Publication statusPublished - 30 Jul 2024

Fingerprint

Dive into the research topics of 'Text-centered cross-sample fusion network for multimodal sentiment analysis'. Together they form a unique fingerprint.

Cite this