TY - JOUR
T1 - TeFNA
T2 - Text-centered Fusion Network with crossmodal attention for multimodal sentiment analysis
AU - Huang, Changqin
AU - Zhang, Junling
AU - Wu, Xuemei
AU - Wang, Yi
AU - Li, Ming
AU - Huang, Xiaodi
N1 - Funding Information:
This work was supported by the National Key Research and Development Program of China (No. 2022ZD0117104 ) and the National Natural Science Foundation of China (No. 62037001 ), in part by the Key Research and Development Program of Zhejiang Province (No. 2022C03106 ) and the Zhejiang Provincial Natural Science Foundation, China (No. LY22F020004 ), and in part by the Open Research Fund of College of Teacher Education, Zhejiang Normal University, China (No. jykf22004 ).
Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2023/6/7
Y1 - 2023/6/7
N2 - Multimodal sentiment analysis (MSA), which goes beyond the analysis of texts to include other modalities such as audio and visual data, has attracted a significant amount of attention. An effective fusion of sentiment information in multiple modalities is key to improving the performance of MSA. However, aligning multiple modalities during the process of fusion faces challenges such as maintaining modal-specific information. This paper proposes a Text-centered Fusion Network with crossmodal Attention (TeFNA), a multimodal fusion network that uses crossmodal attention to model unaligned multimodal timing information. In particular, TeFNA employs a Text-Centered Aligned fusion method (TCA) that takes text modality as the primary modality to improve the representation of fusion features. In addition, TeFNA maximizes the mutual information between modality pairs to maintain task-related emotional information, thereby ensuring that the key information of modalities from input to fusion is preserved. The results of our comprehensive experiments on the multimodal datasets of CMU-MOSI and CMU-MOSEI show that our proposed model outperforms methods in terms of most metrics used.
AB - Multimodal sentiment analysis (MSA), which goes beyond the analysis of texts to include other modalities such as audio and visual data, has attracted a significant amount of attention. An effective fusion of sentiment information in multiple modalities is key to improving the performance of MSA. However, aligning multiple modalities during the process of fusion faces challenges such as maintaining modal-specific information. This paper proposes a Text-centered Fusion Network with crossmodal Attention (TeFNA), a multimodal fusion network that uses crossmodal attention to model unaligned multimodal timing information. In particular, TeFNA employs a Text-Centered Aligned fusion method (TCA) that takes text modality as the primary modality to improve the representation of fusion features. In addition, TeFNA maximizes the mutual information between modality pairs to maintain task-related emotional information, thereby ensuring that the key information of modalities from input to fusion is preserved. The results of our comprehensive experiments on the multimodal datasets of CMU-MOSI and CMU-MOSEI show that our proposed model outperforms methods in terms of most metrics used.
KW - Text-centered
KW - Fusion network
KW - Crossmodal attention
KW - Multimodal sentiment analysis
UR - http://www.scopus.com/inward/record.url?scp=85151803876&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85151803876&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2023.110502
DO - 10.1016/j.knosys.2023.110502
M3 - Article
SN - 1872-7409
VL - 269
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 110502
ER -