TY - JOUR
T1 - MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data
AU - Aziz, Abdul
AU - Chowdhury, Nihad Karim
AU - Kabir, Ashad
AU - Chy, Abu Nowshed
AU - Siddique, Md. Jawad
PY - 2025/3/28
Y1 - 2025/3/28
N2 - Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.
AB - Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.
UR - http://www.scopus.com/inward/record.url?scp=85215404068&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215404068&partnerID=8YFLogxK
U2 - 10.1016/j.neucom.2025.129376
DO - 10.1016/j.neucom.2025.129376
M3 - Article
SN - 0925-2312
VL - 623
JO - Neurocomputing
JF - Neurocomputing
M1 - 129376
ER -