MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data

Abdul Aziz, Nihad Karim Chowdhury, Ashad Kabir, Abu Nowshed Chy, Md. Jawad Siddique

Research output: Contribution to journalArticlepeer-review

Abstract

Desires, emotions, and sentiments are pivotal in understanding and predicting human behavior, influencing various aspects of decision-making, communication, and social interactions. Their analysis, particularly in the context of multimodal data (such as images and texts) from social media, provides profound insights into cultural diversity, psychological well-being, and consumer behavior. Prior studies overlooked the use of image-text pairwise feature representation, which is crucial for the task of human desire understanding. In this research, we have proposed a unified multimodal-based framework with image-text pair settings to identify human desire, sentiment, and emotion. The core of our proposed method lies in the encoder module, which is built using two state-of-the-art multimodal vision-language models (VLMs). To effectively extract visual and contextualized embedding features from social media image and text pairs, we jointly fine-tune two pre-trained multimodal VLMs: Vision-and-Language Transformer (ViLT) and Vision-and-Augmented-Language Transformer (VAuLT). Subsequently, we use an early fusion strategy on these embedding features to obtain combined diverse feature representations. Moreover, we leverage a multi-sample dropout mechanism to enhance the generalization ability and expedite the training process of our proposed method. To evaluate our proposed approach, we used the multimodal dataset MSED for the human desire understanding task. Through our experimental evaluation, we demonstrate that our method excels in capturing both visual and contextual information, resulting in superior performance compared to other state-of-the-art techniques. Specifically, our method outperforms existing approaches by 3% for sentiment analysis, 2.2% for emotion analysis, and approximately 1% for desire analysis.
Original languageEnglish
Article number129376
JournalNeurocomputing
Volume623
DOIs
Publication statusPublished - 28 Mar 2025

Fingerprint

Dive into the research topics of 'MMTF-DES: A fusion of multimodal transformer models for desire, emotion, and sentiment analysis of social media data'. Together they form a unique fingerprint.

Cite this