Multimodal Emotion Recognition Using Data Augmentation and Fusion

Nusrat Jahan Shoumy

Research output: ThesisDoctoral Thesis

252 Downloads (Pure)


Automatic human emotion recognition has been receiving increasing attention from researchers in the area of computer vision, and several solutions have been proposed. Most of the early works have focused on a single modality: typically, facial expression, or speech. More recent efforts have focused on multimodal fusion because human emotion is expressed through three modalities: facial expressions, speech and physiological signals. These efforts have primarily emphasized finding effective multimodal fusion approaches, including audio-visual fusion. Individual modalities are often combined through simple fusion at the feature and/or decision-level. Despite some improvement of accuracy, most of the early approaches have relied on handcrafted features and traditional fusion and classification techniques. The use of well-established feature extraction techniques to automatically extract effective features from multimodal information as well as using deep learning classifiers in fusion and classifications are new directions currently being actively pursued by researchers, but several challenges remain in realizing a multimodal, end-to-end deep learning system.
Another challenge faced by researchers is the lack of emotion data to be used for classification purposes. However, using data augmentation techniques to expand the available dataset has become a popular and effective method for increasing training data in classification problems. This method was therefore implemented in the current study.
This work investigates the use of augmentation, deep learning classification techniques and various fusion models to address the problem of machine understanding of human affective behaviour and to improve the accuracy of both unimodal and multimodal emotion recognition. The aim of this work is to explore how best to configure the emotion classification model through augmentation, deep learning classification networks and fusion to capture individually and jointly the key features contributing to human emotions from the three modalities (speech, face and physiological signals) and, thus, to accurately classify the expressed human emotion. This work studies the use of well-established augmentation techniques, such as flipping, cropping, rotation (for video augmentation), pitch shifting, controlling volume (for speech augmentation) and adding noise (for speech and physiological signals). The use of several classifiers, including convolutional neural network (CNN), naïve bayes (NB), k-nearest neighbor (KNN), linear discriminant analysis (LDA) and decision trees (DT) were studied for this task as well. Proposed fusion techniques (weight-based fusion) were also compared to early fusion and late fusion models to capture the latent correlation and complementary information between the modalities, and thereby improve overall emotion recognition accuracy.
Simulation results and experiments conducted on the proposed systems demonstrated that the augmentation and classification approaches increased performances up to 1% for audio data, 30% for video data and 8% for physiological data. The proposed multimodal system also achieved up to 19% increase in performance when compared to unimodal data.
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Charles Sturt University
  • Zia, Tanveer, Principal Supervisor
  • Ang, Li Minn, Co-Supervisor, External person
  • Seng, Jasmine, Co-Supervisor
Place of PublicationAustralia
Publication statusPublished - 2022


Dive into the research topics of 'Multimodal Emotion Recognition Using Data Augmentation and Fusion'. Together they form a unique fingerprint.

Cite this