TY - JOUR
T1 - Speech emotion recognition using Convolutional Neural Network and long-short term memory
AU - Dangol, Ranjana
AU - Alsadoon, Abeer
AU - Prasad, P. W.C.
AU - Seher, Indra
AU - Alsadoon, Omar Hisham
PY - 2020/11
Y1 - 2020/11
N2 - Human-Robot interactions involve human intentions and human emotion. After the evolvement of positive psychology, the psychological research has a tremendous concentration to study the factors involved in the human emotion generation. Speech emotion recognition (SER) is a challenging job due to the complexity of emotions. Human emotion recognition is gaining importance as good emotional health can lead to good social and mental health. Although there are different approaches for speech emotion recognition, the most advanced model is Convolutional Neural Network (CNN) using Long Short-term Memory (LSTM) network. But they also suffer from the lack of parallelization of the sequences and computation times. Meanwhile, attention-mechanism has way better exhibitions in learning significant feature representations for specific tasks. Based on this technique, we propose an emotion recognition system with relation aware, self-attention mechanism to memorize the discriminative features for SER, where spectrograms are utilized as input. A CNN with a relation-aware self-attention is modelled to analyse 3D log-Mel spectrograms to extract the high-level features. Different layers such as 3D convolutional layers, 3D Max-pooling layers, and LSTM networks are used in the model. Here, the attention layer is exercised to support distinct parts of emotion and assemble discriminative utterance-level representations for SER. Finally, the fully connected layer is equipped with the utterance level representations with 64 output units to achieve higher-level representations. The approach of relation-aware attention-based 3D CNN and LSTM model provided a better outcome of 80.80% on average scale in speech emotion recognition. The proposed model in this paper focuses on enhancement of the attention mechanism to gain additional benefits of sequence to sequence parallelization by improving the recognition accuracy.
AB - Human-Robot interactions involve human intentions and human emotion. After the evolvement of positive psychology, the psychological research has a tremendous concentration to study the factors involved in the human emotion generation. Speech emotion recognition (SER) is a challenging job due to the complexity of emotions. Human emotion recognition is gaining importance as good emotional health can lead to good social and mental health. Although there are different approaches for speech emotion recognition, the most advanced model is Convolutional Neural Network (CNN) using Long Short-term Memory (LSTM) network. But they also suffer from the lack of parallelization of the sequences and computation times. Meanwhile, attention-mechanism has way better exhibitions in learning significant feature representations for specific tasks. Based on this technique, we propose an emotion recognition system with relation aware, self-attention mechanism to memorize the discriminative features for SER, where spectrograms are utilized as input. A CNN with a relation-aware self-attention is modelled to analyse 3D log-Mel spectrograms to extract the high-level features. Different layers such as 3D convolutional layers, 3D Max-pooling layers, and LSTM networks are used in the model. Here, the attention layer is exercised to support distinct parts of emotion and assemble discriminative utterance-level representations for SER. Finally, the fully connected layer is equipped with the utterance level representations with 64 output units to achieve higher-level representations. The approach of relation-aware attention-based 3D CNN and LSTM model provided a better outcome of 80.80% on average scale in speech emotion recognition. The proposed model in this paper focuses on enhancement of the attention mechanism to gain additional benefits of sequence to sequence parallelization by improving the recognition accuracy.
KW - Convolutional Neural Network (CNN)
KW - Deep Learning
KW - Hierarchical Spectral Clustering (HSC)
KW - Long short-term memory
KW - Long Short-term Memory (LSTM)
KW - Relation-aware self-attention mechanism
KW - Speech Emotion Recognition (SER)
UR - http://www.scopus.com/inward/record.url?scp=85089995170&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089995170&partnerID=8YFLogxK
U2 - 10.1007/s11042-020-09693-w
DO - 10.1007/s11042-020-09693-w
M3 - Article
AN - SCOPUS:85089995170
SN - 1380-7501
VL - 79
SP - 32917
EP - 32934
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
ER -