Audio-visual speech processing for human computer interaction

S.W. Chin, K.P. Seng, L.-M. Ang

Research output: Book chapter/Published conference paperChapter (peer-reviewed)

12 Citations (Scopus)
4 Downloads (Pure)


This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches. © Springer-Verlag Berlin Heidelberg 2012.
Original languageEnglish
Title of host publicationAdvances in robotics and virtual reality
EditorsTauseef Gulrez , Aboul Ella Hassanien
Place of PublicationBerlin, Heidelberg
Number of pages31
ISBN (Electronic)9783642233630
ISBN (Print)9783642233623
Publication statusPublished - 2012

Publication series

NameIntelligent Systems Reference Library
ISSN (Print)1868-4394

Fingerprint Dive into the research topics of 'Audio-visual speech processing for human computer interaction'. Together they form a unique fingerprint.

  • Cite this

    Chin, S. W., Seng, K. P., & Ang, L-M. (2012). Audio-visual speech processing for human computer interaction. In T. Gulrez , & A. E. Hassanien (Eds.), Advances in robotics and virtual reality (1 ed., Vol. 26, pp. 135-165). (Intelligent Systems Reference Library; Vol. 26). Springer.