This chapter presents an audio-visual speech recognition (AVSR) for Human Computer Interaction (HCI) that mainly focuses on 3 modules: (i) the radial basis function neural network (RBF-NN) voice activity detection (VAD) (ii) the watershed lips detection and H∞ lips tracking and (iii) the multi-stream audio-visual back-end processing. The importance of the AVSR as the pipeline for the HCI and the background studies of the respective modules are first discussed follow by the design details of the overall proposed AVSR system. Compared to the conventional lips detection approach which needs a prerequisite skin/non-skin detection and face localization, the proposed watershed lips detection with the aid of H∞ lips tracking approach provides a potentially time saving direct lips detection technique, rendering the preliminary criterion obsolete. Alternatively, with a better noise compensation and a more precise speech localization offered by the proposed RBF-NN VAD compared to the conventional zero-crossing rate and short-term signal energy, it has yield to a higher performance capability for the recognition process through the audio modality. Lastly, the developed AVSR system which integrates the audio and visual information, as well the temporal synchrony audiovisual data stream has proved to obtain a significant improvement compared to the unimodal speech recognition, also the decision and feature integration approaches. © Springer-Verlag Berlin Heidelberg 2012.
|Title of host publication||Advances in robotics and virtual reality|
|Editors||Tauseef Gulrez , Aboul Ella Hassanien|
|Place of Publication||Berlin, Heidelberg|
|Number of pages||31|
|Publication status||Published - 2012|
|Name||Intelligent Systems Reference Library|
Chin, S. W., Seng, K. P., & Ang, L-M. (2012). Audio-visual speech processing for human computer interaction. In T. Gulrez , & A. E. Hassanien (Eds.), Advances in robotics and virtual reality (1 ed., Vol. 26, pp. 135-165). (Intelligent Systems Reference Library; Vol. 26). Springer. https://doi.org/10.1007/978-3-642-23363-0_6