By the earliest motivation of building humanoid robot to take care of human being in the daily life, the researches of robotics have been developed several systems over the recent decades. One of the challenges faces humanoid robots is its capability to achieve audio-visual speech communication with people, which is known as human-robot interaction (HRI). In this paper, we propose a novel multimodal speech recognition system can be used independently or to be combined with any humanoid robot. The system is multimodal since it includes audio speech module, visual speech module, face and mouth detection and user identification all in one framework runs on real time. In this framework, we use the Self Organizing Map (SOM) in feature extraction tasks and both the k-Nearest Neighbor and the Hidden Markov Model in feature recognition tasks. Results from experiments are undertaken on a novel Arabic database, developed by the author, includes 36 isolated words and 13 casual phrases gathered by 50 Arabic subjects. The experimental results show how the acoustic cue and the visual cue enhance each other to yield an effective audio-visual speech recognition (AVSR) system. The proposed AVSR system is simple, promising and effectively comparable with other reported systems.
"Multimodal Arabic Speech Recognition for Human-Robot Interaction Applications,"
Applied Mathematics & Information Sciences: Vol. 09
, Article 15.
Available at: https://dc.naturalspublishing.com/amis/vol09/iss6/15