Audio visual speech recognition using feed forward neural network architecture

R Shashidhar, S Patilkulkarni… - 2020 IEEE International …, 2020 - ieeexplore.ieee.org
2020 IEEE International Conference for Innovation in Technology …, 2020ieeexplore.ieee.org
In recent times, human lip-readers are being presented as valuable in the assemble of
scientific proof. But, like all human beings, they grieve from unpredictability in analyzing the
lip movement. Here an intelligent system is designed in such a way that it predicts the output
for the lip reading. Proposed audio visual speech recognition (AVSR) system uses local
proprietary dataset to detect the English word spoken by the speaker in the video, by using
feed forward neural networks (FFNN) and Long-Short-Term-Memory (LSTM) network. The …
In recent times, human lip-readers are being presented as valuable in the assemble of scientific proof. But, like all human beings, they grieve from unpredictability in analyzing the lip movement. Here an intelligent system is designed in such a way that it predicts the output for the lip reading. Proposed audio visual speech recognition (AVSR) system uses local proprietary dataset to detect the English word spoken by the speaker in the video, by using feed forward neural networks (FFNN) and Long-Short-Term-Memory (LSTM) network. The audio features selected are Mel Frequency Cepstral Coefficients (MFCC), MEL, CONTRAST, TONNETZ and CHROMA. In case of visual feature based model development, difference of location of various points around the lip of current frame and previous frame has been considered. These features are extracted for each video in the dataset. Using the extracted audio features a Deep Neural Nework having feed forward architecture is trained and using the extracted visual features a LSTM recurrent neural network is developed. In the audio and visual feature based model, accuracy is 91.42% and 80% respectively. Finally, audio and video models are integrated using feed forward neural network. Final model is capable of taking more appropriate decision while predicting the spoken word. The accuracy of integrated model is 92.38%.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果