In recent times, human lip-readers are being presented as valuable in the assemble of scientific proof. But, like all human beings, they grieve from unpredictability in analyzing the lip movement. Here an intelligent system is designed in such a way that it predicts the output for the lip reading. Proposed audio visual speech recognition (AVSR) system uses local proprietary dataset to detect the English word spoken by the speaker in the video, by using feed forward neural networks (FFNN) and Long-Short-Term-Memory (LSTM) network. The audio features selected are Mel Frequency Cepstral Coefficients (MFCC), MEL, CONTRAST, TONNETZ and CHROMA. In case of visual feature based model development, difference of location of various points around the lip of current frame and previous frame has been considered. These features are extracted for each video in the dataset. Using the extracted audio features a Deep Neural Nework having feed forward architecture is trained and using the extracted visual features a LSTM recurrent neural network is developed. In the audio and visual feature based model, accuracy is 91.42% and 80% respectively. Finally, audio and video models are integrated using feed forward neural network. Final model is capable of taking more appropriate decision while predicting the spoken word. The accuracy of integrated model is 92.38%.