We showcase a system for real-time video text recognition. The system is based on the standard workflow of text spotting system, which includes text detection and word recognition procedures. We apply deep neural networks in both procedures. In text localization stage, textual candidates are roughly captured by using a Maximally Stable Extremal Regions (MSERs) detector with high recall rate, false alarms are then eliminated by using Convolutional Neural Network (CNN ) verifier. For word recognition, we developed a skeleton based method for segmenting text region from its background, then a CNN based word recognizer is utilized for recognizing texts. Our current implementation demonstrates a real time performance for recognizing scene text by using a standard laptop with webcam. The word recognizer achieves competitive result to state-of-the-art methods by only using synthetical training data.