Text visual question answer (TextVQA) is an important task of visual text understanding, which requires to understand the text generated by text recognition module and provide correct answers to specific questions. Recent works of TextVQA have tried to combine text recognition and multi-modal learning. However, due to the lack of effective preprocessing of text recognition output, existing approaches suffer from serious contextual information missing, which leads to unsatisfactory performance. In this work, we propose a Multi-Modal Learning framework with Text Merging (MML&TM in short) for TextVQA, where we develop a text merging (TM) algorithm, which can effectively merge the word-level text obtained from the text recognition module to construct line-level and paragraph-level texts for enhancing semantic context, which is crucial to visual text understanding. The TM module can be easily incorporated into the multi-modal learning framework to generate more comprehensive answers for TextVQA. We evaluate our method on a public dataset ST-VQA. Experimental results show that our TM algorithm can obtain complete semantic information, which subsequently helps MML&TM generate better answers for TextVQA.