Multilingual aligned corpora from movie subtitles

M Mangeot, E Giguet - 2005 - hal.science
M Mangeot, E Giguet
2005hal.science
This paper describes a methodology for building aligned multilingual corpora form movie
subtitles found on the Web. The subtitles have specific formats and encodings. In a first step,
we convert them to our multilingual subtitle format based on XML. In a second step, we align
the subtitle sentences with the time used to display them on the screen. We implemented the
tool Jimaku in order to semi-automatically perform these steps. The last step consists in
aligning the sentences at the sub-sentence level and to index the corpus for contextual …
This paper describes a methodology for building aligned multilingual corpora form movie subtitles found on the Web. The subtitles have specific formats and encodings. In a first step, we convert them to our multilingual subtitle format based on XML. In a second step, we align the subtitle sentences with the time used to display them on the screen. We implemented the tool Jimaku in order to semi- automatically perform these steps. The last step consists in aligning the sentences at the sub-sentence level and to index the corpus for contextual lookup. For this step, we use the W I M S platform, result of previous research on text collections management.
hal.science
以上显示的是最相近的搜索结果。 查看全部搜索结果