Automatic language identification in texts: A survey

T Jauhiainen, M Lui, M Zampieri, T Baldwin… - Journal of Artificial …, 2019 - jair.org
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …

[PDF][PDF] Language identification: The long and the short of the matter

T Baldwin, M Lui - … technologies: The 2010 annual conference of …, 2010 - aclanthology.org
Abstract Language identification is the task of identifying the language a given document is
written in. This paper describes a detailed examination of what models perform best under …

[PDF][PDF] Cross-domain feature selection for language identification

M Lui, T Baldwin - … of 5th international joint conference on natural …, 2011 - aclanthology.org
We show that transductive (cross-domain) learning is an important consideration in building
a general-purpose language identification system, and develop a feature selection method …

Automatic detection and language identification of multilingual documents

M Lui, JH Lau, T Baldwin - Transactions of the Association for …, 2014 - direct.mit.edu
Abstract Language identification is the task of automatically detecting the language (s)
present in a document based on the content of the document. In this work, we address the …

[PDF][PDF] Accurate language identification of twitter messages

M Lui, T Baldwin - Proceedings of the 5th workshop on language …, 2014 - aclanthology.org
We present an evaluation of “off-theshelf” language identification systems as applied to
microblog messages from Twitter. A key challenge is the lack of an adequate corpus of …

[PDF][PDF] Reconsidering Language Identification for Written Language Resources.

B Hughes, T Baldwin, S Bird, J Nicholson… - …, 2006 - minerva-access.unimelb.edu.au
The task of identifying the language in which a given document (ranging from a sentence to
thousands of pages) is written has been relatively well studied over several decades …

[PDF][PDF] Language identification from text using n-gram based cumulative frequency addition

B Ahmed, SH Cha, C Tappert - Proceedings of Student/Faculty …, 2004 - researchgate.net
This paper describes the preliminary results of an efficient language classifier using an ad-
hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler …

Language identification from small text samples

KN Murthy, GB Kumar - Journal of Quantitative Linguistics, 2006 - Taylor & Francis
There is an increasing need to deal with multi-lingual documents today. If we could segment
multi-lingual documents language-wise, it would be very useful both for exploration of …

Factors that affect the accuracy of text-based language identification

GR Botha, E Barnard - Computer Speech & Language, 2012 - Elsevier
The classification accuracy of text-based language identification depends on several factors,
including the size of the text fragment to be identified, the amount of training data available …

[PDF][PDF] Automatic identification of close languages-case study: Malay and Indonesian

B Ranaivo-Malançon - ECTI Transactions on Computer and …, 2006 - eprints.usm.my
Identifying the language of an unknown text is not a new problem but what is new is the task
of identifying close languages. Malay and Indonesian as many other language€ are very …