Automatic language identification in texts: A survey

T Jauhiainen, M Lui, M Zampieri, T Baldwin… - Journal of Artificial …, 2019 - jair.org
Language identification (" LI") is the problem of determining the natural language that a
document or part thereof is written in. Automatic LI has been extensively researched for over …

[PDF][PDF] On achieving and evaluating language-independence in NLP

EM Bender - Linguistic Issues in Language Technology, 2011 - journals.colorado.edu
On Achieving and Evaluating Language-Independence in NLP Page 1 Linguistic Issues in
Language Technology LiLT Submitted, October 2011 On Achieving and Evaluating …

[PDF][PDF] Language identification: The long and the short of the matter

T Baldwin, M Lui - … technologies: The 2010 annual conference of …, 2010 - aclanthology.org
Abstract Language identification is the task of identifying the language a given document is
written in. This paper describes a detailed examination of what models perform best under …

[PDF][PDF] Labeling the languages of words in mixed-language documents using weakly supervised methods

B King, S Abney - Proceedings of the 2013 Conference of the …, 2013 - aclanthology.org
In this paper we consider the problem of labeling the languages of words in mixed-language
documents. This problem is approached in a weakly supervised fashion, as a sequence …

[PDF][PDF] Cross-domain feature selection for language identification

M Lui, T Baldwin - … of 5th international joint conference on natural …, 2011 - aclanthology.org
We show that transductive (cross-domain) learning is an important consideration in building
a general-purpose language identification system, and develop a feature selection method …

Estimating code-switching on twitter with a novel generalized word-level language detection technique

S Rijhwani, R Sequiera, M Choudhury… - Proceedings of the …, 2017 - aclanthology.org
Word-level language detection is necessary for analyzing code-switched text, where
multiple languages could be mixed within a sentence. Existing models are restricted to code …

[PDF][PDF] Language identification for creating language-specific twitter collections

S Bergsma, P McNamee, M Bagdouri… - Proceedings of the …, 2012 - aclanthology.org
Social media services such as Twitter offer an immense volume of real-world linguistic data.
We explore the use of Twitter to obtain authentic user-generated text in low-resource …

Tweetlid: a benchmark for tweet language identification

A Zubiaga, IS Vicente, P Gamallo, JR Pichel… - Language Resources …, 2016 - Springer
Abstract Language identification, as the task of determining the language a given text is
written in, has progressed substantially in recent decades. However, three main issues …

Selecting and weighting n-grams to identify 1100 languages

RD Brown - Text, Speech, and Dialogue: 16th International …, 2013 - Springer
This paper presents a language identification algorithm using cosine similarity against a
filtered and weighted subset of the most frequent n-grams in training data with optional inter …

[PDF][PDF] Experiments with Cross-lingual Systems for Synthesis of Code-Mixed Text.

S Sitaram, SK Rallabandi, S Rijhwani, AW Black - SSW, 2016 - researchgate.net
Abstract Most Text to Speech (TTS) systems today assume that the input is in a single
language written in its native script, which is the language that the TTS database is recorded …