[PDF][PDF] Identifying the coding system and language of on-line documents on the internet

GI Kikui - COLING 1996 Volume 2: The 16th International …, 1996 - aclanthology.org
This paper proposes a new algorithm that simultaneously identifies the coding system and
language of a code string fetched from the Internet, especially World-Wide Web. The …

[PDF][PDF] Language identification from text using n-gram based cumulative frequency addition

B Ahmed, SH Cha, C Tappert - Proceedings of Student/Faculty …, 2004 - researchgate.net
This paper describes the preliminary results of an efficient language classifier using an ad-
hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler …

Language identification from small text samples

KN Murthy, GB Kumar - Journal of Quantitative Linguistics, 2006 - Taylor & Francis
There is an increasing need to deal with multi-lingual documents today. If we could segment
multi-lingual documents language-wise, it would be very useful both for exploration of …

[PDF][PDF] Automatic identification of close languages-case study: Malay and Indonesian

B Ranaivo-Malançon - ECTI Transactions on Computer and …, 2006 - eprints.usm.my
Identifying the language of an unknown text is not a new problem but what is new is the task
of identifying close languages. Malay and Indonesian as many other language€ are very …

Hypertextsorten: Definition, Struktur, Klassifikation

G Rehm - 2005 - jlupub.ub.uni-giessen.de
Suchmaschinen im WWW indexieren und durchsuchen Dokumente in großer
Geschwindigkeit. Trotz der quantitativ beeindruckenden Ergebnisse lässt dieQualität der …

[PDF][PDF] Study of some distance measures for language and encoding identification

AK Singh - Proceedings of the Workshop on Linguistic Distances, 2006 - aclanthology.org
To determine how close two language models (eg, n-grams models) are, we can use
several distance measures. If we can represent the models as distributions, then the …

[PDF][PDF] Identification of languages and encodings in a multilingual document

AK Singh, J Gorla - Cahiers du Cental, 2007 - Citeseer
Text on the Web is available in numerous languages and encodings, often not according to
any standards. The number of multilingual documents on the Web is also increasing. The …

Language set identification in noisy synthetic multilingual documents

T Jauhiainen, K Lindén, H Jauhiainen - … 2015, Cairo, Egypt, April 14-20 …, 2015 - Springer
In this paper, we reconsider the problem of language identification of multilingual
documents. Automated language identification algorithms have been improving steadily …

[PDF][PDF] Categorization according to language: A step toward combining linguistic knowledge and statistic learning

E Giguet - International Workshop of Parsing Technologies (IWPT' …, 1995 - hal.science
In this article, we address the problem of categorization according to language by presenting
a method based on natural properties of language which allow us to categorize any kind of …

Daniel at the FinSBD-2 task: Extracting Lists and Sentences from PDF Documents: a model-driven end-to-end approach to PDF document analysis

E Giguet, G Lejeune - Second Workshop on Financial Technology and …, 2021 - hal.science
In this paper, we present the method we have designed and implemented for identifying lists
and sentences in PDF documents while participating to FinSBD-2 Financial Document …