[PDF][PDF] Variable bit quantisation for lsh

S Moran, V Lavrenko, M Osborne - … of the 51st Annual Meeting of …, 2013 - aclanthology.org
We introduce a scheme for optimally allocating a variable number of bits per LSH
hyperplane. Previous approaches assign a constant number of bits per hyperplane. This …

[PDF][PDF] Two-stage hashing for fast document retrieval

H Li, W Liu, H Ji - Proceedings of the 52nd Annual Meeting of the …, 2014 - aclanthology.org
This work fulfills sublinear time Nearest Neighbor Search (NNS) in massivescale document
collections. The primary contribution is to propose a two-stage unsupervised hashing …

An empirical study on crosslingual transfer in probabilistic topic models

S Hao, MJ Paul - Computational Linguistics, 2020 - direct.mit.edu
Probabilistic topic modeling is a common first step in crosslingual tasks to enable knowledge
transfer and extract multilingual features. Although many multilingual topic models have …

Large-scale semantic exploration of scientific literature using topic-based hashing algorithms

C Badenes-Olmedo, JL Redondo-Garcia… - Semantic …, 2020 - content.iospress.com
Searching for similar documents and exploring major themes covered across groups of
documents are common activities when browsing collections of scientific papers. This …

Efficient nearest-neighbor search in the probability simplex

K Krstovski, DA Smith, HM Wallach… - Proceedings of the 2013 …, 2013 - dl.acm.org
Document similarity tasks arise in many areas of information retrieval and natural language
processing. A fundamental question when comparing documents is which representation to …

[PDF][PDF] Online polylingual topic models for fast document translation detection

K Krstovski, DA Smith - Proceedings of the Eighth Workshop on …, 2013 - aclanthology.org
Many tasks in NLP and IR require efficient document similarity computations. Beyond their
common application to exploratory data analysis, latent variable topic models have been …

[PDF][PDF] Bootstrapping translation detection and sentence extraction from comparable corpora

K Krstovski, DA Smith - Proceedings of the 2016 Conference of …, 2016 - aclanthology.org
Most work on extracting parallel text from comparable corpora depends on linguistic
resources such as seed parallel documents or translation dictionaries. This paper presents a …

[PDF][PDF] Using term position similarity and language modeling for bilingual document alignment

TC Le, HT Vu, J Oberlander, O Bojar - Proceedings of the First …, 2016 - aclanthology.org
Abstract The WMT Bilingual Document Alignment Task requires systems to assign source
pages to their “translations”, in a big space of possible pairs. We present four methods: The …

Mining relational structure from millions of books: position paper

DA Smith, R Manmatha, J Allan - Proceedings of the 4th ACM workshop …, 2011 - dl.acm.org
Existing large-scale scanned book collections have many shortcomings for data-driven
research, from OCR of variable quality to the lack of accurate descriptive and structural …

Finding translations in scanned book collections

IZ Yalniz, R Manmatha - Proceedings of the 35th international ACM …, 2012 - dl.acm.org
This paper describes an approach for identifying translations of books in large scanned
book collections with OCR errors. The method is based on the idea that although individual …