Miracl: A multilingual retrieval dataset covering 18 diverse languages

X Zhang, N Thakur, O Ogundepo… - Transactions of the …, 2023 - direct.mit.edu
MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively
encompass over three billion native speakers around the world. This resource is designed to …

WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia

SJ Semnani, VZ Yao, HC Zhang, MS Lam - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and
has high conversationality and low latency. WikiChat is grounded on the English Wikipedia …

Making a miracl: Multilingual information retrieval across a continuum of languages

X Zhang, N Thakur, O Ogundepo, E Kamalloo… - arXiv preprint arXiv …, 2022 - arxiv.org
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a
multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc …

Cross-language information retrieval

P Galuščáková, DW Oard, S Nair - arXiv preprint arXiv:2111.05988, 2021 - arxiv.org
Two key assumptions shape the usual view of ranked retrieval:(1) that the searcher can
choose words for their query that might appear in the documents that they wish to see, and …

Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation

Z Huang, P Yu, J Allan - Proceedings of the Sixteenth ACM International …, 2023 - dl.acm.org
Benefiting from transformer-based pre-trained language models, neural ranking models
have made significant progress. More recently, the advent of multilingual pre-trained …

C3: Continued pretraining with contrastive weak supervision for cross language ad-hoc retrieval

E Yang, S Nair, R Chandradevan… - Proceedings of the 45th …, 2022 - dl.acm.org
Pretrained language models have improved effectiveness on numerous tasks, including ad-
hoc retrieval. Recent work has shown that continuing to pretrain a language model with …

HC4: A new suite of test collections for ad hoc CLIR

D Lawrie, J Mayfield, DW Oard, E Yang - European Conference on …, 2022 - Springer
HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval
(CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in …

Semantic matching based legal information retrieval system for COVID-19 pandemic

J Zhu, J Wu, X Luo, J Liu - Artificial intelligence and law, 2024 - Springer
Recently, the pandemic caused by COVID-19 is severe in the entire world. The prevention
and control of crimes associated with COVID-19 are critical for controlling the pandemic …

BLADE: combining vocabulary pruning and intermediate pretraining for scaleable neural CLIR

S Nair, E Yang, D Lawrie, J Mayfield… - Proceedings of the 46th …, 2023 - dl.acm.org
Learning sparse representations using pretrained language models enhances the
monolingual ranking effectiveness. Such representations are sparse vectors in the …

An experimental study on pretraining transformers from scratch for IR

C Lassance, H Déjean, S Clinchant - European Conference on …, 2023 - Springer
Abstract Finetuning Pretrained Language Models (PLM) for IR has been de facto the
standard practice since their breakthrough effectiveness few years ago. But, is this approach …