Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

H Song, R Dabre, C Chu, S Kurohashi… - ACM Transactions on …, 2023 - dl.acm.org
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation
(NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair …

Subword segmental machine translation: Unifying segmentation and target sentence generation

F Meyer, J Buys - arXiv preprint arXiv:2305.07005, 2023 - arxiv.org
Subword segmenters like BPE operate as a preprocessing step in neural machine
translation and other (conditional) language models. They are applied to datasets before …

Subword segmental language modelling for nguni languages

F Meyer, J Buys - arXiv preprint arXiv:2210.06525, 2022 - arxiv.org
Subwords have become the standard units of text in NLP, enabling efficient open-
vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is …

Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer

HW Li, YJ Lin, YT Li, C Lin, HY Kao - Proceedings of the 2023 …, 2023 - aclanthology.org
Unsupervised Chinese word segmentation (UCWS) has made progress by incorporating
linguistic knowledge from pre-trained language models using parameter-free probing …

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

G Abudouwaili, S Ruzmamat, K Abiderexiti, B Wu… - Applied Sciences, 2024 - mdpi.com
Morphological segmentation and stemming are foundational tasks in natural language
processing. They have become effective ways to alleviate data sparsity in agglutinative …

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

CM Downey, S Drizin, L Haroutunian… - arXiv preprint arXiv …, 2021 - arxiv.org
We show that unsupervised sequence-segmentation performance can be transferred to
extremely low-resource languages by pre-training a Masked Segmental Language Model …

Unsupervised word Segmentation Based on Word Influence

R Yan, H Zhang, W Silamu… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Word segmentation task is the cornerstone of text processing. There are 7111 languages
worldwide, most of which are low-resource languages. This paper attempts to solve the …

BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation

H Song, R Dabre, Z Mao, C Chu… - Proceedings of the 2nd …, 2022 - aclanthology.org
Existing subword segmenters are either 1) frequency-based without semantics information
or 2) neural-based but trained on parallel corpora. To address this, we present BERTSeg, an …

Building and Evaluating Open-Vocabulary Language Models

SJ Mielke - 2023 - jscholarship.library.jhu.edu
Abstract Language models have always been a fundamental NLP tool and application. This
thesis focuses on open-vocabulary language models, ie, models that can deal with novel …