Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

[HTML][HTML] Unsupervised multi-granular Chinese word segmentation and term discovery via graph partition

Z Yuan, Y Liu, Q Yin, B Li, X Feng, G Zhang… - Journal of Biomedical …, 2020 - Elsevier
Objective This study aims at realizing unsupervised term discovery in Chinese electronic
health records (EHRs) by using the word segmentation technique. The existing supervised …

Learning to discover, ground and use words with segmental neural language models

K Kawakami, C Dyer, P Blunsom - arXiv preprint arXiv:1811.09353, 2018 - arxiv.org
We propose a segmental neural language model that combines the generalization power of
neural networks with the ability to discover word-like units that are latent in unsegmented …

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

H Song, R Dabre, C Chu, S Kurohashi… - ACM Transactions on …, 2023 - dl.acm.org
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation
(NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair …

A masked segmental language model for unsupervised natural language segmentation

CM Downey, F Xia, GA Levow… - arXiv preprint arXiv …, 2021 - arxiv.org
Segmentation remains an important preprocessing step both in languages where" words" or
other important syntactic/semantic units (like morphemes) are not clearly delineated by white …

Unsupervised word segmentation with bi-directional neural language model

L Wang, X Zheng - ACM Transactions on Asian and Low-Resource …, 2022 - dl.acm.org
We propose an unsupervised word segmentation model, in which for each unlabelled
sentence sample, the learning objective is to maximize the generation probability of the …

Subword segmental machine translation: Unifying segmentation and target sentence generation

F Meyer, J Buys - arXiv preprint arXiv:2305.07005, 2023 - arxiv.org
Subword segmenters like BPE operate as a preprocessing step in neural machine
translation and other (conditional) language models. They are applied to datasets before …

Subword segmental language modelling for nguni languages

F Meyer, J Buys - arXiv preprint arXiv:2210.06525, 2022 - arxiv.org
Subwords have become the standard units of text in NLP, enabling efficient open-
vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is …

Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer

HW Li, YJ Lin, YT Li, C Lin, HY Kao - Proceedings of the 2023 …, 2023 - aclanthology.org
Unsupervised Chinese word segmentation (UCWS) has made progress by incorporating
linguistic knowledge from pre-trained language models using parameter-free probing …

Learning context using segment-level LSTM for neural sequence labeling

Y Shin, S Lee - IEEE/ACM Transactions on Audio, Speech, and …, 2019 - ieeexplore.ieee.org
This article introduces an approach that learns segment-level context for sequence labeling
in natural language processing (NLP). Previous approaches limit their basic unit to a word …