A masked segmental language model for unsupervised natural language segmentation

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：86 相关文章所有 5 个版本

[PDF] arxiv.org

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

H Song, R Dabre, C Chu, S Kurohashi… - ACM Transactions on …, 2023 - dl.acm.org

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation
(NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair …

被引用次数：3 相关文章所有 6 个版本

[PDF] arxiv.org

Subword segmental machine translation: Unifying segmentation and target sentence generation

F Meyer, J Buys - arXiv preprint arXiv:2305.07005, 2023 - arxiv.org

Subword segmenters like BPE operate as a preprocessing step in neural machine
translation and other (conditional) language models. They are applied to datasets before …

被引用次数：6 相关文章所有 6 个版本

[PDF] arxiv.org

Subword segmental language modelling for nguni languages

F Meyer, J Buys - arXiv preprint arXiv:2210.06525, 2022 - arxiv.org

Subwords have become the standard units of text in NLP, enabling efficient open-
vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is …

被引用次数：3 相关文章所有 6 个版本

[PDF] aclanthology.org

Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer

HW Li, YJ Lin, YT Li, C Lin, HY Kao - Proceedings of the 2023 …, 2023 - aclanthology.org

Unsupervised Chinese word segmentation (UCWS) has made progress by incorporating
linguistic knowledge from pre-trained language models using parameter-free probing …

被引用次数：1 相关文章所有 3 个版本

[PDF] mdpi.com

A Benchmark for Morphological Segmentation in Uyghur and Kazakh

G Abudouwaili, S Ruzmamat, K Abiderexiti, B Wu… - Applied Sciences, 2024 - mdpi.com

Morphological segmentation and stemming are foundational tasks in natural language
processing. They have become effective ways to alleviate data sparsity in agglutinative …

[PDF] arxiv.org

Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

CM Downey, S Drizin, L Haroutunian… - arXiv preprint arXiv …, 2021 - arxiv.org

We show that unsupervised sequence-segmentation performance can be transferred to
extremely low-resource languages by pre-training a Masked Segmental Language Model …

被引用次数：7 相关文章所有 5 个版本

Unsupervised word Segmentation Based on Word Influence

R Yan, H Zhang, W Silamu… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org

Word segmentation task is the cornerstone of text processing. There are 7111 languages
worldwide, most of which are low-resource languages. This paper attempts to solve the …

被引用次数：2 相关文章

[PDF] aclanthology.org

BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation

H Song, R Dabre, Z Mao, C Chu… - Proceedings of the 2nd …, 2022 - aclanthology.org

Existing subword segmenters are either 1) frequency-based without semantics information
or 2) neural-based but trained on parallel corpora. To address this, we present BERTSeg, an …

被引用次数：4 相关文章所有 2 个版本

[PDF] jhu.edu

Building and Evaluating Open-Vocabulary Language Models

SJ Mielke - 2023 - jscholarship.library.jhu.edu

Abstract Language models have always been a fundamental NLP tool and application. This
thesis focuses on open-vocabulary language models, ie, models that can deal with novel …

高级搜索

QQ 群