Charformer: Fast character transformers via gradient-based subword tokenization

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org
State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Impact of tokenization on language models: An analysis for turkish

C Toraman, EH Yilmaz, F Şahinuç… - ACM Transactions on …, 2023 - dl.acm.org
Tokenization is an important text preprocessing step to prepare input tokens for deep
language models. WordPiece and BPE are de facto methods employed by important …

The SIGMORPHON 2022 shared task on morpheme segmentation

K Batsuren, G Bella, A Arora, V Martinović… - arXiv preprint arXiv …, 2022 - arxiv.org
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to
decompose a word into a sequence of morphemes and covered most types of morphology …

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org
Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

MT Alrefaie, NE Morsy, N Samir - arXiv preprint arXiv:2403.11130, 2024 - arxiv.org
This paper presents a comprehensive examination of the impact of tokenization strategies
and vocabulary sizes on the performance of Arabic language models in downstream natural …

Extending the subwording model of multilingual pretrained models for new languages

K Imamura, E Sumita - arXiv preprint arXiv:2211.15965, 2022 - arxiv.org
Multilingual pretrained models are effective for machine translation and cross-lingual
processing because they contain multiple languages in one model. However, they are …

[PDF][PDF] Simultaneous Domain Adaptation of Tokenization and Machine Translation

T Enomoto, T Hirasawa, H Kim, T Oka… - Proceedings of the …, 2023 - aclanthology.org
Abstract Domain adaptation through fine-tuning is a well-established strategy to tailor a
neural network model trained on a general-domain for a specific target-domain. During the …

Elementwise Language Representation

D Kim, J Kim - arXiv preprint arXiv:2302.13475, 2023 - arxiv.org
We propose a new technique for computational language representation called
elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal …

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10808, 2023 - arxiv.org
This paper proposes a method to optimize tokenization for the performance improvement of
already trained downstream models. Our method generates tokenization results attaining …