What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language …
Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important …
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology …
S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org
Most current popular subword tokenizers are trained based on word frequency statistics over a corpus, without considering information about co-occurrence or context. Nevertheless, the …
This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural …
K Imamura, E Sumita - arXiv preprint arXiv:2211.15965, 2022 - arxiv.org
Multilingual pretrained models are effective for machine translation and cross-lingual processing because they contain multiple languages in one model. However, they are …
T Enomoto, T Hirasawa, H Kim, T Oka… - Proceedings of the …, 2023 - aclanthology.org
Abstract Domain adaptation through fine-tuning is a well-established strategy to tailor a neural network model trained on a general-domain for a specific target-domain. During the …
D Kim, J Kim - arXiv preprint arXiv:2302.13475, 2023 - arxiv.org
We propose a new technique for computational language representation called elementwise embedding, in which a material (semantic unit) is abstracted into a horizontal …
T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10808, 2023 - arxiv.org
This paper proposes a method to optimize tokenization for the performance improvement of already trained downstream models. Our method generates tokenization results attaining …