Joint optimization of tokenization and downstream model

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org

State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

被引用次数：150 相关文章所有 5 个版本

[PDF] arxiv.org

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：105 相关文章所有 5 个版本

[PDF] arxiv.org

Impact of tokenization on language models: An analysis for turkish

C Toraman, EH Yilmaz, F Şahinuç… - ACM Transactions on …, 2023 - dl.acm.org

Tokenization is an important text preprocessing step to prepare input tokens for deep
language models. WordPiece and BPE are de facto methods employed by important …

被引用次数：73 相关文章所有 4 个版本

[PDF] arxiv.org

The SIGMORPHON 2022 shared task on morpheme segmentation

K Batsuren, G Bella, A Arora, V Martinović… - arXiv preprint arXiv …, 2022 - arxiv.org

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to
decompose a word into a sequence of morphemes and covered most types of morphology …

被引用次数：34 相关文章所有 9 个版本

[PDF] arxiv.org

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org

Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

MT Alrefaie, NE Morsy, N Samir - arXiv preprint arXiv:2403.11130, 2024 - arxiv.org

This paper presents a comprehensive examination of the impact of tokenization strategies
and vocabulary sizes on the performance of Arabic language models in downstream natural …

被引用次数：3 相关文章

[PDF] arxiv.org

Extending the subwording model of multilingual pretrained models for new languages

K Imamura, E Sumita - arXiv preprint arXiv:2211.15965, 2022 - arxiv.org

Multilingual pretrained models are effective for machine translation and cross-lingual
processing because they contain multiple languages in one model. However, they are …

被引用次数：2 相关文章所有 2 个版本

[PDF] aclanthology.org

高级搜索

QQ 群