Optimizing word segmentation for downstream task

MH Vu, R Akbar, PA Robert, B Swiatczak… - Nature Machine …, 2023 - nature.com

Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …

被引用次数：37 相关文章所有 4 个版本

[PDF] arxiv.org

Charformer: Fast character transformers via gradient-based subword tokenization

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org

State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

被引用次数：150 相关文章所有 5 个版本

[PDF] arxiv.org

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：105 相关文章所有 5 个版本

[PDF] arxiv.org

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org

Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …

被引用次数：17 相关文章所有 4 个版本

[PDF] arxiv.org

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

M Fan, C Hu, S Zhou - arXiv preprint arXiv:2308.05585, 2023 - arxiv.org

The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in shaping
the impact of large language models (LLMs), contributing significantly to controlling output …

相关文章所有 2 个版本

[PDF] aclanthology.org

Where are we Still Split on Tokenization?

R Goot - Findings of the Association for Computational …, 2024 - aclanthology.org

Abstract Many Natural Language Processing (NLP) tasks are labeled on the token level,
forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often …

被引用次数：1 相关文章

[PDF] aclanthology.org

[PDF][PDF] Simultaneous Domain Adaptation of Tokenization and Machine Translation

T Enomoto, T Hirasawa, H Kim, T Oka… - Proceedings of the …, 2023 - aclanthology.org

Abstract Domain adaptation through fine-tuning is a well-established strategy to tailor a
neural network model trained on a general-domain for a specific target-domain. During the …

相关文章所有 2 个版本

[PDF] arxiv.org

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10808, 2023 - arxiv.org

This paper proposes a method to optimize tokenization for the performance improvement of
already trained downstream models. Our method generates tokenization results attaining …

相关文章所有 2 个版本

[PDF] arxiv.org

Tokenization Preference for Human and Machine Learning Model: An Annotation Study

T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10813, 2023 - arxiv.org

Is preferred tokenization for humans also preferred for machine-learning (ML) models? This
study examines the relations between preferred tokenization for humans (appropriateness …

相关文章所有 2 个版本

Composing Word Embeddings for Compound Words Using Linguistic Knowledge

K Komiya, S Kono, T Seito, T Hirabayashi - ACM Transactions on Asian …, 2023 - dl.acm.org

In recent years, the use of distributed representations has been a fundamental technology
for natural language processing. However, Japanese has multiple compound words, and …

被引用次数：1 相关文章所有 3 个版本

高级搜索

QQ 群