Linguistically inspired roadmap for building biologically reliable protein language models

MH Vu, R Akbar, PA Robert, B Swiatczak… - Nature Machine …, 2023 - nature.com
Deep neural-network-based language models (LMs) are increasingly applied to large-scale
protein sequence data to predict protein function. However, being largely black-box models …

Charformer: Fast character transformers via gradient-based subword tokenization

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org
State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org
What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

Incorporating context into subword vocabularies

S Yehezkel, Y Pinter - arXiv preprint arXiv:2210.07095, 2022 - arxiv.org
Most current popular subword tokenizers are trained based on word frequency statistics over
a corpus, without considering information about co-occurrence or context. Nevertheless, the …

Proximal Policy Optimization Actual Combat: Manipulating Output Tokenizer Length

M Fan, C Hu, S Zhou - arXiv preprint arXiv:2308.05585, 2023 - arxiv.org
The Reinforcement Learning from Human Feedback (RLHF) plays a pivotal role in shaping
the impact of large language models (LLMs), contributing significantly to controlling output …

Where are we Still Split on Tokenization?

R Goot - Findings of the Association for Computational …, 2024 - aclanthology.org
Abstract Many Natural Language Processing (NLP) tasks are labeled on the token level,
forthese tasks, the first step is to identify the tokens (tokenization). Becausethis step is often …

[PDF][PDF] Simultaneous Domain Adaptation of Tokenization and Machine Translation

T Enomoto, T Hirasawa, H Kim, T Oka… - Proceedings of the …, 2023 - aclanthology.org
Abstract Domain adaptation through fine-tuning is a well-established strategy to tailor a
neural network model trained on a general-domain for a specific target-domain. During the …

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10808, 2023 - arxiv.org
This paper proposes a method to optimize tokenization for the performance improvement of
already trained downstream models. Our method generates tokenization results attaining …

Tokenization Preference for Human and Machine Learning Model: An Annotation Study

T Hiraoka, T Iwakura - arXiv preprint arXiv:2304.10813, 2023 - arxiv.org
Is preferred tokenization for humans also preferred for machine-learning (ML) models? This
study examines the relations between preferred tokenization for humans (appropriateness …

Composing Word Embeddings for Compound Words Using Linguistic Knowledge

K Komiya, S Kono, T Seito, T Hirabayashi - ACM Transactions on Asian …, 2023 - dl.acm.org
In recent years, the use of distributed representations has been a fundamental technology
for natural language processing. However, Japanese has multiple compound words, and …