Byte pair encoding is suboptimal for language model pretraining

C Wei, YC Wang, B Wang, CCJ Kuo - arXiv preprint arXiv:2303.05759, 2023 - arxiv.org

Language modeling studies the probability distributions over strings of texts. It is one of the
most fundamental tasks in natural language processing (NLP). It has been widely used in …

被引用次数：40 相关文章所有 3 个版本

[PDF] openreview.net

Perceiver io: A general architecture for structured inputs & outputs

A Jaegle, S Borgeaud, JB Alayrac, C Doersch… - arXiv preprint arXiv …, 2021 - arxiv.org

A central goal of machine learning is the development of systems that can solve many
problems in as many data domains as possible. Current architectures, however, cannot be …

被引用次数：543 相关文章所有 4 个版本

[PDF] mit.edu

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

JH Clark, D Garrette, I Turc, J Wieting - Transactions of the Association …, 2022 - direct.mit.edu

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet
nearly all commonly used models still require an explicit tokenization step. While recent …

被引用次数：190 相关文章所有 9 个版本

[PDF] arxiv.org

Charformer: Fast character transformers via gradient-based subword tokenization

Y Tay, VQ Tran, S Ruder, J Gupta, HW Chung… - arXiv preprint arXiv …, 2021 - arxiv.org

State-of-the-art models in natural language processing rely on separate rigid subword
tokenization algorithms, which limit their generalization ability and adaptation to new …

被引用次数：129 相关文章所有 5 个版本

[PDF] arxiv.org

Multimodal large language models: A survey

J Wu, W Gan, Z Chen, S Wan… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org

The exploration of multimodal language models integrates multiple data types, such as
images, text, language, audio, and other heterogeneity. While the latest large language …

被引用次数：88 相关文章所有 5 个版本

[PDF] arxiv.org

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP

SJ Mielke, Z Alyafeai, E Salesky, C Raffel… - arXiv preprint arXiv …, 2021 - arxiv.org

What are the units of text that we want to model? From bytes to multi-word expressions, text
can be analyzed and generated at many granularities. Until recently, most natural language …

被引用次数：76 相关文章所有 5 个版本

[PDF] arxiv.org

Impact of tokenization on language models: An analysis for turkish

C Toraman, EH Yilmaz, F Şahinuç… - ACM Transactions on …, 2023 - dl.acm.org

Tokenization is an important text preprocessing step to prepare input tokens for deep
language models. WordPiece and BPE are de facto methods employed by important …

被引用次数：52 相关文章所有 4 个版本

[PDF] mit.edu

Strong Prediction: Language model surprisal explains multiple N400 effects

JA Michaelov, MD Bardolph, CK Van Petten… - Neurobiology of …, 2024 - direct.mit.edu

Theoretical accounts of the N400 are divided as to whether the amplitude of the N400
response to a stimulus reflects the extent to which the stimulus was predicted, the extent to …

被引用次数：24 相关文章所有 9 个版本

[PDF] mdpi.com

Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models

S Shaikh, SM Daudpota, AS Imran, Z Kastrati - Applied Sciences, 2021 - mdpi.com

Data imbalance is a frequently occurring problem in classification tasks where the number of
samples in one category exceeds the amount in others. Quite often, the minority class data is …

被引用次数：62 相关文章所有 13 个版本

[PDF] arxiv.org

Square one bias in NLP: Towards a multi-dimensional exploration of the research manifold

S Ruder, I Vulić, A Søgaard - arXiv preprint arXiv:2206.09755, 2022 - arxiv.org

The prototypical NLP experiment trains a standard architecture on labeled English data and
optimizes for accuracy, without accounting for other dimensions such as fairness …

被引用次数：33 相关文章所有 6 个版本

高级搜索

QQ 群