Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

A Thawani, S Ghanekar, X Zhu, J Pujara - arXiv preprint arXiv:2310.11628, 2023 - arxiv.org
Language models typically tokenize text into subwords, using a deterministic, hand-
engineered heuristic of combining characters into longer surface-level strings such as' ing'or …

Neural machine translation of electrical engineering based on vector fusion

H Chen, Y Chen, J Zhang - Applied Sciences, 2023 - mdpi.com
The development of neural machine translation has achieved a good translation effect on
large-scale general corpora, but there are still many problems in the translation of low …

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

K Slagle - arXiv preprint arXiv:2404.14408, 2024 - arxiv.org
Tokenization is widely used in large language models because it significantly improves
performance. However, tokenization imposes several disadvantages, such as performance …

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

L Huang, Y Feng - arXiv preprint arXiv:2405.19290, 2024 - arxiv.org
Subword tokenization is a common method for vocabulary building in Neural Machine
Translation (NMT) models. However, increasingly complex tasks have revealed its …

[PDF][PDF] Manipulating Data Representations for Neural Machine Translation

C Amrhein - 2023 - zora.uzh.ch
In natural language processing, much current research focuses on training larger and larger
models on more and more data. In this thesis, we argue that how data is represented can …

Privacy in Federated Learning

M Zhang - 2024 - search.proquest.com
Abstract The rise of Artificial Intelligence technology has raised concerns about the potential
compromise of privacy due to the handling of personal data. Private AI prevents cybercrimes …

Subword embedding from bytes against embedding-based attacks

M Zhang, J Xu - openreview.net
NLP models have grown as a powerful technology and impact our social life like never
before, along with rising concerns in practical applications including privacy invasion and …