LBPE: Long-token-first Tokenization to Improve Large Language Models
The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich in semantic information, have fewer occurrences in tokenized...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs)
facilitates robust handling of subword units and avoids issues of
out-of-vocabulary words. Despite its success, a critical challenge persists:
long tokens, rich in semantic information, have fewer occurrences in tokenized
datasets compared to short tokens, which can result in imbalanced learning
issue across different tokens. To address that, we propose LBPE, which
prioritizes long tokens during the encoding process. LBPE generates tokens
according to their reverse ranks of token length rather than their ranks in the
vocabulary, granting longer tokens higher priority during the encoding process.
Consequently, LBPE smooths the frequency differences between short and long
tokens, and thus mitigates the learning imbalance. Extensive experiments across
diverse language modeling tasks demonstrate that LBPE consistently outperforms
the original BPE, well demonstrating its effectiveness. |
---|---|
DOI: | 10.48550/arxiv.2411.05504 |