Different Tokenization Schemes Lead to Comparable Performance in Spanish Number Agreement
The relationship between language model tokenization and performance is an open area of research. Here, we investigate how different tokenization schemes impact number agreement in Spanish plurals. We find that morphologically-aligned tokenization performs similarly to other tokenization schemes, ev...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The relationship between language model tokenization and performance is an
open area of research. Here, we investigate how different tokenization schemes
impact number agreement in Spanish plurals. We find that
morphologically-aligned tokenization performs similarly to other tokenization
schemes, even when induced artificially for words that would not be tokenized
that way during training. We then present exploratory analyses demonstrating
that language model embeddings for different plural tokenizations have similar
distributions along the embedding space axis that maximally distinguishes
singular and plural nouns. Our results suggest that morphologically-aligned
tokenization is a viable tokenization approach, and existing models already
generalize some morphological patterns to new items. However, our results
indicate that morphological tokenization is not strictly required for
performance. |
---|---|
DOI: | 10.48550/arxiv.2403.13754 |