Tokenizer Choice For LLM Training: Negligible or Crucial?
The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplore...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Ali, Mehdi Fromm, Michael Thellmann, Klaudia Rutmann, Richard Lübbering, Max Leveling, Johannes Klug, Katrin Ebert, Jan Doll, Niclas Buschhoff, Jasper Schulze Jain, Charvi Weber, Alexander Arno Jurkschat, Lena Abdelwahab, Hammam John, Chelsea Suarez, Pedro Ortiz Ostendorff, Malte Weinbach, Samuel Sifa, Rafet Kesselheim, Stefan Flores-Herr, Nicolas |
description | The recent success of Large Language Models (LLMs) has been predominantly
driven by curating the training dataset composition, scaling of model
architectures and dataset sizes and advancements in pretraining objectives,
leaving tokenizer influence as a blind spot. Shedding light on this
underexplored area, we conduct a comprehensive study on the influence of
tokenizer choice on LLM downstream performance by training 24 mono- and
multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer
algorithms and parameterizations. Our studies highlight that the tokenizer
choice can significantly impact the model's downstream performance and training
costs. In particular, we find that the common tokenizer evaluation metrics
fertility and parity are not always predictive of model downstream performance,
rendering these metrics a questionable proxy for the model's downstream
performance. Furthermore, we show that multilingual tokenizers trained on the
five most frequent European languages require vocabulary size increases of
factor three in comparison to English. While English-centric tokenizers have
been applied to the training of multi-lingual LLMs in the past, we find that
this approach results in a severe downstream performance degradation and
additional training costs of up to 68%, due to an inefficient tokenization
vocabulary. |
doi_str_mv | 10.48550/arxiv.2310.08754 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_08754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_08754</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-aa1d47e030f12c9b6657488c22c70a60a6ab8d2d013fdf32711636cca2c851043</originalsourceid><addsrcrecordid>eNotj89qwkAYxPfiQbQP4Ml9gaT7f1cvIqFpC1EvuYcvm038MCZlS0vbp29qhYGBGRjmR8iKs1Q5rdkjxC_8TIWcAuasVnOyKcdLGPAnRJqdR_SB5mOkRXGgZQQccOi29Bi6Hjus-0CnLosfHqHfLcmshf49PNx9Qcr8qcxekuL0_JrtiwSMVQkAb5QNTLKWC7-pjdFWOeeF8JaBmQS1a0TDuGybVgrLuZHGexDeac6UXJD1_-zte_UW8Qrxu_pjqG4M8hcJ9j93</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><source>arXiv.org</source><creator>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</creator><creatorcontrib>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</creatorcontrib><description>The recent success of Large Language Models (LLMs) has been predominantly
driven by curating the training dataset composition, scaling of model
architectures and dataset sizes and advancements in pretraining objectives,
leaving tokenizer influence as a blind spot. Shedding light on this
underexplored area, we conduct a comprehensive study on the influence of
tokenizer choice on LLM downstream performance by training 24 mono- and
multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer
algorithms and parameterizations. Our studies highlight that the tokenizer
choice can significantly impact the model's downstream performance and training
costs. In particular, we find that the common tokenizer evaluation metrics
fertility and parity are not always predictive of model downstream performance,
rendering these metrics a questionable proxy for the model's downstream
performance. Furthermore, we show that multilingual tokenizers trained on the
five most frequent European languages require vocabulary size increases of
factor three in comparison to English. While English-centric tokenizers have
been applied to the training of multi-lingual LLMs in the past, we find that
this approach results in a severe downstream performance degradation and
additional training costs of up to 68%, due to an inefficient tokenization
vocabulary.</description><identifier>DOI: 10.48550/arxiv.2310.08754</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/publicdomain/zero/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.08754$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.08754$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ali, Mehdi</creatorcontrib><creatorcontrib>Fromm, Michael</creatorcontrib><creatorcontrib>Thellmann, Klaudia</creatorcontrib><creatorcontrib>Rutmann, Richard</creatorcontrib><creatorcontrib>Lübbering, Max</creatorcontrib><creatorcontrib>Leveling, Johannes</creatorcontrib><creatorcontrib>Klug, Katrin</creatorcontrib><creatorcontrib>Ebert, Jan</creatorcontrib><creatorcontrib>Doll, Niclas</creatorcontrib><creatorcontrib>Buschhoff, Jasper Schulze</creatorcontrib><creatorcontrib>Jain, Charvi</creatorcontrib><creatorcontrib>Weber, Alexander Arno</creatorcontrib><creatorcontrib>Jurkschat, Lena</creatorcontrib><creatorcontrib>Abdelwahab, Hammam</creatorcontrib><creatorcontrib>John, Chelsea</creatorcontrib><creatorcontrib>Suarez, Pedro Ortiz</creatorcontrib><creatorcontrib>Ostendorff, Malte</creatorcontrib><creatorcontrib>Weinbach, Samuel</creatorcontrib><creatorcontrib>Sifa, Rafet</creatorcontrib><creatorcontrib>Kesselheim, Stefan</creatorcontrib><creatorcontrib>Flores-Herr, Nicolas</creatorcontrib><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><description>The recent success of Large Language Models (LLMs) has been predominantly
driven by curating the training dataset composition, scaling of model
architectures and dataset sizes and advancements in pretraining objectives,
leaving tokenizer influence as a blind spot. Shedding light on this
underexplored area, we conduct a comprehensive study on the influence of
tokenizer choice on LLM downstream performance by training 24 mono- and
multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer
algorithms and parameterizations. Our studies highlight that the tokenizer
choice can significantly impact the model's downstream performance and training
costs. In particular, we find that the common tokenizer evaluation metrics
fertility and parity are not always predictive of model downstream performance,
rendering these metrics a questionable proxy for the model's downstream
performance. Furthermore, we show that multilingual tokenizers trained on the
five most frequent European languages require vocabulary size increases of
factor three in comparison to English. While English-centric tokenizers have
been applied to the training of multi-lingual LLMs in the past, we find that
this approach results in a severe downstream performance degradation and
additional training costs of up to 68%, due to an inefficient tokenization
vocabulary.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj89qwkAYxPfiQbQP4Ml9gaT7f1cvIqFpC1EvuYcvm038MCZlS0vbp29qhYGBGRjmR8iKs1Q5rdkjxC_8TIWcAuasVnOyKcdLGPAnRJqdR_SB5mOkRXGgZQQccOi29Bi6Hjus-0CnLosfHqHfLcmshf49PNx9Qcr8qcxekuL0_JrtiwSMVQkAb5QNTLKWC7-pjdFWOeeF8JaBmQS1a0TDuGybVgrLuZHGexDeac6UXJD1_-zte_UW8Qrxu_pjqG4M8hcJ9j93</recordid><startdate>20231012</startdate><enddate>20231012</enddate><creator>Ali, Mehdi</creator><creator>Fromm, Michael</creator><creator>Thellmann, Klaudia</creator><creator>Rutmann, Richard</creator><creator>Lübbering, Max</creator><creator>Leveling, Johannes</creator><creator>Klug, Katrin</creator><creator>Ebert, Jan</creator><creator>Doll, Niclas</creator><creator>Buschhoff, Jasper Schulze</creator><creator>Jain, Charvi</creator><creator>Weber, Alexander Arno</creator><creator>Jurkschat, Lena</creator><creator>Abdelwahab, Hammam</creator><creator>John, Chelsea</creator><creator>Suarez, Pedro Ortiz</creator><creator>Ostendorff, Malte</creator><creator>Weinbach, Samuel</creator><creator>Sifa, Rafet</creator><creator>Kesselheim, Stefan</creator><creator>Flores-Herr, Nicolas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231012</creationdate><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><author>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-aa1d47e030f12c9b6657488c22c70a60a6ab8d2d013fdf32711636cca2c851043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Ali, Mehdi</creatorcontrib><creatorcontrib>Fromm, Michael</creatorcontrib><creatorcontrib>Thellmann, Klaudia</creatorcontrib><creatorcontrib>Rutmann, Richard</creatorcontrib><creatorcontrib>Lübbering, Max</creatorcontrib><creatorcontrib>Leveling, Johannes</creatorcontrib><creatorcontrib>Klug, Katrin</creatorcontrib><creatorcontrib>Ebert, Jan</creatorcontrib><creatorcontrib>Doll, Niclas</creatorcontrib><creatorcontrib>Buschhoff, Jasper Schulze</creatorcontrib><creatorcontrib>Jain, Charvi</creatorcontrib><creatorcontrib>Weber, Alexander Arno</creatorcontrib><creatorcontrib>Jurkschat, Lena</creatorcontrib><creatorcontrib>Abdelwahab, Hammam</creatorcontrib><creatorcontrib>John, Chelsea</creatorcontrib><creatorcontrib>Suarez, Pedro Ortiz</creatorcontrib><creatorcontrib>Ostendorff, Malte</creatorcontrib><creatorcontrib>Weinbach, Samuel</creatorcontrib><creatorcontrib>Sifa, Rafet</creatorcontrib><creatorcontrib>Kesselheim, Stefan</creatorcontrib><creatorcontrib>Flores-Herr, Nicolas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ali, Mehdi</au><au>Fromm, Michael</au><au>Thellmann, Klaudia</au><au>Rutmann, Richard</au><au>Lübbering, Max</au><au>Leveling, Johannes</au><au>Klug, Katrin</au><au>Ebert, Jan</au><au>Doll, Niclas</au><au>Buschhoff, Jasper Schulze</au><au>Jain, Charvi</au><au>Weber, Alexander Arno</au><au>Jurkschat, Lena</au><au>Abdelwahab, Hammam</au><au>John, Chelsea</au><au>Suarez, Pedro Ortiz</au><au>Ostendorff, Malte</au><au>Weinbach, Samuel</au><au>Sifa, Rafet</au><au>Kesselheim, Stefan</au><au>Flores-Herr, Nicolas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Tokenizer Choice For LLM Training: Negligible or Crucial?</atitle><date>2023-10-12</date><risdate>2023</risdate><abstract>The recent success of Large Language Models (LLMs) has been predominantly
driven by curating the training dataset composition, scaling of model
architectures and dataset sizes and advancements in pretraining objectives,
leaving tokenizer influence as a blind spot. Shedding light on this
underexplored area, we conduct a comprehensive study on the influence of
tokenizer choice on LLM downstream performance by training 24 mono- and
multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer
algorithms and parameterizations. Our studies highlight that the tokenizer
choice can significantly impact the model's downstream performance and training
costs. In particular, we find that the common tokenizer evaluation metrics
fertility and parity are not always predictive of model downstream performance,
rendering these metrics a questionable proxy for the model's downstream
performance. Furthermore, we show that multilingual tokenizers trained on the
five most frequent European languages require vocabulary size increases of
factor three in comparison to English. While English-centric tokenizers have
been applied to the training of multi-lingual LLMs in the past, we find that
this approach results in a severe downstream performance degradation and
additional training costs of up to 68%, due to an inefficient tokenization
vocabulary.</abstract><doi>10.48550/arxiv.2310.08754</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2310.08754 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2310_08754 |
source | arXiv.org |
subjects | Computer Science - Learning |
title | Tokenizer Choice For LLM Training: Negligible or Crucial? |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-17T16%3A38%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Tokenizer%20Choice%20For%20LLM%20Training:%20Negligible%20or%20Crucial?&rft.au=Ali,%20Mehdi&rft.date=2023-10-12&rft_id=info:doi/10.48550/arxiv.2310.08754&rft_dat=%3Carxiv_GOX%3E2310_08754%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |