Tokenizer Choice For LLM Training: Negligible or Crucial?

The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplore...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ali, Mehdi, Fromm, Michael, Thellmann, Klaudia, Rutmann, Richard, Lübbering, Max, Leveling, Johannes, Klug, Katrin, Ebert, Jan, Doll, Niclas, Buschhoff, Jasper Schulze, Jain, Charvi, Weber, Alexander Arno, Jurkschat, Lena, Abdelwahab, Hammam, John, Chelsea, Suarez, Pedro Ortiz, Ostendorff, Malte, Weinbach, Samuel, Sifa, Rafet, Kesselheim, Stefan, Flores-Herr, Nicolas
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Ali, Mehdi
Fromm, Michael
Thellmann, Klaudia
Rutmann, Richard
Lübbering, Max
Leveling, Johannes
Klug, Katrin
Ebert, Jan
Doll, Niclas
Buschhoff, Jasper Schulze
Jain, Charvi
Weber, Alexander Arno
Jurkschat, Lena
Abdelwahab, Hammam
John, Chelsea
Suarez, Pedro Ortiz
Ostendorff, Malte
Weinbach, Samuel
Sifa, Rafet
Kesselheim, Stefan
Flores-Herr, Nicolas
description The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.
doi_str_mv 10.48550/arxiv.2310.08754
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_08754</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_08754</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-aa1d47e030f12c9b6657488c22c70a60a6ab8d2d013fdf32711636cca2c851043</originalsourceid><addsrcrecordid>eNotj89qwkAYxPfiQbQP4Ml9gaT7f1cvIqFpC1EvuYcvm038MCZlS0vbp29qhYGBGRjmR8iKs1Q5rdkjxC_8TIWcAuasVnOyKcdLGPAnRJqdR_SB5mOkRXGgZQQccOi29Bi6Hjus-0CnLosfHqHfLcmshf49PNx9Qcr8qcxekuL0_JrtiwSMVQkAb5QNTLKWC7-pjdFWOeeF8JaBmQS1a0TDuGybVgrLuZHGexDeac6UXJD1_-zte_UW8Qrxu_pjqG4M8hcJ9j93</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><source>arXiv.org</source><creator>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</creator><creatorcontrib>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</creatorcontrib><description>The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.</description><identifier>DOI: 10.48550/arxiv.2310.08754</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/publicdomain/zero/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.08754$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.08754$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ali, Mehdi</creatorcontrib><creatorcontrib>Fromm, Michael</creatorcontrib><creatorcontrib>Thellmann, Klaudia</creatorcontrib><creatorcontrib>Rutmann, Richard</creatorcontrib><creatorcontrib>Lübbering, Max</creatorcontrib><creatorcontrib>Leveling, Johannes</creatorcontrib><creatorcontrib>Klug, Katrin</creatorcontrib><creatorcontrib>Ebert, Jan</creatorcontrib><creatorcontrib>Doll, Niclas</creatorcontrib><creatorcontrib>Buschhoff, Jasper Schulze</creatorcontrib><creatorcontrib>Jain, Charvi</creatorcontrib><creatorcontrib>Weber, Alexander Arno</creatorcontrib><creatorcontrib>Jurkschat, Lena</creatorcontrib><creatorcontrib>Abdelwahab, Hammam</creatorcontrib><creatorcontrib>John, Chelsea</creatorcontrib><creatorcontrib>Suarez, Pedro Ortiz</creatorcontrib><creatorcontrib>Ostendorff, Malte</creatorcontrib><creatorcontrib>Weinbach, Samuel</creatorcontrib><creatorcontrib>Sifa, Rafet</creatorcontrib><creatorcontrib>Kesselheim, Stefan</creatorcontrib><creatorcontrib>Flores-Herr, Nicolas</creatorcontrib><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><description>The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj89qwkAYxPfiQbQP4Ml9gaT7f1cvIqFpC1EvuYcvm038MCZlS0vbp29qhYGBGRjmR8iKs1Q5rdkjxC_8TIWcAuasVnOyKcdLGPAnRJqdR_SB5mOkRXGgZQQccOi29Bi6Hjus-0CnLosfHqHfLcmshf49PNx9Qcr8qcxekuL0_JrtiwSMVQkAb5QNTLKWC7-pjdFWOeeF8JaBmQS1a0TDuGybVgrLuZHGexDeac6UXJD1_-zte_UW8Qrxu_pjqG4M8hcJ9j93</recordid><startdate>20231012</startdate><enddate>20231012</enddate><creator>Ali, Mehdi</creator><creator>Fromm, Michael</creator><creator>Thellmann, Klaudia</creator><creator>Rutmann, Richard</creator><creator>Lübbering, Max</creator><creator>Leveling, Johannes</creator><creator>Klug, Katrin</creator><creator>Ebert, Jan</creator><creator>Doll, Niclas</creator><creator>Buschhoff, Jasper Schulze</creator><creator>Jain, Charvi</creator><creator>Weber, Alexander Arno</creator><creator>Jurkschat, Lena</creator><creator>Abdelwahab, Hammam</creator><creator>John, Chelsea</creator><creator>Suarez, Pedro Ortiz</creator><creator>Ostendorff, Malte</creator><creator>Weinbach, Samuel</creator><creator>Sifa, Rafet</creator><creator>Kesselheim, Stefan</creator><creator>Flores-Herr, Nicolas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231012</creationdate><title>Tokenizer Choice For LLM Training: Negligible or Crucial?</title><author>Ali, Mehdi ; Fromm, Michael ; Thellmann, Klaudia ; Rutmann, Richard ; Lübbering, Max ; Leveling, Johannes ; Klug, Katrin ; Ebert, Jan ; Doll, Niclas ; Buschhoff, Jasper Schulze ; Jain, Charvi ; Weber, Alexander Arno ; Jurkschat, Lena ; Abdelwahab, Hammam ; John, Chelsea ; Suarez, Pedro Ortiz ; Ostendorff, Malte ; Weinbach, Samuel ; Sifa, Rafet ; Kesselheim, Stefan ; Flores-Herr, Nicolas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-aa1d47e030f12c9b6657488c22c70a60a6ab8d2d013fdf32711636cca2c851043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Ali, Mehdi</creatorcontrib><creatorcontrib>Fromm, Michael</creatorcontrib><creatorcontrib>Thellmann, Klaudia</creatorcontrib><creatorcontrib>Rutmann, Richard</creatorcontrib><creatorcontrib>Lübbering, Max</creatorcontrib><creatorcontrib>Leveling, Johannes</creatorcontrib><creatorcontrib>Klug, Katrin</creatorcontrib><creatorcontrib>Ebert, Jan</creatorcontrib><creatorcontrib>Doll, Niclas</creatorcontrib><creatorcontrib>Buschhoff, Jasper Schulze</creatorcontrib><creatorcontrib>Jain, Charvi</creatorcontrib><creatorcontrib>Weber, Alexander Arno</creatorcontrib><creatorcontrib>Jurkschat, Lena</creatorcontrib><creatorcontrib>Abdelwahab, Hammam</creatorcontrib><creatorcontrib>John, Chelsea</creatorcontrib><creatorcontrib>Suarez, Pedro Ortiz</creatorcontrib><creatorcontrib>Ostendorff, Malte</creatorcontrib><creatorcontrib>Weinbach, Samuel</creatorcontrib><creatorcontrib>Sifa, Rafet</creatorcontrib><creatorcontrib>Kesselheim, Stefan</creatorcontrib><creatorcontrib>Flores-Herr, Nicolas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ali, Mehdi</au><au>Fromm, Michael</au><au>Thellmann, Klaudia</au><au>Rutmann, Richard</au><au>Lübbering, Max</au><au>Leveling, Johannes</au><au>Klug, Katrin</au><au>Ebert, Jan</au><au>Doll, Niclas</au><au>Buschhoff, Jasper Schulze</au><au>Jain, Charvi</au><au>Weber, Alexander Arno</au><au>Jurkschat, Lena</au><au>Abdelwahab, Hammam</au><au>John, Chelsea</au><au>Suarez, Pedro Ortiz</au><au>Ostendorff, Malte</au><au>Weinbach, Samuel</au><au>Sifa, Rafet</au><au>Kesselheim, Stefan</au><au>Flores-Herr, Nicolas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Tokenizer Choice For LLM Training: Negligible or Crucial?</atitle><date>2023-10-12</date><risdate>2023</risdate><abstract>The recent success of Large Language Models (LLMs) has been predominantly driven by curating the training dataset composition, scaling of model architectures and dataset sizes and advancements in pretraining objectives, leaving tokenizer influence as a blind spot. Shedding light on this underexplored area, we conduct a comprehensive study on the influence of tokenizer choice on LLM downstream performance by training 24 mono- and multilingual LLMs at a 2.6B parameter scale, ablating different tokenizer algorithms and parameterizations. Our studies highlight that the tokenizer choice can significantly impact the model's downstream performance and training costs. In particular, we find that the common tokenizer evaluation metrics fertility and parity are not always predictive of model downstream performance, rendering these metrics a questionable proxy for the model's downstream performance. Furthermore, we show that multilingual tokenizers trained on the five most frequent European languages require vocabulary size increases of factor three in comparison to English. While English-centric tokenizers have been applied to the training of multi-lingual LLMs in the past, we find that this approach results in a severe downstream performance degradation and additional training costs of up to 68%, due to an inefficient tokenization vocabulary.</abstract><doi>10.48550/arxiv.2310.08754</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.08754
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_08754
source arXiv.org
subjects Computer Science - Learning
title Tokenizer Choice For LLM Training: Negligible or Crucial?
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-17T16%3A38%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Tokenizer%20Choice%20For%20LLM%20Training:%20Negligible%20or%20Crucial?&rft.au=Ali,%20Mehdi&rft.date=2023-10-12&rft_id=info:doi/10.48550/arxiv.2310.08754&rft_dat=%3Carxiv_GOX%3E2310_08754%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true