Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kolesnikova, Alina, Kuratov, Yuri, Konovalov, Vasily, Burtsev, Mikhail
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Kolesnikova, Alina
Kuratov, Yuri
Konovalov, Vasily
Burtsev, Mikhail
description Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.
doi_str_mv 10.48550/arxiv.2205.02340
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2205_02340</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2205_02340</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-c231be37d99b643f823b52e89524a03a0830119421040083d70eacda0d465b3d3</originalsourceid><addsrcrecordid>eNo1j81KxDAUhbNxIaMP4Mq8QOtNbtKfpYy_WBWGwW25adIxEBtpWkff3jrq6nDg4xw-xs4E5KrSGi5o_PQfuZSgc5Co4Jg9PQxxH5zdOX7l0-RDoMnHgceeb-aUPA28oWE30wI8RutC4ns_vfKNs3P3T77EjswcaPw6YUc9heRO_3LFtjfX2_Vd1jzf3q8vm4yKErJOojAOS1vXplDYVxKNlq6qtVQESFAhCFErKUDBUmwJjjpLYFWhDVpcsfPf2YNQ-z76t-W8_RFrD2L4DXQ4R-0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><source>arXiv.org</source><creator>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</creator><creatorcontrib>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</creatorcontrib><description>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</description><identifier>DOI: 10.48550/arxiv.2205.02340</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2022-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2205.02340$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2205.02340$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kolesnikova, Alina</creatorcontrib><creatorcontrib>Kuratov, Yuri</creatorcontrib><creatorcontrib>Konovalov, Vasily</creatorcontrib><creatorcontrib>Burtsev, Mikhail</creatorcontrib><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><description>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81KxDAUhbNxIaMP4Mq8QOtNbtKfpYy_WBWGwW25adIxEBtpWkff3jrq6nDg4xw-xs4E5KrSGi5o_PQfuZSgc5Co4Jg9PQxxH5zdOX7l0-RDoMnHgceeb-aUPA28oWE30wI8RutC4ns_vfKNs3P3T77EjswcaPw6YUc9heRO_3LFtjfX2_Vd1jzf3q8vm4yKErJOojAOS1vXplDYVxKNlq6qtVQESFAhCFErKUDBUmwJjjpLYFWhDVpcsfPf2YNQ-z76t-W8_RFrD2L4DXQ4R-0</recordid><startdate>20220504</startdate><enddate>20220504</enddate><creator>Kolesnikova, Alina</creator><creator>Kuratov, Yuri</creator><creator>Konovalov, Vasily</creator><creator>Burtsev, Mikhail</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220504</creationdate><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><author>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-c231be37d99b643f823b52e89524a03a0830119421040083d70eacda0d465b3d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kolesnikova, Alina</creatorcontrib><creatorcontrib>Kuratov, Yuri</creatorcontrib><creatorcontrib>Konovalov, Vasily</creatorcontrib><creatorcontrib>Burtsev, Mikhail</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kolesnikova, Alina</au><au>Kuratov, Yuri</au><au>Konovalov, Vasily</au><au>Burtsev, Mikhail</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</atitle><date>2022-05-04</date><risdate>2022</risdate><abstract>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</abstract><doi>10.48550/arxiv.2205.02340</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2205.02340
ispartof
issn
language eng
recordid cdi_arxiv_primary_2205_02340
source arXiv.org
subjects Computer Science - Computation and Language
Computer Science - Learning
title Knowledge Distillation of Russian Language Models with Reduction of Vocabulary
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T13%3A02%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge%20Distillation%20of%20Russian%20Language%20Models%20with%20Reduction%20of%20Vocabulary&rft.au=Kolesnikova,%20Alina&rft.date=2022-05-04&rft_id=info:doi/10.48550/arxiv.2205.02340&rft_dat=%3Carxiv_GOX%3E2205_02340%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true