Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kolesnikova, Alina, Kuratov, Yuri, Konovalov, Vasily, Burtsev, Mikhail
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Kolesnikova, Alina Kuratov, Yuri Konovalov, Vasily Burtsev, Mikhail
description	Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.
doi_str_mv	10.48550/arxiv.2205.02340
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2205_02340</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2205_02340</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-c231be37d99b643f823b52e89524a03a0830119421040083d70eacda0d465b3d3</originalsourceid><addsrcrecordid>eNo1j81KxDAUhbNxIaMP4Mq8QOtNbtKfpYy_WBWGwW25adIxEBtpWkff3jrq6nDg4xw-xs4E5KrSGi5o_PQfuZSgc5Co4Jg9PQxxH5zdOX7l0-RDoMnHgceeb-aUPA28oWE30wI8RutC4ns_vfKNs3P3T77EjswcaPw6YUc9heRO_3LFtjfX2_Vd1jzf3q8vm4yKErJOojAOS1vXplDYVxKNlq6qtVQESFAhCFErKUDBUmwJjjpLYFWhDVpcsfPf2YNQ-z76t-W8_RFrD2L4DXQ4R-0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><source>arXiv.org</source><creator>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</creator><creatorcontrib>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</creatorcontrib><description>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</description><identifier>DOI: 10.48550/arxiv.2205.02340</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2022-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2205.02340$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2205.02340$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kolesnikova, Alina</creatorcontrib><creatorcontrib>Kuratov, Yuri</creatorcontrib><creatorcontrib>Konovalov, Vasily</creatorcontrib><creatorcontrib>Burtsev, Mikhail</creatorcontrib><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><description>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81KxDAUhbNxIaMP4Mq8QOtNbtKfpYy_WBWGwW25adIxEBtpWkff3jrq6nDg4xw-xs4E5KrSGi5o_PQfuZSgc5Co4Jg9PQxxH5zdOX7l0-RDoMnHgceeb-aUPA28oWE30wI8RutC4ns_vfKNs3P3T77EjswcaPw6YUc9heRO_3LFtjfX2_Vd1jzf3q8vm4yKErJOojAOS1vXplDYVxKNlq6qtVQESFAhCFErKUDBUmwJjjpLYFWhDVpcsfPf2YNQ-z76t-W8_RFrD2L4DXQ4R-0</recordid><startdate>20220504</startdate><enddate>20220504</enddate><creator>Kolesnikova, Alina</creator><creator>Kuratov, Yuri</creator><creator>Konovalov, Vasily</creator><creator>Burtsev, Mikhail</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220504</creationdate><title>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</title><author>Kolesnikova, Alina ; Kuratov, Yuri ; Konovalov, Vasily ; Burtsev, Mikhail</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-c231be37d99b643f823b52e89524a03a0830119421040083d70eacda0d465b3d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kolesnikova, Alina</creatorcontrib><creatorcontrib>Kuratov, Yuri</creatorcontrib><creatorcontrib>Konovalov, Vasily</creatorcontrib><creatorcontrib>Burtsev, Mikhail</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kolesnikova, Alina</au><au>Kuratov, Yuri</au><au>Konovalov, Vasily</au><au>Burtsev, Mikhail</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge Distillation of Russian Language Models with Reduction of Vocabulary</atitle><date>2022-05-04</date><risdate>2022</risdate><abstract>Today, transformer language models serve as a core component for majority of natural language processing tasks. Industrial application of such models requires minimization of computation time and memory footprint. Knowledge distillation is one of approaches to address this goal. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. Alternative option is to reduce the number of tokens in vocabulary and therefore the embeddings matrix of the student model. The main problem with vocabulary minimization is mismatch between input sequences and output class distributions of a teacher and a student models. As a result, it is impossible to directly apply KL-based knowledge distillation. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary. Evaluation of distilled models on a number of common benchmarks for Russian such as Russian SuperGLUE, SberQuAD, RuSentiment, ParaPhaser, Collection-3 demonstrated that our techniques allow to achieve compression from $17\times$ to $49\times$, while maintaining quality of $1.7\times$ compressed student with the full-sized vocabulary, but reduced number of Transformer layers only. We make our code and distilled models available.</abstract><doi>10.48550/arxiv.2205.02340</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2205.02340
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2205_02340
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Knowledge Distillation of Russian Language Models with Reduction of Vocabulary
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T13%3A02%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge%20Distillation%20of%20Russian%20Language%20Models%20with%20Reduction%20of%20Vocabulary&rft.au=Kolesnikova,%20Alina&rft.date=2022-05-04&rft_id=info:doi/10.48550/arxiv.2205.02340&rft_dat=%3Carxiv_GOX%3E2205_02340%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true