SinKD: Sinkhorn Distance Minimization for Knowledge Distillation

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumpti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transaction on neural networks and learning systems 2024-12, p.1-15
Hauptverfasser: Cui, Xiao, Qin, Yulei, Gao, Yuting, Zhang, Enwei, Xu, Zihan, Wu, Tong, Li, Ke, Sun, Xing, Zhou, Wengang, Li, Houqiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 15
container_issue
container_start_page 1
container_title IEEE transaction on neural networks and learning systems
container_volume
creator Cui, Xiao
Qin, Yulei
Gao, Yuting
Zhang, Enwei
Xu, Zihan
Wu, Tong
Li, Ke
Sun, Xing
Zhou, Wengang
Li, Houqiang
description Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.
doi_str_mv 10.1109/TNNLS.2024.3501335
format Article
fullrecord <record><control><sourceid>hal_RIE</sourceid><recordid>TN_cdi_ieee_primary_10777837</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10777837</ieee_id><sourcerecordid>oai_HAL_hal_04803835v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</originalsourceid><addsrcrecordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><source>IEEE Electronic Library (IEL)</source><creator>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creator><creatorcontrib>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creatorcontrib><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><identifier>ISSN: 2162-237X</identifier><identifier>DOI: 10.1109/TNNLS.2024.3501335</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Artificial Intelligence ; Bidirectional control ; Computation and Language ; Computer Science ; Computer Vision and Pattern Recognition ; Costs ; Encoding ; Knowledge distillation (KD) ; Machine Learning ; Minimization ; Robustness ; Sinkhorn distance ; Statistics ; Sun ; Temperature measurement ; Training ; Transformers ; Wasserstein distance</subject><ispartof>IEEE transaction on neural networks and learning systems, 2024-12, p.1-15</ispartof><rights>Attribution</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>yuleichin@126.com ; tristanli@tencent.com ; lihq@ustc.edu.cn ; winfredsun@tencent.com ; zhwg@ustc.edu.cn ; cuixiao2001@mail.ustc.edu.cn</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,315,781,785,797,886,27928,27929,54762</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-04803835$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><subject>Adaptation models</subject><subject>Artificial Intelligence</subject><subject>Bidirectional control</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Computer Vision and Pattern Recognition</subject><subject>Costs</subject><subject>Encoding</subject><subject>Knowledge distillation (KD)</subject><subject>Machine Learning</subject><subject>Minimization</subject><subject>Robustness</subject><subject>Sinkhorn distance</subject><subject>Statistics</subject><subject>Sun</subject><subject>Temperature measurement</subject><subject>Training</subject><subject>Transformers</subject><subject>Wasserstein distance</subject><issn>2162-237X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</recordid><startdate>20241203</startdate><enddate>20241203</enddate><creator>Cui, Xiao</creator><creator>Qin, Yulei</creator><creator>Gao, Yuting</creator><creator>Zhang, Enwei</creator><creator>Xu, Zihan</creator><creator>Wu, Tong</creator><creator>Li, Ke</creator><creator>Sun, Xing</creator><creator>Zhou, Wengang</creator><creator>Li, Houqiang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid></search><sort><creationdate>20241203</creationdate><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><author>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Artificial Intelligence</topic><topic>Bidirectional control</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Computer Vision and Pattern Recognition</topic><topic>Costs</topic><topic>Encoding</topic><topic>Knowledge distillation (KD)</topic><topic>Machine Learning</topic><topic>Minimization</topic><topic>Robustness</topic><topic>Sinkhorn distance</topic><topic>Statistics</topic><topic>Sun</topic><topic>Temperature measurement</topic><topic>Training</topic><topic>Transformers</topic><topic>Wasserstein distance</topic><toplevel>online_resources</toplevel><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cui, Xiao</au><au>Qin, Yulei</au><au>Gao, Yuting</au><au>Zhang, Enwei</au><au>Xu, Zihan</au><au>Wu, Tong</au><au>Li, Ke</au><au>Sun, Xing</au><au>Zhou, Wengang</au><au>Li, Houqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><date>2024-12-03</date><risdate>2024</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>2162-237X</issn><coden>ITNNAL</coden><abstract>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</abstract><pub>IEEE</pub><doi>10.1109/TNNLS.2024.3501335</doi><tpages>15</tpages><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2162-237X
ispartof IEEE transaction on neural networks and learning systems, 2024-12, p.1-15
issn 2162-237X
language eng
recordid cdi_ieee_primary_10777837
source IEEE Electronic Library (IEL)
subjects Adaptation models
Artificial Intelligence
Bidirectional control
Computation and Language
Computer Science
Computer Vision and Pattern Recognition
Costs
Encoding
Knowledge distillation (KD)
Machine Learning
Minimization
Robustness
Sinkhorn distance
Statistics
Sun
Temperature measurement
Training
Transformers
Wasserstein distance
title SinKD: Sinkhorn Distance Minimization for Knowledge Distillation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T16%3A51%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SinKD:%20Sinkhorn%20Distance%20Minimization%20for%20Knowledge%20Distillation&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Cui,%20Xiao&rft.date=2024-12-03&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=2162-237X&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2024.3501335&rft_dat=%3Chal_RIE%3Eoai_HAL_hal_04803835v1%3C/hal_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10777837&rfr_iscdi=true