SinKD: Sinkhorn Distance Minimization for Knowledge Distillation

Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumpti...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2024-12, p.1-15
Hauptverfasser:	Cui, Xiao, Qin, Yulei, Gao, Yuting, Zhang, Enwei, Xu, Zihan, Wu, Tong, Li, Ke, Sun, Xing, Zhou, Wengang, Li, Houqiang
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Artificial Intelligence Bidirectional control Computation and Language Computer Science Computer Vision and Pattern Recognition Costs Encoding Knowledge distillation (KD) Machine Learning Minimization Robustness Sinkhorn distance Statistics Sun Temperature measurement Training Transformers Wasserstein distance
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	15
container_issue
container_start_page	1
container_title	IEEE transaction on neural networks and learning systems
container_volume
creator	Cui, Xiao Qin, Yulei Gao, Yuting Zhang, Enwei Xu, Zihan Wu, Tong Li, Ke Sun, Xing Zhou, Wengang Li, Houqiang
description	Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.
doi_str_mv	10.1109/TNNLS.2024.3501335
format	Article
fullrecord	<record><control><sourceid>hal_RIE</sourceid><recordid>TN_cdi_ieee_primary_10777837</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10777837</ieee_id><sourcerecordid>oai_HAL_hal_04803835v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</originalsourceid><addsrcrecordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><source>IEEE Electronic Library (IEL)</source><creator>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creator><creatorcontrib>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creatorcontrib><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><identifier>ISSN: 2162-237X</identifier><identifier>DOI: 10.1109/TNNLS.2024.3501335</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Artificial Intelligence ; Bidirectional control ; Computation and Language ; Computer Science ; Computer Vision and Pattern Recognition ; Costs ; Encoding ; Knowledge distillation (KD) ; Machine Learning ; Minimization ; Robustness ; Sinkhorn distance ; Statistics ; Sun ; Temperature measurement ; Training ; Transformers ; Wasserstein distance</subject><ispartof>IEEE transaction on neural networks and learning systems, 2024-12, p.1-15</ispartof><rights>Attribution</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>yuleichin@126.com ; tristanli@tencent.com ; lihq@ustc.edu.cn ; winfredsun@tencent.com ; zhwg@ustc.edu.cn ; cuixiao2001@mail.ustc.edu.cn</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,315,781,785,797,886,27928,27929,54762</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-04803835$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><subject>Adaptation models</subject><subject>Artificial Intelligence</subject><subject>Bidirectional control</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Computer Vision and Pattern Recognition</subject><subject>Costs</subject><subject>Encoding</subject><subject>Knowledge distillation (KD)</subject><subject>Machine Learning</subject><subject>Minimization</subject><subject>Robustness</subject><subject>Sinkhorn distance</subject><subject>Statistics</subject><subject>Sun</subject><subject>Temperature measurement</subject><subject>Training</subject><subject>Transformers</subject><subject>Wasserstein distance</subject><issn>2162-237X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</recordid><startdate>20241203</startdate><enddate>20241203</enddate><creator>Cui, Xiao</creator><creator>Qin, Yulei</creator><creator>Gao, Yuting</creator><creator>Zhang, Enwei</creator><creator>Xu, Zihan</creator><creator>Wu, Tong</creator><creator>Li, Ke</creator><creator>Sun, Xing</creator><creator>Zhou, Wengang</creator><creator>Li, Houqiang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid></search><sort><creationdate>20241203</creationdate><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><author>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Artificial Intelligence</topic><topic>Bidirectional control</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Computer Vision and Pattern Recognition</topic><topic>Costs</topic><topic>Encoding</topic><topic>Knowledge distillation (KD)</topic><topic>Machine Learning</topic><topic>Minimization</topic><topic>Robustness</topic><topic>Sinkhorn distance</topic><topic>Statistics</topic><topic>Sun</topic><topic>Temperature measurement</topic><topic>Training</topic><topic>Transformers</topic><topic>Wasserstein distance</topic><toplevel>online_resources</toplevel><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cui, Xiao</au><au>Qin, Yulei</au><au>Gao, Yuting</au><au>Zhang, Enwei</au><au>Xu, Zihan</au><au>Wu, Tong</au><au>Li, Ke</au><au>Sun, Xing</au><au>Zhou, Wengang</au><au>Li, Houqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><date>2024-12-03</date><risdate>2024</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>2162-237X</issn><coden>ITNNAL</coden><abstract>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</abstract><pub>IEEE</pub><doi>10.1109/TNNLS.2024.3501335</doi><tpages>15</tpages><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2162-237X
ispartof	IEEE transaction on neural networks and learning systems, 2024-12, p.1-15
issn	2162-237X
language	eng
recordid	cdi_ieee_primary_10777837
source	IEEE Electronic Library (IEL)
subjects	Adaptation models Artificial Intelligence Bidirectional control Computation and Language Computer Science Computer Vision and Pattern Recognition Costs Encoding Knowledge distillation (KD) Machine Learning Minimization Robustness Sinkhorn distance Statistics Sun Temperature measurement Training Transformers Wasserstein distance
title	SinKD: Sinkhorn Distance Minimization for Knowledge Distillation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T16%3A51%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SinKD:%20Sinkhorn%20Distance%20Minimization%20for%20Knowledge%20Distillation&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Cui,%20Xiao&rft.date=2024-12-03&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=2162-237X&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2024.3501335&rft_dat=%3Chal_RIE%3Eoai_HAL_hal_04803835v1%3C/hal_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10777837&rfr_iscdi=true