SinKD: Sinkhorn Distance Minimization for Knowledge Distillation
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumpti...
Gespeichert in:
Veröffentlicht in: | IEEE transaction on neural networks and learning systems 2024-12, p.1-15 |
---|---|
Hauptverfasser: | , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 15 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transaction on neural networks and learning systems |
container_volume | |
creator | Cui, Xiao Qin, Yulei Gao, Yuting Zhang, Enwei Xu, Zihan Wu, Tong Li, Ke Sun, Xing Zhou, Wengang Li, Houqiang |
description | Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD. |
doi_str_mv | 10.1109/TNNLS.2024.3501335 |
format | Article |
fullrecord | <record><control><sourceid>hal_RIE</sourceid><recordid>TN_cdi_ieee_primary_10777837</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10777837</ieee_id><sourcerecordid>oai_HAL_hal_04803835v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</originalsourceid><addsrcrecordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><source>IEEE Electronic Library (IEL)</source><creator>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creator><creatorcontrib>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</creatorcontrib><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><identifier>ISSN: 2162-237X</identifier><identifier>DOI: 10.1109/TNNLS.2024.3501335</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>IEEE</publisher><subject>Adaptation models ; Artificial Intelligence ; Bidirectional control ; Computation and Language ; Computer Science ; Computer Vision and Pattern Recognition ; Costs ; Encoding ; Knowledge distillation (KD) ; Machine Learning ; Minimization ; Robustness ; Sinkhorn distance ; Statistics ; Sun ; Temperature measurement ; Training ; Transformers ; Wasserstein distance</subject><ispartof>IEEE transaction on neural networks and learning systems, 2024-12, p.1-15</ispartof><rights>Attribution</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>yuleichin@126.com ; tristanli@tencent.com ; lihq@ustc.edu.cn ; winfredsun@tencent.com ; zhwg@ustc.edu.cn ; cuixiao2001@mail.ustc.edu.cn</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,315,781,785,797,886,27928,27929,54762</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10777837$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-04803835$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><description>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</description><subject>Adaptation models</subject><subject>Artificial Intelligence</subject><subject>Bidirectional control</subject><subject>Computation and Language</subject><subject>Computer Science</subject><subject>Computer Vision and Pattern Recognition</subject><subject>Costs</subject><subject>Encoding</subject><subject>Knowledge distillation (KD)</subject><subject>Machine Learning</subject><subject>Minimization</subject><subject>Robustness</subject><subject>Sinkhorn distance</subject><subject>Statistics</subject><subject>Sun</subject><subject>Temperature measurement</subject><subject>Training</subject><subject>Transformers</subject><subject>Wasserstein distance</subject><issn>2162-237X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9jD1PwzAURT2ARFX6BxBDVoaU58-XMFG1lKKGMjQDW-TENjGkDkoqEPx6Qou4y5HuPbqEXFCYUgrpdb7ZZNspAyamXALlXJ6QEaOKxYzj8xmZ9P0rDFEglUhH5Hbrw3pxEw14q9suRAvf73WobPTog9_5b733bYhc20Xr0H421rzYg-Ob5jCdk1Onm95O_jgm-fIun6_i7On-YT7L4lpKjCtmNMjUABrFREWTJFXIsbJUVIilkS4tnaNgUmqwBMG4wpInYGlphXWOj8nV8bbWTfHe-Z3uvopW-2I1y4rfDkQCPOHygw7u5dH11tp_mQIiJhz5D5ZAVWU</recordid><startdate>20241203</startdate><enddate>20241203</enddate><creator>Cui, Xiao</creator><creator>Qin, Yulei</creator><creator>Gao, Yuting</creator><creator>Zhang, Enwei</creator><creator>Xu, Zihan</creator><creator>Wu, Tong</creator><creator>Li, Ke</creator><creator>Sun, Xing</creator><creator>Zhou, Wengang</creator><creator>Li, Houqiang</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid></search><sort><creationdate>20241203</creationdate><title>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</title><author>Cui, Xiao ; Qin, Yulei ; Gao, Yuting ; Zhang, Enwei ; Xu, Zihan ; Wu, Tong ; Li, Ke ; Sun, Xing ; Zhou, Wengang ; Li, Houqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-h557-c2da059d07d624c18896737ce14c77bd5f9bff10d91d7b042367b380e1be4eff3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adaptation models</topic><topic>Artificial Intelligence</topic><topic>Bidirectional control</topic><topic>Computation and Language</topic><topic>Computer Science</topic><topic>Computer Vision and Pattern Recognition</topic><topic>Costs</topic><topic>Encoding</topic><topic>Knowledge distillation (KD)</topic><topic>Machine Learning</topic><topic>Minimization</topic><topic>Robustness</topic><topic>Sinkhorn distance</topic><topic>Statistics</topic><topic>Sun</topic><topic>Temperature measurement</topic><topic>Training</topic><topic>Transformers</topic><topic>Wasserstein distance</topic><toplevel>online_resources</toplevel><creatorcontrib>Cui, Xiao</creatorcontrib><creatorcontrib>Qin, Yulei</creatorcontrib><creatorcontrib>Gao, Yuting</creatorcontrib><creatorcontrib>Zhang, Enwei</creatorcontrib><creatorcontrib>Xu, Zihan</creatorcontrib><creatorcontrib>Wu, Tong</creatorcontrib><creatorcontrib>Li, Ke</creatorcontrib><creatorcontrib>Sun, Xing</creatorcontrib><creatorcontrib>Zhou, Wengang</creatorcontrib><creatorcontrib>Li, Houqiang</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Cui, Xiao</au><au>Qin, Yulei</au><au>Gao, Yuting</au><au>Zhang, Enwei</au><au>Xu, Zihan</au><au>Wu, Tong</au><au>Li, Ke</au><au>Sun, Xing</au><au>Zhou, Wengang</au><au>Li, Houqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SinKD: Sinkhorn Distance Minimization for Knowledge Distillation</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><date>2024-12-03</date><risdate>2024</risdate><spage>1</spage><epage>15</epage><pages>1-15</pages><issn>2162-237X</issn><coden>ITNNAL</coden><abstract>Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs). Existing KD methods investigate various divergence measures including the Kullback-Leibler (KL), reverse KL (RKL), and Jensen-Shannon (JS) divergences. However, due to limitations inherent in their assumptions and definitions, these measures fail to deliver effective supervision when a distribution overlap exists between the teacher and the student. In this article, we show that the aforementioned KL, RKL, and JS divergences, respectively, suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation, which deteriorates logits-based KD for diverse natural language processing (NLP) tasks. We propose the Sinkhorn KD (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between distributions of teacher and student models. Besides, thanks to the properties of the Sinkhorn metric, we get rid of sample-wise KD that restricts the perception of divergences inside each teacher-student sample pair. Instead, we propose a batch-wise reformulation to capture the geometric intricacies of distributions across samples in the high-dimensional space. A comprehensive evaluation of GLUE and SuperGLUE, in terms of comparability, validity, and generalizability, highlights our superiority over state-of-the-art (SOTA) methods on all kinds of LLMs with encoder-only, encoder-decoder, and decoder-only architectures. Codes and models are available at https://github.com/2018cx/SinKD.</abstract><pub>IEEE</pub><doi>10.1109/TNNLS.2024.3501335</doi><tpages>15</tpages><orcidid>https://orcid.org/yuleichin@126.com</orcidid><orcidid>https://orcid.org/tristanli@tencent.com</orcidid><orcidid>https://orcid.org/lihq@ustc.edu.cn</orcidid><orcidid>https://orcid.org/winfredsun@tencent.com</orcidid><orcidid>https://orcid.org/zhwg@ustc.edu.cn</orcidid><orcidid>https://orcid.org/cuixiao2001@mail.ustc.edu.cn</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2162-237X |
ispartof | IEEE transaction on neural networks and learning systems, 2024-12, p.1-15 |
issn | 2162-237X |
language | eng |
recordid | cdi_ieee_primary_10777837 |
source | IEEE Electronic Library (IEL) |
subjects | Adaptation models Artificial Intelligence Bidirectional control Computation and Language Computer Science Computer Vision and Pattern Recognition Costs Encoding Knowledge distillation (KD) Machine Learning Minimization Robustness Sinkhorn distance Statistics Sun Temperature measurement Training Transformers Wasserstein distance |
title | SinKD: Sinkhorn Distance Minimization for Knowledge Distillation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T16%3A51%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SinKD:%20Sinkhorn%20Distance%20Minimization%20for%20Knowledge%20Distillation&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Cui,%20Xiao&rft.date=2024-12-03&rft.spage=1&rft.epage=15&rft.pages=1-15&rft.issn=2162-237X&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2024.3501335&rft_dat=%3Chal_RIE%3Eoai_HAL_hal_04803835v1%3C/hal_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10777837&rfr_iscdi=true |