PDD: Pruning Neural Networks During Knowledge Distillation

Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Cognitive computation 2024-11, Vol.16 (6), p.3457-3467
Hauptverfasser:	Dan, Xi, Yang, Wenjie, Zhang, Fuyan, Zhou, Yihang, Yu, Zhuojun, Qiu, Zhen, Zhao, Boyuan, Dong, Zeyu, Huang, Libo, Yang, Chuanguang
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Artificial Intelligence Artificial neural networks Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Efficiency Knowledge Methods Neural networks Parameters Performance degradation Pruning Redundancy Teachers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3467
container_issue	6
container_start_page	3457
container_title	Cognitive computation
container_volume	16
creator	Dan, Xi Yang, Wenjie Zhang, Fuyan Zhou, Yihang Yu, Zhuojun Qiu, Zhen Zhao, Boyuan Dong, Zeyu Huang, Libo Yang, Chuanguang
description	Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation ( PDD ) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation .
doi_str_mv	10.1007/s12559-024-10350-9
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3125876782</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125876782</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-48e96335d16aa3587490aae4b80668b5ce4071ce74ffe3e1479ee757b20797ef3</originalsourceid><addsrcrecordid>eNp9kEtPwzAQhC0EEqXwBzhF4hxYv-3eUMNLVNADnC033VQpISl2oop_T0IQ3DjtajUzq_kIOadwSQH0VaRMSpsCEykFLiG1B2RCjVKptUoc_u5SHZOTGLcASlrJJmS2zLJZsgxdXdab5Am74Kt-tPsmvMUk68JwfqybfYXrDSZZGduyqnxbNvUpOSp8FfHsZ07J6-3Ny_w-XTzfPcyvF2nOANpUGLSKc7mmynsujRYWvEexMqCUWckcBWiaoxZFgRyp0BZRS71ioK3Ggk_JxZi7C81Hh7F126YLdf_S8b620Uob1qvYqMpDE2PAwu1C-e7Dp6PgBkZuZOR6Ru6bkbO9iY-muBuKYviL_sf1BTcbaDU</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125876782</pqid></control><display><type>article</type><title>PDD: Pruning Neural Networks During Knowledge Distillation</title><source>SpringerLink Journals</source><creator>Dan, Xi ; Yang, Wenjie ; Zhang, Fuyan ; Zhou, Yihang ; Yu, Zhuojun ; Qiu, Zhen ; Zhao, Boyuan ; Dong, Zeyu ; Huang, Libo ; Yang, Chuanguang</creator><creatorcontrib>Dan, Xi ; Yang, Wenjie ; Zhang, Fuyan ; Zhou, Yihang ; Yu, Zhuojun ; Qiu, Zhen ; Zhao, Boyuan ; Dong, Zeyu ; Huang, Libo ; Yang, Chuanguang</creatorcontrib><description>Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation ( PDD ) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation .</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-024-10350-9</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Artificial Intelligence ; Artificial neural networks ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Computer Science ; Efficiency ; Knowledge ; Methods ; Neural networks ; Parameters ; Performance degradation ; Pruning ; Redundancy ; Teachers</subject><ispartof>Cognitive computation, 2024-11, Vol.16 (6), p.3457-3467</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-48e96335d16aa3587490aae4b80668b5ce4071ce74ffe3e1479ee757b20797ef3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-024-10350-9$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s12559-024-10350-9$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,777,781,27905,27906,41469,42538,51300</link.rule.ids></links><search><creatorcontrib>Dan, Xi</creatorcontrib><creatorcontrib>Yang, Wenjie</creatorcontrib><creatorcontrib>Zhang, Fuyan</creatorcontrib><creatorcontrib>Zhou, Yihang</creatorcontrib><creatorcontrib>Yu, Zhuojun</creatorcontrib><creatorcontrib>Qiu, Zhen</creatorcontrib><creatorcontrib>Zhao, Boyuan</creatorcontrib><creatorcontrib>Dong, Zeyu</creatorcontrib><creatorcontrib>Huang, Libo</creatorcontrib><creatorcontrib>Yang, Chuanguang</creatorcontrib><title>PDD: Pruning Neural Networks During Knowledge Distillation</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation ( PDD ) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation .</description><subject>Accuracy</subject><subject>Artificial Intelligence</subject><subject>Artificial neural networks</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Efficiency</subject><subject>Knowledge</subject><subject>Methods</subject><subject>Neural networks</subject><subject>Parameters</subject><subject>Performance degradation</subject><subject>Pruning</subject><subject>Redundancy</subject><subject>Teachers</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kEtPwzAQhC0EEqXwBzhF4hxYv-3eUMNLVNADnC033VQpISl2oop_T0IQ3DjtajUzq_kIOadwSQH0VaRMSpsCEykFLiG1B2RCjVKptUoc_u5SHZOTGLcASlrJJmS2zLJZsgxdXdab5Am74Kt-tPsmvMUk68JwfqybfYXrDSZZGduyqnxbNvUpOSp8FfHsZ07J6-3Ny_w-XTzfPcyvF2nOANpUGLSKc7mmynsujRYWvEexMqCUWckcBWiaoxZFgRyp0BZRS71ioK3Ggk_JxZi7C81Hh7F126YLdf_S8b620Uob1qvYqMpDE2PAwu1C-e7Dp6PgBkZuZOR6Ru6bkbO9iY-muBuKYviL_sf1BTcbaDU</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Dan, Xi</creator><creator>Yang, Wenjie</creator><creator>Zhang, Fuyan</creator><creator>Zhou, Yihang</creator><creator>Yu, Zhuojun</creator><creator>Qiu, Zhen</creator><creator>Zhao, Boyuan</creator><creator>Dong, Zeyu</creator><creator>Huang, Libo</creator><creator>Yang, Chuanguang</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20241101</creationdate><title>PDD: Pruning Neural Networks During Knowledge Distillation</title><author>Dan, Xi ; Yang, Wenjie ; Zhang, Fuyan ; Zhou, Yihang ; Yu, Zhuojun ; Qiu, Zhen ; Zhao, Boyuan ; Dong, Zeyu ; Huang, Libo ; Yang, Chuanguang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-48e96335d16aa3587490aae4b80668b5ce4071ce74ffe3e1479ee757b20797ef3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial Intelligence</topic><topic>Artificial neural networks</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Efficiency</topic><topic>Knowledge</topic><topic>Methods</topic><topic>Neural networks</topic><topic>Parameters</topic><topic>Performance degradation</topic><topic>Pruning</topic><topic>Redundancy</topic><topic>Teachers</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Dan, Xi</creatorcontrib><creatorcontrib>Yang, Wenjie</creatorcontrib><creatorcontrib>Zhang, Fuyan</creatorcontrib><creatorcontrib>Zhou, Yihang</creatorcontrib><creatorcontrib>Yu, Zhuojun</creatorcontrib><creatorcontrib>Qiu, Zhen</creatorcontrib><creatorcontrib>Zhao, Boyuan</creatorcontrib><creatorcontrib>Dong, Zeyu</creatorcontrib><creatorcontrib>Huang, Libo</creatorcontrib><creatorcontrib>Yang, Chuanguang</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Dan, Xi</au><au>Yang, Wenjie</au><au>Zhang, Fuyan</au><au>Zhou, Yihang</au><au>Yu, Zhuojun</au><au>Qiu, Zhen</au><au>Zhao, Boyuan</au><au>Dong, Zeyu</au><au>Huang, Libo</au><au>Yang, Chuanguang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PDD: Pruning Neural Networks During Knowledge Distillation</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>16</volume><issue>6</issue><spage>3457</spage><epage>3467</epage><pages>3457-3467</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>Although deep neural networks have developed at a high level, the large computational requirement limits the deployment in end devices. To this end, a variety of model compression and acceleration techniques have been developed. Among these, knowledge distillation has emerged as a popular approach that involves training a small student model to mimic the performance of a larger teacher model. However, the student architectures used in existing knowledge distillation are not optimal and always have redundancy, which raises questions about the validity of this assumption in practice. This study aims to investigate this assumption and empirically demonstrate that student models could contain redundancy, which can be removed through pruning without significant performance degradation. Therefore, we propose a novel pruning method to eliminate redundancy in student models. Instead of using traditional post-training pruning methods, we perform pruning during knowledge distillation ( PDD ) to prevent any loss of important information from the teacher models to the student models. This is achieved by designing a differentiable mask for each convolutional layer, which can dynamically adjust the channels to be pruned based on the loss. Experimental results show that with ResNet20 as the student model and ResNet56 as the teacher model, a 39.53%-FLOPs reduction was achieved by removing 32.77% of parameters, while the top-1 accuracy on CIFAR10 increased by 0.17%. With VGG11 as the student model and VGG16 as the teacher model, a 74.96%-FLOPs reduction was achieved by removing 76.43% of parameters, with only a loss of 1.34% in the top-1 accuracy on CIFAR10. Our code is available at https://github.com/YihangZhou0424/PDD-Pruning-during-distillation .</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-024-10350-9</doi><tpages>11</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1866-9956
ispartof	Cognitive computation, 2024-11, Vol.16 (6), p.3457-3467
issn	1866-9956 1866-9964
language	eng
recordid	cdi_proquest_journals_3125876782
source	SpringerLink Journals
subjects	Accuracy Artificial Intelligence Artificial neural networks Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Efficiency Knowledge Methods Neural networks Parameters Performance degradation Pruning Redundancy Teachers
title	PDD: Pruning Neural Networks During Knowledge Distillation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T08%3A01%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PDD:%20Pruning%20Neural%20Networks%20During%20Knowledge%20Distillation&rft.jtitle=Cognitive%20computation&rft.au=Dan,%20Xi&rft.date=2024-11-01&rft.volume=16&rft.issue=6&rft.spage=3457&rft.epage=3467&rft.pages=3457-3467&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-024-10350-9&rft_dat=%3Cproquest_cross%3E3125876782%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125876782&rft_id=info:pmid/&rfr_iscdi=true