Knowledge distillation-based compression method for pre-trained language model, and platform

A knowledge distillation-based compression method for a pre-trained language model, and a platform. In the method, a universal feature transfer knowledge distillation strategy is first designed, and in a process of distilling knowledge from a teacher model to a student model, feature maps of each la...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hongsheng Wang, Haijun Shan, Fei Yang
Format:	Patent
Sprache:	eng
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hongsheng Wang Haijun Shan Fei Yang
description	A knowledge distillation-based compression method for a pre-trained language model, and a platform. In the method, a universal feature transfer knowledge distillation strategy is first designed, and in a process of distilling knowledge from a teacher model to a student model, feature maps of each layer of the student model are approximated to features of the teacher model, with emphasis on the feature expression capacity in intermediate layers of the teacher model for small samples, and these features are used to guide the student model; then, the ability of the self-attention distribution of the teacher model to detect semantics and syntax between words is used to construct a knowledge distillation method based on self-attention crossover; and finally, in order to improve the learning quality of early-period training and the generalization ability of late-period training in the learning model, a Bernouli probability distribution-based linear transfer strategy is designed to gradually complete knowledge transfer of the feature map and self-attention distribution from the teacher to the student. By means of the present method, automatic compression is performed on a pre-trained multi-task-oriented language model, improving language model compression efficiency.
format	Patent
fullrecord	<record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_GB2608919A</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>GB2608919A</sourcerecordid><originalsourceid>FETCH-epo_espacenet_GB2608919A3</originalsourceid><addsrcrecordid>eNqFjLsKAjEQRdNYiPoNzgcY8AHilio-wNZSWMad2TUwyYQk4u-bwt7qwuGcOzaPW9CPMA0M5HJxIlicBvvEzASd-pg450rAc3kpQa8JKrMloQtVEQzDG2vulVgWgIEg1pPq-akZ9SiZZ7-dmPn5dD9eLUdtOUfsOHBpL4f1drlrVs1-89_4Ag5mO18</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>Knowledge distillation-based compression method for pre-trained language model, and platform</title><source>esp@cenet</source><creator>Hongsheng Wang ; Haijun Shan ; Fei Yang</creator><creatorcontrib>Hongsheng Wang ; Haijun Shan ; Fei Yang</creatorcontrib><description>A knowledge distillation-based compression method for a pre-trained language model, and a platform. In the method, a universal feature transfer knowledge distillation strategy is first designed, and in a process of distilling knowledge from a teacher model to a student model, feature maps of each layer of the student model are approximated to features of the teacher model, with emphasis on the feature expression capacity in intermediate layers of the teacher model for small samples, and these features are used to guide the student model; then, the ability of the self-attention distribution of the teacher model to detect semantics and syntax between words is used to construct a knowledge distillation method based on self-attention crossover; and finally, in order to improve the learning quality of early-period training and the generalization ability of late-period training in the learning model, a Bernouli probability distribution-based linear transfer strategy is designed to gradually complete knowledge transfer of the feature map and self-attention distribution from the teacher to the student. By means of the present method, automatic compression is performed on a pre-trained multi-task-oriented language model, improving language model compression efficiency.</description><language>eng</language><subject>CALCULATING ; COMPUTING ; COUNTING ; ELECTRIC DIGITAL DATA PROCESSING ; PHYSICS</subject><creationdate>2023</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20230118&DB=EPODOC&CC=GB&NR=2608919A$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,776,881,25543,76294</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20230118&DB=EPODOC&CC=GB&NR=2608919A$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Hongsheng Wang</creatorcontrib><creatorcontrib>Haijun Shan</creatorcontrib><creatorcontrib>Fei Yang</creatorcontrib><title>Knowledge distillation-based compression method for pre-trained language model, and platform</title><description>A knowledge distillation-based compression method for a pre-trained language model, and a platform. In the method, a universal feature transfer knowledge distillation strategy is first designed, and in a process of distilling knowledge from a teacher model to a student model, feature maps of each layer of the student model are approximated to features of the teacher model, with emphasis on the feature expression capacity in intermediate layers of the teacher model for small samples, and these features are used to guide the student model; then, the ability of the self-attention distribution of the teacher model to detect semantics and syntax between words is used to construct a knowledge distillation method based on self-attention crossover; and finally, in order to improve the learning quality of early-period training and the generalization ability of late-period training in the learning model, a Bernouli probability distribution-based linear transfer strategy is designed to gradually complete knowledge transfer of the feature map and self-attention distribution from the teacher to the student. By means of the present method, automatic compression is performed on a pre-trained multi-task-oriented language model, improving language model compression efficiency.</description><subject>CALCULATING</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>ELECTRIC DIGITAL DATA PROCESSING</subject><subject>PHYSICS</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2023</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNqFjLsKAjEQRdNYiPoNzgcY8AHilio-wNZSWMad2TUwyYQk4u-bwt7qwuGcOzaPW9CPMA0M5HJxIlicBvvEzASd-pg450rAc3kpQa8JKrMloQtVEQzDG2vulVgWgIEg1pPq-akZ9SiZZ7-dmPn5dD9eLUdtOUfsOHBpL4f1drlrVs1-89_4Ag5mO18</recordid><startdate>20230118</startdate><enddate>20230118</enddate><creator>Hongsheng Wang</creator><creator>Haijun Shan</creator><creator>Fei Yang</creator><scope>EVB</scope></search><sort><creationdate>20230118</creationdate><title>Knowledge distillation-based compression method for pre-trained language model, and platform</title><author>Hongsheng Wang ; Haijun Shan ; Fei Yang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_GB2608919A3</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng</language><creationdate>2023</creationdate><topic>CALCULATING</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>ELECTRIC DIGITAL DATA PROCESSING</topic><topic>PHYSICS</topic><toplevel>online_resources</toplevel><creatorcontrib>Hongsheng Wang</creatorcontrib><creatorcontrib>Haijun Shan</creatorcontrib><creatorcontrib>Fei Yang</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hongsheng Wang</au><au>Haijun Shan</au><au>Fei Yang</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>Knowledge distillation-based compression method for pre-trained language model, and platform</title><date>2023-01-18</date><risdate>2023</risdate><abstract>A knowledge distillation-based compression method for a pre-trained language model, and a platform. In the method, a universal feature transfer knowledge distillation strategy is first designed, and in a process of distilling knowledge from a teacher model to a student model, feature maps of each layer of the student model are approximated to features of the teacher model, with emphasis on the feature expression capacity in intermediate layers of the teacher model for small samples, and these features are used to guide the student model; then, the ability of the self-attention distribution of the teacher model to detect semantics and syntax between words is used to construct a knowledge distillation method based on self-attention crossover; and finally, in order to improve the learning quality of early-period training and the generalization ability of late-period training in the learning model, a Bernouli probability distribution-based linear transfer strategy is designed to gradually complete knowledge transfer of the feature map and self-attention distribution from the teacher to the student. By means of the present method, automatic compression is performed on a pre-trained multi-task-oriented language model, improving language model compression efficiency.</abstract><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier
ispartof
issn
language	eng
recordid	cdi_epo_espacenet_GB2608919A
source	esp@cenet
subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
title	Knowledge distillation-based compression method for pre-trained language model, and platform
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T08%3A32%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=Hongsheng%20Wang&rft.date=2023-01-18&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EGB2608919A%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true