Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ahmadabadi, Hamid, Manzari, Omid Nejati, Ayatollahi, Ahmad
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Ahmadabadi, Hamid Manzari, Omid Nejati Ayatollahi, Ahmad
description	This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.
doi_str_mv	10.48550/arxiv.2311.01283
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_01283</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_01283</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-7237848286328d647280e09372f9b7e4b459ac10c62ed90a2adfcf5e928854c3</originalsourceid><addsrcrecordid>eNotz0tOwzAYBGBvWKDCAVjVF0jwM3aWVSgUUVoJKraRa_8OlhwbOeV1e2hhNTObkT6EriiphZaSXJvyFT5qximtCWWan6OXmzAdQowhDfgh5c8IbgDsSx5xt9lUu2LS5HMZoeDH7CBO-HfhZXo1yYLDq_fRJLywh5ATfgKbhxSO_QKdeRMnuPzPGXq-Xe66VbXe3t13i3VlGsUrxbjSQjPdcKZdIxTTBEjLFfPtXoHYC9kaS4ltGLiWGGact15Cy7SWwvIZmv-9nlz9WwmjKd_90deffPwHkfdKfg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><source>arXiv.org</source><creator>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</creator><creatorcontrib>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</creatorcontrib><description>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</description><identifier>DOI: 10.48550/arxiv.2311.01283</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.01283$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.01283$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ahmadabadi, Hamid</creatorcontrib><creatorcontrib>Manzari, Omid Nejati</creatorcontrib><creatorcontrib>Ayatollahi, Ahmad</creatorcontrib><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><description>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0tOwzAYBGBvWKDCAVjVF0jwM3aWVSgUUVoJKraRa_8OlhwbOeV1e2hhNTObkT6EriiphZaSXJvyFT5qximtCWWan6OXmzAdQowhDfgh5c8IbgDsSx5xt9lUu2LS5HMZoeDH7CBO-HfhZXo1yYLDq_fRJLywh5ATfgKbhxSO_QKdeRMnuPzPGXq-Xe66VbXe3t13i3VlGsUrxbjSQjPdcKZdIxTTBEjLFfPtXoHYC9kaS4ltGLiWGGact15Cy7SWwvIZmv-9nlz9WwmjKd_90deffPwHkfdKfg</recordid><startdate>20231102</startdate><enddate>20231102</enddate><creator>Ahmadabadi, Hamid</creator><creator>Manzari, Omid Nejati</creator><creator>Ayatollahi, Ahmad</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231102</creationdate><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><author>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-7237848286328d647280e09372f9b7e4b459ac10c62ed90a2adfcf5e928854c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Ahmadabadi, Hamid</creatorcontrib><creatorcontrib>Manzari, Omid Nejati</creatorcontrib><creatorcontrib>Ayatollahi, Ahmad</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ahmadabadi, Hamid</au><au>Manzari, Omid Nejati</au><au>Ayatollahi, Ahmad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</atitle><date>2023-11-02</date><risdate>2023</risdate><abstract>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</abstract><doi>10.48550/arxiv.2311.01283</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.01283
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_01283
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T06%3A00%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distilling%20Knowledge%20from%20CNN-Transformer%20Models%20for%20Enhanced%20Human%20Action%20Recognition&rft.au=Ahmadabadi,%20Hamid&rft.date=2023-11-02&rft_id=info:doi/10.48550/arxiv.2311.01283&rft_dat=%3Carxiv_GOX%3E2311_01283%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true