Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition

This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ahmadabadi, Hamid, Manzari, Omid Nejati, Ayatollahi, Ahmad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Ahmadabadi, Hamid
Manzari, Omid Nejati
Ayatollahi, Ahmad
description This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.
doi_str_mv 10.48550/arxiv.2311.01283
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_01283</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_01283</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-7237848286328d647280e09372f9b7e4b459ac10c62ed90a2adfcf5e928854c3</originalsourceid><addsrcrecordid>eNotz0tOwzAYBGBvWKDCAVjVF0jwM3aWVSgUUVoJKraRa_8OlhwbOeV1e2hhNTObkT6EriiphZaSXJvyFT5qximtCWWan6OXmzAdQowhDfgh5c8IbgDsSx5xt9lUu2LS5HMZoeDH7CBO-HfhZXo1yYLDq_fRJLywh5ATfgKbhxSO_QKdeRMnuPzPGXq-Xe66VbXe3t13i3VlGsUrxbjSQjPdcKZdIxTTBEjLFfPtXoHYC9kaS4ltGLiWGGact15Cy7SWwvIZmv-9nlz9WwmjKd_90deffPwHkfdKfg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><source>arXiv.org</source><creator>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</creator><creatorcontrib>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</creatorcontrib><description>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</description><identifier>DOI: 10.48550/arxiv.2311.01283</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.01283$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.01283$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ahmadabadi, Hamid</creatorcontrib><creatorcontrib>Manzari, Omid Nejati</creatorcontrib><creatorcontrib>Ayatollahi, Ahmad</creatorcontrib><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><description>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz0tOwzAYBGBvWKDCAVjVF0jwM3aWVSgUUVoJKraRa_8OlhwbOeV1e2hhNTObkT6EriiphZaSXJvyFT5qximtCWWan6OXmzAdQowhDfgh5c8IbgDsSx5xt9lUu2LS5HMZoeDH7CBO-HfhZXo1yYLDq_fRJLywh5ATfgKbhxSO_QKdeRMnuPzPGXq-Xe66VbXe3t13i3VlGsUrxbjSQjPdcKZdIxTTBEjLFfPtXoHYC9kaS4ltGLiWGGact15Cy7SWwvIZmv-9nlz9WwmjKd_90deffPwHkfdKfg</recordid><startdate>20231102</startdate><enddate>20231102</enddate><creator>Ahmadabadi, Hamid</creator><creator>Manzari, Omid Nejati</creator><creator>Ayatollahi, Ahmad</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231102</creationdate><title>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</title><author>Ahmadabadi, Hamid ; Manzari, Omid Nejati ; Ayatollahi, Ahmad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-7237848286328d647280e09372f9b7e4b459ac10c62ed90a2adfcf5e928854c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Ahmadabadi, Hamid</creatorcontrib><creatorcontrib>Manzari, Omid Nejati</creatorcontrib><creatorcontrib>Ayatollahi, Ahmad</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ahmadabadi, Hamid</au><au>Manzari, Omid Nejati</au><au>Ayatollahi, Ahmad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition</atitle><date>2023-11-02</date><risdate>2023</risdate><abstract>This paper presents a study on improving human action recognition through the utilization of knowledge distillation, and the combination of CNN and ViT models. The research aims to enhance the performance and efficiency of smaller student models by transferring knowledge from larger teacher models. The proposed method employs a Transformer vision network as the student model, while a convolutional network serves as the teacher model. The teacher model extracts local image features, whereas the student model focuses on global features using an attention mechanism. The Vision Transformer (ViT) architecture is introduced as a robust framework for capturing global dependencies in images. Additionally, advanced variants of ViT, namely PVT, Convit, MVIT, Swin Transformer, and Twins, are discussed, highlighting their contributions to computer vision tasks. The ConvNeXt model is introduced as a teacher model, known for its efficiency and effectiveness in computer vision. The paper presents performance results for human action recognition on the Stanford 40 dataset, comparing the accuracy and mAP of student models trained with and without knowledge distillation. The findings illustrate that the suggested approach significantly improves the accuracy and mAP when compared to training networks under regular settings. These findings emphasize the potential of combining local and global features in action recognition tasks.</abstract><doi>10.48550/arxiv.2311.01283</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2311.01283
ispartof
issn
language eng
recordid cdi_arxiv_primary_2311_01283
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Distilling Knowledge from CNN-Transformer Models for Enhanced Human Action Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T06%3A00%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Distilling%20Knowledge%20from%20CNN-Transformer%20Models%20for%20Enhanced%20Human%20Action%20Recognition&rft.au=Ahmadabadi,%20Hamid&rft.date=2023-11-02&rft_id=info:doi/10.48550/arxiv.2311.01283&rft_dat=%3Carxiv_GOX%3E2311_01283%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true