Hierarchical Multi-Attention Transfer for Knowledge Distillation

Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on multimedia computing communications and applications 2023-09, Vol.20 (2), p.1-20, Article 51
Hauptverfasser:	Gou, Jianping, Sun, Liyuan, Yu, Baosheng, Wan, Shaohua, Tao, Dacheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial Intelligence Computing Methodologies Deep Learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Knowledge distillation (KD) is a powerful and widely applicable technique for the compression of deep learning models. The main idea of knowledge distillation is to transfer knowledge from a large teacher model to a small student model, where the attention mechanism has been intensively explored in regard to its great flexibility for managing different teacher-student architectures. However, existing attention-based methods usually transfer similar attention knowledge from the intermediate layers of deep neural networks, leaving the hierarchical structure of deep representation learning poorly investigated for knowledge distillation. In this paper, we propose a hierarchical multi-attention transfer framework (HMAT), where different types of attention are utilized to transfer the knowledge at different levels of deep representation learning for knowledge distillation. Specifically, position-based and channel-based attention knowledge characterize the knowledge from low-level and high-level feature representations, respectively, and activation-based attention knowledge characterize the knowledge from both mid-level and high-level feature representations. Extensive experiments on three popular visual recognition tasks, image classification, image retrieval, and object detection, demonstrate that the proposed hierarchical multi-attention transfer or HMAT significantly outperforms recent state-of-the-art KD methods.
ISSN:	1551-6857 1551-6865
DOI:	10.1145/3568679