Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation
Recently, the performance of monocular depth estimation (MDE) has been significantly boosted with the integration of transformer models. However, the transformer models are usually computationally-expensive, and their effectiveness in light-weight models are limited compared to convolutions. This li...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recently, the performance of monocular depth estimation (MDE) has been
significantly boosted with the integration of transformer models. However, the
transformer models are usually computationally-expensive, and their
effectiveness in light-weight models are limited compared to convolutions. This
limitation hinders their deployment on resource-limited devices. In this paper,
we propose a cross-architecture knowledge distillation method for MDE, dubbed
DisDepth, to enhance efficient CNN models with the supervision of
state-of-the-art transformer models. Concretely, we first build a simple
framework of convolution-based MDE, which is then enhanced with a novel
local-global convolution module to capture both local and global information in
the image. To effectively distill valuable information from the transformer
teacher and bridge the gap between convolution and transformer features, we
introduce a method to acclimate the teacher with a ghost decoder. The ghost
decoder is a copy of the student's decoder, and adapting the teacher with the
ghost decoder aligns the features to be student-friendly while preserving their
original performance. Furthermore, we propose an attentive knowledge
distillation loss that adaptively identifies features valuable for depth
estimation. This loss guides the student to focus more on attentive regions,
improving its performance. Extensive experiments on KITTI and NYU Depth V2
datasets demonstrate the effectiveness of DisDepth. Our method achieves
significant improvements on various efficient backbones, showcasing its
potential for efficient monocular depth estimation. |
---|---|
DOI: | 10.48550/arxiv.2404.16386 |