DWRSeg: Rethinking Efficient Acquisition of Multi-scale Contextual Information for Real-time Semantic Segmentation
Many current works directly adopt multi-rate depth-wise dilated convolutions to capture multi-scale contextual information simultaneously from one input feature map, thus improving the feature extraction efficiency for real-time semantic segmentation. However, this design may lead to difficult acces...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Many current works directly adopt multi-rate depth-wise dilated convolutions
to capture multi-scale contextual information simultaneously from one input
feature map, thus improving the feature extraction efficiency for real-time
semantic segmentation. However, this design may lead to difficult access to
multi-scale contextual information because of the unreasonable structure and
hyperparameters. To lower the difficulty of drawing multi-scale contextual
information, we propose a highly efficient multi-scale feature extraction
method, which decomposes the original single-step method into two steps, Region
Residualization-Semantic Residualization. In this method, the multi-rate
depth-wise dilated convolutions take a simpler role in feature extraction:
performing simple semantic-based morphological filtering with one desired
receptive field in the second step based on each concise feature map of region
form provided by the first step, to improve their efficiency. Moreover, the
dilation rates and the capacity of dilated convolutions for each network stage
are elaborated to fully utilize all the feature maps of region form that can be
achieved.Accordingly, we design a novel Dilation-wise Residual (DWR) module and
a Simple Inverted Residual (SIR) module for the high and low level network,
respectively, and form a powerful DWR Segmentation (DWRSeg) network. Extensive
experiments on the Cityscapes and CamVid datasets demonstrate the effectiveness
of our method by achieving a state-of-the-art trade-off between accuracy and
inference speed, in addition to being lighter weight. Without pretraining or
resorting to any training trick, we achieve an mIoU of 72.7% on the Cityscapes
test set at a speed of 319.5 FPS on one NVIDIA GeForce GTX 1080 Ti card, which
exceeds the latest methods of a speed of 69.5 FPS and 0.8% mIoU. The code and
trained models are publicly available. |
---|---|
DOI: | 10.48550/arxiv.2212.01173 |