Real-Time Semantic Segmentation via Auto Depth, Downsampling Joint Decision and Feature Aggregation

To satisfy the stringent requirements for computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks. To enjoy the ability of model auto-design, Neural Architecture Search (NAS) has been introduced t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of computer vision 2021-05, Vol.129 (5), p.1506-1525
Hauptverfasser:	Sun, Peng, Wu, Jiaxiang, Li, Songyuan, Lin, Peiwen, Huang, Junzhou, Li, Xi
Format:	Artikel
Sprache:	eng
Schlagworte:	Agglomeration Artificial Intelligence Computer Imaging Computer Science Image Processing and Computer Vision Image segmentation Pattern Recognition Pattern Recognition and Graphics Real time Searching Semantic segmentation Semantics Vision Weight reduction
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	To satisfy the stringent requirements for computational resources in the field of real-time semantic segmentation, most approaches focus on the hand-crafted design of light-weight segmentation networks. To enjoy the ability of model auto-design, Neural Architecture Search (NAS) has been introduced to search for the optimal building blocks of networks automatically. However, the network depth, downsampling strategy, and feature aggregation method are still set in advance and nonadjustable during searching. Moreover, these key properties are highly correlated and essential for a remarkable real-time segmentation model. In this paper, we propose a joint search framework, called AutoRTNet, to automate all the aforementioned key properties in semantic segmentation. Specifically, we propose hyper-cells to jointly decide the network depth and the downsampling strategy via a novel cell-level pruning process. Furthermore, we propose an aggregation cell to achieve automatic multi-scale feature aggregation. Extensive experimental results on Cityscapes and CamVid datasets demonstrate that the proposed AutoRTNet achieves the new state-of-the-art trade-off between accuracy and speed. Notably, our AutoRTNet achieves 73.9% mIoU on Cityscapes and 110.0 FPS on an NVIDIA TitanXP GPU card with input images at a resolution of 768 × 1536 .
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-021-01433-3