TBFormer: three-branch efficient transformer for semantic segmentation
Semantic segmentation usually benefits from global context, spatial details, and boundary features. Dual-branch networks have demonstrated their superiority in this regard. However, CNN-based network models have poor ability in handling tasks with long-range dependencies. Direct fusion of high-frequ...
Gespeichert in:
Veröffentlicht in: | Signal, image and video processing image and video processing, 2024-06, Vol.18 (4), p.3661-3672 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Semantic segmentation usually benefits from global context, spatial details, and boundary features. Dual-branch networks have demonstrated their superiority in this regard. However, CNN-based network models have poor ability in handling tasks with long-range dependencies. Direct fusion of high-frequency details and low-frequency context can lead to the context information around the target being overwhelmed by detail information, and downsampling operations can cause loss of spatial information and difficulties in extracting boundary features. Therefore, we propose a unique transformer-based three-branch semantic segmentation architecture, called TBFormer. TBFormer consists of the context branch, spatial branch, and edge branch. The context branch utilizes the characteristics of the transformer to obtain rich global context features. Additionally, this branch introduces a feature refine module at its end, effectively capturing and refining multi-scale information. The spatial branch utilizes small stride convolutions to preserve high-resolution information. The edge branch strengthens the role of edge information in the model through a novel Sobel operator. On top of the three branches, we design a lightweight all-convolution upsample-concat decoder to fuse features. Our TBFormer achieves the best balance between accuracy and computational complexity. Specifically, our TBF-T achieves 78.5% mIoU and 58.5 GFLOPs on the Cityscapes dataset, and 78.1% mIoU and 29.4 GFLOPs on the Pascal VOC dataset. Compared to previous counterparts, it reaches significantly better performance and efficiency. |
---|---|
ISSN: | 1863-1703 1863-1711 |
DOI: | 10.1007/s11760-024-03030-6 |