Focal Modulation Networks
We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-w...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We propose focal modulation networks (FocalNets in short), where
self-attention (SA) is completely replaced by a focal modulation mechanism for
modeling token interactions in vision. Focal modulation comprises three
components: (i) hierarchical contextualization, implemented using a stack of
depth-wise convolutional layers, to encode visual contexts from short to long
ranges, (ii) gated aggregation to selectively gather contexts for each query
token based on its
content, and (iii) element-wise modulation or affine transformation to inject
the aggregated context into the query. Extensive experiments show FocalNets
outperform the state-of-the-art SA counterparts (e.g., Swin and Focal
Transformers) with similar computational costs on the tasks of image
classification, object detection, and segmentation. Specifically, FocalNets
with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K.
After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3%
top-1 accuracy when finetuned with resolution 224 and 384, respectively. When
transferred to downstream tasks, FocalNets exhibit clear superiority. For
object detection with Mask R-CNN, FocalNet base trained with 1\times
outperforms the Swin counterpart by 2.1 points and already surpasses Swin
trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with
UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin
at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we
achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO
Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4
mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of
much larger attention-based models like Swinv2-G and BEIT-3. Code and
checkpoints are available at https://github.com/microsoft/FocalNet. |
---|---|
DOI: | 10.48550/arxiv.2203.11926 |