Applying Plain Transformers to Real-World Point Clouds
To apply transformer-based models to point cloud understanding, many previous works modify the architecture of transformers by using, e.g., local attention and down-sampling. Although they have achieved promising results, earlier works on transformers for point clouds have two issues. First, the pow...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To apply transformer-based models to point cloud understanding, many previous
works modify the architecture of transformers by using, e.g., local attention
and down-sampling. Although they have achieved promising results, earlier works
on transformers for point clouds have two issues. First, the power of plain
transformers is still under-explored. Second, they focus on simple and small
point clouds instead of complex real-world ones. This work revisits the plain
transformers in real-world point cloud understanding. We first take a closer
look at some fundamental components of plain transformers, e.g., patchifier and
positional embedding, for both efficiency and performance. To close the
performance gap due to the lack of inductive bias and annotated data, we
investigate self-supervised pre-training with masked autoencoder (MAE).
Specifically, we propose drop patch, which prevents information leakage and
significantly improves the effectiveness of MAE. Our models achieve SOTA
results in semantic segmentation on the S3DIS dataset and object detection on
the ScanNet dataset with lower computational costs. Our work provides a new
baseline for future research on transformers for point clouds. |
---|---|
DOI: | 10.48550/arxiv.2303.00086 |