An enhanced vision transformer with wavelet position embedding for histopathological image classification

•An enhanced vision transformer with wavelet position embedding is proposed to classify histopathological images.•A wavelet position embedding module is employed to relieve aliasing phenomenon caused by downsampling operation.•An external multi-head attention is proposed to replace self-attention in...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2023-08, Vol.140, p.109532, Article 109532
Hauptverfasser:	Ding, Meidan, Qu, Aiping, Zhong, Haiqin, Lai, Zhihui, Xiao, Shuomin, He, Penghui
Format:	Artikel
Sprache:	eng
Schlagworte:	Convolutional neural network External multi-head attention Histopathological image classification Vision transformer Wavelet position embedding
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•An enhanced vision transformer with wavelet position embedding is proposed to classify histopathological images.•A wavelet position embedding module is employed to relieve aliasing phenomenon caused by downsampling operation.•An external multi-head attention is proposed to replace self-attention in the transformer block for reducing the parameters with low cost computation and excavating potential correlations between different samples.•The proposed method not only has less parameters and low FLOPs, but also achieves improved classification performance of histopathological images. Histopathological image classification is a fundamental task in pathological diagnosis workflow. It remains a huge challenge due to the complexity of histopathological images. Recently, hybrid methods combining convolutional neural networks(CNN) with vision transformers(ViT) are proposed to this field. These methods can well represent the global and local contextual information and achieve excellent classification performances. However, the downsampling operation like max-pooling which ignores the sampling theorem transmits the jagged artifacts into transformer, which would lead to an aliasing phenomenon. It makes the subsequent feature maps focus on the incorrect regions and influences the final classification results. In this work, we propose an enhanced vision transformer with wavelet position embedding to tackle this challenge. In particular, a wavelet position embedding module, which introduces the wave transform into position embedding, is employed to enhance the smoothness of discontinuous feature information by decomposing sequences into amplitude and phase in pathological feature maps. In addition, an external multi-head attention is proposed to replace self-attention in the transformer block with two linear layers. It reduces the cost of computation and excavates potential correlations between different samples. We evaluate the proposed method on three public histopathological classification challenging datasets, and perform a quantitative comparison with previous state-of-the-art methods. The results empirically demonstrate that our method achieves the best accuracy. Furthermore, it has the least parameters and a very low FLOPs. In conclusion, the enhanced vision transformer shows high classification performances and demonstrates significant potential for assisting pathologists in pathological diagnosis.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2023.109532