Self-Supervised Lightweight Depth Estimation in Endoscopy Combining CNN and Transformer

In recent years, an increasing number of medical engineering tasks, such as surgical navigation, pre-operative registration, and surgical robotics, rely on 3D reconstruction techniques. Self-supervised depth estimation has attracted interest in endoscopic scenarios because it does not require ground...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on medical imaging 2024-05, Vol.43 (5), p.1934-1944
Hauptverfasser:	Yang, Zhuoyue, Pan, Junjun, Dai, Ju, Sun, Zhen, Xiao, Yi
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial neural networks Convolution Convolutional neural networks Depth Depth and ego-motion estimation Endoscopy Endoscopy - methods Feature extraction Humans Image Processing, Computer-Assisted - methods Imaging, Three-Dimensional - methods Lightweight lightweight architecture Mathematical models Modules Navigation Neural networks Neural Networks, Computer Parameters Pose estimation Robotics self-supervised learning Surgery Task analysis transformer and CNN Transformers Weight reduction
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In recent years, an increasing number of medical engineering tasks, such as surgical navigation, pre-operative registration, and surgical robotics, rely on 3D reconstruction techniques. Self-supervised depth estimation has attracted interest in endoscopic scenarios because it does not require ground truth. Most existing methods depend on expanding the size of parameters to improve their performance. There, designing a lightweight self-supervised model that can obtain competitive results is a hot topic. We propose a lightweight network with a tight coupling of convolutional neural network (CNN) and Transformer for depth estimation. Unlike other methods that use CNN and Transformer to extract features separately and then fuse them on the deepest layer, we utilize the modules of CNN and Transformer to extract features at different scales in the encoder. This hierarchical structure leverages the advantages of CNN in texture perception and Transformer in shape extraction. In the same scale of feature extraction, the CNN is used to acquire local features while the Transformer encodes global information. Finally, we add multi-head attention modules to the pose network to improve the accuracy of predicted poses. Experiments demonstrate that our approach obtains comparable results while effectively compressing the model parameters on two datasets.
ISSN:	0278-0062 1558-254X 1558-254X
DOI:	10.1109/TMI.2024.3352390