Temporal-based Swin Transformer network for workflow recognition of surgical video

Purpose Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal for computer assisted radiology and surgery 2023-01, Vol.18 (1), p.139-147
Hauptverfasser:	Pan, Xiaoying, Gao, Xuanrong, Wang, Hongyu, Zhang, Wuxia, Mu, Yuanzhen, He, Xianli
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computer Imaging Computer Science Datasets Feature extraction Health Informatics Humans Imaging Learning Medicine Medicine & Public Health Operating Rooms Original Article Pattern Recognition and Graphics Radiology Recognition Semantics Surgery Transformers Vision Workflow
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Purpose Surgical workflow recognition has emerged as an important part of computer-assisted intervention systems for the modern operating room, which also is a very challenging problem. Although the CNN-based approach achieves excellent performance, it does not learn global and long-range semantic information interactions well due to the inductive bias inherent in convolution. Methods In this paper, we propose a temporal-based Swin Transformer network (TSTNet) for the surgical video workflow recognition task. TSTNet contains two main parts: the Swin Transformer and the LSTM. The Swin Transformer incorporates the attention mechanism to encode remote dependencies and learn highly expressive representations. The LSTM is capable of learning long-range dependencies and is used to extract temporal information. The TSTNet organically combines the two components to extract spatiotemporal features that contain more contextual information. In particular, based on a full understanding of the natural features of the surgical video, we propose a priori revision algorithm (PRA) using a priori information about the sequence of the surgical phase. This strategy optimizes the output of TSTNet and further improves the recognition performance. Results We conduct extensive experiments using the Cholec80 dataset to validate the effectiveness of the TSTNet-PRA method. Our method achieves excellent performance on the Cholec80 dataset, which accuracy is up to 92.8% and greatly exceeds the state-of-the-art methods. Conclusion By modelling remote temporal information and multi-scale visual information, we propose the TSTNet-PRA method. It was evaluated on a large public dataset, showing a high recognition capability superior to other spatiotemporal networks.
ISSN:	1861-6429 1861-6410 1861-6429
DOI:	10.1007/s11548-022-02785-y