Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each ges...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE robotics and automation letters 2024-12, Vol.9 (12), p.11513-11520
Hauptverfasser:	Shao, Zhimin, Xu, Jialang, Stoyanov, Danail, Mazomenos, Evangelos B., Jin, Yueming
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognition Computer vision for medical robotics Kinematics prompt engineering Real-time systems Robots Semantics Surgery surgical error detection Training Transformers video-language learning Videos Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	11520
container_issue	12
container_start_page	11513
container_title	IEEE robotics and automation letters
container_volume	9
creator	Shao, Zhimin Xu, Jialang Stoyanov, Danail Mazomenos, Evangelos B. Jin, Yueming
description	Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.
doi_str_mv	10.1109/LRA.2024.3495452
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10750058</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10750058</ieee_id><sourcerecordid>10_1109_LRA_2024_3495452</sourcerecordid><originalsourceid>FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</originalsourceid><addsrcrecordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><source>IEEE Electronic Library (IEL)</source><creator>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creator><creatorcontrib>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creatorcontrib><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><identifier>ISSN: 2377-3766</identifier><identifier>EISSN: 2377-3766</identifier><identifier>DOI: 10.1109/LRA.2024.3495452</identifier><identifier>CODEN: IRALC6</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cognition ; Computer vision for medical robotics ; Kinematics ; prompt engineering ; Real-time systems ; Robots ; Semantics ; Surgery ; surgical error detection ; Training ; Transformers ; video-language learning ; Videos ; Visualization</subject><ispartof>IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-2324-7033 ; 0000-0003-0357-5996 ; 0000-0002-3078-0939 ; 0000-0003-3775-3877 ; 0000-0002-0980-3227</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><title>IEEE robotics and automation letters</title><addtitle>LRA</addtitle><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><subject>Cognition</subject><subject>Computer vision for medical robotics</subject><subject>Kinematics</subject><subject>prompt engineering</subject><subject>Real-time systems</subject><subject>Robots</subject><subject>Semantics</subject><subject>Surgery</subject><subject>surgical error detection</subject><subject>Training</subject><subject>Transformers</subject><subject>video-language learning</subject><subject>Videos</subject><subject>Visualization</subject><issn>2377-3766</issn><issn>2377-3766</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</recordid><startdate>202412</startdate><enddate>202412</enddate><creator>Shao, Zhimin</creator><creator>Xu, Jialang</creator><creator>Stoyanov, Danail</creator><creator>Mazomenos, Evangelos B.</creator><creator>Jin, Yueming</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></search><sort><creationdate>202412</creationdate><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><author>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition</topic><topic>Computer vision for medical robotics</topic><topic>Kinematics</topic><topic>prompt engineering</topic><topic>Real-time systems</topic><topic>Robots</topic><topic>Semantics</topic><topic>Surgery</topic><topic>surgical error detection</topic><topic>Training</topic><topic>Transformers</topic><topic>video-language learning</topic><topic>Videos</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE robotics and automation letters</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shao, Zhimin</au><au>Xu, Jialang</au><au>Stoyanov, Danail</au><au>Mazomenos, Evangelos B.</au><au>Jin, Yueming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</atitle><jtitle>IEEE robotics and automation letters</jtitle><stitle>LRA</stitle><date>2024-12</date><risdate>2024</risdate><volume>9</volume><issue>12</issue><spage>11513</spage><epage>11520</epage><pages>11513-11520</pages><issn>2377-3766</issn><eissn>2377-3766</eissn><coden>IRALC6</coden><abstract>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</abstract><pub>IEEE</pub><doi>10.1109/LRA.2024.3495452</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2377-3766
ispartof	IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520
issn	2377-3766 2377-3766
language	eng
recordid	cdi_ieee_primary_10750058
source	IEEE Electronic Library (IEL)
subjects	Cognition Computer vision for medical robotics Kinematics prompt engineering Real-time systems Robots Semantics Surgery surgical error detection Training Transformers video-language learning Videos Visualization
title	Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T23%3A41%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Think%20Step%20by%20Step:%20Chain-of-Gesture%20Prompting%20for%20Error%20Detection%20in%20Robotic%20Surgical%20Videos&rft.jtitle=IEEE%20robotics%20and%20automation%20letters&rft.au=Shao,%20Zhimin&rft.date=2024-12&rft.volume=9&rft.issue=12&rft.spage=11513&rft.epage=11520&rft.pages=11513-11520&rft.issn=2377-3766&rft.eissn=2377-3766&rft.coden=IRALC6&rft_id=info:doi/10.1109/LRA.2024.3495452&rft_dat=%3Ccrossref_RIE%3E10_1109_LRA_2024_3495452%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10750058&rfr_iscdi=true