Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos

Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each ges...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE robotics and automation letters 2024-12, Vol.9 (12), p.11513-11520
Hauptverfasser: Shao, Zhimin, Xu, Jialang, Stoyanov, Danail, Mazomenos, Evangelos B., Jin, Yueming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 11520
container_issue 12
container_start_page 11513
container_title IEEE robotics and automation letters
container_volume 9
creator Shao, Zhimin
Xu, Jialang
Stoyanov, Danail
Mazomenos, Evangelos B.
Jin, Yueming
description Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.
doi_str_mv 10.1109/LRA.2024.3495452
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10750058</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10750058</ieee_id><sourcerecordid>10_1109_LRA_2024_3495452</sourcerecordid><originalsourceid>FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</originalsourceid><addsrcrecordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><source>IEEE Electronic Library (IEL)</source><creator>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creator><creatorcontrib>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creatorcontrib><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><identifier>ISSN: 2377-3766</identifier><identifier>EISSN: 2377-3766</identifier><identifier>DOI: 10.1109/LRA.2024.3495452</identifier><identifier>CODEN: IRALC6</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cognition ; Computer vision for medical robotics ; Kinematics ; prompt engineering ; Real-time systems ; Robots ; Semantics ; Surgery ; surgical error detection ; Training ; Transformers ; video-language learning ; Videos ; Visualization</subject><ispartof>IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-2324-7033 ; 0000-0003-0357-5996 ; 0000-0002-3078-0939 ; 0000-0003-3775-3877 ; 0000-0002-0980-3227</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><title>IEEE robotics and automation letters</title><addtitle>LRA</addtitle><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><subject>Cognition</subject><subject>Computer vision for medical robotics</subject><subject>Kinematics</subject><subject>prompt engineering</subject><subject>Real-time systems</subject><subject>Robots</subject><subject>Semantics</subject><subject>Surgery</subject><subject>surgical error detection</subject><subject>Training</subject><subject>Transformers</subject><subject>video-language learning</subject><subject>Videos</subject><subject>Visualization</subject><issn>2377-3766</issn><issn>2377-3766</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</recordid><startdate>202412</startdate><enddate>202412</enddate><creator>Shao, Zhimin</creator><creator>Xu, Jialang</creator><creator>Stoyanov, Danail</creator><creator>Mazomenos, Evangelos B.</creator><creator>Jin, Yueming</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></search><sort><creationdate>202412</creationdate><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><author>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition</topic><topic>Computer vision for medical robotics</topic><topic>Kinematics</topic><topic>prompt engineering</topic><topic>Real-time systems</topic><topic>Robots</topic><topic>Semantics</topic><topic>Surgery</topic><topic>surgical error detection</topic><topic>Training</topic><topic>Transformers</topic><topic>video-language learning</topic><topic>Videos</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE robotics and automation letters</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shao, Zhimin</au><au>Xu, Jialang</au><au>Stoyanov, Danail</au><au>Mazomenos, Evangelos B.</au><au>Jin, Yueming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</atitle><jtitle>IEEE robotics and automation letters</jtitle><stitle>LRA</stitle><date>2024-12</date><risdate>2024</risdate><volume>9</volume><issue>12</issue><spage>11513</spage><epage>11520</epage><pages>11513-11520</pages><issn>2377-3766</issn><eissn>2377-3766</eissn><coden>IRALC6</coden><abstract>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</abstract><pub>IEEE</pub><doi>10.1109/LRA.2024.3495452</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2377-3766
ispartof IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520
issn 2377-3766
2377-3766
language eng
recordid cdi_ieee_primary_10750058
source IEEE Electronic Library (IEL)
subjects Cognition
Computer vision for medical robotics
Kinematics
prompt engineering
Real-time systems
Robots
Semantics
Surgery
surgical error detection
Training
Transformers
video-language learning
Videos
Visualization
title Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T23%3A41%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Think%20Step%20by%20Step:%20Chain-of-Gesture%20Prompting%20for%20Error%20Detection%20in%20Robotic%20Surgical%20Videos&rft.jtitle=IEEE%20robotics%20and%20automation%20letters&rft.au=Shao,%20Zhimin&rft.date=2024-12&rft.volume=9&rft.issue=12&rft.spage=11513&rft.epage=11520&rft.pages=11513-11520&rft.issn=2377-3766&rft.eissn=2377-3766&rft.coden=IRALC6&rft_id=info:doi/10.1109/LRA.2024.3495452&rft_dat=%3Ccrossref_RIE%3E10_1109_LRA_2024_3495452%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10750058&rfr_iscdi=true