Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each ges...
Gespeichert in:
Veröffentlicht in: | IEEE robotics and automation letters 2024-12, Vol.9 (12), p.11513-11520 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 11520 |
---|---|
container_issue | 12 |
container_start_page | 11513 |
container_title | IEEE robotics and automation letters |
container_volume | 9 |
creator | Shao, Zhimin Xu, Jialang Stoyanov, Danail Mazomenos, Evangelos B. Jin, Yueming |
description | Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy. |
doi_str_mv | 10.1109/LRA.2024.3495452 |
format | Article |
fullrecord | <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10750058</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10750058</ieee_id><sourcerecordid>10_1109_LRA_2024_3495452</sourcerecordid><originalsourceid>FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</originalsourceid><addsrcrecordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><source>IEEE Electronic Library (IEL)</source><creator>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creator><creatorcontrib>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</creatorcontrib><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><identifier>ISSN: 2377-3766</identifier><identifier>EISSN: 2377-3766</identifier><identifier>DOI: 10.1109/LRA.2024.3495452</identifier><identifier>CODEN: IRALC6</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cognition ; Computer vision for medical robotics ; Kinematics ; prompt engineering ; Real-time systems ; Robots ; Semantics ; Surgery ; surgical error detection ; Training ; Transformers ; video-language learning ; Videos ; Visualization</subject><ispartof>IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0003-2324-7033 ; 0000-0003-0357-5996 ; 0000-0002-3078-0939 ; 0000-0003-3775-3877 ; 0000-0002-0980-3227</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10750058$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><title>IEEE robotics and automation letters</title><addtitle>LRA</addtitle><description>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</description><subject>Cognition</subject><subject>Computer vision for medical robotics</subject><subject>Kinematics</subject><subject>prompt engineering</subject><subject>Real-time systems</subject><subject>Robots</subject><subject>Semantics</subject><subject>Surgery</subject><subject>surgical error detection</subject><subject>Training</subject><subject>Transformers</subject><subject>video-language learning</subject><subject>Videos</subject><subject>Visualization</subject><issn>2377-3766</issn><issn>2377-3766</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE9PAjEUxBujiQS5e_DQL7D4-m_beiOIaEKiAfRmNrvdLlRhS9py4Nu7CAcub95hZjL5IXRPYEgI6MfZfDSkQPmQcS24oFeoR5mUGZN5fn3x36JBjD8AQASVTIse-l6uXfuLF8nucHX41yc8XpeuzXyTTW1M-2DxR_DbXXLtCjc-4EkI3X22yZrkfItdi-e-8skZvNiHlTPlBn-52vp4h26achPt4Kx99PkyWY5fs9n79G08mmWGcJkyLrWhqmKlNBXrtkqtCVVC1xrySgmoGBWSUwEEqKmVILVmyuZWgFKq5IL1EZx6TfAxBtsUu-C2ZTgUBIojoaIjVBwJFWdCXeThFHHW2gu7FABCsT-v9GAa</recordid><startdate>202412</startdate><enddate>202412</enddate><creator>Shao, Zhimin</creator><creator>Xu, Jialang</creator><creator>Stoyanov, Danail</creator><creator>Mazomenos, Evangelos B.</creator><creator>Jin, Yueming</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></search><sort><creationdate>202412</creationdate><title>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</title><author>Shao, Zhimin ; Xu, Jialang ; Stoyanov, Danail ; Mazomenos, Evangelos B. ; Jin, Yueming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c147t-479c28b3a7cb337779912859d906b850b32574250102cd851d938e6e50888a453</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition</topic><topic>Computer vision for medical robotics</topic><topic>Kinematics</topic><topic>prompt engineering</topic><topic>Real-time systems</topic><topic>Robots</topic><topic>Semantics</topic><topic>Surgery</topic><topic>surgical error detection</topic><topic>Training</topic><topic>Transformers</topic><topic>video-language learning</topic><topic>Videos</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Shao, Zhimin</creatorcontrib><creatorcontrib>Xu, Jialang</creatorcontrib><creatorcontrib>Stoyanov, Danail</creatorcontrib><creatorcontrib>Mazomenos, Evangelos B.</creatorcontrib><creatorcontrib>Jin, Yueming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE robotics and automation letters</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shao, Zhimin</au><au>Xu, Jialang</au><au>Stoyanov, Danail</au><au>Mazomenos, Evangelos B.</au><au>Jin, Yueming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos</atitle><jtitle>IEEE robotics and automation letters</jtitle><stitle>LRA</stitle><date>2024-12</date><risdate>2024</risdate><volume>9</volume><issue>12</issue><spage>11513</spage><epage>11520</epage><pages>11513-11520</pages><issn>2377-3766</issn><eissn>2377-3766</eissn><coden>IRALC6</coden><abstract>Despite advancements in robotic systems and surgical data science, ensuring safe execution in robot-assisted minimally invasive surgery (RMIS) remains challenging. Current methods for surgical error detection typically involve two parts: identifying gestures and then detecting errors within each gesture clip. These methods often overlook the rich contextual and semantic information inherent in surgical videos, with limited performance due to reliance on accurate gesture identification. Inspired by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Gesture (COG) prompting, integrating contextual information from surgical videos step by step. This encompasses two reasoning modules that simulate expert surgeons' decision-making: a Gestural-Visual Reasoning module using transformer and attention architectures for gesture prompting and a Multi-Scale Temporal Reasoning module employing a multi-stage temporal convolutional network with slow and fast paths for temporal information extraction. We validate our method on the JIGSAWS dataset and show improvements over the state-of-the-art, achieving 4.6% higher F1 score, 4.6% higher Accuracy, and 5.9% higher Jaccard index, with an average frame processing time of 6.69 milliseconds. This demonstrates our approach's potential to enhance RMIS safety and surgical education efficacy.</abstract><pub>IEEE</pub><doi>10.1109/LRA.2024.3495452</doi><tpages>8</tpages><orcidid>https://orcid.org/0000-0003-2324-7033</orcidid><orcidid>https://orcid.org/0000-0003-0357-5996</orcidid><orcidid>https://orcid.org/0000-0002-3078-0939</orcidid><orcidid>https://orcid.org/0000-0003-3775-3877</orcidid><orcidid>https://orcid.org/0000-0002-0980-3227</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2377-3766 |
ispartof | IEEE robotics and automation letters, 2024-12, Vol.9 (12), p.11513-11520 |
issn | 2377-3766 2377-3766 |
language | eng |
recordid | cdi_ieee_primary_10750058 |
source | IEEE Electronic Library (IEL) |
subjects | Cognition Computer vision for medical robotics Kinematics prompt engineering Real-time systems Robots Semantics Surgery surgical error detection Training Transformers video-language learning Videos Visualization |
title | Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T23%3A41%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Think%20Step%20by%20Step:%20Chain-of-Gesture%20Prompting%20for%20Error%20Detection%20in%20Robotic%20Surgical%20Videos&rft.jtitle=IEEE%20robotics%20and%20automation%20letters&rft.au=Shao,%20Zhimin&rft.date=2024-12&rft.volume=9&rft.issue=12&rft.spage=11513&rft.epage=11520&rft.pages=11513-11520&rft.issn=2377-3766&rft.eissn=2377-3766&rft.coden=IRALC6&rft_id=info:doi/10.1109/LRA.2024.3495452&rft_dat=%3Ccrossref_RIE%3E10_1109_LRA_2024_3495452%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10750058&rfr_iscdi=true |