ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments

Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low l...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2024, Vol.12, p.172061-172074
Hauptverfasser: Zhang, Miao, Shen, Yiqing, Yin, Jun, Lu, Shuai, Wang, Xueqian
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 172074
container_issue
container_start_page 172061
container_title IEEE access
container_volume 12
creator Zhang, Miao
Shen, Yiqing
Yin, Jun
Lu, Shuai
Wang, Xueqian
description Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel anomaly detection agent (ADAGENT) framework, which is an AI agent framework that combines the "Chain of Critical Self-Reflection (CCS)", specialized toolsets, and "Heuristic Retrieval-Augmented Generation (RAG)" to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a 15\%\sim {30\%} improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches.
doi_str_mv 10.1109/ACCESS.2024.3480250
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3131913658</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10716620</ieee_id><doaj_id>oai_doaj_org_article_eb1ab84728ba408eb0a27adedd4ad095</doaj_id><sourcerecordid>3131913658</sourcerecordid><originalsourceid>FETCH-LOGICAL-c244t-365e004e10e1092b43526ba2e791fd10a424b9bcb42ebf5f9c53ffba77829f8f3</originalsourceid><addsrcrecordid>eNpNUdtKAzEQXUTBUv0CfQj43JrbXuLb0tYLVH3Q4mNINpOast3UZCv490ZXpMPADMM5Zw6cLLsgeEoIFtf1bLZ4eZlSTPmU8QrTHB9lI0oKMWE5K44P9tPsPMYNTlWlU16OslU9r-8WT683qO78VrVfaA49NL3zHarX0PXozfXv6HHf9m7rjWrRUoU1oEdvoI3IJZT5hBABLbpPF3y3TZx4lp1Y1UY4_5vjbHW7eJ3dT5bPdw-zejlpKOf9hBU5YMyB4NSCas5yWmhFoRTEGoIVp1wL3WhOQdvciiZn1mpVlhUVtrJsnD0MusarjdwFt1XhS3rl5O_Bh7VUoXdNCxI0UbriJa204rgCjRUtlQFjuDJY5EnratDaBf-xh9jLjd-HLtmXjDAiSHJbJRQbUE3wMQaw_18Jlj9xyCEO-ROH_IsjsS4HlgOAA0ZJioJi9g10r4Vn</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3131913658</pqid></control><display><type>article</type><title>ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments</title><source>Directory of Open Access Journals</source><source>IEEE Xplore Open Access Journals</source><source>EZB Electronic Journals Library</source><creator>Zhang, Miao ; Shen, Yiqing ; Yin, Jun ; Lu, Shuai ; Wang, Xueqian</creator><creatorcontrib>Zhang, Miao ; Shen, Yiqing ; Yin, Jun ; Lu, Shuai ; Wang, Xueqian</creatorcontrib><description>Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel anomaly detection agent (ADAGENT) framework, which is an AI agent framework that combines the "Chain of Critical Self-Reflection (CCS)", specialized toolsets, and "Heuristic Retrieval-Augmented Generation (RAG)" to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;15\%\sim {30\%} &lt;/tex-math&gt;&lt;/inline-formula&gt; improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2024.3480250</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Accuracy ; AI agent ; Anomalies ; Anomaly detection ; Artificial intelligence ; Benchmark testing ; Benchmarks ; Cognition ; Context modeling ; Error analysis ; Error detection ; Feature extraction ; Image quality ; Information retrieval ; Lighting ; Multimodal language model ; Multisensory integration ; Performance evaluation ; Prompt engineering ; Semantics ; Training ; Visual tasks ; Visualization</subject><ispartof>IEEE access, 2024, Vol.12, p.172061-172074</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c244t-365e004e10e1092b43526ba2e791fd10a424b9bcb42ebf5f9c53ffba77829f8f3</cites><orcidid>0000-0001-7866-3339 ; 0000-0003-3542-0593 ; 0009-0000-0551-9678</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10716620$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,776,780,860,2096,4010,27610,27900,27901,27902,54908</link.rule.ids></links><search><creatorcontrib>Zhang, Miao</creatorcontrib><creatorcontrib>Shen, Yiqing</creatorcontrib><creatorcontrib>Yin, Jun</creatorcontrib><creatorcontrib>Lu, Shuai</creatorcontrib><creatorcontrib>Wang, Xueqian</creatorcontrib><title>ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments</title><title>IEEE access</title><addtitle>Access</addtitle><description>Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel anomaly detection agent (ADAGENT) framework, which is an AI agent framework that combines the "Chain of Critical Self-Reflection (CCS)", specialized toolsets, and "Heuristic Retrieval-Augmented Generation (RAG)" to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;15\%\sim {30\%} &lt;/tex-math&gt;&lt;/inline-formula&gt; improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches.</description><subject>Accuracy</subject><subject>AI agent</subject><subject>Anomalies</subject><subject>Anomaly detection</subject><subject>Artificial intelligence</subject><subject>Benchmark testing</subject><subject>Benchmarks</subject><subject>Cognition</subject><subject>Context modeling</subject><subject>Error analysis</subject><subject>Error detection</subject><subject>Feature extraction</subject><subject>Image quality</subject><subject>Information retrieval</subject><subject>Lighting</subject><subject>Multimodal language model</subject><subject>Multisensory integration</subject><subject>Performance evaluation</subject><subject>Prompt engineering</subject><subject>Semantics</subject><subject>Training</subject><subject>Visual tasks</subject><subject>Visualization</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNUdtKAzEQXUTBUv0CfQj43JrbXuLb0tYLVH3Q4mNINpOast3UZCv490ZXpMPADMM5Zw6cLLsgeEoIFtf1bLZ4eZlSTPmU8QrTHB9lI0oKMWE5K44P9tPsPMYNTlWlU16OslU9r-8WT683qO78VrVfaA49NL3zHarX0PXozfXv6HHf9m7rjWrRUoU1oEdvoI3IJZT5hBABLbpPF3y3TZx4lp1Y1UY4_5vjbHW7eJ3dT5bPdw-zejlpKOf9hBU5YMyB4NSCas5yWmhFoRTEGoIVp1wL3WhOQdvciiZn1mpVlhUVtrJsnD0MusarjdwFt1XhS3rl5O_Bh7VUoXdNCxI0UbriJa204rgCjRUtlQFjuDJY5EnratDaBf-xh9jLjd-HLtmXjDAiSHJbJRQbUE3wMQaw_18Jlj9xyCEO-ROH_IsjsS4HlgOAA0ZJioJi9g10r4Vn</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Zhang, Miao</creator><creator>Shen, Yiqing</creator><creator>Yin, Jun</creator><creator>Lu, Shuai</creator><creator>Wang, Xueqian</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0001-7866-3339</orcidid><orcidid>https://orcid.org/0000-0003-3542-0593</orcidid><orcidid>https://orcid.org/0009-0000-0551-9678</orcidid></search><sort><creationdate>2024</creationdate><title>ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments</title><author>Zhang, Miao ; Shen, Yiqing ; Yin, Jun ; Lu, Shuai ; Wang, Xueqian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c244t-365e004e10e1092b43526ba2e791fd10a424b9bcb42ebf5f9c53ffba77829f8f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>AI agent</topic><topic>Anomalies</topic><topic>Anomaly detection</topic><topic>Artificial intelligence</topic><topic>Benchmark testing</topic><topic>Benchmarks</topic><topic>Cognition</topic><topic>Context modeling</topic><topic>Error analysis</topic><topic>Error detection</topic><topic>Feature extraction</topic><topic>Image quality</topic><topic>Information retrieval</topic><topic>Lighting</topic><topic>Multimodal language model</topic><topic>Multisensory integration</topic><topic>Performance evaluation</topic><topic>Prompt engineering</topic><topic>Semantics</topic><topic>Training</topic><topic>Visual tasks</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Miao</creatorcontrib><creatorcontrib>Shen, Yiqing</creatorcontrib><creatorcontrib>Yin, Jun</creatorcontrib><creatorcontrib>Lu, Shuai</creatorcontrib><creatorcontrib>Wang, Xueqian</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE Xplore Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Miao</au><au>Shen, Yiqing</au><au>Yin, Jun</au><au>Lu, Shuai</au><au>Wang, Xueqian</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2024</date><risdate>2024</risdate><volume>12</volume><spage>172061</spage><epage>172074</epage><pages>172061-172074</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Multimodal Language Models (MMLMs), such as LLaVA and GPT-4V, have shown zero-shot generalization capabilities for understanding images and text across various domains. However, their effectiveness in open-world visual tasks, particularly anomaly detection under challenging conditions, such as low light or poor image quality, has yet to be thoroughly investigated. Assessing the robustness and limitations of MMLMs in these scenarios is essential to ensuring their reliability and safety in real-world applications, where the input image quality can vary significantly. To address this gap, we propose a benchmark comprising 460 images captured under challenging conditions, including low light and blurring. This benchmark is specifically designed to evaluate the anomaly detection capabilities of MMLMs. We assess the performance of state-of-the-art MMLMs, such as Qwen-VL-Max-0809, GPT-4V, Gemini-1.5, Claude3-opus, ERNIE-Bot-4, and SparkDesk-v3.5, across six diverse scenes. Our evaluations indicate that these MMLMs struggle with error detection in adverse scenarios, thereby highlighting the need for further investigation into the underlying causes and potential improvement strategies. To tackle these limitations, we introduce a novel anomaly detection agent (ADAGENT) framework, which is an AI agent framework that combines the "Chain of Critical Self-Reflection (CCS)", specialized toolsets, and "Heuristic Retrieval-Augmented Generation (RAG)" to enhance anomaly detection performance with MMLMs. ADAGENT sequentially evaluates abilities, such as text generation, semantic understanding, contextual comprehension, key information extraction, reasoning, and logical thinking. By implementing this framework, we demonstrated a &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;15\%\sim {30\%} &lt;/tex-math&gt;&lt;/inline-formula&gt; improvement in the top-3 accuracy for anomaly detection tasks under adverse conditions, compared with baseline approaches.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2024.3480250</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0001-7866-3339</orcidid><orcidid>https://orcid.org/0000-0003-3542-0593</orcidid><orcidid>https://orcid.org/0009-0000-0551-9678</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2169-3536
ispartof IEEE access, 2024, Vol.12, p.172061-172074
issn 2169-3536
2169-3536
language eng
recordid cdi_proquest_journals_3131913658
source Directory of Open Access Journals; IEEE Xplore Open Access Journals; EZB Electronic Journals Library
subjects Accuracy
AI agent
Anomalies
Anomaly detection
Artificial intelligence
Benchmark testing
Benchmarks
Cognition
Context modeling
Error analysis
Error detection
Feature extraction
Image quality
Information retrieval
Lighting
Multimodal language model
Multisensory integration
Performance evaluation
Prompt engineering
Semantics
Training
Visual tasks
Visualization
title ADAGENT: Anomaly Detection Agent With Multimodal Large Models in Adverse Environments
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A37%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ADAGENT:%20Anomaly%20Detection%20Agent%20With%20Multimodal%20Large%20Models%20in%20Adverse%20Environments&rft.jtitle=IEEE%20access&rft.au=Zhang,%20Miao&rft.date=2024&rft.volume=12&rft.spage=172061&rft.epage=172074&rft.pages=172061-172074&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2024.3480250&rft_dat=%3Cproquest_cross%3E3131913658%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3131913658&rft_id=info:pmid/&rft_ieee_id=10716620&rft_doaj_id=oai_doaj_org_article_eb1ab84728ba408eb0a27adedd4ad095&rfr_iscdi=true