Rethinking Self-Attention for Multispectral Object Detection

Data from different modalities, such as infrared and visible images, can offer complementary information, and integrating such information can significantly enhance the capabilities of a system to perceive and recognize its surroundings. Thus, multi-modal object detection has widespread applications...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on intelligent transportation systems 2024-11, Vol.25 (11), p.16300-16311
Hauptverfasser: Hu, Sijie, Bonardi, Fabien, Bouchafa, Samia, Prendinger, Helmut, Sidibe, Desire
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 16311
container_issue 11
container_start_page 16300
container_title IEEE transactions on intelligent transportation systems
container_volume 25
creator Hu, Sijie
Bonardi, Fabien
Bouchafa, Samia
Prendinger, Helmut
Sidibe, Desire
description Data from different modalities, such as infrared and visible images, can offer complementary information, and integrating such information can significantly enhance the capabilities of a system to perceive and recognize its surroundings. Thus, multi-modal object detection has widespread applications, particularly in challenging weather conditions like low-light scenarios. The core of multi-modal fusion lies in developing a reasonable fusion strategy, which can fully exploit the complementary features of different modalities while preventing a significant increase in model complexity. To this end, this paper proposes a novel lightweight cross-fusion module named Channel-Patch Cross Fusion (CPCF), which leverages Channel-wise Cross-Attention (CCA), Patch-wise Cross-Attention (PCA) and Adaptive Gating (AG) to encourage mutual rectification among different modalities. This process simultaneously explores commonalities across modalities while maintaining the uniqueness of each modality. Furthermore, we design a versatile intermediate fusion framework that can leverage CPCF to enhance the performance of multi-modal object detection. The proposed method is extensively evaluated on multiple public multi-modal datasets, namely FLIR, LLVIP, and DroneVehicle. The experiments indicate that our method yields consistent performance gains across various benchmarks and can be extended to different types of detectors, further demonstrating its robustness and generalizability. Our codes are available at https://github.com/Superjie13/CPCF_Multispectral .
doi_str_mv 10.1109/TITS.2024.3412417
format Article
fullrecord <record><control><sourceid>hal_RIE</sourceid><recordid>TN_cdi_ieee_primary_10565297</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10565297</ieee_id><sourcerecordid>oai_HAL_hal_04620359v1</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-2ed3bc15f8952db4495b841829048480e9b2523269ca99c2bc28566e27961de73</originalsourceid><addsrcrecordid>eNpNkE1Lw0AURQdRsFZ_gOAiWxep815mJjPgptSPFiIFW9dDMn2xU2NSklHw35vQIq7e5XLuWxzGroFPALi5Wy_WqwlyFJNEAApIT9gIpNQx56BOh4wiNlzyc3bRdbu-FRJgxO5fKWx9_eHr92hFVRlPQ6A6-KaOyqaNXr6q4Ls9udDmVbQsdn2KHij0p0cu2VmZVx1dHe-YvT09rmfzOFs-L2bTLHZoZIiRNknhQJbaSNwUQhhZaAEaDRdaaE6mQIkJKuNyYxwWDrVUijA1CjaUJmN2e_i7zSu7b_1n3v7YJvd2Ps3s0HGhkCfSfEPPwoF1bdN1LZV_A-B2UGUHVXZQZY-q-s3NYeOJ6B8vlUSTJr8FimNi</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Rethinking Self-Attention for Multispectral Object Detection</title><source>IEEE Electronic Library (IEL)</source><creator>Hu, Sijie ; Bonardi, Fabien ; Bouchafa, Samia ; Prendinger, Helmut ; Sidibe, Desire</creator><creatorcontrib>Hu, Sijie ; Bonardi, Fabien ; Bouchafa, Samia ; Prendinger, Helmut ; Sidibe, Desire</creatorcontrib><description>Data from different modalities, such as infrared and visible images, can offer complementary information, and integrating such information can significantly enhance the capabilities of a system to perceive and recognize its surroundings. Thus, multi-modal object detection has widespread applications, particularly in challenging weather conditions like low-light scenarios. The core of multi-modal fusion lies in developing a reasonable fusion strategy, which can fully exploit the complementary features of different modalities while preventing a significant increase in model complexity. To this end, this paper proposes a novel lightweight cross-fusion module named Channel-Patch Cross Fusion (CPCF), which leverages Channel-wise Cross-Attention (CCA), Patch-wise Cross-Attention (PCA) and Adaptive Gating (AG) to encourage mutual rectification among different modalities. This process simultaneously explores commonalities across modalities while maintaining the uniqueness of each modality. Furthermore, we design a versatile intermediate fusion framework that can leverage CPCF to enhance the performance of multi-modal object detection. The proposed method is extensively evaluated on multiple public multi-modal datasets, namely FLIR, LLVIP, and DroneVehicle. The experiments indicate that our method yields consistent performance gains across various benchmarks and can be extended to different types of detectors, further demonstrating its robustness and generalizability. Our codes are available at https://github.com/Superjie13/CPCF_Multispectral .</description><identifier>ISSN: 1524-9050</identifier><identifier>EISSN: 1558-0016</identifier><identifier>DOI: 10.1109/TITS.2024.3412417</identifier><identifier>CODEN: ITISFG</identifier><language>eng</language><publisher>IEEE</publisher><subject>attention ; Complexity theory ; Computer Science ; Deep learning ; Feature extraction ; Infrared imaging ; intermediate fusion ; Multispectral ; Multispectral imaging ; Object detection ; Robustness ; YOLO</subject><ispartof>IEEE transactions on intelligent transportation systems, 2024-11, Vol.25 (11), p.16300-16311</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c295t-2ed3bc15f8952db4495b841829048480e9b2523269ca99c2bc28566e27961de73</cites><orcidid>0000-0002-5843-7139 ; 0000-0002-8518-2856 ; 0000-0003-4654-9835 ; 0000-0002-3555-7306 ; 0000-0002-2860-8128</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10565297$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>230,314,776,780,792,881,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10565297$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://hal.science/hal-04620359$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Hu, Sijie</creatorcontrib><creatorcontrib>Bonardi, Fabien</creatorcontrib><creatorcontrib>Bouchafa, Samia</creatorcontrib><creatorcontrib>Prendinger, Helmut</creatorcontrib><creatorcontrib>Sidibe, Desire</creatorcontrib><title>Rethinking Self-Attention for Multispectral Object Detection</title><title>IEEE transactions on intelligent transportation systems</title><addtitle>TITS</addtitle><description>Data from different modalities, such as infrared and visible images, can offer complementary information, and integrating such information can significantly enhance the capabilities of a system to perceive and recognize its surroundings. Thus, multi-modal object detection has widespread applications, particularly in challenging weather conditions like low-light scenarios. The core of multi-modal fusion lies in developing a reasonable fusion strategy, which can fully exploit the complementary features of different modalities while preventing a significant increase in model complexity. To this end, this paper proposes a novel lightweight cross-fusion module named Channel-Patch Cross Fusion (CPCF), which leverages Channel-wise Cross-Attention (CCA), Patch-wise Cross-Attention (PCA) and Adaptive Gating (AG) to encourage mutual rectification among different modalities. This process simultaneously explores commonalities across modalities while maintaining the uniqueness of each modality. Furthermore, we design a versatile intermediate fusion framework that can leverage CPCF to enhance the performance of multi-modal object detection. The proposed method is extensively evaluated on multiple public multi-modal datasets, namely FLIR, LLVIP, and DroneVehicle. The experiments indicate that our method yields consistent performance gains across various benchmarks and can be extended to different types of detectors, further demonstrating its robustness and generalizability. Our codes are available at https://github.com/Superjie13/CPCF_Multispectral .</description><subject>attention</subject><subject>Complexity theory</subject><subject>Computer Science</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Infrared imaging</subject><subject>intermediate fusion</subject><subject>Multispectral</subject><subject>Multispectral imaging</subject><subject>Object detection</subject><subject>Robustness</subject><subject>YOLO</subject><issn>1524-9050</issn><issn>1558-0016</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1Lw0AURQdRsFZ_gOAiWxep815mJjPgptSPFiIFW9dDMn2xU2NSklHw35vQIq7e5XLuWxzGroFPALi5Wy_WqwlyFJNEAApIT9gIpNQx56BOh4wiNlzyc3bRdbu-FRJgxO5fKWx9_eHr92hFVRlPQ6A6-KaOyqaNXr6q4Ls9udDmVbQsdn2KHij0p0cu2VmZVx1dHe-YvT09rmfzOFs-L2bTLHZoZIiRNknhQJbaSNwUQhhZaAEaDRdaaE6mQIkJKuNyYxwWDrVUijA1CjaUJmN2e_i7zSu7b_1n3v7YJvd2Ps3s0HGhkCfSfEPPwoF1bdN1LZV_A-B2UGUHVXZQZY-q-s3NYeOJ6B8vlUSTJr8FimNi</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Hu, Sijie</creator><creator>Bonardi, Fabien</creator><creator>Bouchafa, Samia</creator><creator>Prendinger, Helmut</creator><creator>Sidibe, Desire</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>1XC</scope><scope>VOOES</scope><orcidid>https://orcid.org/0000-0002-5843-7139</orcidid><orcidid>https://orcid.org/0000-0002-8518-2856</orcidid><orcidid>https://orcid.org/0000-0003-4654-9835</orcidid><orcidid>https://orcid.org/0000-0002-3555-7306</orcidid><orcidid>https://orcid.org/0000-0002-2860-8128</orcidid></search><sort><creationdate>20241101</creationdate><title>Rethinking Self-Attention for Multispectral Object Detection</title><author>Hu, Sijie ; Bonardi, Fabien ; Bouchafa, Samia ; Prendinger, Helmut ; Sidibe, Desire</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-2ed3bc15f8952db4495b841829048480e9b2523269ca99c2bc28566e27961de73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>attention</topic><topic>Complexity theory</topic><topic>Computer Science</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Infrared imaging</topic><topic>intermediate fusion</topic><topic>Multispectral</topic><topic>Multispectral imaging</topic><topic>Object detection</topic><topic>Robustness</topic><topic>YOLO</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hu, Sijie</creatorcontrib><creatorcontrib>Bonardi, Fabien</creatorcontrib><creatorcontrib>Bouchafa, Samia</creatorcontrib><creatorcontrib>Prendinger, Helmut</creatorcontrib><creatorcontrib>Sidibe, Desire</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>Hyper Article en Ligne (HAL) (Open Access)</collection><jtitle>IEEE transactions on intelligent transportation systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Sijie</au><au>Bonardi, Fabien</au><au>Bouchafa, Samia</au><au>Prendinger, Helmut</au><au>Sidibe, Desire</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Rethinking Self-Attention for Multispectral Object Detection</atitle><jtitle>IEEE transactions on intelligent transportation systems</jtitle><stitle>TITS</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>25</volume><issue>11</issue><spage>16300</spage><epage>16311</epage><pages>16300-16311</pages><issn>1524-9050</issn><eissn>1558-0016</eissn><coden>ITISFG</coden><abstract>Data from different modalities, such as infrared and visible images, can offer complementary information, and integrating such information can significantly enhance the capabilities of a system to perceive and recognize its surroundings. Thus, multi-modal object detection has widespread applications, particularly in challenging weather conditions like low-light scenarios. The core of multi-modal fusion lies in developing a reasonable fusion strategy, which can fully exploit the complementary features of different modalities while preventing a significant increase in model complexity. To this end, this paper proposes a novel lightweight cross-fusion module named Channel-Patch Cross Fusion (CPCF), which leverages Channel-wise Cross-Attention (CCA), Patch-wise Cross-Attention (PCA) and Adaptive Gating (AG) to encourage mutual rectification among different modalities. This process simultaneously explores commonalities across modalities while maintaining the uniqueness of each modality. Furthermore, we design a versatile intermediate fusion framework that can leverage CPCF to enhance the performance of multi-modal object detection. The proposed method is extensively evaluated on multiple public multi-modal datasets, namely FLIR, LLVIP, and DroneVehicle. The experiments indicate that our method yields consistent performance gains across various benchmarks and can be extended to different types of detectors, further demonstrating its robustness and generalizability. Our codes are available at https://github.com/Superjie13/CPCF_Multispectral .</abstract><pub>IEEE</pub><doi>10.1109/TITS.2024.3412417</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-5843-7139</orcidid><orcidid>https://orcid.org/0000-0002-8518-2856</orcidid><orcidid>https://orcid.org/0000-0003-4654-9835</orcidid><orcidid>https://orcid.org/0000-0002-3555-7306</orcidid><orcidid>https://orcid.org/0000-0002-2860-8128</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1524-9050
ispartof IEEE transactions on intelligent transportation systems, 2024-11, Vol.25 (11), p.16300-16311
issn 1524-9050
1558-0016
language eng
recordid cdi_ieee_primary_10565297
source IEEE Electronic Library (IEL)
subjects attention
Complexity theory
Computer Science
Deep learning
Feature extraction
Infrared imaging
intermediate fusion
Multispectral
Multispectral imaging
Object detection
Robustness
YOLO
title Rethinking Self-Attention for Multispectral Object Detection
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T08%3A13%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-hal_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Rethinking%20Self-Attention%20for%20Multispectral%20Object%20Detection&rft.jtitle=IEEE%20transactions%20on%20intelligent%20transportation%20systems&rft.au=Hu,%20Sijie&rft.date=2024-11-01&rft.volume=25&rft.issue=11&rft.spage=16300&rft.epage=16311&rft.pages=16300-16311&rft.issn=1524-9050&rft.eissn=1558-0016&rft.coden=ITISFG&rft_id=info:doi/10.1109/TITS.2024.3412417&rft_dat=%3Chal_RIE%3Eoai_HAL_hal_04620359v1%3C/hal_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10565297&rfr_iscdi=true