Conditional DETR V2: Efficient Detection Transformer with Box Queries

In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presente...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Xiaokang, Wei, Fangyun, Zeng, Gang, Wang, Jingdong
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chen, Xiaokang Wei, Fangyun Zeng, Gang Wang, Jingdong
description	In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.
doi_str_mv	10.48550/arxiv.2207.08914
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2207_08914</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2207_08914</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-bf7a0c06d82fbe5c3b1b18568ee3edbdba44939f6c90dfaa34381e0330612dde3</originalsourceid><addsrcrecordid>eNotz81KxDAUhuFsXMjoBbgyN9B68tM0daed-gMDohS35SQ5wcBMK2nV8e5lxll9ixc-eBi7ElBqW1Vwg3mfvkspoS7BNkKfs66dxpCWNI245euuf-Pv8pZ3MSafaFz4mhbyh8z7jOMcp7yjzH_S8sHvpz1__aKcaL5gZxG3M12edsX6h65vn4rNy-Nze7cp0NS6cLFG8GCCldFR5ZUTTtjKWCJFwQWHWjeqicY3ECKi0soKAqXACBkCqRW7_r89OobPnHaYf4eDZzh61B9PH0WX</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Conditional DETR V2: Efficient Detection Transformer with Box Queries</title><source>arXiv.org</source><creator>Chen, Xiaokang ; Wei, Fangyun ; Zeng, Gang ; Wang, Jingdong</creator><creatorcontrib>Chen, Xiaokang ; Wei, Fangyun ; Zeng, Gang ; Wang, Jingdong</creatorcontrib><description>In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.</description><identifier>DOI: 10.48550/arxiv.2207.08914</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2207.08914$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2207.08914$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Xiaokang</creatorcontrib><creatorcontrib>Wei, Fangyun</creatorcontrib><creatorcontrib>Zeng, Gang</creatorcontrib><creatorcontrib>Wang, Jingdong</creatorcontrib><title>Conditional DETR V2: Efficient Detection Transformer with Box Queries</title><description>In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KxDAUhuFsXMjoBbgyN9B68tM0daed-gMDohS35SQ5wcBMK2nV8e5lxll9ixc-eBi7ElBqW1Vwg3mfvkspoS7BNkKfs66dxpCWNI245euuf-Pv8pZ3MSafaFz4mhbyh8z7jOMcp7yjzH_S8sHvpz1__aKcaL5gZxG3M12edsX6h65vn4rNy-Nze7cp0NS6cLFG8GCCldFR5ZUTTtjKWCJFwQWHWjeqicY3ECKi0soKAqXACBkCqRW7_r89OobPnHaYf4eDZzh61B9PH0WX</recordid><startdate>20220718</startdate><enddate>20220718</enddate><creator>Chen, Xiaokang</creator><creator>Wei, Fangyun</creator><creator>Zeng, Gang</creator><creator>Wang, Jingdong</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220718</creationdate><title>Conditional DETR V2: Efficient Detection Transformer with Box Queries</title><author>Chen, Xiaokang ; Wei, Fangyun ; Zeng, Gang ; Wang, Jingdong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-bf7a0c06d82fbe5c3b1b18568ee3edbdba44939f6c90dfaa34381e0330612dde3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Xiaokang</creatorcontrib><creatorcontrib>Wei, Fangyun</creatorcontrib><creatorcontrib>Zeng, Gang</creatorcontrib><creatorcontrib>Wang, Jingdong</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Xiaokang</au><au>Wei, Fangyun</au><au>Zeng, Gang</au><au>Wang, Jingdong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Conditional DETR V2: Efficient Detection Transformer with Box Queries</atitle><date>2022-07-18</date><risdate>2022</risdate><abstract>In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.</abstract><doi>10.48550/arxiv.2207.08914</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2207.08914
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2207_08914
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Conditional DETR V2: Efficient Detection Transformer with Box Queries
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T20%3A52%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Conditional%20DETR%20V2:%20Efficient%20Detection%20Transformer%20with%20Box%20Queries&rft.au=Chen,%20Xiaokang&rft.date=2022-07-18&rft_id=info:doi/10.48550/arxiv.2207.08914&rft_dat=%3Carxiv_GOX%3E2207_08914%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true