MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation

Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large num...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xie, Rongchang, Wang, Chunyu, Wang, Yizhou
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Xie, Rongchang Wang, Chunyu Wang, Yizhou
description	Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts (1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.
doi_str_mv	10.48550/arxiv.2003.13239
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2003_13239</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2003_13239</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-4db8af438df725357ded6d6b16125a26acc7ce6e5c9960e74d644b85c460a7f13</originalsourceid><addsrcrecordid>eNotj81qwkAUhWfTRbF9gK6cF0ic3ztJF4KIPwVFF-7DzcwdCGhSJrG0b2_Urg6cA9_hY-xDitwU1ooZpt_mJ1dC6FxqpctXNt_TgOtrT598wY-JsiFh01LgY9d0Ld93gc48dolvrxds-bHria_6obngMO5v7CXiuaf3_5yw03p1Wm6z3WHztVzsMgRXZibUBUajixCdstq6QAEC1BKksqgAvXeegKwvSxDkTABj6sJ6AwJdlHrCpk_sQ6D6TuN9-qvuItVDRN8AxF9CIw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation</title><source>arXiv.org</source><creator>Xie, Rongchang ; Wang, Chunyu ; Wang, Yizhou</creator><creatorcontrib>Xie, Rongchang ; Wang, Chunyu ; Wang, Yizhou</creatorcontrib><description>Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts (1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.</description><identifier>DOI: 10.48550/arxiv.2003.13239</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2020-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2003.13239$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2003.13239$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xie, Rongchang</creatorcontrib><creatorcontrib>Wang, Chunyu</creatorcontrib><creatorcontrib>Wang, Yizhou</creatorcontrib><title>MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation</title><description>Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts (1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81qwkAUhWfTRbF9gK6cF0ic3ztJF4KIPwVFF-7DzcwdCGhSJrG0b2_Urg6cA9_hY-xDitwU1ooZpt_mJ1dC6FxqpctXNt_TgOtrT598wY-JsiFh01LgY9d0Ld93gc48dolvrxds-bHria_6obngMO5v7CXiuaf3_5yw03p1Wm6z3WHztVzsMgRXZibUBUajixCdstq6QAEC1BKksqgAvXeegKwvSxDkTABj6sJ6AwJdlHrCpk_sQ6D6TuN9-qvuItVDRN8AxF9CIw</recordid><startdate>20200330</startdate><enddate>20200330</enddate><creator>Xie, Rongchang</creator><creator>Wang, Chunyu</creator><creator>Wang, Yizhou</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200330</creationdate><title>MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation</title><author>Xie, Rongchang ; Wang, Chunyu ; Wang, Yizhou</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-4db8af438df725357ded6d6b16125a26acc7ce6e5c9960e74d644b85c460a7f13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Xie, Rongchang</creatorcontrib><creatorcontrib>Wang, Chunyu</creatorcontrib><creatorcontrib>Wang, Yizhou</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xie, Rongchang</au><au>Wang, Chunyu</au><au>Wang, Yizhou</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation</atitle><date>2020-03-30</date><risdate>2020</risdate><abstract>Cross view feature fusion is the key to address the occlusion problem in human pose estimation. The current fusion methods need to train a separate model for every pair of cameras making them difficult to scale. In this work, we introduce MetaFuse, a pre-trained fusion model learned from a large number of cameras in the Panoptic dataset. The model can be efficiently adapted or finetuned for a new pair of cameras using a small number of labeled images. The strong adaptation power of MetaFuse is due in large part to the proposed factorization of the original fusion model into two parts (1) a generic fusion model shared by all cameras, and (2) lightweight camera-dependent transformations. Furthermore, the generic model is learned from many cameras by a meta-learning style algorithm to maximize its adaptation capability to various camera poses. We observe in experiments that MetaFuse finetuned on the public datasets outperforms the state-of-the-arts by a large margin which validates its value in practice.</abstract><doi>10.48550/arxiv.2003.13239</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2003.13239
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2003_13239
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	MetaFuse: A Pre-trained Fusion Model for Human Pose Estimation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-15T03%3A12%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MetaFuse:%20A%20Pre-trained%20Fusion%20Model%20for%20Human%20Pose%20Estimation&rft.au=Xie,%20Rongchang&rft.date=2020-03-30&rft_id=info:doi/10.48550/arxiv.2003.13239&rft_dat=%3Carxiv_GOX%3E2003_13239%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true