MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentangleme...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jiang, Zeren, Guo, Chen, Kaufmann, Manuel, Jiang, Tianjian, Valentin, Julien, Hilliges, Otmar, Song, Jie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Jiang, Zeren
Guo, Chen
Kaufmann, Manuel
Jiang, Tianjian
Valentin, Julien
Hilliges, Otmar
Song, Jie
description We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.
doi_str_mv 10.48550/arxiv.2406.01595
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_01595</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_01595</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2406_015953</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0zMwNLU05WTw8S3NKckMyKm0UghKTc7PKy4pKk0uyczPU8hPUwDLFeSkKgSk5oOotKL8XAXf_Lz85NKcxCKFsMyU1HyFzDyFkoxUhfDMnBQeBta0xJziVF4ozc0g7-Ya4uyhC7Y3vqAoMzexqDIeZH882H5jwioA90k7BQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild</title><source>arXiv.org</source><creator>Jiang, Zeren ; Guo, Chen ; Kaufmann, Manuel ; Jiang, Tianjian ; Valentin, Julien ; Hilliges, Otmar ; Song, Jie</creator><creatorcontrib>Jiang, Zeren ; Guo, Chen ; Kaufmann, Manuel ; Jiang, Tianjian ; Valentin, Julien ; Hilliges, Otmar ; Song, Jie</creatorcontrib><description>We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.</description><identifier>DOI: 10.48550/arxiv.2406.01595</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.01595$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.01595$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jiang, Zeren</creatorcontrib><creatorcontrib>Guo, Chen</creatorcontrib><creatorcontrib>Kaufmann, Manuel</creatorcontrib><creatorcontrib>Jiang, Tianjian</creatorcontrib><creatorcontrib>Valentin, Julien</creatorcontrib><creatorcontrib>Hilliges, Otmar</creatorcontrib><creatorcontrib>Song, Jie</creatorcontrib><title>MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild</title><description>We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw0zMwNLU05WTw8S3NKckMyKm0UghKTc7PKy4pKk0uyczPU8hPUwDLFeSkKgSk5oOotKL8XAXf_Lz85NKcxCKFsMyU1HyFzDyFkoxUhfDMnBQeBta0xJziVF4ozc0g7-Ya4uyhC7Y3vqAoMzexqDIeZH882H5jwioA90k7BQ</recordid><startdate>20240603</startdate><enddate>20240603</enddate><creator>Jiang, Zeren</creator><creator>Guo, Chen</creator><creator>Kaufmann, Manuel</creator><creator>Jiang, Tianjian</creator><creator>Valentin, Julien</creator><creator>Hilliges, Otmar</creator><creator>Song, Jie</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240603</creationdate><title>MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild</title><author>Jiang, Zeren ; Guo, Chen ; Kaufmann, Manuel ; Jiang, Tianjian ; Valentin, Julien ; Hilliges, Otmar ; Song, Jie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2406_015953</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Zeren</creatorcontrib><creatorcontrib>Guo, Chen</creatorcontrib><creatorcontrib>Kaufmann, Manuel</creatorcontrib><creatorcontrib>Jiang, Tianjian</creatorcontrib><creatorcontrib>Valentin, Julien</creatorcontrib><creatorcontrib>Hilliges, Otmar</creatorcontrib><creatorcontrib>Song, Jie</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jiang, Zeren</au><au>Guo, Chen</au><au>Kaufmann, Manuel</au><au>Jiang, Tianjian</au><au>Valentin, Julien</au><au>Hilliges, Otmar</au><au>Song, Jie</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild</atitle><date>2024-06-03</date><risdate>2024</risdate><abstract>We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos.</abstract><doi>10.48550/arxiv.2406.01595</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2406.01595
ispartof
issn
language eng
recordid cdi_arxiv_primary_2406_01595
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T03%3A47%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MultiPly:%20Reconstruction%20of%20Multiple%20People%20from%20Monocular%20Video%20in%20the%20Wild&rft.au=Jiang,%20Zeren&rft.date=2024-06-03&rft_id=info:doi/10.48550/arxiv.2406.01595&rft_dat=%3Carxiv_GOX%3E2406_01595%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true