Cross-BERT for Point Cloud Pretraining

Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by expl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Xin, Li, Peng, Wei, Zeyong, Zhu, Zhe, Wei, Mingqiang, Hou, Junhui, Nan, Liangliang, Qin, Jing, Xie, Haoran, Wang, Fu Lee
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Li, Xin Li, Peng Wei, Zeyong Zhu, Zhe Wei, Mingqiang Hou, Junhui Nan, Liangliang Qin, Jing Xie, Haoran Wang, Fu Lee
description	Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
doi_str_mv	10.48550/arxiv.2312.04891
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_04891</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_04891</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-47ceacf6b0cc0026cfc7a92886c91b817fde2ffaae84e5e9ee11e3ffeb1e3c6f3</originalsourceid><addsrcrecordid>eNotzrsOgjAUgOEuDkZ9ACeZ3MC2QCmjEm-Jicawk0M9xzRBMAWNvr3X6d_-fIyNBQ8iHcd8Bu5h74EMhQx4pFPRZ9PMNW3rL5bH3KPGeYfG1p2XVc3t5B0cdg5sbevzkPUIqhZH_w5Yvlrm2cbf7dfbbL7zQSXCjxKDYEiV3BjOpTJkEkil1sqkotQioRNKIgDUEcaYIgqBIRGW7xhF4YBNftsvtLg6ewH3LD7g4gsOX0WJO0E</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Cross-BERT for Point Cloud Pretraining</title><source>arXiv.org</source><creator>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</creator><creatorcontrib>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</creatorcontrib><description>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</description><identifier>DOI: 10.48550/arxiv.2312.04891</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.04891$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.04891$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xin</creatorcontrib><creatorcontrib>Li, Peng</creatorcontrib><creatorcontrib>Wei, Zeyong</creatorcontrib><creatorcontrib>Zhu, Zhe</creatorcontrib><creatorcontrib>Wei, Mingqiang</creatorcontrib><creatorcontrib>Hou, Junhui</creatorcontrib><creatorcontrib>Nan, Liangliang</creatorcontrib><creatorcontrib>Qin, Jing</creatorcontrib><creatorcontrib>Xie, Haoran</creatorcontrib><creatorcontrib>Wang, Fu Lee</creatorcontrib><title>Cross-BERT for Point Cloud Pretraining</title><description>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrsOgjAUgOEuDkZ9ACeZ3MC2QCmjEm-Jicawk0M9xzRBMAWNvr3X6d_-fIyNBQ8iHcd8Bu5h74EMhQx4pFPRZ9PMNW3rL5bH3KPGeYfG1p2XVc3t5B0cdg5sbevzkPUIqhZH_w5Yvlrm2cbf7dfbbL7zQSXCjxKDYEiV3BjOpTJkEkil1sqkotQioRNKIgDUEcaYIgqBIRGW7xhF4YBNftsvtLg6ewH3LD7g4gsOX0WJO0E</recordid><startdate>20231208</startdate><enddate>20231208</enddate><creator>Li, Xin</creator><creator>Li, Peng</creator><creator>Wei, Zeyong</creator><creator>Zhu, Zhe</creator><creator>Wei, Mingqiang</creator><creator>Hou, Junhui</creator><creator>Nan, Liangliang</creator><creator>Qin, Jing</creator><creator>Xie, Haoran</creator><creator>Wang, Fu Lee</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231208</creationdate><title>Cross-BERT for Point Cloud Pretraining</title><author>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-47ceacf6b0cc0026cfc7a92886c91b817fde2ffaae84e5e9ee11e3ffeb1e3c6f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xin</creatorcontrib><creatorcontrib>Li, Peng</creatorcontrib><creatorcontrib>Wei, Zeyong</creatorcontrib><creatorcontrib>Zhu, Zhe</creatorcontrib><creatorcontrib>Wei, Mingqiang</creatorcontrib><creatorcontrib>Hou, Junhui</creatorcontrib><creatorcontrib>Nan, Liangliang</creatorcontrib><creatorcontrib>Qin, Jing</creatorcontrib><creatorcontrib>Xie, Haoran</creatorcontrib><creatorcontrib>Wang, Fu Lee</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xin</au><au>Li, Peng</au><au>Wei, Zeyong</au><au>Zhu, Zhe</au><au>Wei, Mingqiang</au><au>Hou, Junhui</au><au>Nan, Liangliang</au><au>Qin, Jing</au><au>Xie, Haoran</au><au>Wang, Fu Lee</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-BERT for Point Cloud Pretraining</atitle><date>2023-12-08</date><risdate>2023</risdate><abstract>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</abstract><doi>10.48550/arxiv.2312.04891</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2312.04891
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2312_04891
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Cross-BERT for Point Cloud Pretraining
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T09%3A35%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-BERT%20for%20Point%20Cloud%20Pretraining&rft.au=Li,%20Xin&rft.date=2023-12-08&rft_id=info:doi/10.48550/arxiv.2312.04891&rft_dat=%3Carxiv_GOX%3E2312_04891%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true