Cross-BERT for Point Cloud Pretraining

Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by expl...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Xin, Li, Peng, Wei, Zeyong, Zhu, Zhe, Wei, Mingqiang, Hou, Junhui, Nan, Liangliang, Qin, Jing, Xie, Haoran, Wang, Fu Lee
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Li, Xin
Li, Peng
Wei, Zeyong
Zhu, Zhe
Wei, Mingqiang
Hou, Junhui
Nan, Liangliang
Qin, Jing
Xie, Haoran
Wang, Fu Lee
description Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.
doi_str_mv 10.48550/arxiv.2312.04891
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2312_04891</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2312_04891</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-47ceacf6b0cc0026cfc7a92886c91b817fde2ffaae84e5e9ee11e3ffeb1e3c6f3</originalsourceid><addsrcrecordid>eNotzrsOgjAUgOEuDkZ9ACeZ3MC2QCmjEm-Jicawk0M9xzRBMAWNvr3X6d_-fIyNBQ8iHcd8Bu5h74EMhQx4pFPRZ9PMNW3rL5bH3KPGeYfG1p2XVc3t5B0cdg5sbevzkPUIqhZH_w5Yvlrm2cbf7dfbbL7zQSXCjxKDYEiV3BjOpTJkEkil1sqkotQioRNKIgDUEcaYIgqBIRGW7xhF4YBNftsvtLg6ewH3LD7g4gsOX0WJO0E</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Cross-BERT for Point Cloud Pretraining</title><source>arXiv.org</source><creator>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</creator><creatorcontrib>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</creatorcontrib><description>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</description><identifier>DOI: 10.48550/arxiv.2312.04891</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-12</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2312.04891$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2312.04891$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xin</creatorcontrib><creatorcontrib>Li, Peng</creatorcontrib><creatorcontrib>Wei, Zeyong</creatorcontrib><creatorcontrib>Zhu, Zhe</creatorcontrib><creatorcontrib>Wei, Mingqiang</creatorcontrib><creatorcontrib>Hou, Junhui</creatorcontrib><creatorcontrib>Nan, Liangliang</creatorcontrib><creatorcontrib>Qin, Jing</creatorcontrib><creatorcontrib>Xie, Haoran</creatorcontrib><creatorcontrib>Wang, Fu Lee</creatorcontrib><title>Cross-BERT for Point Cloud Pretraining</title><description>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrsOgjAUgOEuDkZ9ACeZ3MC2QCmjEm-Jicawk0M9xzRBMAWNvr3X6d_-fIyNBQ8iHcd8Bu5h74EMhQx4pFPRZ9PMNW3rL5bH3KPGeYfG1p2XVc3t5B0cdg5sbevzkPUIqhZH_w5Yvlrm2cbf7dfbbL7zQSXCjxKDYEiV3BjOpTJkEkil1sqkotQioRNKIgDUEcaYIgqBIRGW7xhF4YBNftsvtLg6ewH3LD7g4gsOX0WJO0E</recordid><startdate>20231208</startdate><enddate>20231208</enddate><creator>Li, Xin</creator><creator>Li, Peng</creator><creator>Wei, Zeyong</creator><creator>Zhu, Zhe</creator><creator>Wei, Mingqiang</creator><creator>Hou, Junhui</creator><creator>Nan, Liangliang</creator><creator>Qin, Jing</creator><creator>Xie, Haoran</creator><creator>Wang, Fu Lee</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231208</creationdate><title>Cross-BERT for Point Cloud Pretraining</title><author>Li, Xin ; Li, Peng ; Wei, Zeyong ; Zhu, Zhe ; Wei, Mingqiang ; Hou, Junhui ; Nan, Liangliang ; Qin, Jing ; Xie, Haoran ; Wang, Fu Lee</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-47ceacf6b0cc0026cfc7a92886c91b817fde2ffaae84e5e9ee11e3ffeb1e3c6f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xin</creatorcontrib><creatorcontrib>Li, Peng</creatorcontrib><creatorcontrib>Wei, Zeyong</creatorcontrib><creatorcontrib>Zhu, Zhe</creatorcontrib><creatorcontrib>Wei, Mingqiang</creatorcontrib><creatorcontrib>Hou, Junhui</creatorcontrib><creatorcontrib>Nan, Liangliang</creatorcontrib><creatorcontrib>Qin, Jing</creatorcontrib><creatorcontrib>Xie, Haoran</creatorcontrib><creatorcontrib>Wang, Fu Lee</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xin</au><au>Li, Peng</au><au>Wei, Zeyong</au><au>Zhu, Zhe</au><au>Wei, Mingqiang</au><au>Hou, Junhui</au><au>Nan, Liangliang</au><au>Qin, Jing</au><au>Xie, Haoran</au><au>Wang, Fu Lee</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-BERT for Point Cloud Pretraining</atitle><date>2023-12-08</date><risdate>2023</risdate><abstract>Introducing BERT into cross-modal settings raises difficulties in its optimization for handling multiple modalities. Both the BERT architecture and training objective need to be adapted to incorporate and model information from different modalities. In this paper, we address these challenges by exploring the implicit semantic and geometric correlations between 2D and 3D data of the same objects/scenes. We propose a new cross-modal BERT-style self-supervised learning paradigm, called Cross-BERT. To facilitate pretraining for irregular and sparse point clouds, we design two self-supervised tasks to boost cross-modal interaction. The first task, referred to as Point-Image Alignment, aims to align features between unimodal and cross-modal representations to capture the correspondences between the 2D and 3D modalities. The second task, termed Masked Cross-modal Modeling, further improves mask modeling of BERT by incorporating high-dimensional semantic information obtained by cross-modal interaction. By performing cross-modal interaction, Cross-BERT can smoothly reconstruct the masked tokens during pretraining, leading to notable performance enhancements for downstream tasks. Through empirical evaluation, we demonstrate that Cross-BERT outperforms existing state-of-the-art methods in 3D downstream applications. Our work highlights the effectiveness of leveraging cross-modal 2D knowledge to strengthen 3D point cloud representation and the transferable capability of BERT across modalities.</abstract><doi>10.48550/arxiv.2312.04891</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2312.04891
ispartof
issn
language eng
recordid cdi_arxiv_primary_2312_04891
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Cross-BERT for Point Cloud Pretraining
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T09%3A35%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-BERT%20for%20Point%20Cloud%20Pretraining&rft.au=Li,%20Xin&rft.date=2023-12-08&rft_id=info:doi/10.48550/arxiv.2312.04891&rft_dat=%3Carxiv_GOX%3E2312_04891%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true