MoviNet: A novel network for cross-modal map extraction by vision transformer and CNN

Map quality is of great importance to location-based-services(LBS) applications such as navigation and route planning. Typically, a map can be extracted from either vehicle GPS trajectories or aerial images. Unfortunately, the quality of the extracted maps is usually unsatisfactory due to the inhere...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Knowledge-based systems 2023-10, Vol.278, p.110890, Article 110890
Hauptverfasser:	Chen, Zheng, Fang, Junhua, Chao, Pingfu, Zhao, Pengpeng, Xu, Jiajie, Zhao, Lei
Format:	Artikel
Sprache:	eng
Schlagworte:	Aerial images Convolutional neural network GPS trajectory data Map extraction Vision transformer
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Map quality is of great importance to location-based-services(LBS) applications such as navigation and route planning. Typically, a map can be extracted from either vehicle GPS trajectories or aerial images. Unfortunately, the quality of the extracted maps is usually unsatisfactory due to the inherent quality issues in the two data sources. Compared with extracting maps from a single data source, cross-modal map extraction methods consider both data sources and often achieve better results. However, almost all existing cross-modal methods are based on CNN, which fail to sufficiently model global information. To overcome the above problem, we propose MoviNet, a novel cross-modal map extraction method that combines ViT (vision transformer) and CNN. Specifically, instead of partially integrating global information in the fusion scheme as in previous works, MoviNet introduces a lightweight ViT model MobileViT as the encoder to enhance the model’s ability to capture global information. Meanwhile, we introduce a new lightweight but effective fusion scheme that generates modal-unified fusion features from the features of the two modalities, to enhance the information representation ability of the respective modalities. Extensive experiments conducted on the Beijing and Porto datasets show the superior performance of our proposed method over all baselines. https://github.com/Chan6688/MoviNet
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2023.110890