Text-to-Event Retrieval in Aerial Videos
Recent years have seen a rise in the use of video sensors in remote sensing (RS) as they represent a rich source for understanding earth dynamics and human activities. Accordingly, the volume of RS video data has rapidly increased. Accessing a video of interest from large repositories through text-t...
Gespeichert in:
Veröffentlicht in: | IEEE geoscience and remote sensing letters 2024, Vol.21, p.1-5 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent years have seen a rise in the use of video sensors in remote sensing (RS) as they represent a rich source for understanding earth dynamics and human activities. Accordingly, the volume of RS video data has rapidly increased. Accessing a video of interest from large repositories through text-to-video retrieval is preferable due to its flexibility and efficiency. In this letter, we propose a text-to-event retrieval model for aerial videos. The architecture of our model consists of two branches. The first is the video branch that extracts frame-level features from the video by using the vision transformer (ViT). Then, these features are concatenated into a unified representation and fed into a temporal attention module to incorporate the temporal aspects. The second branch is the text branch that extracts textual representations from the query by the bidirectional encoder representations from transformers (BERTs). The two branches are trained jointly on video and text pairs by minimizing a bidirectional contrastive loss. Experimental results on the CapERA dataset, which is an extension of the event recognition in aerial video (ERA) dataset, show the effectiveness of the proposed method. The dataset will be available at https://www.github.com/yakoubbazi/CapEra . |
---|---|
ISSN: | 1545-598X 1558-0571 |
DOI: | 10.1109/LGRS.2023.3330174 |