Listen As You Wish: Audio based Event Detection via Text-to-Audio Grounding in Smart Cities
With the development of internet of things technologies, tremendous sensor audio data has been produced, which poses great challenges to audio-based event detection in smart cities. In this paper, we target a challenging audio-based event detection task, namely, text-to-audio grounding. In addition...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the development of internet of things technologies, tremendous sensor
audio data has been produced, which poses great challenges to audio-based event
detection in smart cities. In this paper, we target a challenging audio-based
event detection task, namely, text-to-audio grounding. In addition to precisely
localizing all of the desired on- and off-sets in the untrimmed audio, this
challenging new task requires extensive acoustic and linguistic comprehension
as well as the reasoning for the crossmodal matching relations between the
audio and query. The current approaches often treat the query as an entire one
through a global query representation in order to address those issues. We
contend that this strategy has several drawbacks. Firstly, the interactions
between the query and the audio are not fully utilized. Secondly, it has not
distinguished the importance of different keywords in a query. In addition,
since the audio clips are of arbitrary lengths, there exist many segments which
are irrelevant to the query but have not been filtered out in the approach.
This further hinders the effective grounding of desired segments. Motivated by
the above concerns, a novel Cross-modal Graph Interaction (CGI) model is
proposed to comprehensively model the relations between the words in a query
through a novel language graph. To capture the fine-grained relevances between
the audio and query, a cross-modal attention module is introduced to generate
snippet-specific query representations and automatically assign higher weights
to keywords with more important semantics. Furthermore, we develop a
cross-gating module for the audio and query to weaken irrelevant parts and
emphasize the important ones. |
---|---|
DOI: | 10.48550/arxiv.2106.14136 |