PESFormer: Boosting Macro- and Micro-expression Spotting with Direct Timestamp Encoding
The task of macro- and micro-expression spotting aims to precisely localize and categorize temporal expression instances within untrimmed videos. Given the sparse distribution and varying durations of expressions, existing anchor-based methods often represent instances by encoding their deviations f...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The task of macro- and micro-expression spotting aims to precisely localize
and categorize temporal expression instances within untrimmed videos. Given the
sparse distribution and varying durations of expressions, existing anchor-based
methods often represent instances by encoding their deviations from predefined
anchors. Additionally, these methods typically slice the untrimmed videos into
fixed-length sliding windows. However, anchor-based encoding often fails to
capture all training intervals, and slicing the original video as sliding
windows can result in valuable training intervals being discarded. To overcome
these limitations, we introduce PESFormer, a simple yet effective model based
on the vision transformer architecture to achieve point-to-interval expression
spotting. PESFormer employs a direct timestamp encoding (DTE) approach to
replace anchors, enabling binary classification of each timestamp instead of
optimizing entire ground truths. Thus, all training intervals are retained in
the form of discrete timestamps. To maximize the utilization of training
intervals, we enhance the preprocessing process by replacing the short videos
produced through the sliding window method.Instead, we implement a strategy
that involves zero-padding the untrimmed training videos to create uniform,
longer videos of a predetermined duration. This operation efficiently preserves
the original training intervals and eliminates video slice
enhancement.Extensive qualitative and quantitative evaluations on three
datasets -- CAS(ME)^2, CAS(ME)^3 and SAMM-LV -- demonstrate that our PESFormer
outperforms existing techniques, achieving the best performance. |
---|---|
DOI: | 10.48550/arxiv.2410.18695 |