SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequence...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present SegINR, a novel approach to neural Text-to-Speech (TTS) that
addresses sequence alignment without relying on an auxiliary duration predictor
and complex autoregressive (AR) or non-autoregressive (NAR) frame-level
sequence modeling. SegINR simplifies the process by converting text sequences
directly into frame-level features. It leverages an optimal text encoder to
extract embeddings, transforming each into a segment of frame-level features
using a conditional implicit neural representation (INR). This method, named
segment-wise INR (SegINR), models temporal dynamics within each segment and
autonomously defines segment boundaries, reducing computational costs. We
integrate SegINR into a two-stage TTS framework, using it for semantic token
prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate
that SegINR outperforms conventional methods in speech quality with
computational efficiency. |
---|---|
DOI: | 10.48550/arxiv.2410.04690 |