ResiDual: Transformer with Dual Residual Connections
Transformer networks have become the preferred architecture for many tasks due to their state-of-the-art performance. However, the optimal way to implement residual connections in Transformer, which are essential for effective training, is still debated. Two widely used variants are the Post-Layer-N...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Transformer networks have become the preferred architecture for many tasks
due to their state-of-the-art performance. However, the optimal way to
implement residual connections in Transformer, which are essential for
effective training, is still debated. Two widely used variants are the
Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN)
Transformers, which apply layer normalization after each residual block's
output or before each residual block's input, respectively. While both variants
enjoy their advantages, they also suffer from severe limitations: Post-LN
causes gradient vanishing issue that hinders training deep Transformers, and
Pre-LN causes representation collapse issue that limits model capacity. In this
paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN
(PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits
their advantages while avoids their limitations. We conduct both theoretical
analyses and empirical experiments to verify the effectiveness of ResiDual.
Theoretically, we prove that ResiDual has a lower bound on the gradient to
avoid the vanishing issue due to the residual connection from Pre-LN. Moreover,
ResiDual also has diverse model representations to avoid the collapse issue due
to the residual connection from Post-LN. Empirically, ResiDual outperforms both
Post-LN and Pre-LN on several machine translation benchmarks across different
network depths and data sizes. Thanks to the good theoretical and empirical
performance, ResiDual Transformer can serve as a foundation architecture for
different AI models (e.g., large language models). Our code is available at
https://github.com/microsoft/ResiDual. |
---|---|
DOI: | 10.48550/arxiv.2304.14802 |