GLU Variants Improve Transformer
Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Gated Linear Units (arXiv:1612.08083) consist of the component-wise product
of two linear projections, one of which is first passed through a sigmoid
function. Variations on GLU are possible, using different nonlinear (or even
linear) functions in place of sigmoid. We test these variants in the
feed-forward sublayers of the Transformer (arXiv:1706.03762)
sequence-to-sequence model, and find that some of them yield quality
improvements over the typically-used ReLU or GELU activations. |
---|---|
DOI: | 10.48550/arxiv.2002.05202 |