Sparse Mixer Architecture

Improved multi-layer machine learning model architectures are provided that exhibit increased accuracy, decreased training time, decreased inference compute cost, and/or increased stability while training. These improved models include a plurality of sequential layers, each layer comprising a mixing...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Thorp, James Lee, Ainslie, Joshua Timothy
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Improved multi-layer machine learning model architectures are provided that exhibit increased accuracy, decreased training time, decreased inference compute cost, and/or increased stability while training. These improved models include a plurality of sequential layers, each layer comprising a mixing layer that feeds into a feedforward layer. These improved models achieve these benefits by 'enhancing' a subset of the feedforward layers with mixture-of-experts or other sparse multi-network architectures while 'degrading' a subset of the mixing layers to be simple linear mixing layers (e.g., that multiply inputs by one or more mixing matrices) rather than more complicated attentional mixing mechanisms (e.g., including a number of matrix multiplications, dot products, and nonlinear operations). Such a combination of mixing layer modifications and feedforward layer modifications in a single multi-layer model exhibits synergistic improvements with respect to training time, inference computational cost, and training stability for a given level of model accuracy.