Large model distributed training method and system for heterogeneous hardware cluster

The invention discloses a large model distributed training method and system for a heterogeneous hardware cluster, and belongs to the field of artificial intelligence. The method comprises the following steps: automatically selecting a heterogeneous training mode according to a large model input by...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: AO YULONG
Format: Patent
Sprache:chi ; eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The invention discloses a large model distributed training method and system for a heterogeneous hardware cluster, and belongs to the field of artificial intelligence. The method comprises the following steps: automatically selecting a heterogeneous training mode according to a large model input by a user and hardware cluster configuration, and carrying out distributed parallelization on the large model; in the heterogeneous pipeline parallel mode, each network layer in the same micro-batch of the large model is mapped to a device of a corresponding hardware type in a cluster, and in the heterogeneous data parallel mode, a plurality of data parallel instances of the large model are mapped to the device of the corresponding hardware type in the cluster; and carrying out parallel iterative training on the large model in the devices of all hardware types, carrying out communication between the devices by utilizing the heterogeneous communication library in the training process based on the selected heterogeneous