Systems and methods for fault tolerance recover during training of model of classifier using distributed system

There is provided a distributed system for training a classifier, comprising: machine learning (ML) workers each configured for computing a model update for a model of the classifier; a parameter server (PS) configured for parallel processing to provide the model to each of the ML workers, receive m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	WU ZUGUANG, PETERFREUND NATAN, TALYANSKY ROMAN, MELAMED ZACH
Format:	Patent
Sprache:	chi ; eng
Schlagworte:	CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	There is provided a distributed system for training a classifier, comprising: machine learning (ML) workers each configured for computing a model update for a model of the classifier; a parameter server (PS) configured for parallel processing to provide the model to each of the ML workers, receive model updates from each of the ML workers, and iteratively update the model using each model update;gradient datasets each associated with a respective ML worker, storing a model-update-identification (delta-M-ID) indicative of the computed model update and the respective model update; a global dataset that stores, the delta-M-ID, an identification of the ML worker (ML-worker-ID) that computed the model update, and a model version MODEL- VERSION that marks a new model in PS that is computed frommerging the model update with a previous model in PS; and a model download dataset that stores the ML-worker-ID and the MODEL-VERSION of each transmitted model. 提供了一种训练分类器的分布式系统，包括：机器学习(machine learning，简称ML)工作节点，其每个工作节点用于计算分