BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs

Matrix factorization on sparse matrices has been proven to be an effective approach for data mining and machine learning. However, the prior parallel implementations for matrix factorization fail to capture the internal social property embedded in real-world use cases. This article presents an effic...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on parallel and distributed systems 2021-09, Vol.32 (9), p.2291-2302
Hauptverfasser:	Chen, Jing, Fang, Jianbin, Liu, Weifeng, Yang, Canqun
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms alternating least squares Artificial neural networks Data mining data reordering data reuse Factorization Format GPGPUs Graphics processing units Least squares Logic gates Machine learning Matrix decomposition Matrix factorization Motion pictures performance evaluation Sparse matrices Sparsity Two dimensional displays
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Matrix factorization on sparse matrices has been proven to be an effective approach for data mining and machine learning. However, the prior parallel implementations for matrix factorization fail to capture the internal social property embedded in real-world use cases. This article presents an efficient implementation of the alternative least squares (ALS) algorithm called BALS built on top of a new sparse matrix format for parallel matrix factorization. The BALS storage format organizes the sparse matrix into 2D tiles to avoid repeated data loads and improve data reuses. We further propose a data reordering technique to sort sparse matrices according to nonzeros. The experimental results show that BALS can yield a superior performance than state-of-the-art implementations, i.e., our BALS generally runs faster than Gates' implementation over different latent feature sizes, with a speedup of up to 2.08× on K20C, 3.72× on TITAN X and 3.13× on TITAN RTX. When compared with alternative matrix factorization algorithms, our BALS consistently outperforms CDMF, cuMF_CCD, and cuMF_SGD over various latent feature sizes and datasets. The reordering technique can provide an extra improvement of up to 23.68 percent on K20C, 19.87 percent on TITAN X and 20.38 percent on TITAN RTX.
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2021.3064942