Developing Biceps to completely compute in subquadratic time a new generic type of bicluster in dense and sparse matrices

Given an m -by- n real matrix, biclustering aims to discover relevant submatrices. This article defines a new type of bicluster. In any of its columns, the values in the rows of the bicluster must be all strictly greater than those in the rows absent from it, hence the discovery of a binary clusteri...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Data mining and knowledge discovery 2022-07, Vol.36 (4), p.1451-1497
Hauptverfasser:	Abreu, Bernardo, Ataide Martins, João Paulo, Cerf, Loïc
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Artificial Intelligence Chemistry and Earth Sciences Clustering Computer Science Data Mining and Knowledge Discovery Data structures Dynamic programming Information Storage and Retrieval Physics Sparse matrices Sparsity Statistics for Engineering
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Given an m -by- n real matrix, biclustering aims to discover relevant submatrices. This article defines a new type of bicluster. In any of its columns, the values in the rows of the bicluster must be all strictly greater than those in the rows absent from it, hence the discovery of a binary clustering of the rows in the restricted context of the columns of the bicluster. To only keep the best bicluster among those carrying redundant information, its rows must not be a subset or a superset of the rows of another bicluster of greater or equal quality. Any computable function can be chosen to assign qualities to the biclusters. In that respect, the proposed definition is generic. Dynamic programming and appropriate data structures allow to exhaustively list the biclusters satisfying it within O ( m 2 n + m n 2 ) time, plus the time to compute O ( mn ) qualities. After some adaptations, the proposed algorithm, Biceps, remains subquadratic if its complexity is expressed in function of m non-min n , where m non-min is the maximal number of non-minimal values in a column, i. e., for sparse matrices. Experiments on three real-world datasets demonstrate the effectiveness of the proposal in different application contexts. They also show its good theoretical efficiency is practical as well: two minutes and 5.3 GB of RAM are enough to list the desired biclusters in a dense 801-by-20,531 matrix; 3.5s and 192 MB of RAM for a sparse 631,532-by-174,559 matrix with 2,575,425 nonzero values.
ISSN:	1384-5810 1573-756X
DOI:	10.1007/s10618-022-00834-3