Performance analysis of SVD algorithm on the Trident processor

Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design const...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Soliman, M.I., Sedukhin, S.G.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Communication switching Delay Frequency Matrix decomposition Microarchitecture Performance analysis Registers Singular value decomposition Very large scale integration Wire
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).
DOI:	10.1109/CW.2002.1180865