Algorithm and Hardware Co-Design of Energy-Efficient LSTM Networks for Video Recognition with Hierarchical Tucker Tensor Decomposition
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition t...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Long short-term memory (LSTM) is a type of powerful deep neural network that
has been widely used in many sequence analysis and modeling applications.
However, the large model size problem of LSTM networks make their practical
deployment still very challenging, especially for the video recognition tasks
that require high-dimensional input data. Aiming to overcome this limitation
and fully unlock the potentials of LSTM models, in this paper we propose to
perform algorithm and hardware co-design towards high-performance
energy-efficient LSTM networks. At algorithm level, we propose to develop fully
decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM,
which enjoys ultra-low model complexity while still achieving high accuracy. In
order to fully reap such attractive algorithmic benefit, we further develop the
corresponding customized hardware architecture to support the efficient
execution of the proposed FDHT-LSTM model. With the delicate design of memory
access scheme, the complicated matrix transformation can be efficiently
supported by the underlying hardware without any access conflict in an
on-the-fly way. Our evaluation results show that both the proposed
ultra-compact FDHT-LSTM models and the corresponding hardware accelerator
achieve very high performance. Compared with the state-of-the-art compressed
LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size
and significant accuracy improvement across different video recognition
datasets. Meanwhile, compared with the state-of-the-art tensor decomposed
model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves
better performance in throughput, area efficiency and energy efficiency,
respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed
design also outperforms TIE with higher throughput, higher energy efficiency
and comparable area efficiency. |
---|---|
DOI: | 10.48550/arxiv.2212.02046 |