Hardware designs for convolutional neural networks: Memoryful, memoryless and cached
This work presents three hardware architectures for convolutional neural networks with high degree of parallelism and component reuse implemented in a programmable device. The first design, which is termed memoryful architecture, uses as much memory as necessary to store the input data and intermedi...
Gespeichert in:
Veröffentlicht in: | Integration (Amsterdam) 2024-01, Vol.94, p.102074, Article 102074 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | This work presents three hardware architectures for convolutional neural networks with high degree of parallelism and component reuse implemented in a programmable device. The first design, which is termed memoryful architecture, uses as much memory as necessary to store the input data and intermediate results. The second design, which is termed memoryless architecture, defines and explores a specific input sequencing pattern to completely avoid the use of memory. The third design, which is termed cache memory-based architecture, is an intermediate solution, where the input sequence is further explored. In this case, a cache memory is used to store some intermediate results and, consequently, to improve processing performance. We compare the three designs in terms of power, area, and processing time. Allowing memory usage in the memoryful architecture increases the overall hardware cost, but reduces processing time. Preventing all memory usage in memoryless architecture increases operation parallelism, but compromises processing time. A trade-off between memory usage and processing performance is achieved in cache memory-based architecture. It provides a processing time 3× shorter than memoryful architecture, running at a clock frequency about 20% higher. When compared to the memoryless design, the cache-based architecture achieves about 13× less processing time even running at a clock frequency about 1.5% smaller. The improvement in clock frequency and processing performance comes at a cost in terms of hardware resources for the cached memory architecture. Its design, depending on the cache size, may require up to 25% more logic elements than the memoryful and memoryless architectures.
•Three hardware architectures for CNN with high degree of parallelism and reuse.•LeNet-5 hardware architecture using local memories and 16-bit fixed-point format.•Specific scanning pattern for operation execution to eliminate the use of memories.•Cache memory implementation for balancing performance and memory use. |
---|---|
ISSN: | 0167-9260 1872-7522 |
DOI: | 10.1016/j.vlsi.2023.102074 |