Low cost and high throughput multiplierless design of a 16 point 1-D DCT of the new HEVC video coding standard

This paper presents the hardware design of a 16-points 1-D DCT used in the emerging video coding standard HEVC - High Efficiency Video Coding. The 1-D DCT is used by the 16×16 2-D DCT of the HEVC standard. The transforms stage is one of the innovations proposed by HEVC, not only because of the varia...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jeske, R., de Souza, J. C., Wrege, G., Conceicao, R., Grellert, M., Mattos, J., Agostini, L.
Format:	Tagungsbericht
Sprache:	eng ; jpn
Schlagworte:	1-D DCT Algorithm design and analysis Computer architecture Discrete cosine transforms Equations FPGA design Hardware HEVC Mathematical model video coding
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper presents the hardware design of a 16-points 1-D DCT used in the emerging video coding standard HEVC - High Efficiency Video Coding. The 1-D DCT is used by the 16×16 2-D DCT of the HEVC standard. The transforms stage is one of the innovations proposed by HEVC, not only because of the variable size (from 4×4 to 32×32) but also because higher dimension transforms other than the traditional 4×4 and 8×8 are used. The hardware design presented in this work focuses on low cost and high throughput. To achieve such objectives, the 16-points algorithm from HEVC was simplified, so that a more efficient hardware design could be implemented. Some strategies were used during this simplification, such as operations reordering, factoring to compress the length of the operators, multiplications by constant turned into shifts and adds, sub-expressions sharing, among others. The architecture was designed in a fully combinational way in order to reduce hardware overhead. Synthesis results obtained using Altera FPGAs from the Cyclone II and Stratix III families showed hardware resources reduction reaching 72% when compared to an architecture described as a direct transcription of the non-optimized version of the algorithm. Even with a purely combinational implementation, the designed architecture achieved a throughput between 376 Msamples/s and 1.4 Gsamples/s. With these results, the architecture is capable of processing, in the worst case, more than 30 QFHD frames (3840×2160 pixels) per second. Therefore, the architecture is capable of processing videos with significantly high resolutions in real time. To the best of our knowledge, this is the first work in the literature that presents hardware results for the HEVC transforms.
DOI:	10.1109/SPL.2012.6211786