Variable-Rate Deep Image Compression With Vision Transformers

Recently, vision transformers have been applied in many computer vision problems due to its long-range learning ability. However, it has not been throughly explored in image compression. We propose a patch-based learned image compression network by incorporating vision transformers. The input image...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE access 2022, Vol.10, p.50323-50334
Hauptverfasser: Li, Binglin, Liang, Jie, Han, Jingning
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Recently, vision transformers have been applied in many computer vision problems due to its long-range learning ability. However, it has not been throughly explored in image compression. We propose a patch-based learned image compression network by incorporating vision transformers. The input image is divided into patches before feeding to the encoder and the patches are reconstructed from the decoder to form a complete image. Different kinds of transformer blocks (TransBlocks) are applied to meet the various requirements in the subnetworks. We also propose a transformer-based context model (TransContext) to facilitate the coding based on previously decoded symbols. Since the computational complexity of the attention mechanism in transformers is a quadratic function of the sequence length, we partition the feature tensor into different segments and conduct the transformer in each segment to save computational cost. To alleviate the compression artifacts, we use overlapping patches and apply an existing deblocking network to further remove the artifacts. At last, the residual coding scheme is adopted to get the compression performance for variable bit rates. We show that our patch-based learned image compression with transformers obtain 0.75dB improvement in PSNR at 0.15bpp than the prior variable-rate compression work on the Kodak dataset. When using the residual coding strategy, our framework keeps good performance in PSNR and is comparable to BPG420. For MS-SSIM, we get higher results than BPG444 across a range of bit rates (0.021 at 0.21bpp) and other variable-rate learned image compression models at low bit rates.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3173256