Compressing Chinese text files using an adaptive Huffman coding scheme and a static dictionary of character pairs

The compression method for Chinese text files proposed in this paper is based on a single pass data compression technique, adaptive Huffman coding. All Chinese text files to be compressed are modeled to contain not only ASCII characters, Chinese ideographic characters and punctuation marks, but also...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ong, G.H., Chong, W.T.
Format:	Tagungsbericht
Sprache:	eng
Schlagworte:	Arithmetic Computer science Context modeling Data compression Dictionaries Encoding Frequency Huffman coding Information systems Natural languages
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The compression method for Chinese text files proposed in this paper is based on a single pass data compression technique, adaptive Huffman coding. All Chinese text files to be compressed are modeled to contain not only ASCII characters, Chinese ideographic characters and punctuation marks, but also commonly used Chinese character pairs. The approach of using a static dictionary is employed to maintain about 3000 most frequently occurring character pairs found in general Chinese texts. This is to define the extension to the standard source alphabet in ideogram-based adaptive Huffman coding. The performance in compression ratio and CPU execution time of the proposed method is evaluated against those of the adaptive byte-oriented Huffman coding scheme, the adaptive ideogram-based Huffman coding scheme, and the adaptive LZW method. The experimental results have shown that the proposed method based on adaptive Huffman coding with an extended source alphabet yields better compression on Chinese text files.
DOI:	10.1109/SICON.1993.515699