A new text compression technique based on language structure

This paper describes a new data compression technique which utilises some of the common structural characteris tics of languages. The proposed algorithm is designed to partition a word into its root and suffix(es), which are then replaced by shorter bit representations. The method uses three diction...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of information science 1995-01, Vol.21 (2), p.87-94
1. Verfasser:	Ibrahim Akman, K.
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Computerized information storage and retrieval Data Compression Dictionaries Exact sciences and technology Examples Formats. Markup languages. Codification. Conversion Information and communication sciences Information and document structure and analysis Information processing and retrieval Information science. Documentation Language Classification Languages Mathematical Formulas Methods Morphology (Languages) Program Implementation Sciences and techniques of general use Studies Text Compression Theoretical Analysis Turkish Turkish language
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This paper describes a new data compression technique which utilises some of the common structural characteris tics of languages. The proposed algorithm is designed to partition a word into its root and suffix(es), which are then replaced by shorter bit representations. The method uses three dictionaries in the form of binary search trees and one character array. The first two dictionaries are for roots, whereas the third one is for suffixes. The character array is used for both searching compressible words and coding incompressible words. The number of bits in representing a substring depends on the number of the entries in the dictionary in which the substring is found. The proposed algorithm is implemented in the Turkish language and tested using three different text groups with different lengths. The results indicate a compression of up to 47%.
ISSN:	0165-5515 1741-6485
DOI:	10.1177/016555159502100203