Information Content of Protein Sequences

The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our res...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of theoretical biology 2000-10, Vol.206 (3), p.379-386
Hauptverfasser:	WEISS, OLAF, JIMÉNEZ-MONTAÑO, MIGUEL A, HERZEL, HANSPETER
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Amino Acid Sequence Animals Information Theory Models, Biological Peptide Library Sequence Analysis, DNA
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides.
ISSN:	0022-5193 1095-8541
DOI:	10.1006/jtbi.2000.2138