Long-Range Memory in Literary Texts: On the Universal Clustering of the Rare Words

A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf's law). Here we address the complementary question, if also the rhythm of the text, chara...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	PloS one 2016-11, Vol.11 (11), p.e0164658-e0164658
Hauptverfasser:	Tanaka-Ishii, Kumiko, Bunde, Armin
Format:	Artikel
Sprache:	eng
Schlagworte:	Autocorrelation function Autocorrelation functions Biology and Life Sciences Cluster Analysis Clustering Data processing Engineering and Technology England France Germany Humans Intervals Ireland Language Linguistics Linguistics - methods Mathematical Computing Models, Theoretical Physical Sciences Probability Research and Analysis Methods Self-similarity Semantics Social Sciences Statistical physics Texts Time series Vocabulary
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A fundamental problem in linguistics is how literary texts can be quantified mathematically. It is well known that the frequency of a (rare) word in a text is roughly inverse proportional to its rank (Zipf's law). Here we address the complementary question, if also the rhythm of the text, characterized by the arrangement of the rare words in the text, can be quantified mathematically in a similar basic way. To this end, we consider representative classic single-authored texts from England/Ireland, France, Germany, China, and Japan. In each text, we classify each word by its rank. We focus on the rare words with ranks above some threshold Q and study the lengths of the (return) intervals between them. We find that for all texts considered, the probability SQ(r) that the length of an interval exceeds r, follows a perfect Weibull-function, SQ(r) = exp(-b(β)rβ), with β around 0.7. The return intervals themselves are arranged in a long-range correlated self-similar fashion, where the autocorrelation function CQ(s) of the intervals follows a power law, CQ(s) ∼ s-γ, with an exponent γ between 0.14 and 0.48. We show that these features lead to a pronounced clustering of the rare words in the text.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0164658