Seminarium NLP 2014–2015 @ ZIL IPI PAN - Workshop “Quantitative Models of Texts in Natural Language”

AUTOR: Kumiko Tanaka-Ishii
AFILIACJA: Graduate School of Information Science and Electrical Engineering, Kyushu University
TYTUŁ: Computational Constancy Measures of Texts

STRESZCZENIE:

In this talk, I will mainly speak about a mathematical and empirical
verification of computational constancy measures for natural language
text. A constancy measure characterizes a given text by having an
invariant value for any size larger than a certain amount. The study
of such measures has a 70-year history dating back to Yule's K, with
the original intended application of author identification. We examine
various measures proposed since Yule and reconsider reports made so
far, thus overviewing the study of constancy measures. We then explain
how K is essentially equivalent to an approximation of the
second-order R'enyi entropy, thus indicating its signification within
language science. We then empirically examine constancy measure
candidates within this new, broader context. The approximated
higher-order entropy exhibits stable convergence across different
languages and kinds of text. We also show, however, that it cannot
identify authors, contrary to Yule's intention. Lastly, we apply K
to two unknown scripts, of the Voynich manuscript and Rongorongo, and
show how the results support previous hypotheses about these scripts.

AUTOR: Łukasz Dębowski
AFILIACJA: Institute of Computer Science, Polish Academy of Sciences
TYTUŁ: Hilberg's Hypothesis---Experimental Verification and Theoretical Results

STRESZCZENIE:

May generation of texts in natural language be described by a
probabilistic model? Researchers from several domains worked on
this problem. The theoretical difficulties inspired a few
important concepts in applied mathematics, such as Markov chains,
entropy, fractals, and algorithmic complexity. Hilberg's
conjecture is yet another mathematical model for natural
language. According to this hypothesis, the entropy of texts in
natural language grows as a power law. We will review the history
of empirical verification of this hypothesis and we will sketch
some mathematical developments which provide its rational
support. Briefly speaking, the following empirical observations
support Hilberg's conjecture: estimates of entropy by Shannon
using the guessing method (upper and lower bound), scaling of
compression rate for universal codes (upper bound), and scaling
of subword complexity and maximal repetition (lower bound). As
for the theoretical results, there are three main developments.
First, considerations concerning maximal repetition indicate that
universal codes such as the Lempel-Ziv code may fail to
efficiently compress sources that satisfy Hilberg's
conjecture. Second, Hilberg's conjecture implies the empirically
observed power-law growth of vocabulary in texts. Third,
Hilberg's conjecture can be explained by a hypothesis that texts
describe consistently an infinite random object, such as the
external world.