ICS PAS

The Linguistic Engineering Group

Poliqarp 1.1


This directory contains the beta version of Poliqarp 1.1 with statistical extensions:

See http://korpus.pl/index.php?page=poliqarp for the stable Poliqarp 1.0, as well as pointers to documention describing the basic query syntax.


Contents of poliqarp1_1.readme:

POLIQARP 1.1 - STATISTICAL EXTENSIONS TO QUERY SYNTAX
Aleksander Buczyñski 

INTRODUCTION

Poliqarp is a utility for searching large tagged corpora, which features
an expressive query language and support for positional tagsets.
Version 1.0 worked as a concordancer, which returned a list of matches
with contexts of selected width as a response to every query.

Version 1.1 introduces the possibility to make statistical queries, for
example:

- what is the frequency distribution of a given word's forms?
- what parts of speech occur directly after a given word?
- what verbs are used most often in a given style?

The new keyword "group by", followed by a list of attributes of a given
segment (word), tells the program that the user is not interested in
contexts of the occurrences of each match, but rather in the frequency
distribution of the specified attributes' values.

SIMPLE QUERIES

Frequency of forms of the word "korpus" (corpus):

[base=korpus] group by orth

As above, but with a distinction between homonymous cases:

[base=korpus] group by number, case

Or:

[base=woda] group by number, case, orth

Frequency of parts of speech:

[] group by pos

Frequency of parts of speech in books by Prus:

[] meta autor=Prus group by pos

The results of the first part of the query (before group by) can be
grouped according to a set of segment attributes (specified after group
by). Some attributes -- orth, length, base, pos -- are universal for all
corpora and segments, others -- like number, case, degree or gender --
may be dependent on the tagset used in the corpus. The examples given in
this document are based on the tagset used in the IPI PAN Corpus of
Polish.

HANDLING MULTIWORD MATCHES

Poliqarp queries can return matches longer than a single segment. To
specify which segment's attribute should be used for grouping, you
should add the segment number before the name of the attribute. For
example, to find verbs occurring right after the word "woda" (water),
you could write:

[base=woda][pos=verb] group by 2.base

As above, but allowing an optional adverb between "woda" and the verb:

[base=woda][pos=adv]{0,1}[pos=fin] group by -1.base

-1. specifies the last segment of the result. Similarly, -2. is the
second last, -3. the third last, etc.

Three subsequent adverbs:

[pos=adv]{3} group by 1.base, 2.base, 3.base sort by freq

"1." can be skipped:

[pos=adv]{3} group by base, 2.base, 3.base sort by freq

AMBIGUITIES

Each segment can have a number of interpretations. For example, the
Polish word "mam" can be a form of the verb "mieæ" (to have) or the noun
"mama" (mom). In fact, only orth and length are attributes of the
segment itself -- all the other ones constitute its interpretation.

By default, one random interpretation of each segment is chosen for
grouping. But if you add "interp combine" after an attribute selector,
the value of the attribute will be calculated as a concatenation of all
the unique values of the attribute in all interpretations (separated by
a vertical bar -- "|").

For example:

[base~mieæ] group by base interp combine

will return results like "mienie|mieæ", "maiæ|mieæ", etc.

Note: the results will vary significantly depending on the value of the
"Show only disambiguated results" option.

RESULTS SORTING AND SELECTION

... sort by freq -- by frequency, desceding

... sort a fronte -- alphabetic order, ascending

... min N -- only results occuring at least N times

SAMPLE SIZE

By default, statistics are created on the basis of 1000 random matches,
in order to quickly display approximate results. If you need more
accurate counts, the sample size can be modified by the keyword
"count". For example:

[base=korpus] group by orth sort a fronte count all

[] group by pos orth sort by freq count 10000

PARTIAL GROUPING

Simple bigram frequencies:

[][] group by 1.base, 2.base sort by freq

These are often not enough. For example, in collocation detection, not
only the bigram frequency, but also the frequencies of its constituents
are taken into account. Therefore, a special separator -- semicolon ";"
-- has been introduced, which makes it possible to split the grouping
rules into two parts, for example:

[][] group by 1.base; 2.base sort by freq

will cause the program to group the results by:

- 1.base (the part before the semicolon)
- 2.base (the part after the semicolon)
- 1.base, 2.base (together)

For each line of the results of the last grouping, the results of
partial groupings will also be displayed.

Modifiers "sort by" and "min X" are applied to the last grouping. For
example, the query:

[][] group by 1.base; 2.base sort by freq

will return results in the same order as the query with a comma instead
of the semicolon.

Each of the grouping parts may include more than one attribute, but the
grouping may have no more than two parts. In other words: the query can
include any number of commas, but only zero or one semicolon.

Note: different grouping parts may reference the same segment, for
example:

[base=korpus] group by number; case

COLLOCATION FUNCTIONS

A few measures for statistical detection of collocations have been
implemented as a sorting parameter:

- sort by cp - conditional probability

- sort by scp - symmetric conditional probability

- sort by maxcp - maximum conditional probability

- sort by dice - Dice's formula

Example:

[pos!=interp]{2} group by base; 2.base sort by scp

Because dependency measures in fact prefer rare bigrams, a minimum
frequency threshold is recommended, for example:

[pos!=interp]{2} group by base; 2.base sort by scp min 2

An alternative approach is to add some frequency bias to the dependency
test value, for example

[pos!=interp]{2} group by base; 2.base sort by scp bias 0.5

"bias b" means "before sorting, multiply the function results by power b
of bigram frequency". Of course, "bias" and "min" keywords can be
combined, for example:

[pos!=interp]{2} group by base; 2.base sort by scp bias 0.5 min 2

RESULTS

Displayed results include up to 6 columns:

1. Concatenation of values of the first part of grouping (whole
  grouping, if no semicolon is used).

2. Concatenation of values of the second part of grouping.

3. Number of matches matching the given combination of values from the
  first part of the grouping.

4. Number of matches matching the given combination of values from the
  second part of the grouping.

5. Number of matches matching the given combination of values from the
  whole grouping.

6. Value of the collocation test (if used).

WIDE CONTEXT AND METADATA

After clicking a line of results, the program displays its wide context
or metadata of the selected match in the lower window. The same also
happens for statistical results, but the program has to select one
example from many matches included in that line. In version 1.1 it is
always the first suitable match found in the corpus.

TECHNICAL NOTES

Note 1. Once calculated, the results of the first part of the query can
be grouped many times without searching the corpus again. For example:

[base=woda] group by case

should compute significantly faster if asked directly after another
query containing "[base=woda]", for example:

[base=woda] group by number

Note 2. There is a hardcoded limit of 500000 results per query,
therefore the most generic queries, like:

[] group by pos count all

will be calculated only on the first 0,5 mln segments from the corpus.

Back to the homepage of the Linguistic Engineering Group.


Valid XHTML 1.0! Valid CSS!

Creation Date: Thursday, November 2, 2006
Last Content Update: Thursday, November 2, 2006
Last Modified: Sat May 3 18:23:48 CEST 2008

Maintained by AP.