Poliqarp 1.1

This directory contains the beta version of Poliqarp 1.1 with statistical extensions:

See http://korpus.pl/index.php?page=poliqarp for the stable Poliqarp 1.0, as well as pointers to documention describing the basic query syntax.

POLIQARP 1.1 - STATISTICAL EXTENSIONS TO QUERY SYNTAX Aleksander Buczyński INTRODUCTION Poliqarp is a utility for searching large tagged corpora, which features an expressive query language and support for positional tagsets. Version 1.0 worked as a concordancer, which returned a list of matches with contexts of selected width as a response to every query. Version 1.1 introduces the possibility to make statistical queries, for example: - what is the frequency distribution of a given word's forms? - what parts of speech occur directly after a given word? - what verbs are used most often in a given style? The new keyword "group by", followed by a list of attributes of a given segment (word), tells the program that the user is not interested in contexts of the occurrences of each match, but rather in the frequency distribution of the specified attributes' values. SIMPLE QUERIES Frequency of forms of the word "korpus" (corpus): [base=korpus] group by orth As above, but with a distinction between homonymous cases: [base=korpus] group by number, case Or: [base=woda] group by number, case, orth Frequency of parts of speech: [] group by pos Frequency of parts of speech in books by Prus: [] meta autor=Prus group by pos The results of the first part of the query (before group by) can be grouped according to a set of segment attributes (specified after group by). Some attributes -- orth, length, base, pos -- are universal for all corpora and segments, others -- like number, case, degree or gender -- may be dependent on the tagset used in the corpus. The examples given in this document are based on the tagset used in the IPI PAN Corpus of Polish. HANDLING MULTIWORD MATCHES Poliqarp queries can return matches longer than a single segment. To specify which segment's attribute should be used for grouping, you should add the segment number before the name of the attribute. For example, to find verbs occurring right after the word "woda" (water), you could write: [base=woda][pos=verb] group by 2.base As above, but allowing an optional adverb between "woda" and the verb: [base=woda][pos=adv]{0,1}[pos=fin] group by -1.base -1. specifies the last segment of the result. Similarly, -2. is the second last, -3. the third last, etc. Three subsequent adverbs: [pos=adv]{3} group by 1.base, 2.base, 3.base sort by freq "1." can be skipped: [pos=adv]{3} group by base, 2.base, 3.base sort by freq AMBIGUITIES Each segment can have a number of interpretations. For example, the Polish word "mam" can be a form of the verb "mieć" (to have) or the noun "mama" (mom). In fact, only orth and length are attributes of the segment itself -- all the other ones constitute its interpretation. By default, one random interpretation of each segment is chosen for grouping. But if you add "interp combine" after an attribute selector, the value of the attribute will be calculated as a concatenation of all the unique values of the attribute in all interpretations (separated by a vertical bar -- "|"). For example: [base~mieć] group by base interp combine will return results like "mienie|mieć", "maić|mieć", etc. Note: the results will vary significantly depending on the value of the "Show only disambiguated results" option. RESULTS SORTING AND SELECTION ... sort by freq -- by frequency, desceding ... sort a fronte -- alphabetic order, ascending ... min N -- only results occuring at least N times SAMPLE SIZE By default, statistics are created on the basis of 1000 random matches, in order to quickly display approximate results. If you need more accurate counts, the sample size can be modified by the keyword "count". For example: [base=korpus] group by orth sort a fronte count all [] group by pos orth sort by freq count 10000 PARTIAL GROUPING Simple bigram frequencies: [][] group by 1.base, 2.base sort by freq These are often not enough. For example, in collocation detection, not only the bigram frequency, but also the frequencies of its constituents are taken into account. Therefore, a special separator -- semicolon ";" -- has been introduced, which makes it possible to split the grouping rules into two parts, for example: [][] group by 1.base; 2.base sort by freq will cause the program to group the results by: - 1.base (the part before the semicolon) - 2.base (the part after the semicolon) - 1.base, 2.base (together) For each line of the results of the last grouping, the results of partial groupings will also be displayed. Modifiers "sort by" and "min X" are applied to the last grouping. For example, the query: [][] group by 1.base; 2.base sort by freq will return results in the same order as the query with a comma instead of the semicolon. Each of the grouping parts may include more than one attribute, but the grouping may have no more than two parts. In other words: the query can include any number of commas, but only zero or one semicolon. Note: different grouping parts may reference the same segment, for example: [base=korpus] group by number; case COLLOCATION FUNCTIONS A few measures for statistical detection of collocations have been implemented as a sorting parameter: - sort by cp - conditional probability - sort by scp - symmetric conditional probability - sort by maxcp - maximum conditional probability - sort by dice - Dice's formula Example: [pos!=interp]{2} group by base; 2.base sort by scp Because dependency measures in fact prefer rare bigrams, a minimum frequency threshold is recommended, for example: [pos!=interp]{2} group by base; 2.base sort by scp min 2 An alternative approach is to add some frequency bias to the dependency test value, for example [pos!=interp]{2} group by base; 2.base sort by scp bias 0.5 "bias b" means "before sorting, multiply the function results by power b of bigram frequency". Of course, "bias" and "min" keywords can be combined, for example: [pos!=interp]{2} group by base; 2.base sort by scp bias 0.5 min 2 RESULTS Displayed results include up to 6 columns: 1. Concatenation of values of the first part of grouping (whole grouping, if no semicolon is used). 2. Concatenation of values of the second part of grouping. 3. Number of matches matching the given combination of values from the first part of the grouping. 4. Number of matches matching the given combination of values from the second part of the grouping. 5. Number of matches matching the given combination of values from the whole grouping. 6. Value of the collocation test (if used). WIDE CONTEXT AND METADATA After clicking a line of results, the program displays its wide context or metadata of the selected match in the lower window. The same also happens for statistical results, but the program has to select one example from many matches included in that line. In version 1.1 it is always the first suitable match found in the corpus. TECHNICAL NOTES Note 1. Once calculated, the results of the first part of the query can be grouped many times without searching the corpus again. For example: [base=woda] group by case should compute significantly faster if asked directly after another query containing "[base=woda]", for example: [base=woda] group by number Note 2. There is a hardcoded limit of 500000 results per query, therefore the most generic queries, like: [] group by pos count all will be calculated only on the first 0,5 mln segments from the corpus.

Creation Date: Thursday, November 2, 2006
Last Content Update: Thursday, November 2, 2006
Last Modified: Sat May 3 18:23:48 CEST 2008

Maintained by AP.

The Linguistic Engineering Group

Poliqarp 1.1