The Potential of the IPI PAN Corpus

Adam Przepiórkowski

In: PSiCL 41, pp. 31-48. 2005.


The aim of this article is to present the IPI PAN Corpus (cf., a large morphosyntactically annotated XML encoded corpus of Polish developed at the Institute of Computer Science, Polish Academy of Sciences, and describe its potential and actual use in natural language processing and in linguistic research. In particular, various quantitative information about the corpus and its publicly available subcorpora is provided including: sizes in terms of orthographic words and interpretable segments, tagset size measured in types and tokens, etc., but also information reflecting interesting facts about Polish, i.e., frequencies of words of different lengths, frequencies of grammatical classes and some grammatical categories, and initial frequency lists of Polish lemmata.

Electronically available format (pre-final version):

BibTeX entry:

@string{psicl =  "Poznań Studies in Contemporary Linguistics"}
  author =       "Adam Przepiórkowski",
  title =        "The Potential of the {IPI PAN} Corpus",
  journal =      psicl,
  volume =       41,
  pages =        "31--48",
  year =         2006}

Valid XHTML 1.0! Valid CSS!

Creation Date: Wednesday, August 25, 2004
Last Modified: Thu Jan 24 12:33:58 CET 2008