The IPI PAN Corpus in Numbers

Adam Przepiórkowski

In the Proceedings of the 2nd Language & Technology Conference, Poznań, Poland 2005. This is a shorter version of this article.

Abstract:

The aim of this article is to present the IPI PAN Corpus (cf. http://korpus.pl/), a large morphosyntactically annotated XML encoded corpus of Polish developed at the Institute of Computer Science, Polish Academy of Sciences. Various quantitative information about the corpus and its publicly available subcorpora is given including: sizes in terms of orthographic words and interpretable segments, tagset size measured in types and tokens, etc., but also information reflecting interesting facts about Polish, i.e., frequencies of words of different lengths and frequencies of grammatical classes and some grammatical categories.

Electronically available format:

PDF (76 KB)

BibTeX entry:

@InProceedings{prze:05c,
  author =       "Adam Przepiórkowski",
  title =        "The {IPI PAN} {C}orpus in Numbers",
  booktitle =    "Proceedings of the \emph{2nd Language \& Technology
                  Conference}",
  year =         2005,
  editor =       "Zygmunt Vetulani",
  address =      "Poznań, Poland"}

Creation Date: Monday, May 2, 2005
Last Modified: Tue Jun 7 22:24:16 CEST 2005