The Linguistic Engineering Group

TEI P5 Encoding of the National Corpus of Polish

This page offers examples of the TEI P5 encoding of various layers in the National Corpus of Polish (NKJP), as of 6 May 2011. The schemata and the examples may still change without notice. Examples in the column file in NKJP come from the full corpus (i.e., linguistic annotations are generated automatically, by appropriate tools), while those in file in 1M come from the manually annotated 1-million-word subcorpus of NKJP.

file in NKJP	file in 1M	ODD	RelaxNG / DTD
text_structure.xml	–	NKJP_structure.xml	NKJP_structure.rng
–	text.xml	NKJP_text.xml	NKJP_text.rng
ann_segmentation.xml	ann_segmentation.xml	NKJP_segmentation.xml	NKJP_segmentation.rng
ann_morphosyntax.xml	ann_morphosyntax.xml	NKJP_morphosyntax.xml	NKJP_morphosyntax.rng
ann_senses.xml	ann_senses.xml	NKJP_senses.xml	NKJP_senses.rng
ann_words.xml	ann_words.xml	NKJP_words.xml	NKJP_words.rng
ann_named.xml	ann_named.xml	NKJP_named.xml	NKJP_named.rng
ann_groups.xml	ann_groups.xml	NKJP_groups.xml	NKJP_groups.rng
header.xml	header.xml	–	header.dtd

Corpus headers:

NKJP header: NKJP_header.xml (definitely not a final version!),
1M header: NKJP_1M_header.xml (definitely not a final version!).

How to validate particular files (with xmllint on Linux):

$ xmllint --noout --valid header.xml
$ xmllint --noout --xinclude --relaxng NKJP_text.rng text.xml
$ xmllint --noout --xinclude --relaxng NKJP_structure.rng text_structure.xml
$ xmllint --noout --xinclude --relaxng NKJP_segmentation.rng ann_segmentation.xml
$ xmllint --noout --xinclude --relaxng NKJP_morphosyntax.rng ann_morphosyntax.xml
$ xmllint --noout --xinclude --relaxng NKJP_senses.rng ann_senses.xml
$ xmllint --noout --xinclude --relaxng NKJP_words.rng ann_words.xml
$ xmllint --noout --xinclude --relaxng NKJP_groups.rng ann_groups.xml
$ xmllint --noout --xinclude --relaxng NKJP_named.rng ann_named.xml

Papers:

Adam Przepiórkowski. (2009). TEI P5 as an XML Standard for Treebank Encoding. In: Marco Passarotti, Adam Przepiórkowski, Savina Raynaud and Frank Van Eynde, eds., Proceedings of of the Eighth International Workshop on Treebanks and Linguistic Theories (TLT8), pp. 149-160.
Adam Przepiórkowski and Piotr Bański. (2009). Which XML standards for multilevel corpus annotation?. In the proceedings of LTC 2009, pp. 245-250.
Adam Przepiórkowski and Piotr Bański. (2009). XML Text Interchange Format in the National Corpus of Polish. To appear in the proceedings of PALC 2009.

Tools:

narzedzia.tgz – an in-house Windows tool for downloading and converting HTML files into the TEI4NKJP XML; instruction only in Polish (Windows code page 1250); YMMV…

Back to the homepage of the Linguistic Engineering Group.

Last Modified: Sun Aug 7 22:13:32 CEST 2011

Maintained by AP.