AUTOR: Benoît Sagot
AFILIACJA: INRIA Rocquencourt / Université Paris 7
TYTU£: Parsing French: resources, formalisms, parsers
STRESZCZENIE:
This talk will present three series of works that share a common goal:
develop large-coverage parsers that are both efficient and
linguistically relevant, without the help of hand-crafted resources.
Indeed, I will focus on French, for which there are currently no such
existing resourcess like the Penn TreeBank (both in terms of size and
richness), and for which syntactic lexicons are still an active area
of research.
Firstly, I will present and illustrate the general idea that
statistic analysis of automatically generated data (raw corpora,
tagged corpora, parsing results from fully symbolic parsers) can be
the basis for the development of rich and large-coverage resources
(lexicons, grammars,...). To illustrate this point I will briefly
describe three of the techniques I developed to apply this idea: a
method to acquire automatically morphological lexicon, a simple
pattern-matching approach to induce specific syntactic properties and
an error mining technique in parsers output.
Secondly, I will present SxLFG, an efficient and robust parser
generator for LFG developed in collaboration with Pierre Boullier. I
will illustrate this by presenting our parsing system for French that
relies on SxLFG, on a large-coverage LFG grammar for French we
developed, and on the pre-syntactic processing chain SxPipe. In
particular, I will show that the efficiency of SxLFG allows the
parsing of large corpora without any probabilistic information, thus
creating annotated corpora on which probabilistic models can be
bootstrapped (for parsers, taggers,...). I will give an insight into
ongoing work about learning and using such bootstrapped models into
SxLFG, as well as cascading modules that work on the constituency
forest (chunk-based filter, probabilistic disambiguation on the
forest, f-structures computation, with possible backtrack in case of
failure) .
Thirdly, I will briefly describe some limitations of standard two-
stage formalisms such as LFG, TAG or others ("linear" syntactic
backbone + unification-based decorations), and I will propose another
possible approach, namely "non-linear" formalisms ("non-linear"
meaning here "closed by intersection"). I will introduce Range
Concatenation Grammars (RCGs), a non-linear formalism parsable in
polynomial time, and explain why it is suitable to model natural
languages. In particular, I will shortly describe the medium-coverage
grammar I developed in a (still polynomial) syntactic extension of
RCGs, and how standard analyses (constituency, dependency,
topological boxes, predicate-argument semantics) can be obtained as
partial projections of the full analysis. I will conclude with
possible ways to merge the efficiency, robustness and large coverage
of our SxLFG-based parser with the non-linearity and linguistic
relevance of RCGs.