Grammar-based Treebanking
Stefanie Dipper
IMS, Universität Stuttgart

This talk reports first results in grammar-based treebanking in the context of the TIGER project. Beside the annotation tool annotate, which is based on a statistical tagger and parser, a broad coverage symbolic LFG grammar is used to parse German newspaper text. After parsing, a transfer component converts the grammar's output into the TIGER treebank export format (cf. the workshop contribution by Jonas Kuhn et al.).

The LFG grammar applied in parsing has been developed using the Xerox Linguistic Environment (XLE). The output of an LFG grammar basically consists of two representations, the constituent structure (c-structure) of the sentence being parsed, and its functional structure (f-structure). In case of an ambiguous sentence, XLE allows for "packing" the different readings into one complex f-structure representation.

Usually, the grammar output is ambiguous. However, XLE provides a (non-statistical) mechanism for suppressing certain ambiguities automatically. This mechanism reduces the number of ambiguities considerably: Remaining ambiguities (on average 6 analyses per parsed sentence, median: 2 analyses) have to be resolved by a human annotator.

In two experiments, the grammar's performance were investigated. Coverage and robustness are weak points in grammar-based annotation. However, after some automatic text preprocessing (like adding header markers), the grammar performance improved considerably.

In the second experiment, the grammar's analyses were evaluated. More than 80% of the parses contained the correct reading. Further preprocessing steps like completing the grammar's lexicon by extracting unknown words from the corpus will certainly improve accuracy.


doug@essex.ac.uk