From LFG Structures to TIGER Treebank Annotations
Jonas Kuhn, Heike Zinsmeister & Martin Emele
IMS, Universität Stuttgart

Annotating large newspaper corpora by hand is a time consuming and costly task. Hence reconstructing syntactic annotations from already parsed corpora seems to be an interesting alternative.

In the context of the TIGER project, one line of research explores how a large-scale LFG grammar of German can be used in syntactic annotation of a newspaper corpus (cf. Stefanie Dipper's workshop contribution). The Xerox Linguistic Environment (XLE) is used (i) to parse sentences from the corpus, and (ii) to explore the space of solutions assigned to each sentence by the grammar, using various browsing tools (cf. [King et al. 2000]). The annotator selects the correct reading in the given context (the presentation of choices may be guided by a statistical model trained for the grammar, [Riezler et al. 2000]).

In this talk we address the issue how the representation format output by the parser can be converted to the format specified by the NEGRA/TIGER annotation scheme [Skut et al. 1997]. The annotation scheme includes a subset of the information given by the LFG grammar. It is organized in a different but related way (using primarily grammatical functions for structuring). The conversion routine should allow for a declarative specification of the mapping criteria. A further requirement is that it should be possible to respond to modifications of the grammar or the annotation scheme with local adjustments of the mapping criteria. Strictly mechanical, format-related conversion steps should be separated from steps involving structural re-organization.

Different conceptualization of linguistic phenomena within the two systems leads to structural mismatches similar to mismatches which occur in the translation between different natural language representations. It can be demonstrated that although the formats are related, the LFG -> NEGRA/TIGER conversion involves some non-trivial transformation tasks, showing that it is useful to make use of a transfer system for relating source and target representations.

We argue that existing transfer rewriting systems (as developed for multilingual transfer based on sets of feature structure descriptions, cf. [Emele & Dorna 1998], [Kay 1999]) fulfill the requirements for such a component; thus, format conversion can take advantage of systems that have been integrated and tested already.

The use of a transfer component raises a particular issue concerning the access of tree configurations in the source (LFG) representation. It appears as if transformation rules had to be specified with conditions based on recursively defined tree relations (which would exceed the expressive power of non-recursive rewriting systems). However we point out that it is possible to precompute the relevant non-primitive relations (essentially a kind of transitive closure), such that the existing transfer component with all its advantages can be used after a canonical preprocessing step.

References

[Emele & Dorna 1998]
Martin C. Emele and Michael Dorna. 1998. Ambiguity Preserving Machine Translation using Packed Representations. In Proceedings of the 17th International Conference on Computational Linguistics (COLING-ACL '98), Montreal, Canada.
[Kay 1999]
Martin Kay, 1999. Chart Translation. In Proceedings of the Machine Translation Summit VII '99, Singapore.
[King et al. 2000]
Tracy Holloway King, Anette Frank, Jonas Kuhn, John Maxwell, Stefanie Dipper. 2000. Ambiguity Management in Grammar Writing. Ms. Xerox PARC, Xerox Research Centre Europe, IMS Stuttgart.
[Skut et al. 1997]
Wojciech Skut, Brigitte Krenn, Thorsten Brants, Hans Uszkoreit. 1997. An annotation scheme for free word order languages. Proceedings of ANLP-97, Washington.
[Riezler et al. 2000]
Stefan Riezler, Detlef Prescher, Jonas Kuhn, Mark Johnson. 2000. Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training. Ms. IMS Stuttgart.

doug@essex.ac.uk