The Tübingen VERBMOBIL Treebanks

E. Hinrichs
J. Bartels
Y. Kawata
E. Kordoni
H. Telljohann

The VERBMOBIL Treebanks of German, English, and Japanese are under construction as part of the VERBMOBIL project, which has the overriding goal to develop a speaker-independent system for the translation of spontaneous speech. In the framework of this language technology project the treebanks provide training data for machine translation modules and stochastic parsers. The three corpora consist of syntactic tree structures, semi-automatically annotated with the help of the graphical annotation tool Annotate V2.3 (Plaehn 1998), which was developed in the NEGRA project of the SFB 378 (Brants and Skut 1998) at the Universität des Saarlandes. Compared to entirely manual treebank construction, semi-automatic annotation can help to reduce the number of inconsistencies and annotation errors that will inevitably arise in any treebank of significant size. This semi-automatic method of annotation differs also from the one used in the Penn Treebank, for instance, where human correction succeeds the fully automatic parsing.

The annotated trees are based on data transcribed from spoken language dialogues of the following scenarios: appointment negotiation, travel planning, hotel reservation. In contrast to written language, the segmentation of spontaneous speech utterances into sentences is challenging. The specific characteristics of spoken language dialogues have to be taken into account: repetitions, hesitation, false starts, etc. For this reason the dialogue turn (consisting of one or more sentences and/or phrases) denoting an uninterrupted contribution by one dialogue participant has been defined as the primary domain of syntactic analysis and annotation.

Whereas basic design principles (e.g. Longest Match Principle, Flat Clustering Principle) are applied for all three treebanks, language specific annotation schemes and guidelines have been developed for each of them. The linguistic inventory is based upon a minimal set of assumptions concerning constituenthood, phrase attachment, and grammatical functions, which are uncontroversial among the major syntactic theories. Thus, the reusability of the annotated data for science and engineering is ensured.

The size of the German treebank has exceeded 30 000 fully annotated trees. Based on the STTS-Tagset (Schiller et al. 1995), the annotation scheme for the German treebank adopts the theory of topological fields (Höhle 1985) as the primary clustering principle of a German sentence. The size of the English treebank is approaching 30 000 trees. The Penn Treebank tagset is adopted, and the annotation scheme is HPSG-oriented, in accordance with the HPSG grammar of English for use in VERBMOBIL developed in the CSLI LinGO (Linguistic Grammars Online) Project. The size of the Japanese treebank is approaching 20 000 trees. Along with the transcription into Roman characters on the string level, the annotation scheme is not assuming a specific theory but assuming varieties of context free techniques in NLP, so that the data will be available to a wider group of researchers.


doug@essex.ac.uk