TüBa-D/Z Release 9.1 (12/2014)

The TüBa-D/Z treebank is a syntactically annotated German newspaper corpus based on data taken from the daily issues of  'die tageszeitung' (taz). The treebank currently comprises 85,358 sentences (1,569,916 tokens; 3,444 newspaper articles). The annotation is performed manually. This is work in progress, and releases of more data will follow.

 

What's new in Release 9.1?

This minor release includes 17 910 manual annotations of a selected set of lemmas (30 nouns, 79 verbs) with their corresponding senses in the German wordnet GermaNet with the goal of providing a gold standard for word sense disambiguation. See the word sense annotation page for more information. Please note that no new sentences have been added between release 9.0 and release 9.1. Only those formats that support word sense annotation are part of this minor release (Negra Export 3 and 4, CoNLL 2011/2012, Export XML). Other formats remain unchanged and can be obtained from release 9.0.

 

View and Search:

Browse and search the TüBa-D/Z treebank using the TüNDRA treebank search web application. Institutional login or CLARIN account is required.

Annotation layers:

The annotation comprises information on

  • inflectional morphology
  • lemmas
  • syntactic constituency
  • grammatical functions
  • (complex) named entities incl. semantic classification (organisationpersonlocationgeo-political entity, and other)
  • anaphora and coreference relations
  • GermaNet word senses
  • dependency relations (automatically created)
  • chunk annotation (automatically created)

 

The syntactic annotation is based on assumptions which are uncontroversial within major syntactic theories. The annotation scheme distinguishes four levels of syntactic constituency:

  • the lexical level
  • the phrasal level
  • the level of topological fields
  • the clausal level

 

The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted by descriptive linguists of German. In addition to constituent structure, annotated trees contain edge labels between nodes. These edge labels encode grammatical functions (as relation between phrases) and the distinction between heads and non-heads (as phrase-internal relations).

 

The annotation scheme is surface-oriented in that it relies on a context-free backbone and uses neither crossing branches nor traces. Instead, it describes long-distance relations by specific functional labels.

 

All sentences of the treebank are enriched with anaphoric and coreference relations referring to nominal and pronominal antecedents. The linking relations were annotated in PALinkA with markables which were automatically extracted from TüBa-D/Z:

  • coreferential relations: 54,382
  • anaphoric relations: 50,706
  • cataphoric relations: 1,579
  • expletives: 7,960
  • bound relations: 2,603
  • split antecedents: 344
  • instances: 291
  • inherent reflexives: 9,138

 

For selected discourse connectives, the instances occurring in the treebank have been annotated with the discourse relation(s) conveyed by the connective instance. Portions of the treebank have been sense-annotated for the connectives  nachdem (298 instances), während (531 instances), sobald (28 instances), seitdem (13 instances), als (169 instances),  aber (161 instances), and bevor (119 instances). For annotation guidelines see Simon et al. (2011).

 

Another annotation layer contains structural information as well as implicit discourse relations for a subcorpus of 41 annotated newspaper articles (21,817 tokens) with 1,458 (explicit and implicit) discourse relations. For the annotation schema and numbers on agreement see Gastel et al. (2011).

 

An extensive description of the complete annotation scheme of syntactic annotation can be found in the stylebook:

 

Part-of-Speech tags are annotated with the "Stuttgart-Tübingen-TagSet" (STTS):

 

The annotation guidelines of anaphora and coreference relations can be found in the following manual: tuebadz-coreference-manual-2007.pdf.

The annotation guidelines of discourse connectives can be found in the following manual:  tuebadz-Konnektorenhandbuch_A3_v1.1.pdf.

 

The treebank is available in different formats:

 

The negra export format can be used in combination with the annotation tool Annotate (no longer maintained), which was developed in the Project negra at the Computational Linguistics Department at the University of the Saarland or with the TIGERSearch Tool developed in the TIGER project at the Institute for Natural Language Processing, University of Stuttgart. The XML data can be viewed with any XML viewer.

Since the corrections in some few cases led to a different segmentation, the sentence numbers changed between releases in some cases.

 

Funding for the treebank TüBa-D/Z has come from a variety of sources:

 

How to Obtain a License for TüBa-D/Z:

For academic research, the license is provided free of charge. For all other uses please contact Erhard Hinrichs for further details.

Please note that we do not give licenses to individuals.
Students who are interested in using TüBa-D/Z for a research project or a thesis project should contact their advisors to obtain a licence for their academic institutions. The license agreement has to be signed by a duly authorized person.  

For an academic research license, follow these steps:

  1. Print the License agreement for TüBa-D/Z (PDF).
  2. Fill out the license agreement and send it back via post, fax or scan to tuebadz-info. Please give a short description of the intended academic research use.
  3. After processing the license, we will send you a password for the download webpage.
  4. Download the treebank.

 

Contact:

Marie Hinrichs

Eberhard Karls University of Tübingen
Department of Computational Linguistics
Wilhelmstr. 19
D-72074 Tübingen, Germany

Fax: +49 - (0)7071 - 29 5214