The corpus TüPP-D/Z

The creation of TüPP-D/Z was funded by the DEREKO project and the Kompetenzzentrum für Text- und Informationstechnologie (KIT), and received additional support from the A1 project of the Sonderforschungsbereich 441.

TüPP-D/Z is a collection of articles from the taz newspaper which have been automatically annotated with clause structure, topological fields, and chunks, in addition to more low level annotation including parts of speech and morphological ambiguity classes. All texts are processed automatically, starting from paragraph, sentence and word form token segmentation. Word forms include information about some regular types of named entities, including dates, telephone numbers, and number/unit combinations. 

The current release of TüPP-D/Z is based on the 1999 HTML distribution (scientific edition) of the taz, which includes newspaper articles from September 2, 1986 to May 7, 1999 and which consists of 11,512,293 sentences (204,425,497 tokens).

A more in-depth description of the linguistic annotation can be found in the partial parsing stylebook, and information about the actual XML encoding of linguistic annotation can be found in the markup guide.

 TüPP-D/Z is distributed in XML format. It comes with converters that help you produce e.g. bracketed vertical format.

How to Obtain a License for TüPP-D/Z:

 

For academic research, the license is provided free of charge. For all other uses please contact Erhard Hinrichs for further details. 

     
  1. Please print the License agreement for TüPP-D/Z (PDF).
  2.  
  3. Buy the "taz archive DVD". It is available for 50 € in the taz-shop.
    Purchase of the taz archive DVD is necessary for IPR reasons.
  4.  
  5. Fill in the license agreement for TüPP-D/Z and send the license agreement to tuebadz-info (via scan, fax or post - see postal address below). Please include the proof of license/bill for the taz data.
  6.  
  7. After processing the license, we will send you a password for the download webpage.

  8. Download the TüPP-D/Z.
  9.  

 

Contact: 

Marie Hinrichs

Eberhard Karls Universität Tübingen
Department of Computational
Linguistics Wilhelmstr. 19
D-72074 Tübingen
Germany 

Tel.: +49 - (0)7071 - 29 78490
Fax: +49 - (0)7071 - 29 5214