Logo SfSGermaNet

Technical Notes for Lexicographers

Contents




 

 

Resources and Tools

A number of tools and resources have been developed to support the work of the lexicographers.

Corpora

A large German text corpus has been compiled to support corpus-based lexicography. It consists of texts from the German newspapers Frankfurter Rundschau (40 Mio tokens), Donau Kurier (8.5 Mio.), VDI Nachrichten (0.2 Mio.) and a collection of postings to German Usenet Newsgroups (10 Mio.). The texts are disambiguated with respect to part-of-speech categories and (pseudo-) lemmatized. The corpus is accessible under the label GCOLL (German Collection) via the IMS Corpus Work Bench (CWB). See the CWB manuals and man pages (xkwic, cqp, decode, lexdecode) for details.

Lemma Lists

A number of frequency lists of word forms and lemmas have been generated for various corpora. The lists are available as plain text file in the directory

/home/lsd/data/Corpora/LemmaLists

There is a set of five files for each of the corpora German Collection (gcoll), Donau Kurier (dk), Frankfurter Rundschau (fr), Tübinger Newskorpus (tn), and VDI Nachrichten (vdi), which follow the naming convention <corpus-name>.<suffix>, where <corpus-name> is one of the above mentioned abbreviations for the corpora and <suffix> is one of the following:

  • fpl: frequency-pos-lemma
  • words: frequency-wordform
  • 1000nouns: frequency-lemma for top 1000 lemmas tagged as nouns
  • 1000verbs: frequency-lemma for top 1000 lemmas tagged as verbs
  • 1000adj: frequency-lemma for top 1000 lemmas tagged as adjectives

Various Software

For Verbal Information:

  • VPref lists information on possible prefixes for German simple verbs on the basis of Lingsoft's GERTWOL morphological analyser and the GCOLL corpus
  • VComp list information on subcategorization frames on the basis of the CELEX database
  • VInfo is a combination of VPref and VComp, listing possible prefixes together with their subcategorization frames.