Contents
Resources and Tools
A number of tools and resources have been developed to support the work
of the lexicographers.
Corpora
A large German text corpus has been compiled to support corpus-based
lexicography. It consists of texts from the German newspapers Frankfurter
Rundschau (40 Mio tokens), Donau Kurier (8.5 Mio.), VDI Nachrichten (0.2 Mio.) and a collection of postings to German Usenet Newsgroups (10 Mio.). The texts are disambiguated with respect to part-of-speech categories and (pseudo-) lemmatized. The corpus is accessible under the label GCOLL (German Collection) via the IMS Corpus
Work Bench (CWB). See the CWB manuals and man pages (xkwic, cqp, decode,
lexdecode) for details.
Lemma Lists
A number of frequency lists of word forms and lemmas have been generated
for various corpora. The lists are available as plain text file in the
directory
/home/lsd/data/Corpora/LemmaLists
There is a set of five files for each of the corpora German
Collection (gcoll), Donau Kurier (dk), Frankfurter Rundschau (fr),
Tübinger Newskorpus (tn), and VDI Nachrichten (vdi),
which follow the naming convention <corpus-name>.<suffix>,
where <corpus-name> is one of the above mentioned abbreviations
for the corpora and <suffix> is one of the following:
- fpl: frequency-pos-lemma
- words: frequency-wordform
- 1000nouns: frequency-lemma for top 1000 lemmas tagged
as nouns
- 1000verbs: frequency-lemma for top 1000 lemmas tagged
as verbs
- 1000adj: frequency-lemma for top 1000 lemmas tagged
as adjectives
Various Software
For Verbal Information:
- VPref lists information on possible prefixes
for German simple verbs on the basis of Lingsoft's GERTWOL morphological
analyser and the GCOLL corpus
- VComp list information on subcategorization
frames on the basis of the CELEX database
- VInfo is a combination of VPref and VComp,
listing possible prefixes together with their subcategorization frames.