Corpus data are rarely used in theoretical linguistic circles. This
is partly because the use of corpus data has traditionally been
conflated with a quantitative analysis of such data (as relevant to
an investigation of language use and language pedagogical issues).
However, corpus searches can equally well provide example data for
generative theorizing, which generally relies on qualitative
analysis by introspection. Corpus data are particularly valuable
for generative work since they contain variation of known and
unknown properties, are natural and therefore easier to evaluate
introspectively, and they can often include information on the
increasingly relevant notion of context.
One reason the usefulness of corpus data for theoretical linguistics
has largely been overlooked is that until recently it was very
difficult to search for examples which fit a particular,
theoretically interesting pattern. This has changed with the advent
of corpora which are annotated with various kinds of linguistic
knowledge such as part-of-speech annotations or syntactic
In this seminar we want to explore the nature and use of syntactic
and morpho-syntactic corpus annotation and its relation to
linguistic knowledge. Linguistic knowledge here is meant as a cover
term for linguistic theorizing as well as pre-theoretic linguistic
insights. Relevant issues include:
- Nature of annotations:
- Tagsets used for part-of-speech tagging, the nature of errors
made by common automatic taggers, and how to correct them using
other kinds of linguistic knowledge.
- Shallow and deep syntactic annotation (e.g. in Penn treebank,
Verbmobil treebanks, Negra2 treebank, topological field
- Using annotations: How to search for syntactically relevant
patterns in corpora annotated in different degrees.