ISCL Hauptseminar (Wintersemester 2014/15, Meurers)

Corpus Annotation: Linguistic Foundations and Computational Linguistic Analysis


Language data collected in electronic corpora can in principle provide important empirical insights for theoretical and computational linguistics. For theoretical linguistics, corpus examples can be used to validate or falsify linguistic generalizations. In computational linguistics, language models and classifiers can be trained on corpus data to learn how to predict or classify previously unseen data on that basis.

Effective querying of corpora for specific phenomena and the development of computational tools for the automatic analysis of language often requires reference to annotations. Annotations essentially function as an index to classes of data which cannot easily be identified based on the surface form alone. For example, finding all sentences containing modal verbs using only the surface forms is possible, but would require a long list of all forms of the modal verbs. Even so, sentences where, for example, “can” is not actually a modal verb (as in “Pass me a can of beer” or “I can tuna for a living”) would be wrongly identified. Other search patterns, such as a query for all sentences containing past participle verbs, cannot even be specified in finite form using the surface string alone. The annotation of corpora thus serves an important function in providing abstractions which make it possible to access or generalize over large sets of examples.

This seminar will provide an overview of the creation and use of linguistically annotated corpora in theoretical and computational linguistics. It will include basic questions such as how to tokenize or sentence segment a corpus as well as conceptual considerations relevant to the creation of annotation schemes, and will then explore different types of corpora (from newspaper to learner corpora) and different types of annotations (morphological, constituency, dependency, semantic and formal pragmatic).

Instructor: Prof. Dr. Detmar Meurers

Course meets: 4 SWS in Seminarraum 1.13, Blochbau (Wilhelmstr. 19)

Credit Points:

Online syllabus:

Moodle page:

If you have not already used this Moodle installation for another course, please log onto it asap and create an account for yourself using your ordinary ZDV university login, then enroll into our course.

Nature of course and my expectations: This is a Hauptseminar intended to provide an overview of the key issues and annotation schemes in this active research area. Each participant is expected to

  1. regularly and actively participate in class, read the papers assigned by any of the presenters and post a question on Moodle to the“Reading Discussion Forum” on each reading at the latest the day before it is discussed in class. (30% of grade)

    Note: According to the rules of the Fakultät missing more than two meetings unexcused, automatically results in failing the class.

  2. explore and present a topic (30% of grade):
  3. for a Hauptseminar Schein, work out a term paper (40% of grade for Hauptseminar):

Academic conduct and misconduct: Research is driven by discussion and free exchange of ideas, motivations, and perspectives. So you are encouraged to work in groups, discuss, and exchange ideas. At the same time, the foundation of the free exchange of ideas is that everyone is open about where they obtained which information. Concretely, this means you are expected to always make explicit when you’ve worked on something as a team – and keep in mind that being part of a team always means sharing the work.

For text you write, you always have to provide explicit references for any ideas or passages you reuse from somewhere else. Note that this includes text “found” on the web, where you should cite the url of the web site in case no more official publication is available.

Class etiquette: Please do not read or work on materials for other classes in our seminar. Come to class on time and do not pack up early. When our seminar meets in the computer lab, only use the computers when you are asked to do a specific activity – do not read email or browse the web. All portable electronic devices such as cell phones should be switched off for the entire length of the flight – oops – class. If for some reason, you must leave early or you have to miss class for an important reason, please let me know before class.

Session plan:

Topics we can chose from

We focus on the conceptual issues, in particular questions relating to linguistic modeling. Which properties and insights can be and have been identified and annotated in corpora?


