ISCL Hauptseminar (Winter semester 2020, Meurers & Weiss)

Linguistic Corpus Annotation


Annotated corpus resources are the primary way in which linguistic insight still feeds current computational linguistic research and applications. This seminar will provide an overview of the creation and use of linguistically annotated corpora in theoretical and computational linguistics. The theoretical presentations and discussion will be complemented by practical tasks relating to the creation and use of corpus annotation. It will start with basic (but surprisingly non-trivial and consequential) questions such as how to tokenize or sentence segment a corpus as well as conceptual considerations relevant to the creation of annotation schemes, and will then explore different types of corpora and different types of annotations (morphological, constituency, dependency, semantic and formal pragmatic), both covering commonly used standard resources (such as the Penn Treebank) as well as linguistic corpus resources for languages other than English. The course consists of 6 SWS to support the integration of practical tasks and results in 9 CP credit (based on presentations and practical components, i.e., without a term paper).


Nature of course and our expectations: This is a research-oriented Hauptseminar, in which we jointly explore perspectives and approaches. You are expected to

  1. regularly and actively participate in class, read the papers assigned by any of the presenters and post a meaningful question on Moodle to the “Reading Discussion Forum” on each reading at the latest on the day before it is discussed in class.
  2. explore and present a topic:
  3. complete the in-class project

Academic conduct and misconduct: Research is driven by discussion and free exchange of ideas, motivations, and perspectives. So you are encouraged to work in groups, discuss, and exchange ideas. At the same time, the foundation of the free exchange of ideas is that everyone is open about where they obtained which information. Concretely, this means you are expected to always make explicit when you’ve worked on something as a team – and keep in mind that being part of a team always means sharing the work.

For text you write, you always have to provide explicit references for any ideas or passages you reuse from somewhere else. Note that this includes text “found” on the web, where you should cite the url of the web site in case no more official publication is available.

We view the university as a community of scholars, where different ideas and opinions can be brought up, considered in terms of their merits, and freely discussed. Note that the right to speak up includes the obligation to voice your opinion in a way that respects fellow students or lecturers.

Session plan:

  1. 11./13.11. Overview of course and the topic [Detmar]
  2. 18.11. Text Processing with the Command Line Interface [Zarah]
    hands-on exercise: use command line
  3. 20.11. (Topic 0) Tokenization and Sentence Segmentation
    homework: play around with segmentation tools
  4. 25.11. (Topic 1) PoS tagging and English POS Tagsets: Susanne, WSJ, BNC-CLAWS5/7
    hands-on exercise: compare tag sets
  5. 27.11. (Topic 2) PoS tag sets of other languages
  6. 02.12. (Topic 3) UD PoS tags
    hands-on exercise: compare English, non-English and UD tag sets
  7. 04.12. PoS annotation quality and corpus annotation error detection [Detmar]
  8. 09.12. (Topic 4) Constituency annotation for English
    hands-on exercise: do constituency parsing
  9. 11.12. (Topic 5) Constituency annotation for other languages
  10. 16.12. (Topic 6) Dependency analysis
    hands-on exercise: compare dependency annotations
  11. 18.12. (Topic 7) Universal Dependencies
  12. 23.12. Syntax and Dependeny annotation error detection
  13. 25.12/06.01. Christmas break
  14. 08.01. (Topic 8) query language basics
    homework: use of ANNIS
  15. 13.01. Corpus creation: Combining multi-layer annotation with Pepper in ANNIS
    hands-on: convert dependency parses across fromats with ANNIS and extract intormation; combine different annotations in ANNIS
  16. 15.01. (Topic 9) Corpus annotation: WebAnno, Exmeralda
  17. 20.01. Inter-rater reliability and model performance metrics [Zarah]
    hands-on: Evaluate IAA and different metrics for treebank comparison (LAS, UAS, etc.)
  18. 22.01. finish hands-on work (no lecture)
  19. 27.01. mini-project intro (annotate and evaluate a linguistic property of choice)
  20. 29.01. mini-project
  21. 03.02. mini-project
  22. 05.02. mini-project
  23. 10.02. mini-project
  24. 12.02. mini-project
  25. 17.02. mini-project
  26. 19.02. mini-project
  27. 24.02. Discussion of project results
  28. 26.02. Wrap-Up



