ISCL Hauptseminar (Winter semester 2020, Meurers & Weiss)

Linguistic Corpus Annotation

Abstract:

Annotated corpus resources are the primary way in which linguistic insight still feeds current computational linguistic research and applications. This seminar will provide an overview of the creation and use of linguistically annotated corpora in theoretical and computational linguistics. The theoretical presentations and discussion will be complemented by practical tasks relating to the creation and use of corpus annotation. It will start with basic (but surprisingly non-trivial and consequential) questions such as how to tokenize or sentence segment a corpus as well as conceptual considerations relevant to the creation of annotation schemes, and will then explore different types of corpora and different types of annotations (morphological, constituency, dependency, semantic and formal pragmatic), both covering commonly used standard resources (such as the Penn Treebank) as well as linguistic corpus resources for languages other than English. The course consists of 6 SWS to support the integration of practical tasks and results in 9 CP credit (based on presentations and practical components, i.e., without a term paper).

Instructors:

Course meets: 6 SWS in
Zoom: https://zoom.us/j/96719107835?pwd=Q29rTERZSThsTGQwZjhQeFowZWM4QT09

Credit Points:

Online syllabus: http://purl.org/dm/20/ws/hs

Moodle page: https://moodle.zdv.uni-tuebingen.de/course/view.php?id=1275

Please enroll in this course by logging into this moodle course with your ordinary ZDV university login.

Nature of course and our expectations: This is a research-oriented Hauptseminar, in which we jointly explore perspectives and approaches. You are expected to

  1. regularly and actively participate in class, read the papers assigned by any of the presenters and post a meaningful question on Moodle to the “Reading Discussion Forum” on each reading at the latest on the day before it is discussed in class.
  2. explore and present a topic:
  3. complete the in-class project

Academic conduct and misconduct: Research is driven by discussion and free exchange of ideas, motivations, and perspectives. So you are encouraged to work in groups, discuss, and exchange ideas. At the same time, the foundation of the free exchange of ideas is that everyone is open about where they obtained which information. Concretely, this means you are expected to always make explicit when you’ve worked on something as a team – and keep in mind that being part of a team always means sharing the work.

For text you write, you always have to provide explicit references for any ideas or passages you reuse from somewhere else. Note that this includes text “found” on the web, where you should cite the url of the web site in case no more official publication is available.

We view the university as a community of scholars, where different ideas and opinions can be brought up, considered in terms of their merits, and freely discussed. Note that the right to speak up includes the obligation to voice your opinion in a way that respects fellow students or lecturers.

Session plan:

  1. 11./13.11. Overview of course and the topic [Detmar]
  2. 18.11. Text Processing with the Command Line Interface [Zarah]
    hands-on exercise: use command line
  3. 20.11. (Topic 0) Tokenization and Sentence Segmentation
    homework: play around with segmentation tools
  4. 25.11. (Topic 1) PoS tagging and English POS Tagsets: Susanne, WSJ, BNC-CLAWS5/7
    hands-on exercise: compare tag sets
  5. 27.11. (Topic 2) PoS tag sets of other languages
  6. 02.12. (Topic 3) UD PoS tags
    hands-on exercise: compare English, non-English and UD tag sets
  7. 04.12. PoS annotation quality and corpus annotation error detection [Detmar]
  8. 09.12. (Topic 4) Constituency annotation for English
    hands-on exercise: do constituency parsing
  9. 11.12. (Topic 5) Constituency annotation for other languages
  10. 16.12. (Topic 6) Dependency analysis
    hands-on exercise: compare dependency annotations
  11. 18.12. (Topic 7) Universal Dependencies
  12. 23.12. Syntax and Dependeny annotation error detection
  13. 25.12/06.01. Christmas break
  14. 08.01. (Topic 8) query language basics
    homework: use of ANNIS
  15. 13.01. Corpus creation: Combining multi-layer annotation with Pepper in ANNIS
    hands-on: convert dependency parses across fromats with ANNIS and extract intormation; combine different annotations in ANNIS
  16. 15.01. (Topic 9) Corpus annotation: WebAnno, Exmeralda
  17. 20.01. Inter-rater reliability and model performance metrics [Zarah]
    hands-on: Evaluate IAA and different metrics for treebank comparison (LAS, UAS, etc.)
  18. 22.01. finish hands-on work (no lecture)
  19. 27.01. mini-project intro (annotate and evaluate a linguistic property of choice)
  20. 29.01. mini-project
  21. 03.02. mini-project
  22. 05.02. mini-project
  23. 10.02. mini-project
  24. 12.02. mini-project
  25. 17.02. mini-project
  26. 19.02. mini-project
  27. 24.02. Discussion of project results
  28. 26.02. Wrap-Up

Materials

References

   Abeillé, A. (ed.) (2003). Treebanks: Building and using syntactically annotated corpora. Dordrecht: Kluwer.

   Abeillé, A., T. Brants & H. Uszkoreit (eds.) (2000). Proceedings of the Second Workshop on Linguistically Interpreted Corpora (LINC-00). Luxembourg. Workshop information at http://www.coli.uni-sb.de/linc2000/.

   Abeillé, A., L. Clément & F. Toussenel (2003). Building a Treebank for French. In Abeillé (2003).

   Artstein, R. & M. Poesio (2009). Survey Article: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 1–42. URL http://www.mitpressjournals.org/doi/abs/10.1162/coli.07-034-R2.

   Atalay, N., K. Oflazer & B. Say (2003). The annotation process in the Turkish treebank. In Proceedings of the 4th International Workshop on Linguistically Interpreteted Corpora (LINC).

   Bies, A., M. Ferguson, K. Katz & R. MacIntyre (1995). Bracketing Guidelines for Treebank II Style Penn Treebank Project. University of Pennsylvania. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz.

   Boyd, A., M. Dickinson & D. Meurers (2008). On Detecting Errors in Dependency Treebanks. Research on Language and Computation 6(2), 113–137. URL http://purl.org/dm/papers/boyd-et-al-08.html.

   Brants, T. (1995). Tagset reduction without information loss. In Proceedings of the 33nd Annual Meeting of the Association for Computational Linguistics (ACL 95). Cambridge, MA: MIT. URL http://www.coli.uni-sb.de/~thorsten/publications/Brants-ACL95.ps.gz.

   Brants, T. & W. Skut (1998). Automation of Treebank Annotation. In Proceedings of New Methods in Language Processing (NeMLaP-98). Syndey. URL http://www.coli.uni-sb.de/~thorsten/publications/Brants-Skut-NeMLaP98.ps.gz.

   Brill, E. (2000). Part-of-Speech Tagging. In R. Dale, H. Moisl & H. Somers (eds.), Handbook of Natural Language Processing, New York: Marcel Dekker. URL http://www.netLibrary.com/ebook_info.asp?product_id=47610.

   Choi, J. D., J. Tetreault & A. Stent (2015). It depends: Dependency parser comparison using a web-based evaluation tool. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 387–396.

   Cloeren, J. (1999). Tagsets. In van Halteren (1999), chap. 4, pp. 37–54.

   Díaz Negrillo, A., D. Meurers, S. Valera & H. Wunsch (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum 36(1–2), 139–154. URL http://purl.org/dm/papers/diaz-negrillo-et-al-09.html.

   Dickinson, M. & D. Meurers (2003a). Detecting Errors in Part-of-Speech Annotation. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest, Hungary, pp. 107–114. URL http://purl.org/dm/papers/dickinson-meurers-03.html.

   Dickinson, M. & W. D. Meurers (2003b). Detecting Inconsistencies in Treebanks. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (TLT-03). Växjö, Sweden, pp. 45–56. URL http://purl.org/dm/papers/dickinson-meurers-tlt03.html.

   Dickinson, M. & W. D. Meurers (2005a). Detecting Annotation Errors in Spoken Language Corpora. In The Special Session on treebanks for spoken language and discourse at NODALIDA-05. Joensuu, Finland. URL http://purl.org/~dm/papers/dickinson-meurers-nodalida05.html.

   Dickinson, M. & W. D. Meurers (2005b). Detecting Errors in Discontinuous Structural Annotation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05). pp. 322–329. URL http://aclweb.org/anthology/P05-1040.

   Dienes, P. & C. Oravecz (2000). Bottom-up tagset design from maximally reduced tagset. In Abeillé et al. (2000), pp. 42–47. URL http://www.coli.uni-sb.de/~dienes/dior2000.ps.gz. Workshop information at http://www.coli.uni-sb.de/linc2000/.

   Duan, H., X. Bai, B. Chang & S. Yu (2003). Chinese word segmentation at Peking University. In Proceedings of the second SIGHAN workshop on Chinese language processing. pp. 152–155.

   Džeroski, S., T. Erjavec & J. Zavrel (2000). Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets. In Gavrilidou et al. (2000), pp. 1099–1104. URL http://nl.ijs.si/et/Bib/LREC00/lrec-tag.ps.

   Elworthy, D. (1995). Tagset Design and Inflected Languages. In Proceedings of the ACL-SIGDAT Workshop. Dublin. URL http://arXiv.org/abs/cmp-lg/9504002.

   Emms, M. (2008). Tree Distance and Some Other Variants of Evalb. In LREC.

   Forst, M., N. Bertomeu, B. Crysmann, F. Fouvry, S. Hansen-Schirra & V. Kordoni (2004). Towards a Dependency-Based Gold Standard for German Parsers. The TIGER Dependency Bank. In S. Hansen-Schirra, S. Oepen & H. Uszkoreit (eds.), 5th International Workshop on Linguistically Interpreted Corpora (LINC-04) at COLING. Geneva, Switzerland: COLING, pp. 31–38. URL http://aclweb.org/anthology/W04-1905.

   Gaizauskas, R. (1995). Investigations into the grammar underlying the Penn Treebank II. Tech. Rep. Research Memorandum CS-95-25, University of Sheffield. URL citeseer.ist.psu.edu/111349.html.

   Garside, R., G. Leech & T. McEnery (eds.) (1997). Corpus annotation: linguistic information from computer text corpora. Harlow, England: Addison Wesley Longman Limited.

   Gärtner, M. & K. Jung (2020). To Boldly Query What No One Has Annotated Before? The Frontiers of Corpus Querying. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 6307–6321. URL https://www.aclweb.org/anthology/2020.acl-main.562.

   Gavrilidou, M., G. Carayannis, S. Markantonatou, S. Piperidis & G. Steinhauer (eds.) (2000). Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-00). Athens.

   Gimpel, K., N. Schneider et al. (2011). Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, pp. 42–47.

   Grefenstette, G. (1999). Tokenization. In van Halteren (1999), chap. 9, pp. 117–133.

   Grefenstette, G. & P. Tapanainen (1994). What is a word, what is a sentence? In Proceedings of the 3rd International Conference on Computational Lexicography (COMPLEX-94). pp. 79–87. URL http://purl.org/dm/lib/Grefenstette.Tapanainen-94.pdf.

   Hajič, J., A. Böhmová, E. Hajičová & B. Vidová-Hladká (2003). The Prague Dependency Treebank: A Three-Level Annotation Scenario. In Abeillé (2003), chap. 7, pp. 103–127. URL http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf.

   Hajič, J., B. Vidová-Hladká & P. Pajas (2001). The Prague Dependency Treebank: Annotation Structure and Support. In Proceedings of the IRCS Workshop on Linguistic Databases. University of Pennsylvania, Philadelphia, pp. 105–114. URL http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHladkaPajas2001.pdf.

   Hajič, J. & B. Hladká (1998). Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In Proceedings of COLING-ACL Conference. Montreal, Canada, pp. 483–490.

   Hajič, J., J. Panevová, E. Buráňová, Z. Urešová & A. Bémová (1999). Annotations at Analytical Layer. Instructions for Annotators. Tech. rep., ÚFAL MFF UK, Prague, Czech Republic. URL http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/pdf/a-man-en.pdf. English translation by Zdeněk Kirschner.

   Hajičová, E., J. Panevová & P. Sgall (2000). A Manual for Tectogrammatic Tagging of the Prague Dependency Treebank. Tech. Rep. TR-2000-09, ÚFAL MFF UK, Prague, Czech Republic. In Czech.

   Hana, J. & D. Zeman (2005). A Manual for Morphological Annotation, 2nd edition. Tech. Rep. 27, ÚFAL MFF UK, Prague, Czech Republic. URL http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf.

   Hui-ming, Y. S.-w. D. & Z. X.-f. S. Bin (2002). The Basic Processing of Contemporary Chinese Corpus at Peking University SPECIFICATION [J]. Journal of Chinese Information Processing 5.

   King, T. H., R. Crouch, S. Riezler, M. Dalrymple & R. M. Kaplan (2003). The PARC 700 Dependency Bank. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora, held at the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL-03). Budapest. URL http://www2.parc.com/isl/groups/nltt/fsbank/.

   Krause, T., U. Leser & A. Lüdeling (2016). graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora. J. Lang. Technol. Comput. Linguistics 31(1), 1–25.

   Krause, T. & A. Zeldes (2016). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities 31(1), 118–139.

   Kübler, S. & A. Wagner (2000). Evaluating POS Tagging under Sub-optimal Conditions. Or: Des Meticulousness Pay? In Proceedings of International Conference on Artificial and Computational Intelligence for Decision, Control and Automation in Engineering and Industrial Applications (ACIDCA’2000). Monastir, Tunisia. URL http://www.sfs.uni-tuebingen.de/~kuebler/papers/acidca.ps.

   Kulick, S., A. Bies, J. Mott, A. Kroch, B. Santorini & M. Liberman (2014). Parser evaluation using derivation trees: A complement to evalb. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 668–673.

   Leech, G. (1997). Grammatical Tagging. In Garside et al. (1997), chap. 2, pp. 19–33.

   Leech, G., R. Garside & M. Bryant (1994). CLAWS4: the tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Kyoto,Japan, pp. 622–628. URL http://citeseer.ist.psu.edu/geoffrey94claws.html.

   Liu, Y., Q. Tan & K. X. Shen (1994). The word segmentation rules and automatic word segmentation methods for Chinese information processing. Qing Hua University Press and Guang Xi p. 36.

   Lu, X. (2006). Hybrid Models for Chinese Unknown Word Resolution. Ph.D. thesis, The Ohio State University.

   Lu, X. (2014). Computational Methods for Corpus Annotation and Analysis. Springer.

   Marcus, M., G. Kim, M. Marcinkiewicz, R. MacIntyre, A. Bies, M. Ferguson, K. Katz & B. Schasberger (1994). The Penn treebank: Annotating predicate argument structure. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/arpa94.ps.gz.

   Marcus, M., B. Santorini & M. A. Marcinkiewicz (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19(2), 313–330. URL ftp://ftp.cis.upenn.edu/pub/treebank/doc/cl93.ps.gz.

   McEnery, T. & A. Wilson (1996). Corpus Linguistics. Edinburgh Textbooks in Empirical Linguistics. Edinburgh, UK: Edinburgh University Press.

   Meurers, D. (2005). On the use of electronic corpora for theoretical linguistics. Case studies from the syntax of German. Lingua 115(11), 1619–1639. URL http://purl.org/dm/papers/meurers-03.html.

   Meurers, D. & S. Müller (2009). Corpora and Syntax. In A. Lüdeling & M. Kytö (eds.), Corpus linguistics, Berlin: Mouton de Gruyter, vol. 2 of Handbooks of Linguistics and Communication Science, pp. 920–933. URL http://purl.org/dm/papers/meurers-mueller-09.html.

   Ng, H. T. & J. K. Low (2004). Chinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based? In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. pp. 277–284.

   Nilsson, J. & J. Nivre (2008). MaltEval: an Evaluation and Visualization Tool for Dependency Parsing. In LREC.

   Nivre, J., M.-C. de Marneffe et al. (2016). Universal Dependencies v1: A Multilingual Treebank Collection. In Proceedings of LREC. URL https://nlp.stanford.edu/pubs/nivre2016ud.pdf.

   Nivre, J., M.-C. de Marneffe et al. (2020). Universal dependencies v2: An evergrowing multilingual treebank collection. arXiv preprint arXiv:2004.10643 .

   Nivre, J., J. Nilsson & J. Hall (2006). Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In Proceedings of the fifth international conference on Language Resources and Evaluation (LREC-06). Genoa, Italy. URL http://stp.lingfil.uu.se/~nivre/docs/talbanken05.pdf.

   Oflazer, K., D. Z. Hakkani-Tür & G. Tür (1999). Design for a Turkish Treebank. In Uszkoreit et al. (1999), pp. 28–34.

   Oflazer, K., B. Say, D. Z. Hakkani-Tür & G. Tür (2003). Building a Turkish Treebank. In Abeillé (2003).

   Palmer, D. D. (2000). Tokenisation and Sentence Segmentation. In R. Dale, H. Moisl & H. Somers (eds.), Handbook of Natural Language Processing, New York: Marcel Dekker, pp. 11–35. URL http://www.netLibrary.com/ebook_info.asp?product_id=47610.

   Remus, S., H. Hedeland, A. Ferger, K. Bührig & C. Biemann (2019). WebAnno-MM: EXMARaLDA meets WebAnno .

   Sampson, G. & A. Babarczy (2003). Limits to annotation precision. In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). pp. 61–68. URL http://www.grsampson.net/Alta.html.

   Santorini, B. (1990). Part-Of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision, 2nd printing). Ms., UPenn.

   Schiller, A., S. Teufel & C. Thielen (1995). Guidlines für das Taggen deutscher Textcorpora mit STTS. Tech. rep., IMS-CL, Univ. Stuttgart and SfS, Univ. Tübingen. URL http://www.cogsci.ed.ac.uk/~simone/stts_guide.ps.gz.

   Schmidt, T. (2004). Transcribing and annotating spoken language with EXMARaLDA. In Proceedings of the LREC-Workshop on XML based richly annotated corpora, Lisbon 2004. Paris: ELRA. URL http://www.exmaralda.org/files/Paper_LREC.pdf. EN.

   Shao, Y., C. Hardmeier, J. Tiedemann & J. Nivre (2017). Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. arXiv preprint arXiv:1704.01314 .

   Suzuki, J., S. Takase, H. Kamigaito, M. Morishita & M. Nagata (2018). An Empirical Study of Building a Strong Baseline for Constituency Parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 612–618. URL https://www.aclweb.org/anthology/P18-2097.

   Taylor, A., M. Marcus & B. Santorini (2003). The Penn Treebank: An Overview. In Abeillé (2003), chap. 1, pp. 5–22.

   Teufel, S., H. Schmid, H. Heid & A. Schiller (1996). EAGLES Study of the relation between Tagsets and Taggers. Document eag clwg tags/v, EAGLES. URL ftp://ftp.ilc.pi.cnr.it/pub/eagles/lexicons/tags.ps.gz.

   Thielen, C. & A. Schiller (1996). Ein kleines und erweitertes Tagset fürs Deutsche. In H. Feldweg & E. W. Hinrichs (eds.), Lexikon und Text: wiederverwendbare Methoden und Ressourcen zur linguistischen Erschließung des Deutschen, Tübingen: Max Niemeyer Verlag, vol. 73 of Lexicographica: Series maior, pp. 215–226.

   Tufiş, D., P. Dienes, C. Oravecz & T. Váradi (2000). Principled Hidden Tagset Design for Tiered Tagging of Hungarian. In Gavrilidou et al. (2000). URL http://www.coli.uni-sb.de/~thorsten/tnt/papers/lrec2000-tufis-ea.pdf.

   Uszkoreit, H., T. Brants & B. Krenn (eds.) (1999). Proceedings of the Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway: Association for Computational Linguistics.

   van Halteren, H. (ed.) (1999). Syntactic Wordclass Tagging. Dordrecht: Kluwer Academic Publishers.

   Váradi, T. & C. Oravecz (1999). Morpho-syntactic ambiguity and tagset design for Hungarian. In Uszkoreit et al. (1999), pp. 8–12. URL http://www.inf.u-szeged.hu/~alexin/ILP/EACL99-Bergen.ps.gz.

   Voutilainen, A. & T. Järvinen (1995). Specifying a shallow grammatical representation for parsing purposes. In Proceedings of the 7th Conference of the EACL. Dublin, Ireland. URL http://www.aclweb.org/anthology-new/E95-1029.

   Wang, D., M. Fang, Y. Song & J. Li (2019). Bridging the gap: Improve part-of-speech tagging for Chinese social media texts with foreign words. In Proceedings of the 5th Workshop on Semantic Deep Learning (SemDeep-5). pp. 12–20.

   Yimam, S. M., C. Biemann, R. Eckart de Castilho & I. Gurevych (2014). Automatic Annotation Suggestions and Custom Annotation Layers in WebAnno. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, Maryland: Association for Computational Linguistics, pp. 91–96. URL http://www.aclweb.org/anthology/P14-5016.

   Zeldes, A., A. Lüdeling, J. Ritz & C. Chiarcos (2009). ANNIS: A search tool for multi-layer annotated corpora .