Prof. Dr. Walt Detmar Meurers
Detecting Annotation Errors in Spoken Language Corpora


Markus Dickinson and Walt Detmar Meurers


Proceedings of the Special session on treebanks for spoken language and discourse at the 15th Nordic Conference of Computational Linguistics (NODALIDA-05).


Consistency of corpus annotation is an essential property for the many uses of annotated corpora in computational and theoretical linguistics. While some research addresses the detection of inconsistencies in part-of-speech and other positional annotation, more recently work has also started to address errors in syntactic and other structural annotation.


Spoken language differs in many respects from written language, but to the best of our knowledge the issue of detecting errors in the annotation of spoken language corpora has not yet been systematically addressed. This is significant since spoken data is increasingly relevant for linguistic and computational research---and such corpora are starting to become more readily available. This paper addresses the issue, based on the variation n-gram error detection approach developed in Dickinson and Meurers (2003). We use the German Verbmobil treebank as an exemplar of a spoken language corpus and discuss properties of such corpora which are relevant when adapting the variation n-gram approach for detecting errors in syntactic annotation of spoken language corpora.



Electronically available file formats:

  • .pdf (104.893 bytes)



Bibtex entry:

  author =       {Markus Dickinson and W. Detmar Meurers},
  title =        {Detecting Annotation Errors in Spoken Language Corpora},
  booktitle =    {The Special Session on treebanks for spoken language 
  and discourse at NODALIDA-05},
  pages =        {},
  url =          {},
  year =         {2005},
  address =      {Joensuu, Finland}