Prof. Dr. Walt Detmar Meurers
Linguistically Annotated Learner Corpora: Aspects of a Layered Linguistic Encoding and Standardized Representation


Detmar Meurers and Holger Wunsch


Proceedings of Linguistic Evidence 2010.


Linguistically annotated corpora that are stored in standardized digital form can be a valuable source of empirical insight. They can help verify linguistic generalizations and support the formulation of new hypotheses. The linguistic annotation of such corpora often is crucial for their effective exploration from a linguistic perspective. The annotation essentially serves as an index to the linguistic classes and phenomena realized in the corpus (cf., e.g., Meurers 2005).


The situation in principle is parallel in the field of Second Language Acquisition (SLA) research, where an increasing number of corpora consisting of language as written by language learners have been compiled (Granger 2008). Yet, the linguistic annotation of learner data has received virtually no attention so far, apart from the so- called error annotation marking language properties which differ from native language patterns (Díaz-Negrillo and Fernández-Domínguez 2006). This is surprising given that prominent strands of SLA research are concerned with researching linguistic regularities in the stages of the acquisition process (cf., e.g., Pienemann 1998), irrespective of whether they are erroneous or not. Learner language is typically viewed as a linguistic system worth characterizing in its own right, so-called interlanguage. Thus learner corpora require systematic linguistic annotation of both correct and incorrect structures for them to effectively support the empirical questions addressed by SLA research (Díaz-Negrillo, Meurers, Valera, Wunsch 2009, Meurers 2009, Rastelli 2009).


In this paper, we report on work in progress investigating the linguistic annotation of learner corpora in terms of two aspects. We first motivate a new perspective on the part-of-speech (POS) categories of learner language and report on its implications for automatic POS tagging. Secondly, addressing a technical prerequisite of this work, we argue for a standardized representation format for annotated learner data.



Electronically available:

  • Paper pdf (136.526 bytes)
  • Poster pdf (242.676 bytes)


Note: The electronic versions of the publications linked on this page are the last versions I had the copyright for.



