On Automatically Analyzing Learner Language Detmar Meurers Universität Tübingen detmar.meurers@uni-tuebingen.de The automatic analysis of learner language using algorithms and resources from computational linguistics is potentially relevant in several, related contexts (cf. Meurers in press). In intelligent tutoring systems, it is needed to provide immediate feedback to learner language produced for activities supporting a range of well-formed and ill-formed variation. In learner corpus research, automatic natural language processing can be used to annotate large corpora with the goal of supporting the discovery and retrieval of examples and phenomena which are of relevance to theories of second language acquisition (SLA) and foreign language teaching practice. In this talk, I consider aspects of both form and meaning analysis arising for such on-line and off-line processing of learner language. Addressing the analysis of form, I want to raise some questions about the nature of the categories which are appropriate and useful for annotating learner language, and which role the context and explicit tasks play for the interpretation of learner language. SLA research essentially observes correlations of linguistic properties exhibited in learner language, whether erroneous or not. Correspondingly, in contrast to the traditional focus on error annotation, I will argue in favor of annotating a range of linguistic properties at different granularity to support retrieval of examples that are of relevance to SLA research questions. This raises the challenge of defining linguistic annotation schemes for learner language and automatically annotating learner corpora with such information – issues that are recently gaining some attention (de Haan 2000, de Mönink 2000, van Rooy & Schäfer 2002, 2003, Myles & Mitchell 2004, Dickinson & Ragheb 2009, to appear, Rastelli 2009, Lu 2010, Sagae et al. 2010, Ott & Ziai 2010). Based on our analysis in Díaz-Negrillo et al. (2010), I will show that conceptualizing the annotation of learner language as a task of robustly applying standard annotation schemes developed for native language fails to identify important interlanguage characteristics. Robustness essentially is the ability to ignore variation in the realization of a category to be identified. But variation in the realization of a category arguably is an important characteristic of learner language. Annotation schemes for learner language at one level of annotation therefore should provide access to minimal observations. Where the variation is not of interest as such, it should be made explicit which type of evidence is systematically prioritized when there is conflicting evidence. Making things concrete, example (1) taken from the NOCE learner corpus (Díaz-Negrillo 2009) includes the underlined word choiced, which distributionally in this sentence seems to be a verb, in terms of the past- tense –ed morphology it also seems to be a verb, but lexically choice is a verb or adjective. 1) People who speak another language have more opportunities to be choiced for a job A tripartite part-of-speech tag annotating distribution, lexical information, and morphology separately provides access to those observations, while research into more abstract patterns such as part-of- speech sequences (Aarts & Granger 1998, Wiersma et al. 2011) can make the precise selection and prioritization of the empirical evidence used explicit. The value of encoding minimal observations receives independent support by recent SLA research (Zyzik & Azevedo 2009) showing that L2 learners have difficulty distinguishing between word classes among semantically related forms, apparently indicating a limitation in the ability to interpret syntactic and morphological cues. Going beyond the lexical level, I will suggest that a decomposition of the syntactic annotation of learner corpora into chunks, dependency and sentence topology will provide comparable benefits. Ultimately, the annotation of minimal observations also stands to improve the inter- annotator agreement, highlighting which distinctions can reliably be annotated and which should be abandoned. Turning from the linguistic categories to the obvious but often implicit fact that any analysis or annotation naturally is based on an interpretation of the learner data, I will argue for the importance of considering which types of activities or tasks support reliable interpretation of learner language for which learners. Just like in language testing (Bachman & Palmer 1996), the relation between the task and the learner language produced is crucial for supporting valid analyses in an intelligent tutoring system or in annotating a learner corpus. Making the task explicit also makes it possible to analyze and evaluate aspects of meaning. To illustrate this point, I will sketch our work in the CoMiC project (http://purl.org/icall/comic) on automatically evaluating the meaning of learner answers to reading comprehension questions. Such activities, with questions and a text linguistically encoding the information that the questions are about, support an investigation of the modeling needed to interpret and evaluate meaning in the face of significant well-formed and ill-formed variation in the learner responses. References Aarts, J. and S. Granger (1998). Tag Sequences in Learner Corpora: a Key to Interlanguage Grammar and Discourse. In Granger, S. (ed.), Learner English on Computer. London: Longman, pp. 132–41. Bachman, L. F. & A. S. Palmer (1996). Language Testing in Practice: Designing and Developing Useful Language Tests. Oxford University Press. de Haan, P. (2000). Tagging non-native English with the TOSCA-ICLE tagger. In C. Mair & M. Hundt (eds.). Corpus Linguistics and Linguistic Theory. Rodopi. 69–79. de Mönink, I. (2000). Parsing a learner corpus. In C. Mair & M. Hundt (eds.). Corpus Linguistics and Linguistic Theory. Rodopi. 81–90. Díaz-Negrillo, A. (2009). EARS: A User’s Manual. Munich, Germany: LINCOM Academic Reference Books. Díaz-Negrillo, A., D. Meurers, S. Valera & H. Wunsch (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum, Vol. 36, No 1-2. 139–154. Special Issue on Corpus Linguistics for Teaching and Learning. http://purl.org/dm/papers/diaz- negrillo-et-al-09.html. Dickinson, M & M. Ragheb (2009). Dependency Annotation for Learner Corpora. Proceedings of the Eighth Workshop on Treebanks and Linguistic Theories (TLT-8). Milan. http://jones.ling.indiana.edu/~mdickinson/papers/dickinson-ragheb09.html Dickinson, M & M. Ragheb (to appear). Avoiding the comparative fallacy in the annotation of learner corpora. Selected Proceedings of the 2010 Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project. Lu, X. (2010). Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics, 15(4):474–496 Meurers, D. (in press). Natural Language Processing and Language Learning. Encyclopedia of Applied Linguistics, edited by Carol A. Chapelle. Blackwell. http://purl.org/dm/papers/meurers-11.html Myles, F. & R. Mitchell (2004). Using information technology to support empirical SLA research. Journal of Applied Linguistics 1(2), 169–196. http://www.equinoxjournals.com/JAL/article/viewArticle/1444. Ott, N. & R. Ziai (2010). Evaluating dependency parsing performance on German learner language. In Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), Tartu, Estonia. http://www.sfs.uni-tuebingen.de/~rziai/papers/Ott.Ziai-10.pdf Rastelli, S. (2009). Learner Corpora without Error Tagging. Linguistik online. 38 (2). http://www.linguistik-online.de/38_09/rastelli.html Rosén, V. & K. De Smedt (2010). Syntactic Annotation of Learner Corpora. In H. Johansen, A. Golden, J. E. Hagen & A.-K. Helland (eds.) Systematisk, variert, men ikke tilfeldig. Antologi om norsk som andrespråk i anledning Kari Tenfjords 60-årsdag [Systematic, varied, but not arbitrary. Anthology about Norwegian as a second language on the occasion of Kari Tenfjord's 60th birthday]. Oslo: Novus forlag, 120–132. Sagae, K., E. Davis, A. Lavie, B. MacWhinney & S. Wintner (2010). Morphosyntactic annotation of CHILDES transcripts. Journal of Child Language. 37 (3), 705–729., van Rooy, B. & L. Schäfer (2002). The Effect of Learner Errors on POS Tag Errors during Automatic POS Tagging. Southern African Linguistics and Applied Language Studies 20, 325–335. van Rooy, B. & L. Schäfer (2003). An Evaluation of Three POS Taggers for the Tagging of the Tswana Learner English Corpus. In D. Archer, P. Rayson, A. Wilson & T. McEnery (eds.), Proceedings of the Corpus Linguistics 2003. vol. 16 of University Centre For Computer Corpus Research On Language Technical Papers,. 835–844. Wiersma, W., J. Nerbonne & T. Lauttamus (2011) Automatically Extracting Typical Syntactic Differences from Corpora. Literary and Linguistic Computing. 26(1): 107–124. http://llc.oxfordjournals.org/content/26/1/107 Zyzik, E. & C. Azevedo (2009). Word Class Distinctions in Second Language Acquisition. Studies in Second Language Acquisition 31, 1–29. http: //journals.cambridge.org/production/action/cjoGetFulltext?fulltextid=39817 76.