On Automatically Analyzing Learner Language

                               Detmar Meurers
                            Universität Tübingen
                       detmar.meurers@uni-tuebingen.de


The automatic analysis of learner language using  algorithms  and  resources
from computational linguistics is potentially relevant in  several,  related
contexts (cf. Meurers in press). In  intelligent  tutoring  systems,  it  is
needed to provide  immediate  feedback  to  learner  language  produced  for
activities supporting a range of well-formed and  ill-formed  variation.  In
learner corpus research, automatic natural language processing can  be  used
to annotate large corpora with the goal  of  supporting  the  discovery  and
retrieval of examples and phenomena which are of relevance  to  theories  of
second language acquisition (SLA) and foreign  language  teaching  practice.
In this talk, I consider aspects of both form and meaning  analysis  arising
for such on-line and off-line processing of learner language.

Addressing the analysis of form, I want to raise some  questions  about  the
nature of the categories which are appropriate  and  useful  for  annotating
learner language, and which role the context and  explicit  tasks  play  for
the interpretation of learner language. SLA  research  essentially  observes
correlations  of  linguistic  properties  exhibited  in  learner   language,
whether erroneous or not. Correspondingly, in contrast  to  the  traditional
focus on error annotation, I will argue in favor of annotating  a  range  of
linguistic properties at  different  granularity  to  support  retrieval  of
examples that are of relevance to SLA research questions.  This  raises  the
challenge of defining linguistic annotation  schemes  for  learner  language
and automatically annotating learner corpora with such information –  issues
that are recently gaining some attention (de Haan 2000, de Mönink 2000,  van
Rooy & Schäfer 2002, 2003, Myles & Mitchell 2004, Dickinson &  Ragheb  2009,
to appear, Rastelli 2009, Lu 2010, Sagae et al.  2010,  Ott  &  Ziai  2010).
Based on our analysis in Díaz-Negrillo et  al.  (2010),  I  will  show  that
conceptualizing the annotation of learner language as  a  task  of  robustly
applying standard annotation schemes developed for native language fails  to
identify important interlanguage characteristics. Robustness essentially  is
the ability to ignore variation in the  realization  of  a  category  to  be
identified. But variation in the realization of a category  arguably  is  an
important  characteristic  of  learner  language.  Annotation  schemes   for
learner language at one level of annotation therefore should provide  access
to minimal observations. Where the variation is not of interest as such,  it
should  be  made  explicit  which  type  of   evidence   is   systematically
prioritized when there is conflicting evidence.

Making things concrete, example (1)  taken  from  the  NOCE  learner  corpus
(Díaz-Negrillo  2009)  includes   the   underlined   word   choiced,   which
distributionally in this sentence seems to be a verb, in terms of the  past-
tense –ed morphology it also seems to be a verb, but lexically choice  is  a
verb or adjective.


 1) People who speak another language have more opportunities to be choiced
    for a job
A   tripartite   part-of-speech   tag   annotating   distribution,   lexical
information,  and   morphology   separately   provides   access   to   those
observations, while research into more abstract patterns  such  as  part-of-
speech sequences (Aarts & Granger 1998, Wiersma et al. 2011)  can  make  the
precise  selection  and  prioritization  of  the  empirical  evidence   used
explicit. The value of encoding minimal  observations  receives  independent
support by recent SLA research  (Zyzik  &  Azevedo  2009)  showing  that  L2
learners  have  difficulty  distinguishing  between   word   classes   among
semantically related  forms,  apparently  indicating  a  limitation  in  the
ability to interpret syntactic and morphological cues.

Going beyond the lexical level, I will suggest that a decomposition  of  the
syntactic  annotation  of  learner  corpora  into  chunks,  dependency   and
sentence  topology  will  provide  comparable  benefits.   Ultimately,   the
annotation of  minimal  observations  also  stands  to  improve  the  inter-
annotator  agreement,  highlighting  which  distinctions  can  reliably   be
annotated and which should be abandoned.

Turning from the linguistic categories to the  obvious  but  often  implicit
fact  that  any  analysis  or  annotation   naturally   is   based   on   an
interpretation of the learner data, I  will  argue  for  the  importance  of
considering  which  types  of   activities   or   tasks   support   reliable
interpretation  of  learner  language  for  which  learners.  Just  like  in
language testing (Bachman & Palmer 1996), the relation between the task  and
the learner language produced is crucial for supporting  valid  analyses  in
an intelligent tutoring system or in annotating a learner corpus.

Making the task explicit also makes it  possible  to  analyze  and  evaluate
aspects of meaning. To illustrate this point, I will sketch our work in  the
CoMiC project (http://purl.org/icall/comic) on automatically evaluating  the
meaning  of  learner  answers  to  reading  comprehension  questions.   Such
activities,  with  questions  and  a  text   linguistically   encoding   the
information that the questions are about, support an  investigation  of  the
modeling  needed  to  interpret  and  evaluate  meaning  in  the   face   of
significant well-formed and ill-formed variation in the learner responses.


References

Aarts, J. and S. Granger (1998). Tag Sequences in Learner Corpora: a Key  to
  Interlanguage Grammar  and  Discourse.  In  Granger,  S.  (ed.),  Learner
  English on Computer. London: Longman, pp. 132–41.
Bachman, L. F.  &  A.  S.  Palmer  (1996).  Language  Testing  in  Practice:
  Designing and Developing Useful Language Tests. Oxford University Press.
de Haan, P. (2000). Tagging non-native English with the  TOSCA-ICLE  tagger.
  In C. Mair & M. Hundt (eds.). Corpus Linguistics and  Linguistic  Theory.
  Rodopi. 69–79.
de Mönink, I. (2000). Parsing a learner  corpus.  In  C.  Mair  &  M.  Hundt
  (eds.). Corpus Linguistics and Linguistic Theory. Rodopi. 81–90.
Díaz-Negrillo, A. (2009). EARS: A User’s  Manual.  Munich,  Germany:  LINCOM
  Academic Reference Books.
Díaz-Negrillo, A., D.  Meurers,  S.  Valera  &  H.  Wunsch  (2010).  Towards
  interlanguage POS annotation for effective learner  corpora  in  SLA  and
  FLT. Language Forum, Vol. 36, No 1-2. 139–154. Special  Issue  on  Corpus
  Linguistics for Teaching  and  Learning.  http://purl.org/dm/papers/diaz-
  negrillo-et-al-09.html.
Dickinson, M & M. Ragheb (2009). Dependency Annotation for Learner  Corpora.
  Proceedings of the Eighth Workshop on Treebanks and  Linguistic  Theories
  (TLT-8).                                                           Milan.
  http://jones.ling.indiana.edu/~mdickinson/papers/dickinson-ragheb09.html
Dickinson, M & M. Ragheb (to appear). Avoiding the  comparative  fallacy  in
  the annotation of learner  corpora.  Selected  Proceedings  of  the  2010
  Second Language Research Forum. Somerville,  MA:  Cascadilla  Proceedings
  Project.
Lu,  X.  (2010).  Automatic  analysis  of  syntactic  complexity  in  second
  language  writing.   International   Journal   of   Corpus   Linguistics,
  15(4):474–496
Meurers, D. (in press). Natural Language Processing and  Language  Learning.
  Encyclopedia  of  Applied  Linguistics,  edited  by  Carol  A.  Chapelle.
  Blackwell. http://purl.org/dm/papers/meurers-11.html
Myles, F. & R. Mitchell (2004).  Using  information  technology  to  support
  empirical SLA research. Journal of Applied Linguistics 1(2), 169–196.
  http://www.equinoxjournals.com/JAL/article/viewArticle/1444.
Ott, N. & R. Ziai  (2010).  Evaluating  dependency  parsing  performance  on
  German learner  language.  In  Proceedings  of  the  Ninth  International
  Workshop on Treebanks and Linguistic Theories (TLT9), Tartu, Estonia.
  http://www.sfs.uni-tuebingen.de/~rziai/papers/Ott.Ziai-10.pdf
Rastelli, S. (2009).  Learner  Corpora  without  Error  Tagging.  Linguistik
  online. 38 (2). http://www.linguistik-online.de/38_09/rastelli.html
Rosén, V. & K. De Smedt (2010). Syntactic Annotation of Learner Corpora.  In
  H. Johansen, A. Golden, J. E. Hagen & A.-K. Helland  (eds.)  Systematisk,
  variert, men ikke tilfeldig. Antologi om norsk som andrespråk i anledning
  Kari  Tenfjords  60-årsdag  [Systematic,  varied,  but   not   arbitrary.
  Anthology about Norwegian as a second language on the  occasion  of  Kari
  Tenfjord's 60th birthday]. Oslo: Novus forlag, 120–132.
Sagae,  K.,  E.  Davis,  A.  Lavie,  B.  MacWhinney  &  S.  Wintner  (2010).
  Morphosyntactic annotation  of  CHILDES  transcripts.  Journal  of  Child
  Language. 37 (3), 705–729.,
van Rooy, B. & L. Schäfer (2002). The Effect of Learner Errors  on  POS  Tag
  Errors during Automatic POS Tagging.  Southern  African  Linguistics  and
  Applied Language Studies 20, 325–335.
van Rooy, B. & L. Schäfer (2003). An Evaluation of  Three  POS  Taggers  for
  the Tagging of the Tswana  Learner  English  Corpus.  In  D.  Archer,  P.
  Rayson, A.  Wilson  &  T.  McEnery  (eds.),  Proceedings  of  the  Corpus
  Linguistics 2003. vol.  16  of  University  Centre  For  Computer  Corpus
  Research On Language Technical Papers,. 835–844.
Wiersma, W., J. Nerbonne &  T.  Lauttamus  (2011)  Automatically  Extracting
  Typical Syntactic  Differences  from  Corpora.  Literary  and  Linguistic
  Computing. 26(1): 107–124. http://llc.oxfordjournals.org/content/26/1/107
Zyzik, E. & C. Azevedo (2009). Word Class Distinctions  in  Second  Language
  Acquisition. Studies in Second Language Acquisition 31, 1–29.
  http:
  //journals.cambridge.org/production/action/cjoGetFulltext?fulltextid=39817
  76.