Prof. Dr. Walt Detmar Meurers
Native Language Identification using Recurring n-grams -- Investigating Abstraction and Domain Dependence


Serhiy Bykh and Walt Detmar Meurers


Proceedings of COLING 2012, the 24th Int. Conference on Computational Linguistics..


Native Language Identification tackles the problem of determining the native language of an author based on a text the author has written in a second language. In this paper, we discuss the systematic use of recurring n-grams of any length as features for training a native language classifier. Starting with surface n-grams, we investigate two degrees of abstraction incorporating parts-of-speech. The approach outperforms previous work employing a comparable data setup, reaching 89.71% accuracy for a task with seven native languages using data from the International Corpus of Learner English (ICLE). We then investigate the claim by Brooke and Hirst (2011) that a content bias in ICLE seems to result in an easy classification by topic instead of by native language characteristics. We show that training our model on ICLE and testing it on three other, independently compiled learner corpora dealing with other topics still results in high accuracy classification.



Electronically available file formats:



Bibtex entry:

  author    = {Bykh, Serhiy  and  Meurers, Detmar},
  title     = {Native Language Identification using Recurring $n$-grams -- 
               Investigating Abstraction and Domain Dependence},
  booktitle = {Proceedings of the 24th International Conference on 
               Computational Linguistics (COLING 2012)},
  month     = {December},
  year      = {2012},
  address   = {Mumbai, India},
  publisher = {The COLING 2012 Organizing Committee},
  pages     = {425--440},
  url       = {}
  pdf       = {}