Interactions between Machine Learning and Corpus Linguistics

Hervé Déjean
University of Tübingen
dejean@sfs.nphil.uni-tuebingen.de

Abstract

The apparition of annotated and bracketed corpora has developed the utilisation of Machine Learning techniques in Natural Language Processing. Generally annotated corpora are used as input data by the learning systems, and no feedback exists. We would like to show that the result of symbolic Machine Learning techniques can be used in order to analyse or evaluate these linguistic resources used as input data. For this purpose, we present a Machine Learning system, ALLiS, whose aim is the generation of a regular expression grammar from bracketed corpora . If the final purpose of this system is to parse texts, it can also be distracted from its main goal in order to check the consistency of annotated corpora, or to compare and evaluate tagsets.


doug@essex.ac.uk