Building a gold standard parsed corpus for French

Anne Abeillé and Lionel Clément
Université Paris 7
abeille,lionel.clement@linguist.jussieu.fr

Abstract

Very few gold standard annotated corpora are currently available for French. We present an ongoing project to build a reference treebank for French starting with a tagged newspaper corpus of 1 Million words (Abeille&al 98), (Abeille&al 99). Similarly to the Penn TreeBank (Marcus&al 93), we distinguish an automatic parsing phase followed by a second phase of systematic manual validation and correction. Similarly to the Prague treebank (Hajicova&al 98), we rely on several types of morphosyntactic and syntactic annotations for which we define extensive guidelines. Our goal is to provide a theory neutral, surface oriented, error free treebank for French. For the moment, we only annotate major phrase boundaries. But, similarly to the Negra project (Brants&al99), the next step will be to annotate both constituents and functional relations.


doug@essex.ac.uk