Fall School 2002 in Sozopol. Course descriptions

COURSE DESCRIPTIONS

The following six courses will make up the instructional part of the school:

Computational Tools for Corpus Linguistics. Erhard Hinrichs, Tübingen; Sandra Kübler, Tübingen.
Corpus-Based Investigation of Issues in Pragmatics. Tilman Berger, Tübingen.
Head-Driven Phrase Structure Grammar for Slavic. Adam Przepiórkowski, Warsaw.
XML-based Corpus Linguistics. Kiril Simov, Sofia.
Morphological and Syntactic Tagging of Slavonic Languages. Vladimír Petkevic, Prague; Karel Oliva, Saarbrücken
Applications of Text Corpora to Lexicography. Anatolij N. Baranov, Moscow.

Computational Tools for Corpus Linguistics

The availability of electronic text corpora has led to the development of various tools in computational linguistics for the construction, annotation and search of huge amounts of data.

The course will provide and overview of the most important aspects of processing text copora:

Standards for the development of text corpora: XML and the Corpus Encoding Standard of the Text Encoding Initiative,
Tools and resources for the semi-automatic annotation of corpora: tokenizing, automatic recognition of parts of speech (''Tagging'' and the design of ''Tagsets''), morphological parsing, syntactic parsing (''Chunk Parsing'', grammar formalisms and grammar design, tree banks), and the disambiguation of the meaning of words,
Tools for the automatic search in corpora: design of query languages, graphical search tools

n the course we will as much as possible use tools and resources developed at the Special Research Program (Sonderforschungsbereich) 441. This will give he participants direct access to the relevant tools so that they may gain practical experience with development and use of these tools.

Lecturers: Erhard Hinrichs, Tübingen; Sandra Kübler, Tübingen.

Corpus-Based Investigation of Issues in Pragmatics

Initially electronic corpora were used for lexicographic purposes. In recent years they have been increasingly employed for the investigation of grammatical questions. Pragmatic issues, however, have usually been ignored, because of traditional preference for using oral texts, which are practically unavailable as electronic copora. This course will pursue the question of how well text copora can nevertheless be used for pragmatic studies if the methodology is adjusted accordingly. More precisely, the following topics will be addressed:

Which types of written text support indirect conclusions on pragmatic issues and which precautions are necessary?
Which search mechanisms are suitable for finding pragmatic information? What kind of annotation is most sensible for these purposes?
Which mechanisms can be used for extracting information on the coordination of the concrete utterance situation from written corpora?

The issues sketched here will be treated with examples from deixis and the theory of speech acts (with emphasis on linguistic expression and politeness). The Slavic corpora of the Tübingen Special Research Program 441 will be used as materials.

Lecturer: Tilman Berger, Tübingen.

Head-Driven Phrase Structure Grammar for Slavic

The aim of this course is to introduce Head-driven Phrase Structure Grammar (HPSG), a constraint-based linguistic formalism. The empirical material used in the course will be drawn from Slavic languages, and the theoretical phenomena dealt with will include:

phrase structure;

agreement;

case assignment;

cliticization;

negation;

unbounded dependencies (so-called 'wh-movement');

other phenomena, depending on participants' interests.

After a presentation of analyses of these phenomena in the framework of HPSG, we will bridge the gap from work in theoretical linguistics to applications in corpus linguistics by investigating occurrences of the empirical phenomena at hand in corpora. This will give us an opportunity to re-evaluate the theoretical analyses in light of the properties of the data found in copora, and it will highlight the advantages and research opportunities, but also the problems of working with corpora in theoretical linguistics.

Lecturer: Adam Przepiórkowski, Warsaw.

XML-based Corpus Linguistics

This course will be based on the CLaRK system developed in the CLaRK Programme and actively used at the SfS and LML for the construction, management and exploration of annotated corpora of German and Bulgarian sentences. The course will cover the following topics:

Corpus linguistics. Basic notions and goals of corpus linguistics: Annotations, tasks, searching.

XML. Basic notions of the XML framework for document description and document exchange: DTD, document structure, elements, entities and attributes, well-formedness and validity.

Tokenizers. Basic and defined tokenizers in the CLaRK system.

Finite State Automata. The use of (cascaded) finite state automata (FSA) in the CLaRK system.

Searching. The XPath language for navigation in XML documents. FSA searching. Mixed mode of XPath and FSA searching.

Constraints. FSA and XPath constraints over XML documents. Linguistic applications of the constraints.

All topics will be accompanied with practical exercises concerning the use of the CLaRK system. The exercises will include manual entering of mark-up, construction of simple FSA grammars, automatic transformation of documents, hand disambiguation of morphosyntactic information, searching, concordance construction, use of constraints for supporting of linguistic annotation.

The style in which corpora are encoded in the CLaRK system will be compared with other styles of encoding corpora, especially with the referential annotation developed in the GATE system. The CLaRK system will be made available to all participants.

Lecturer: Kiril Simov, Sofia.

Morphological and Syntactic Tagging of Slavonic Languages

The objective of the workshop is to discuss various issues concerning morphological, syntactic and other tagging of corpora of Slavic languages. This language family is characterized by specific morphological and syntactic features which can be studied given the existence of various corpora of these languages. The workshop will make it possible for the researchers specializing in the study of Slavic languages to mutually inform one another about the latest results in the tagging of Slavic corpora. One of the main topics will also be the methodology used for morphological tagging: the comparison of stochastic tagging of Slavic languages and the rule-based one and assessing specific differences in tagging different Slavic languages and other languages with whose tagging there is already plenty of experience available (English, German). Another key topic will be the treebanks of syntactically annotated corpora -- various approaches and methodologies used for syntactic annotation will be presented and compared. One of the main results of the workshop is to evaluate the current state of the art of Slavic languages corpora and their processing. Thus, the tagged and annotated corpora of Slavic languages compared to existing annotated corpora of Germanic and Romance languages could contribute to revealing some new typological differences between various language families which -- without corpora -- could not have been discovered yet.

Lecturers: Vladimír Petkevic, Prague; Karel Oliva, Saarbrücken.

Applications of Text Corpora to Lexicography

At the heart of the course will be a comparison of traditional and computer-based approaches to the construction of dictionaries, with special emphasis on corpus-based lexicography. The question of a suitable form of text corpora for lexicographic purposes will be discussed on an introductory level. This will give rise to the discussion of issues of corpus annotation, the representativity of corpora, economy, and an optimized structuring of the data.

In the second part of the course, practical issues of text corpora in specific applications will be discussed on the basis of concrete corpus-oriented projects. The following projects could be used as starting points:

Research into discourse words in Russian
The project of creating a dictionary for Dostoevskij
The project of observing the development of the current political discourse on the basis of a text corpus of modern Russian journalistic texts (1995-1999)

Lecturer: Anatolij N. Baranov, Moscow.

Back