CZECH NATIONAL CORPUS (CNC)
Jana Klimova
HISTORY
The idea of CNC was first mentioned in 1991 in the statement
of intent which was signed by 8 signatories, representatives
of the following institutions: Faculty of Philosophy Charles
University, Faculty of Mathematics and Physics, Charles
University, Masaryk University, Palack=EC university and the
Institute of Czech Language, Academy of Sciences. The aim of
this group was to stop still greater decline of Czech
lexicography and to bring it into computer age and also
coordinate the various workplaces in our country to which
such a computerization of language resources would be of
interest, and to keep them mutually informed. Computer
lexicography and corpus linguistics had not existed as
a special branch of linguistics in our country before that
time and political changes opened the possibilities for
a larger cooperation with foreign countries where these
branches of science within computational linguistics have
already been developed.
It very soon became clear that the project of CNC
needed its own workplace. Thus in September 1994 the
Institute of CNC was founded at the Faculty of Philosophy at
the Charles University. The agreements on cooperation were
signed with other university workplaces in our country.
FINANCING
Original plan to build up a corpus of about 20 million
running words in about 8 months was both overambitious and
not farsighted enough. The total budget was about 3.5
million Czech crowns. The group addressed potential
sponsors, maybe hundreds of them. The result was not very
good, we received 350.000 Czech crowns from one of the most
important banks in our country during the period of 3 years.
In 1993 the Grant agency was established by the Czech
government and we received the grant for the buildup of
"Text Corpus of Czech Written Texts" for a 3-year period.
Moreover, next year we obtained another grant for developing
the "Programming Tools for the Computer Processing of the
Czech Texts" also for a 3-year period. Both grants provided
us with about 2 million CC for the establishing of a new
workplace for the ICNC, for the purchase of necessary
hardware and software etc. This year we have started working
on a new project named "The Czech Language in the Age of
Computers" which will be finished in 6 years in cooperation
with other workplaces.
COLLECTING AND PROCESSING OF DATA
A. The data for the buildup of the corpus are collected in
three ways:
1. The texts are obtained in electronic form from publishing
houses and individual owners who had agreed to make them
available to us. With every publishing house we had to sign
an agreement where it is stated that the texts will not be
used for any profitable activity but only for academic
research. We get all the texts free of charge. The majority
of texts obtained are newspapers (about 60 per cent of the
raw data) - the two most important dailies being LN and MF.
We also get quite a large variety of books, especially
fiction.
2. OCR, i.e. scanning is the other possibility of obtaining
the data. It is mainly older Czech dictionaries which do not
exist in electronic form, that are scanned. The dictionaries
are an important part of a future lexical database and also
form a basis for the future large dictionary of contemporary
Czech.
3. Transcribing will be used for obtaining some ephemerous
data.
B. All the raw texts received in various forms (written by
means of various text editors) are converted into a unified
format and marked up. We have created our own DTD (data type
definition) in SGML for marking up our texts.
The texts also should be cleaned, which means that some
parts of texts (advertisements, sports results) should be
removed. We also have some Slovak texts from the period of
the common Czechoslovak state. The problem with cleaning the
texts is not yet solved, it should be done either by hand,
or the tools for that task have to be developed.
C. The final step in processing the data is the processing
by a corpus tool. We are using the CQP (corpus query
processor) developed at the University of Stuttgart. This
program enables the search for various kinds of linguistic
features in the corpus. A complex corpus manager is just
being developed, which should be a user-friendly interface
tool.
THE COMPOSITION OF CNC
A. The largest and mostly worked on part is the synchronic
part which contains the texts of contemporary written Czech.
Originally the core of the project aimed at a corpus of 20
million word forms. This task is already solved and the
first part of CNC was processed by CQP and is available on
Internet at the address http://ucnk.ff.cuni.cz/cnc
It is planned to enlarge the CNC
a. up to 70 million word forms by the end of '97
b. up to 100 million word forms by the end of '99
c. up to 200 million of word forms by the end of the project
started this year, which means within the span of 6 years
It became evident that if the corpus has to serve as a basis
for a new dictionary, it should be balanced and
representative, it means large enough (at least 100 mil.
word forms). A balanced corpus should contain a great
variety of texts in given proportions of different types of
texts used. The proportions will be established on the basis
of a sociological research concerning the reception and
production of texts.
The corpus is built up the on following principles: the
newspaper texts are included since 1990, all texts have the
following limitation: no author of the texts may have been
born before 1890. The texts for the corpus are processed in
a ratio of one text from the period between 1960-1989 to six
of them published after 1990. The reason for that decision
has been made because since 1990 the book market has been
driven by market mechanisms and works have been published
that have a genuine readership reception. In establishing
the time framework of the texts it had to be taken in
consideration that in the lexicographic processing of
contemporary Czech the 1960s play an important phase in the
development of language. Moreover, the significance of
1989, when all censorship ended and the situation of the
language has changed fundamentally, is also evident.
B. The diachronic part of the CNC is being developed
separately and this work goes on quite slowly because all
texts have to be scanned or transcribed. All older states of
Czech are collected since the beginning of written Czech
texts.
C. The oral part of CNC contains about 500.000 word forms of
recorded spoken Prague Czech and also a similar subcorpus of
spoken Moravian language, which is a bit different, is set
up in Brno.
D. The archive of the raw data is being built simultaneously
as the source for the representative corpus.
HARDWARE AND SOFTWARE
We are equipped with 2 Sun Sparc stations with the capacity
of HDs about 15 GB and several PCs with other 8 GB capacity
of HDs. We have also a scanner with a Prolector software.
The most used software is the already mentioned CQP, several
SGML parsers and several programmes for the processing of
concordances and collocations.
GOALS AND FUTURE PLANS FOR CNC
The corpus will serve as an invaluable and largest source of
data for the new dictionary of Czech language, for creating
a lexical database, for all kind of linguistic research. The
corpus should still be enlarged and kept balanced.
The problems of tagging, lemmatization and parsing should be
solved.
The user friendly complex corpus manager should be finished
this year.