CZECH NATIONAL CORPUS (CNC)

Jana Klimova
HISTORY

The idea of CNC was first mentioned in 1991 in the statement
of intent which was signed by 8 signatories, representatives
of the following institutions: Faculty of Philosophy Charles
University,  Faculty  of  Mathematics  and  Physics, Charles
University, Masaryk  University, Palack=EC university  and the
Institute of Czech Language, Academy of Sciences. The aim of
this  group  was  to  stop  still  greater  decline of Czech
lexicography  and to  bring it  into computer  age and  also
coordinate the  various workplaces in  our country to  which
such  a computerization  of language  resources would  be of
interest,  and  to  keep  them  mutually  informed. Computer
lexicography  and  corpus  linguistics  had  not  existed as
a special branch  of linguistics in our  country before that
time  and  political  changes  opened  the possibilities for
a larger  cooperation  with  foreign  countries  where these
branches  of science  within computational  linguistics have
already been developed.
     It  very  soon  became  clear  that  the project of CNC
needed  its  own  workplace.  Thus  in  September  1994  the
Institute of CNC was founded at the Faculty of Philosophy at
the Charles  University. The agreements  on cooperation were
signed with other university workplaces in our country.

FINANCING

Original  plan to  build up   a corpus  of about  20 million
running words  in about 8 months  was both overambitious and
not  farsighted  enough.  The  total  budget  was  about 3.5
million   Czech  crowns.   The  group   addressed  potential
sponsors, maybe  hundreds of them.  The result was  not very
good, we received 350.000 Czech  crowns from one of the most
important banks in our country during the period of 3 years.
In  1993  the  Grant  agency  was  established  by the Czech
government  and we  received the  grant for  the buildup  of
"Text Corpus  of Czech Written  Texts" for a  3-year period.
Moreover, next year we obtained another grant for developing
the "Programming  Tools for the  Computer Processing of  the
Czech Texts" also for a  3-year period. Both grants provided
us with  about 2 million  CC for the  establishing of a  new
workplace  for  the  ICNC,  for  the  purchase  of necessary
hardware and software etc. This year we have started working
on a  new project named  "The Czech Language  in the Age  of
Computers" which will be finished  in 6 years in cooperation
with other workplaces.

COLLECTING AND PROCESSING OF DATA

A. The data  for the buildup of the  corpus are collected in
three ways:
1. The texts are obtained in electronic form from publishing
houses  and individual  owners who  had agreed  to make them
available to us. With every  publishing house we had to sign
an agreement where  it is stated that the  texts will not be
used  for  any  profitable  activity  but  only for academic
research. We get all the  texts free of charge. The majority
of texts obtained  are newspapers (about 60 per  cent of the
raw data) - the two most  important dailies being LN and MF.
We  also  get  quite  a  large  variety of books, especially
fiction.
2. OCR, i.e. scanning is  the other possibility of obtaining
the data. It is mainly older Czech dictionaries which do not
exist in electronic form, that are scanned. The dictionaries
are an important part of  a future lexical database and also
form a basis for the future large dictionary of contemporary
Czech.
3. Transcribing  will be used for  obtaining some ephemerous
data.

B. All the  raw texts received in various  forms (written by
means of various text editors)  are converted into a unified
format and marked up. We have created our own DTD (data type
definition) in SGML for marking  up our texts.
     The texts also should be cleaned, which means that some
parts  of texts  (advertisements, sports  results) should be
removed. We also  have some Slovak texts from  the period of
the common Czechoslovak state. The problem with cleaning the
texts is not  yet solved, it should be  done either by hand,
or the tools for that task have to be developed.

C. The final  step in processing the data  is the processing
by  a  corpus  tool.  We  are  using  the  CQP (corpus query
processor)  developed at  the University  of Stuttgart. This
program enables  the search for various  kinds of linguistic
features  in the  corpus. A  complex corpus  manager is just
being developed,  which should be  a user-friendly interface
tool.

THE COMPOSITION OF CNC

A. The largest  and mostly worked on part  is the synchronic
part which contains the texts of contemporary written Czech.
Originally the core  of the project aimed at  a corpus of 20
million  word forms.  This task  is already  solved and  the
first part of  CNC was processed by CQP  and is available on
Internet at the address http://ucnk.ff.cuni.cz/cnc
It is planned to enlarge the CNC
a. up to 70 million word forms by the end of '97
b. up to 100 million word forms by the end of '99
c. up to 200 million of word forms by the end of the project
   started this year, which means within the span of 6 years

It became evident that if the corpus has to serve as a basis
for   a  new   dictionary,   it   should  be   balanced  and
representative,  it means  large enough  (at least  100 mil.
word  forms).  A  balanced  corpus  should  contain  a great
variety of texts in given  proportions of different types of
texts used. The proportions will be established on the basis
of  a  sociological  research  concerning  the reception and
production of texts.
The  corpus is  built up  the on  following principles:  the
newspaper texts are included since  1990, all texts have the
following limitation:  no author of the  texts may have been
born before 1890. The texts  for the corpus are processed in
a ratio of one text from the period between 1960-1989 to six
of them  published after 1990. The  reason for that decision
has been  made because since  1990 the book  market has been
driven by  market mechanisms and  works have been  published
that  have a  genuine readership  reception. In establishing
the  time framework  of the  texts  it  had to  be taken  in
consideration  that  in   the  lexicographic  processing  of
contemporary Czech the 1960s play  an important phase in the
development  of  language.  Moreover,  the  significance  of
1989,  when all  censorship ended  and the  situation of the
language has changed fundamentally, is also evident.

B.  The  diachronic  part  of  the  CNC  is  being developed
separately and  this work goes  on quite slowly  because all
texts have to be scanned or transcribed. All older states of
Czech  are collected  since the  beginning of  written Czech
texts.

C. The oral part of CNC contains about 500.000 word forms of
recorded spoken Prague Czech and also a similar subcorpus of
spoken Moravian  language, which is a  bit different, is set
up in Brno.

D. The archive of the raw data is being built simultaneously
as the source for the representative corpus.

HARDWARE AND SOFTWARE

We are equipped with 2  Sun Sparc stations with the capacity
of HDs about 15 GB and  several PCs with other 8 GB capacity
of HDs. We have also a scanner with a Prolector software.
The most used software is the already mentioned CQP, several
SGML parsers  and several programmes  for the processing  of
concordances and collocations.

GOALS AND FUTURE PLANS FOR CNC

The corpus will serve as an invaluable and largest source of
data for the new dictionary  of Czech language, for creating
a lexical database, for all kind of linguistic research. The
corpus should still be enlarged and kept balanced.
The problems of tagging, lemmatization and parsing should be
solved.
The user friendly complex  corpus manager should be finished
this year.