COMPASS

An Intelligent Dictionary System for Reading Text in a Foreign Language

Electronic dictionaries have substantially simplified the time-consuming task of looking up words. This is particularly true when the text to be read is in electronic form, a reading situation that is becoming more and more significant with the increasing spread of computer networks and electronic books and documents.

However, at present neither electronic dictionaries themselves, nor the look-up techniques, are well suited to what is possible within an electronic medium. Dictionaries offer an electronic image of reference works in a print medium designed for human manipulation. Look-up techniques are restricted generally to comparing strings of characters in the text with the strings that occur as dictionary headwords: when the strings match, the corresponding entry is displayed. These systems do not take account of the intellectual abilities of the human dictionary user. Furthermore, they leave it up to the user to relate inflected forms to their base, to identify part of speech, and to pick out the appropriate sense somewhere in an extended dictionary entry.

The COMPASS project seeks to demonstrate that these restrictions on conventional electronic dictionaries can be overcome by the application of existing techniques. To this end a prototype of a computer programme is being developed, which accesses enhanced and structurally elaborated dictionaries with an intelligent, context-sensitive look-up procedure, presenting the information to the user through an attractive graphical interface.

The prototype's performance is being evaluated through a series of user tests. These have given rise to some quite ringing endorsements of the system by the test users. For example, in response to a question on whether COMPASS is more efficient than a paper dictionary, users have commented:

The results show that reading foreign-language texts is substantially easier with a system such as COMPASS, and a better understanding of the text can be gained. In fact we believe that in many cases where the reader already has a basic knowledge of the foreign language use of such a system can obviate the need for translation.

The sections below offer more detail on the components of the prototype and organization of the COMPASS project.

The Dictionaries

The lexicographic basis for the project is supplied by the Collins German Dictionary (German-English) [published by HarperCollins and Klett, Verlag fŸr Wissen und Bildung] and the Oxford-Hachette Dictionary (English-French). Machine-readable versions of these dictionaries were licensed to the partners in the project for research purposes. With these two dictionaries the prototype is able to cover the English-French and German-English language pairs. By the terms of the licence, and in order to make effective use of limited staff time, only excerpts from these dictionaries were used for the prototype.

Technical Adaptation of the Dictionaries

The machine-readable versions of the dictionaries provided by the publishers were SGML-marked type-setting tapes. In order to enable selective access to the information in the dictionary entries, the entries need to undergo a thorough structural analysis. For this the dictionary parser Lexparse was used, which can recognize, and explicitly represent, the hierarchical micro-structure of dictionary entries using a grammar defined by the user. The Lexparse grammars developed for the two dictionaries cover as comprehensively as possible all the structures of the dictionary entries, excluding inconsistent and faulty entries: these make up a considerable part of the dictionary. The faulty entries are corrected manually and parsed a second time. The resulting SGML-annotated dictionaries together with the DTD (document type definition) generated by Lexparse could then be lexicographically adapted in an SGML editor.

Partly during the parsing, partly during the subsequent processing, some unpacking of, and corrections to, the mark-up were introduced. To create the index it was necessary to spell out lemma-variants and expand sub-entries. For the most part these tasks were performed automatically. Finally the two resulting "lexical databases" derived from each dictionary were converted into a common data structure used by the LOCOLEX look-up system.

Extensions to the Dictionaries

To make true "comprehension dictionaries" from the parsed dictionaries, various lexicographical adjustments were necessary. All information in an entry that is unnecessary for the understanding of the word has to be marked explicitly for suppression in the COMPASS system. For example: Within a group of synonyms the most general translation should be given first, so that COMPASS can select this as a reduced representation of the entry. Further unpacking was sometimes necessary, e.g. to supply explicit translations where for reasons of space only implicit example phrases are given. Of course we also needed to supply missing variant forms, missing senses, completely absent headwords and multi-word expressions (MWE), the latter discovered from corpus excepts and the automatic extractions of possible MWE from textual corpora.

Formalization of Context Patterns

The COMPASS system should recognize whether a word queried occurs in a definite context where a special translation is appropriate, and in that case select it. To make this possible, corresponding context patterns must be supplied the in COMPASS dictionary. For this purpose Rank Xerox uses a finite state formalism in which such context patterns are coded as regular expressions. The context formalization is restricted initially to the recognition of multi-word expressions and grammatical collocations.

The formalization is achieved through a number of steps. First the decision is made which contexts overall should be formalized. MWE and grammatical collocations are then reduced to a so-called "canonical" form, which also includes lexical variants. Morphologically variable elements are marked as such. On the basis of these canonical forms a regular expression is generated, which encompasses, e.g., the variations in word-order that German allows. Special ways in which particular MWE may allow variation are added by hand to the regular expressions.

The LOCOLEX Look-up System

The basis of the look-up system is the LOCOLEX system, developed and patented by Rank Xerox. The kernel of LOCOLEX performs the actual look-up and loads the relevant parts of a dictionary entry on the basis of a linguistic analysis of the word's environment. To speed access to individual dictionary entries it uses an index of headwords and their variants. The LOCOLEX software is largely system-independent. It can be developed on, and ported onto, a variety of computer architectures.

The components for linguistic analysis of the source language (the so-called "language model") are not a direct part of the LOCOLEX kernel. Language models are developed separately for languages as required and attached to the LOCOLEX kernel as finite automata. Among the most important components of a language model are algorithms for morphological analysis and identification of parts of speech. Over and above these, the language model includes definitions of the macros and variables for finite automata which are used to recognize multi-word expressions.

Morphological Analysis

The morphological analysis reduces inflected words to their base-form and thus allows inflected words to access their dictionary entries (e.g. of gesungen to the headword singen). It also provides morphosyntactic information (part of speech, case, number and gender) which is used in subsequent steps of the analysis to select the correct meaning.

Part of Speech Disambiguation

If morphological analysis results in ambiguous syntactic information (e.g. article or verb for einen in German, noun or verb for plan in English) this ambiguity is resolved by a Part of Speech Disambiguation component. This uses a probabilistic procedure known as a Hidden Markov Model. These components are especially important for English or French, where many content words are ambiguous as to their part of speech.

Loading the Relevant Parts of a Dictionary Entry

The output of morphological analysis and part of speech disambiguation is used to select the parts of a dictionary entry relevant to a given context. The complete dictionary entry is loaded into main memory via an index. This procedure converts the given SGML structure of the dictionary entry into a largely dictionary-independent system-internal data-structure, and the part selected by the disambiguation is specially marked.

Recognition of Multi-World Expressions

If the selected word is part of a multi-word expression and coded as such in the dictionary entry, the system returns the translation of the whole MWE. This is a further step towards selecting the information relevant to the context from the dictionary entry. For this the MWE coded as regular expressions in the selected dictionary entry is compared with the input text. If a regular expression matches the sentence context, the translation of the corresponding MWE is marked specially and displayed first to the user as an answer to his query.

Graphical User Interface

For the representation of texts and dictionary entries a special graphical user interface has been developed for Apple Macintosh computers. The kernel of this user interface is a so-called "reader", a simple editor program that permits the display of texts, and annotation of individual words with translations, but also changes to the text itself. Accordingly, this reader offers three modes: read, assist and edit.

For application as a reading aid the assist mode is of particular interest. In this mode a look-up and analysis process can be activated by simple selection of a word with the mouse. Reacting to a mouse-click, a small help window appears, placed close to the selected word so as to cover as little as possible of the context. The window displays a list of the translations that appear relevant in the light of the analysis of the context:

The user is offered various options in the help window:

Session Storage

In addition to the representation of the relevant lexical information on the screen some data is recorded in a storage file. The nature and scope of this data can be set by the user. This function, for examples, makes it possible later to review the unknown vocabulary in a text.

User Tests

The first evaluation of the prototype was conducted in the summer of 1995 in user tests at the Universities of Bournemouth (for German-English) and Lyon 2 (for English-French). For each of the two source languages German and English there were two designated newspaper articles, read with the help of the COMPASS system by test users with a basic knowledge of the language. The test users' reading comprehension was examined at the end by comprehension questions. In addition the test users were asked to complete a questionnaire to assess the various COMPASS functions.

The results have been overwhelmingly positive, even at the first test phase. A second test phase will be conducted with an improved version of the prototype at the beginning of 1996.

Project Data

The official title of the project is: COMPASS: Adapting bilingual dictionaries for on-line COMPrehension ASSistance. The project is supported within the framework of the Linguistic Research and Engineering as no. 62-080 by DG-XIII of the European Commission from April 1994 to March 1996. The project partners are:
Helmut Feldweg