On regular dependency trees
Leonid Mitjushin
This paper considers dependency trees that can be constructed by
the bottom-up process. The results presented were partly published in
(Mitjushin 1985). Theorems 2 and 4 are new; for Theorem 3 a simpler
proof has been found. This text does not contain proofs, except for a
sketch of the proof of Theorem 4.
1 Regular dependency trees
By a 'sentence' we understand a finite sequence of nodes which
represent words. A 'segment' is a subsequence of a sentence containing
one or more consecutive nodes.
A 'dependency tree' is a directed and rooted tree whose set of
nodes is a sentence. A 'fragment' (of a dependency tree) is a directed
and rooted tree on a segment. A dependency tree and a one-node segment
are "extreme" examples of fragments.
'Neighbouring segments' are segments going after one another.
Precisely speaking, two segments are 'neighbouring' if they do not
intersect, and their union is a segment. 'Neighbouring fragments' are
fragments whose sets of nodes are neighbouring segments.
If we have two neighbouring fragments, then drawing a directed arc
from any node of one of the fragments to the root of the other creates
a new fragment on the union of segments occupied by the original
fragments.
This operation is called 'linking' and forms the basis for bottom-
up construction of dependency trees (Leykina and Tseytin 1975). At the
initial moment we have only one-node fragments. Then we construct
larger fragments by repeated application of the linking operation. At
last we arrive at complete dependency trees, i.e. fragments on the
whole sentence.
Not every dependency tree can be constructed by the bottom-up
process. The simplest example is as follows:
.-------------------------.
| |
| .-----------. |
| | | |
| | .---------------. |
| | | | | |
| | | | | |
o o o o
(with any direction of arcs allowed in a rooted tree). In this
situation we cannot make any linking, as no two neighbouring nodes are
connected with an arc.
Trees that can be constructed from one-node fragments by (repeated)
linking will be called 'regular'. As direction of arcs is not
important for regularity (all directed trees with the same underlying
non-directed tree are either regular or non-regular simultaneously),
non-directed trees will be considered till the end of this section.
DEFINITION 1. Consider a tree with ordered nodes. A 'tangle' is a
quadruple of nodes (a, b, c, d), where a < b < c < d, and there is
a path in the tree of the form
c , a , . . . , d , b .
Note that nodes a, b, c, d are not necessarily neighbours with
respect to the order considered: there may be other nodes between
them.
THEOREM 1 (criterion for regularity) (Mitjushin 1985). A tree with
ordered nodes is regular if and only if it does not contain tangles.
DEFINITION 2. Consider a tree with ordered nodes. A 'generalized
tangle' is a quadruple of nodes (a, b, c, d), where a < b < c < d,
and there is a path in the tree of the form
c , . . . , a , . . . , d , . . . , b .
The simple tangle is a special case of the generalized tangle.
THEOREM 2 (another criterion for regularity). A tree with ordered
nodes is regular if and only if it does not contain generalized
tangles.
Cutting an edge (a,b) of a tree produces two trees not connected with
each other. Assuming a < b, denote their sets of nodes as V(a) and V(b),
and denote { x | a <= x <= b } as S(a,b) (the segment covered by the edge
(a,b)).
COROLLARY to Theorem 2. If the tree is regular, then, for any edge
(a,b), V(a) and V(b) intersected with S(a,b) are segments.
2 Length of arcs
For a tree with ordered nodes, the length of an arc (or edge)
between nodes a and b is the number of nodes lying between them,
plus 1. (Hence, the length of an arc between neighbouring nodes is
equal to 1.)
There exists a tendency to make arcs of dependency trees shorter.
Some cases of so-called 'postponement' are probably due to this
tendency, for example:
The time had come to decorate the house for Christmas.
The problem arose of what to do with money.
Other examples can be found in (Mitjushin 1985) and (Lin 1996).
In real English and Russian sentences the average length of arcs in
dependency trees usually does not exceed 3. Sentences with greater
values of this parameter often seem "awkward". (Perhaps, for long
sentences the limit should be slightly increased.) To be more precise,
this is true for the specific method of assigning dependency trees to
sentences used by Mel'cuk (1974) and Apresjan et al. (1989, 1992).
However, the statement seems to remain true also for other "natural"
methods.
The tendency to make arcs shorter turns out to be connected with
regularity. In a sense, it may be regarded as "the reason" for
regularity.
Imagine that we have a dependency tree which is not yet linearized.
That is, its nodes are not arranged in any particular linear order,
and our task is to find a "good" arrangenent.
Usually there exist certain linguistically motivated constraints on
admissible order of nodes. These constraints are often imposed on
pairs of nodes connected with arcs, but they may as well be of more
general form.
Consider a directed tree with an unordered set of nodes. Let each
arc z of the tree be assigned a positive 'weight' p[z], and let D be
the 'weighted' total length of arcs: D is the sum of p[z]*d[z], where
d[z] is the length of arc z, and the sum includes all arcs of the
tree.
THEOREM 3 (Mitjushin 1985). Let all constraints on admissible order
of nodes be of the form "a < b", where a and b are connected with an
arc (in any direction). Then any admissible order of nodes at which D
reaches its minimum over such orders produces a regular tree.
3 Parsing with an 'oracle'
In this section we consider dependency trees in the form proposed
by Mel'cuk (1974) and used in the ETAP system developed by Apresjan
et al. (1989, 1992).
For each word of the sentence, all its lexico-morphological
interpretations (called 'homonyms') are considered. A homonym consists
of a lexeme name, a part-of-speech marker, and a list of values of
morphological features.
In this new framework, a fragment is a set of homonyms occupying
one or more successive positions in the sentence (one homonym in each
position), plus a set of directed arcs connecting those homonyms. Each
arc is labelled with a name of syntactic relation. The homonyms and
arcs of a fragment must make up a rooted tree.
As in Section 1, 'linking' will mean drawing an arc between two
neighbouring fragments, but now the arc is labelled with a name of
syntactic relation.
Let the words of the sentence be numbered 1, 2, . . . , n from
left to right. A dependency tree of the sentence is a fragment on
segment [1, n] (i.e. the homonyms which constitute the set of nodes of
the tree occupy positions 1, 2, . . . , n). Regular trees are defined
in the same way as in Section 1.
A sentence can have many different grammatically correct dependency
trees. By an 'intended' dependency tree we understand a dependency
tree which is grammatically correct and agrees with the semantic and
pragmatic content of the sentence. As a rule, such a tree is unique.
A fragment is called 'true' if it is included in the intended
dependency tree of the sentence. An 'oracle' is a device which can say
about any fragment containing more than one homonym whether that
fragment is or is not true. It seems that, in a sense, a human has
such an 'oracle' (working at a subliminal level), though its degree of
certainty, is, say, 99% instead of 100%.
A dependency tree is called 'weakly projective' if its arcs do not
intersect, that is, no quadruples of nodes (a, b, c, d) exist for
which a < b < c < d, a and c are connected with an arc (in any
direction), and b and d are connected with an arc (in any direction).
Intended dependency trees are weakly projective for the vast majority
of real English and Russian sentences.
By Theorem 1, every weakly projective tree is regular. A weakly
projective tree can be constructed by the bottom-up process in order
of increasing length of arcs drawn between fragments.
THEOREM 4. If the intended dependency tree of a sentence is unique
and weakly projective, it can be found in no more than C * H * D
steps, where each step includes one call of the 'oracle'. Here C is a
constant proportional to the number of names of syntactic relations,
H is the squared maximum number of homonyms for one word in the
sentence, and D is the total length of arcs in the intended tree.
A SKETCH OF THE PROOF. The intended tree will be assembled in order
of increasing length of arcs. Let A be the growing set of fragments
constructed from homonyms by the linking operation. At the beginning
of the process, A contains all homonyms of the sentence (and no other
fragments).
Define parameter k which successively takes values 1, 2, . . . ,
n - 1 (k is the current length of arcs to be established). For each k
consider words of the sentence from left to right. If for the current
word i no arcs have been drawn into its homonyms at previous steps,
try to draw arcs of the form (i-k) --> i and i <-- (i+k), provided
these arcs connect neighbouring fragments from A. After drawing each
arc, present the new fragment to the 'oracle', and if it is true,
add it to A.
A more "left-to-right" version of this procedure is also possible.
References
Jurij Apresjan, Igor Boguslavsky, Leonid Iomdin, Aleksandr Lazurskij,
Nikolaj Pertsov, Vladimir Sannikov, and Leonid Tsinman. 1989.
Lingvisticheskoe Obespechenie Sistemy ETAP-2. Nauka, Moscow. ('The
linguistics of the ETAP-2 system', in Russian)
Jurij Apresjan, Igor Boguslavsky, Leonid Iomdin, Aleksandr Lazurskij,
Leonid Mitjushin, Vladimir Sannikov, and Leonid Tsinman. 1992.
Lingvisticheskij Protsessor dlja Slozhnykh Informatsionnykh Sistem.
Nauka, Moscow. ('A linguistic processor for complex information
systems', in Russian)
Berta Leykina and Gregory Tseytin. 1975. Sintaksicheskaja model' s
dopushcheniem ogranichennoj neproektivnosti. In "Mezhdunarodnyj
Seminar po Mashinnomu Perevodu", Moscow, pp. 72 - 74. (`A syntactic
model allowing limited non-projectivity', in Russian)
Dekang Lin. 1996. On the structural complexity of natural language
sentences. In "Proceedings of COLING-96", Vol. 2, Copenhagen,
pp. 729 - 733.
Igor Mel'cuk. 1974. Opyt Teorii Lingvisticheskikh Modelej "Smysl -
Tekst". Nauka, Moscow. (`Toward a theory of Meaning - Text
linguistic models', in Russian)
Leonid Mitjushin. 1985. Dlina sintaksicheskikh svjazej i induktivnye
struktury. "Semiotika i informatika", No. 26, pp. 34 - 51. ('Length
of syntactic links and inductive structures', in Russian)