> rtsz/(`
8/0DTimes New Roman̵00DMS MinchoRoman̵00 DSymbolhoRoman̵00
@n?" dd@ @@``PTK
c$@g4MdMd0
ppp0 <4!d!d<L<4BdBdLg4:d:d0
p@pp?, Workshop on Machine Learning Approaches in Computational Linguistics -ESSLLI 2002O=Q(5Grammar Extraction and Refinement from an HPSG Corpus5 Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
kivs@bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002&$&&>&$&$&dc Plan of the TalkDOP model
An HPSG Corpus - definition
Formalism for HSPG
Extraction of HPSG Grammar from HPSG Corpus
Refinement of an HPSG
Exampler ( bB"","
"e
95 DOP ModelTGrammar formalism for the target grammar
Procedure for the construction of sentence analyses in the chosen grammar formalism
Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus
A performance model guiding the analysis of new sentences with respect to some desirable conditions
U D>
DOP Model (2)Two additional unspoken assumptions are:
The structures in the corpus are decomposable into the grammar formalism
The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus
This assumption refers to the quality of the corpus<) 24 2)46
Y <7Corpus in a Grammar Formalism
A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar G in this grammatical formalism:
" S. S C S SGC(G)
and
" S. S C "S'.(S' G(s(S)) S' C)O "t
L? * =8HPSG Corpus
Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way
In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King s logic - SRL B >9Feature Graphs (1)
S = ~~ - SRL finite signature
G = <N,V,r,T> is a feature graph iff G is a directed, connected and rooted graph such that
N is a set of nodes,
V : NFN is a partial arc function,
r is the root node,
T : NS is a total species assignment functionJ D JDFeature Graphs (2)
wSome notions:
Complete feature graphs
Subsumption based on isomorphism
Unification - there is no most general unifier
xw & G ?:Feature Graphs (2)
#Feature graphs can be interpreted via translation to SRL clauses
Exclusive matrixes can be represented as a set of feature graphs
An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs
A sentence analysis can be represented as a complete feature graph @;Feature Graphs (3)
4Complete feature graphs are a good representation for an HPSG corpus
Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs)
Important property:
For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node0 E?Corpus Grammar
A grammar G such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar
C SGC(G)
In feature graph terms:
For each complete graph in the corpus, the grammar contains a graph which subsumes itjo
9 o
Grammar Extraction (1) 3Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF
Each extracted fragment has to
contain all features for the root node, and
subsume at least one complete graph in the corpusV ^ ' F@Grammar Extraction (2) The set GF is ordered by subsumption relation.
The complete graphs from the corpus are at the bottom.
Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of Ch B "#(. C GBGrammar Extraction (3)
All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity
A grammar from this hierarchy can be chosen by specifying additional constraints over it such as:
it is the most general one that doesn t overgenerate or undergenerate over the corpus, or
it satisfies some external conditions like - the shortest inference over the corpus and etc&>
p HAThe set GF as a Grammar$
This is the original idea behind DOP Model
GF contains all generalizations over the corpus
GF will overgenerate over the corpus
GF will accept ungrammatical sentences
Thus a special inference mechanism is necessary in order to use GF as a grammard+| (P (+.#u6;
ICGrammar Refinement
3In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators
This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement
We can choose the most general grammars that refine the original grammar as a new grammar44
4 B=Conclusions
We define an HPSG corpus as a set of complete graphs
We define an HPSG grammar as a set of graphs
We define a procedure for extraction of corpus grammars from the corpus
We define a refinement of a grammar on the basis of a corpus
/: ` ` ̙33` 333MMM` ff3333f` f` f` 3>?" dd@,|?" dd@ " @ ` n?" dd@ @@``PR @ ` `p>>c(
6$ P
T Click to edit Master title style!
!
0
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0 `
=*
0D! `T
?*
0! `x
?*H
0@h ? ̙33 Default Design0@@( d
0$ϼ P
]*
0ϼ
_* d
c$ ?X
0ϼ
@
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
6Dм `P
]*
6м `
_* H
0h ? ̙33D0d(
d
d
0ͼ P
Y*
d
0μ
[*
d
6dμ `P
Y*
d
6μ `
[* H
d0h ? ̙33
P0(
x
c$dѼ
x
c$Ѽ
H
0@h ? ̙33
`0(
x
c$ҼP
x
c$ӼPp
H
0@h ? ̙33
p<(
~
s*ԼP
~
s*$ռP pp
H
0@h ? ̙33
0<(
0~
0 s*DּP
~
0 s*ּPp
H
00@h ? ̙33
(
l
C̼P
l
CDͼ
H
0@h ? ̙33
( 00
l
CP
l
C
H
0@h ? ̙33
(
l
C$ؼP
l
C}
H
0@h ? ̙33
@L$(
Lr
L SP
r
L STo
H
L0@h ? ̙33
(
l
CDP
l
C@
H
0@h ? ̙33
(
l
C~P
l
CdP
H
0@h ? ̙33
4(
4l
4 CP
l
4 C䁽
H
40@h ? ̙33
(0(
(x
( c$p
x
( c$d 0
H
(0@h ? ̙33
8<(
8~
8 s*Dp
~
8 s* 0
H
80@h ? ̙33
@<(
@~
@ s*Ćp
~
@ s*$ 0
H
@0@h ? ̙33
<( )
<l
< CDP
l
< C
H
<0@h ? ̙33
H$(
Hr
H S$ҼP
r
H Sd
H
H0@h ? ̙33
0(( a@
(l
( CP
l
( CD
H
(0@h ? ̙330zP(
R
3X
Cp
@
H
0h ? ̙33rt@/BDWz>808Fc<PJL~NJR*TBaDpH
VY[]_^PD
`eJmbolDefault Design6Grammar Extraction and Refinement from an HPSG CorpusPlan of the TalkDOP Model [Bod 1998]DOP Model (2)Corpus in a Grammar FormalismHPSG CorpusFeature Graphs (1)Feature Graphs (2)Feature Graphs (3)Feature Graphs (4)Corpus GrammarGrammar Extraction (1) Grammar Extraction (2) Grammar Extraction (3) The set GF as a GrammarGrammar RefinementConclusionsVerwendete Schriftartenb(
0 DSymbolhoRomanQbbv0b(
0
`.
@n?" dd@ @@``TTK dd@ @@``PTK
c$@g4MdMd0
ppp0 <4!d!d<L<4BdBdLg4:d:d0
p@pp?, Workshop on Machine Learning Approaches in Computational Linguistics -ESSLLI 2002O=(5Grammar Extraction and Refinement from an HPSG Corpus5 Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
kivs@bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002&$&&>&$&$&dc Plan of the TalkDOP model
An HPSG Corpus - definition
Formalism for HSPG
Extraction of HPSG Grammar from HPSG Corpus
Refinement of an HPSG
Exampler ( bB"","
"e
95 DOP ModelTGrammar formalism for the target grammar
Procedure for the construction of sentence analyses in the chosen grammar formalism
Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus
A performance model guiding the analysis of new sentences with respect to some desirable conditions
U D>
DOP Model (2)Two additional unspoken assumptions are:
The structures in the corpus are decomposable into the grammar formalism
The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus
This assumption refers to the quality of the corpus<) 24 2)46
Y <7Corpus in a Grammar Formalism
A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar G in this grammatical formalism:
" S. S C S SGC(G)
and
" S. S C "S'.(S' G(s(S)) S' C)O "t
L? * =8HPSG Corpus
Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way
In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King s logic - SRL B >9Feature Graphs (1)
S = ~~~~ - SRL finite signature
G = <N,V,r,T> is a feature graph iff G is a directed, connected and rooted graph such that
N is a set of nodes,
V : NFN is a partial arc function,
r is the root node,
T : NS is a total species assignment functionJ D JDFeature Graphs (2)
Some notions:
Subsumption based on isomorphism
Unification - there is no most general unifier
Complete feature graphs - all information from signature is presented
Paths
Subgraphs*6 ?:Feature Graphs (3)
=Feature graphs can be interpreted via translation to SRL clauses
Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs)
An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs
A sentence analysis can be represented as a complete feature graph @;Feature Graphs (4)
4Complete feature graphs are a good representation for an HPSG corpus
Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs)
Important property:
For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node0 E?Corpus Grammar
A grammar G such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar
C SGC(G)
In feature graph terms:
For each complete graph in the corpus, the grammar contains a graph which subsumes itjo
9 o
Grammar Extraction (1) 3Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF
Each extracted fragment has to
contain all features for the root node, and
subsume at least one complete graph in the corpusV ^ ' F@Grammar Extraction (2) The set GF is ordered by subsumption relation.
The complete graphs from the corpus are at the bottom.
Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of Ch B "#(. C GBGrammar Extraction (3)
All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity
A grammar from this hierarchy can be chosen by specifying additional constraints over it such as:
it is the most general one that doesn t overgenerate or undergenerate over the corpus, or
it satisfies some external conditions like - the shortest inference over the corpus and etc&>
p HAThe set GF as a Grammar$
This is the original idea behind DOP Model
GF contains all generalizations over the corpus
GF will overgenerate over the corpus
GF will accept ungrammatical sentences
Thus a special inference mechanism is necessary in order to use GF as a grammard+| (P (+.#u6;
ICGrammar Refinement
3
c$@g4>d>dv0bppp0 <4!d!d`
0bRb<4BdBdb
0bbg4:d:dv0bp@ppuʚ;2Nʚ;<4ddddDocumentSummaryInformation8:Q0(`
8/0DTimes New Roman̵00DMS MinchoRoman̵00 DSymbolhoRoman̵00
@n?" dd@ @@``PTK
c$@g4MdMd0
ppp0 <4!d!d<L<4BdBdLg4:d:d0
p@pp?, Workshop on Machine Learning Approaches in Computational Linguistics -ESSLLI 2002O=(5Grammar Extraction and Refinement from an HPSG Corpus5 Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
kivs@bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002&$&&>&$&$&dc Plan of the TalkDOP model
An HPSG Corpus - definition
Formalism for HSPG
Extraction of HPSG Grammar from HPSG Corpus
Refinement of an HPSG
Exampler ( bB"","
"e
95 DOP ModelTGrammar formalism for the target grammar
Procedure for the construction of sentence analyses in the chosen grammar formalism
Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus
A performance model guiding the analysis of new sentences with respect to some desirable conditions
U D>
DOP Model (2)Two additional unspoken assumptions are:
The structures in the corpus are decomposable into the grammar formalism
The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus
This assumption refers to the quality of the corpus<) 24 2)46
Y <7Corpus in a Grammar Formalism
A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar G in this grammatical formalism:
" S. S C S SGC(G)
and
" S. S C "S'.(S' G(s(S)) S' C)O "t
L? * =8HPSG Corpus
Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way
In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King s logic - SRL B >9Feature Graphs (1)
S = ~~~~ - SRL finite signature
G = <N,V,r,T> is a feature graph iff G is a directed, connected and rooted graph such that
N is a set of nodes,
V : NFN is a partial arc function,
r is the root node,
T : NS is a total species assignment functionJ D JDFeature Graphs (2)
Some notions:
Subsumption based on isomorphism
Unification - there is no most general unifier
Complete feature graphs - all information from signature is presented
Paths
Subgraphs*6 ?:Feature Graphs (3)
=Feature graphs can be interpreted via translation to SRL clauses
Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs)
An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs
A sentence analysis can be represented as a complete feature graph @;Feature Graphs (4)
4Complete feature graphs are a good representation for an HPSG corpus
Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs)
Important property:
For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node0 E?Corpus Grammar
A grammar G such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar
C SGC(G)
In feature graph terms:
For each complete graph in the corpus, the grammar contains a graph which subsumes itjo
9 o
Grammar Extraction (1) 3Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF
Each extracted fragment has to
contain all features for the root node, and
subsume at least one complete graph in the corpusV ^ ' F@Grammar Extraction (2) The set GF is ordered by subsumption relation.
The complete graphs from the corpus are at the bottom.
Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of Ch B "#(. C GBGrammar Extraction (3)
All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity
A grammar from this hierarchy can be chosen by specifying additional constraints over it such as:
it is the most general one that doesn t overgenerate or undergenerate over the corpus, or
it satisfies some external conditions like - the shortest inference over the corpus and etc&>
p HAThe set GF as a Grammar$
This is the original idea behind DOP Model
GF contains all generalizations over the corpus
GF will overgenerate over the corpus
GF will accept ungrammatical sentences
Thus a special inference mechanism is necessary in order to use GF as a grammard+| (P (+.#u6;
ICGrammar Refinement
3In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators
This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement
We can choose the most general grammars that refine the original grammar as a new grammar44
4 B=Conclusions
We define an HPSG corpus as a set of complete graphs
We define an HPSG grammar as a set of graphs
We define a procedure for extraction of corpus grammars from the corpus
We define a refinement of a grammar on the basis of a corpus
/:rf:
eYJQ0(`
8/0DTimes New Roman̵00DMS MinchoRoman̵00 DSymbolhoRoman̵00
@n?"b{
0bb?, Workshop on Machine Learning Approaches in Computational Linguistics -ESSLLI 2002O=)5Grammar Extraction and Refinement from an HPSG Corpus5 Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of S
Oh+'0$hp
(07BulTreeBank HPSG-based syntactic treebank of BulgarianNasoeeBSfS111Microsoft PowerPointd s@3+@ w@ ]}@`?HG
y nc& &&#TNPP 2OMi-
&
TNPP &&TNPP
- "-- !-- "-&^y&
- @Times New Roman- . 2
tf1 .--%-- @Times New Roman- .:2
PN"Grammar Extraction and Refinement '*)!%). .$2
from an HPSG Corpus)''$.--}%-- @Times New Roman- .2
Kiril Simov
.@Times New Roman- .$2
gBulTreeBank Project
.@Times New Roman- .'2
@i(www.BulTreeBank.org)
.@Times New Roman- .c2
j`=Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
. .%2
akivs@bultreebank.org
. .*2
?ESSLLI'2002 Workshop on
. .[2
}8Machine Learning Approaches in Computational Linguistics
. .2
-u August 5 . . 2
--
. .2
-9, 2002.--"System-&TNPP &
._8SfSSfSpproaches in Computational Linguistics
. .2
-u August 5 . . 2
--
. .2
-9, 2002.--
՜.+,0
A4-Papier (210x297 mm)i
Sirma AI LTD0x28_
Times New Roman
MS MinchoSyIn the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators
This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement
We can choose the most general grammars that refine the original grammar as a new grammar44
4 B=Conclusions
We define an HPSG corpus as a set of complete graphs
We define an HPSG grammar as a set of graphs
We define a procedure for extraction of corpus grammars from the corpus
We define a refinement of a grammar on the basis of a corpus
/:r:
iJ1(`
8/0DTimes New RomanQbbv0b(
0DMS MinchoRomanQbbv0Root EntrydO) 6!?h@Current User2)SummaryInformation(TPowerPoint Document(8
!"#$%&'()*+,-./012NuK6789:;<=>?@ABCDEFGHIJpLgOPQRSTUVWXYZ[\]^_`abcdef5vijklmno3q4Mwxy{|}~
!"#$%&'()*+,-./01;<=>?@ABCDEFGHEntwurfsvorlageFolientitel_SfSSfSD8DB97-4EA8-4F55-BC60-526F1BE3286B}-&G]y_SfSSfS"System-&TNPP &.2
sQWorkshop on Machine Learning Approciences
kivs@bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002&$&&>&$&$&dc Plan of the TalkDOP model
An HPSG Corpus - definition
Formalism for HSPG
Extraction of HPSG Grammar from HPSG Corpus
Refinement of an HPSG grammar
Conclusiont ( bB"","
"Be
95DOP Model [Bod 1998]TGrammar formalism for the target grammar
Procedure for the construction of sentence analyses in the chosen grammar formalism
Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus
A performance model guiding the analysis of new sentences with respect to some desirable conditionsU D>
DOP Model (2)Two additional unspoken assumptions are:
The structures in the corpus are decomposable into the grammar formalism
The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus
This assumption refers to the quality of the corpus@) 24 2)4F
Y <7Corpus in a Grammar Formalism A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar G in this grammatical formalism:
" S. S C S SGC(G)
and
" S. S C "S'.(S' G(s(S)) S' C)O "t
F? =8HPSG Corpus Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way
In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King s logic - SRL *B >9Feature Graphs (1) S = ~~~~ - SRL finite signature
G = <N,V,r,T> is a feature graph iff G is a directed, connected and rooted graph such that
N is a set of nodes,
V : NFN is a partial arc function,
r is the root node,
T : NS is a total species assignment functionJ*D JDFeature Graphs (2) Some notions:
Subsumption based on isomorphism
Unification - there is no most general unifier
Complete feature graphs - all information from signature is presented
Paths
Subgraphs*F ?:Feature Graphs (3) =Feature graphs can be interpreted via translation to SRL clauses
Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs)
An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs
A sentence analysis can be represented as a complete feature graph> @;Feature Graphs (4) 4Complete feature graphs are a good representation for an HPSG corpus
Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs)
Important property:
For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node0* E?Corpus Grammar A grammar G such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar
C SGC(G)
In feature graph terms:
For each complete graph in the corpus, the grammar contains a graph which subsumes itjo
9 o Grammar Extraction (1) 3Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF
Each extracted fragment has to
contain all features for the root node, and
subsume at least one complete graph in the corpusV ^ ' F@Grammar Extraction (2) The set GF is ordered by subsumption relation.
The complete graphs from the corpus are at the bottom.
Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of Ch B "#(4 C GBGrammar Extraction (3)
All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity
A grammar from this hierarchy can be chosen by specifying additional constraints over it such as:
it is the most general one that doesn t overgenerate or undergenerate over the corpus, or
it satisfies some external conditions like - the shortest inference over the corpus and etc(N
p HAThe set GF as a Grammar$ This is the original idea behind DOP Model
GF contains all generalizations over the corpus
GF will overgenerate over the corpus
GF will accept ungrammatical sentences
Thus a special inference mechanism is necessary in order to use GF as a grammard+| (P (+.#uF;
ICGrammar Refinement 3In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators
This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement
We can choose the most general grammars that refine the original grammar as a new grammar444 B=Conclusions We define an HPSG corpus as a set of complete graphs
We define an HPSG grammar as a set of graphs
We define a procedure for extraction of corpus grammars from the corpus
We define a refinement of a grammar on the basis of a corpus /:
`0(
x
c$݆P
x
c$ކPp
H
0@h ? ̙33
<(
~
s*9P
~
s*8<P pp
H
0@h ? ̙33r95
Jc0(`
8/0DTimes New RomanQbbv0b(
0DMS MinchoRomanQbbv0b(
0 DSymbolhoRomanQbbv0b(
0
`.
@n?" dd@ @@``TTK
c$@g4>d>dv0bppp0 <4!d!d`
0bRb<4BdBd$b
0bbg4:d:dv0bp@ppuʚ;2Nʚ;<4dddd$b{
0bb? Workshop on Machine Learning Approaches in Computational Linguistics -ESSLLI 2002O=](5Grammar Extraction and Refinement from an HPSG Corpus5 Kiril Simov
BulTreeBank Project
(www.BulTreeBank.org)
Linguistic Modeling Laboratory, Bulgarian Academy of Sciences
kivs@bultreebank.org
ESSLLI'2002 Workshop on
Machine Learning Approaches in Computational Linguistics
August 5 - 9, 2002&$&&>&$&$&dc Plan of the TalkDOP model
An HPSG Corpus - definition
Formalism for HSPG
Extraction of HPSG Grammar from HPSG Corpus
Refinement of an HPSG grammar
Conclusiont ( bB"","
"Be
95DOP Model [Bod 1998]TGrammar formalism for the target grammar
Procedure for the construction of sentence analyses in the chosen grammar formalism
Decomposition procedure, which extracts a grammar in the target grammar formalism from the structures in the corpus
A performance model guiding the analysis of new sentences with respect to some desirable conditionsU D>
DOP Model (2)Two additional unspoken assumptions are:
The structures in the corpus are decomposable into the grammar formalism
The extracted grammar should neither overgenerate, nor undergenerate with respect to the training corpus
This assumption refers to the quality of the corpus@) 24 2)4 <7Corpus in a Grammar Formalism A corpus C in a given grammatical formalism G is a sequence of analyzed sentences where each analyzed sentence is a member of the set of structures defined as a strong generative capacity of a grammar G in this grammatical formalism:
" S. S C S SGC(G)
and
" S. S C "S'.(S' G(s(S)) S' C)O "t
O =8HPSG Corpus Strong Generative Capacity in HPSG is defined by (King 1999) and (Pollard 1999) in technically the same way
In our work we consider the elements of Strong Generative Capacity in HPSG to be a special kind of feature graphs based on a logic of HPSG: King s logic - SRL >9Feature Graphs (1) S = ~~~~ - SRL finite signature
G = <N,V,r,T> is a feature graph iff G is a directed, connected and rooted graph such that
N is a set of nodes,
V : NFN is a partial arc function,
r is the root node,
T : NS is a total species assignment functionJ JDFeature Graphs (2) Some notions:
Subsumption based on isomorphism
Unification - there is no most general unifier
Complete feature graphs - all information from signature is presented
Paths
Subgraphs* ?:Feature Graphs (3) =Feature graphs can be interpreted via translation to SRL clauses
Exclusive matrixes can be represented as a set of feature graphs (exclusive set of graphs)
An SRL finite theory wrt an SRL finite signature can be represented as a set of feature graphs
A sentence analysis can be represented as a complete feature graph> @;Feature Graphs (4) 4Complete feature graphs are a good representation for an HPSG corpus
Feature graphs are a good representation for an HPSG grammar (exclusive set of graphs)
Important property:
For each node in a graph in the corpus there exists exactly one graph in the grammar which subsumes the subgraph started on the node05 E?Corpus Grammar A grammar G such that the corpus C is a subset of its strong generative capacity is called Corpus Grammar
C SGC(G)
In feature graph terms:
For each complete graph in the corpus, the grammar contains a graph which subsumes itjo
9 o Grammar Extraction (1) 3Grammar extraction from an HPSG corpus C is graph fragmentation operation which produces a set of graphs from which a grammar can be constructed. The result is a set of graphs - GF
Each extracted fragment has to
contain all features for the root node, and
subsume at least one complete graph in the corpusV ^ ' F@Grammar Extraction (2) The set GF is ordered by subsumption relation.
The complete graphs from the corpus are at the bottom.
Each set of graphs G such that for each complete graph M in GF there is at least one feature graph in G that subsumes M and G contains only graphs from GF is a corpus grammar of Ch B "#(4 C GBGrammar Extraction (3)
All corpus grammar extracted in this way can be ordered by set inclusion of their strong generative capacity
A grammar from this hierarchy can be chosen by specifying additional constraints over it such as:
it is the most general one that doesn t overgenerate or undergenerate over the corpus, or
it satisfies some external conditions like - the shortest inference over the corpus and etc( HAThe set GF as a Grammar$ This is the original idea behind DOP Model
GF contains all generalizations over the corpus
GF will overgenerate over the corpus
GF will accept ungrammatical sentences
Thus a special inference mechanism is necessary in order to use GF as a grammard+| (P (+.#u ICGrammar Refinement 3In the process of creation of an HPSG corpus there is an HPSG grammar used by the annotators
This grammar could be used as a starting point for extraction of a better grammar. This process is called Grammar Refinement
We can choose the most general grammars that refine the original grammar as a new grammar444 B=Conclusions We define an HPSG corpus as a set of complete graphs
We define an HPSG grammar as a set of graphs
We define a procedure for extraction of corpus grammars from the corpus
We define a refinement of a grammar on the basis of a corpus /: ` ` ̙33` 333MMM` ff3333f` f` f` 3>?" dd@,|?" dd@ " @ ` n?" dd@ @@``PR @ ` `p>>$(
6 P
T Click to edit Master title style!
!
08
RClick to edit Master text styles
Second level
Third level
Fourth level
Fifth level!
S
0 `
X*
0 `T
Z*
0\ `x
Z*H
0@h ? ̙33 Default Design
00(
x
c$s
s
x
c$s s
H
0@h ? ̙33r0%-67
z8J~~