|
Trans-European Language Resources Infrastructure - II
Information Structure and Coreference
in the Prague Dependency Treebank
Eva Hajičová
Institute of Formal and Applied Linguistics
Charles University
Prague, Czech Republic
e-mail: hajicova@ufal.mff.cuni.cz
The Prague Dependency Treebank (PDT) project is conceived of as
a collection of tree structures representing sentences of (a part of) the Czech
National Corpus (CNC) in the shape of syntactic trees (tagged both on the
analytical and the tectogrammatical levels, in addition to the morphological
tags). The tagging on the tectogrammatical layer is based on the theoretical
framework of Functional Generative Description (FGD). The units processed
by tagging procedures - both automatic and manual - are sentences (as
occurring in the texts in the corpus) but the human annotators are instructed
to assign (disambiguated) structures according to the meaning of the
sentence in its environment, taking contextual (and factual) information into account.
In the full text of the paper, we will concentrate on two issues
connected with intersentential links as captured in the tectogrammatical tree
structures (TGTSs):
- A special attribute TFA is established for representing the
topic-focus articulation (information structure) of the sentence. This attribute
receives one of the following three values:
T | (a non-contrastive contextually bound node, with a lower degree of
communicative dynamism, CD, than its governor), |
F | (a contextually non-bound node, "new" piece of information), |
C | (a contrastive (part of) topic; in the present stage, this value is
assigned only in cases in which the node concerned is in a non-projective
position). The order of the nodes in the TGTSs reflects the underlying
word order (communicative dynamism), which is also semantically relevant. |
- Three attributes are introduced to tentatively capture the
intersentential links, namely the attribute COREF (with the lexical value of the
antecedent of the given expression), CORNUM (the value of which is the
serial number of the antecedent), and CORSTC (with the value Prev, if the
antecedent is in the previous sentence, and 0, if it is in the same sentence;
if neither of these values is true, then this attribute receives the value
NA, for non-applicable).
See previous, next abstract.
Back to Newsletter no. 9.
|