Information Structure and Coreference in the Prague Dependency Treebank

Eva Hajičová
Institute of Formal and Applied Linguistics
Charles University
Prague, Czech Republic

The Prague Dependency Treebank (PDT) project is conceived of as a collection of tree structures representing sentences of (a part of) the Czech National Corpus (CNC) in the shape of syntactic trees (tagged both on the analytical and the tectogrammatical levels, in addition to the morphological tags). The tagging on the tectogrammatical layer is based on the theoretical framework of Functional Generative Description (FGD). The units processed by tagging procedures - both automatic and manual - are sentences (as occurring in the texts in the corpus) but the human annotators are instructed to assign (disambiguated) structures according to the meaning of the sentence in its environment, taking contextual (and factual) information into account.

In the full text of the paper, we will concentrate on two issues connected with intersentential links as captured in the tectogrammatical tree structures (TGTSs):

  1. A special attribute TFA is established for representing the topic-focus articulation (information structure) of the sentence. This attribute receives one of the following three values:
    T(a non-contrastive contextually bound node, with a lower degree of communicative dynamism, CD, than its governor),
    F(a contextually non-bound node, "new" piece of information),
    C(a contrastive (part of) topic; in the present stage, this value is assigned only in cases in which the node concerned is in a non-projective position). The order of the nodes in the TGTSs reflects the underlying word order (communicative dynamism), which is also semantically relevant.
  2. Three attributes are introduced to tentatively capture the intersentential links, namely the attribute COREF (with the lexical value of the antecedent of the given expression), CORNUM (the value of which is the serial number of the antecedent), and CORSTC (with the value Prev, if the antecedent is in the previous sentence, and 0, if it is in the same sentence; if neither of these values is true, then this attribute receives the value NA, for non-applicable).

