Trans-European Language Resources Infrastructure - II

First Steps in Development of Morphological Classification for Computer-Aided Lexicon of Latvian

Everita Milconoka
Artificial Intelligence Laboratory
Institute of Mathematics and Computer Science
University of Latvia

Lexicography is usually considered as a covering collection of lexical items and descriptions of the way they are used. B.Quemada (1994) proposes a distinction between two notions: "lexicography" and "dictionarics". The former covers the collection and the analysis of the forms and meanings of the lexical units; its main objective is to construct lexicographical databases, without necessarily being directly linked to the making of a specific dictionary. The role of "dictionarics" is to address the development and distribution of language dictionaries of various types.

Computer-based lexicon plays a significant role in computational linguistics. The lexicon has typically been viewed as a mere list of entries containing idiosyncratic information associated with individual words.

There is a computer-based lexicon being created at Artificial Intelligence Laboratory, Institute of Mathematics and Computer Science, University of Latvia covering about 4000 words. The present paper deals with morphological features which are introduced for computer-aided lexicon of Latvian.

Since Latvian is inflected language for regular words only stem is stored in dictionary. For irregular words as well as for words with consonant alternations the full paradigm of running word forms is written. This lexicon is linked with rules for morphology describing case and form generation system of Latvian. They cover rules for declension of nouns, adjectives, pronouns, as well as rules for conjugation of verbs.

Although the aim is to develop a universal description of Latvian lexicon, at the moment some morphological features depend on application the lexicon is tested, i.e., UNL (Universal Networking Language).

All features are of two kinds: those which are described and named in the way the traditional Latvian grammar does it and those which are typically used for this computer-aided lexicon.

For instance, for our lexicon, besides the case and declension (1-6) DECL (which is a word stem with the corresponding set of endings and the computer can produce the right word form) we introduced two additional features: UNDECL - which is a ready word form to describe 1) indeclinable word form (e.g., divstavu "two-storey" - these are nouns used only in Genitive case) or 2) ready word form (with the corresponding ending) it concerns suplative forms of pronouns (e.g. kas, ko, kam etc.); and a peculiarity of this lexicon is that we introduce this feature for one more group 3) for nouns of Genitive case which in other languages are adjectives; for instance, not angliska valoda, but angiu valoda English.

Verb - in lexicon all basic stems of verbs (the infinitive stem, the present stem and the past (preterite) stem) are stored. In addition to the grammatical information of verb we put some syntactical information, i.e., its valence and modality.

There, however, are some issues we haven't managed to solve yet, one of them is how to describe Past active participles as dziedajis where one participle has 3 stems: dziedaj-; dziedajus- and dziedaju­- . Therefore, the issue is how the computer can choose the right stem in order to add the ending. For other participles computer does it automatically.

It should be emphasise that till now we have operated with grammatical features and rules which are basic, stable etc. But if we would like to get some more information we need to study corpus to find out the exact use and patterns. To build an adequate lexicon, we must start with usage. In Latvia there are the first attempts of corpus building. And if we have a good corpus, we can create a good corpus based dictionary.

© TELRI, 19.11.1999