Trans-European Language Resources Infrastructure - II

The Uzbek Language Automaton

Hamdam Arzikylov, Andrey Isambaev
Language Engineering Department
Samarkand State Institute for Foreign Languages, Samarkand, Uzbekistan

The Samarkand Language Engineering team supported by TELRI continues to work out a multilingual nd polyfunctional linguistic automaton (LINGTON). The main functions of LINGTON are:

  • text transliterating,
  • text spell-checking,
  • statistical text processing,
  • text indexing and abstracting,
  • machine translation.

Geopolitical changes in Central Asia have stimulated the cultural, economical and linguistic integration of the Uzbek peoples. Hence, the task of development of the text processing systems for the Uzbek language has become a high priority. Taking this into account the Samarkand partner of TELRI is developing an English-Uzbek machine translation system. An automatic dictionary (AD), morphologic analysis/synthesis and partly syntactic analysis/synthesis modules has already been developed. The Uzbek input vocabulary is divided into two files. The first one (about 12,000 entries) includes nominal and verbal stems as well as indeclinable words (e.g. adjectives, adverbs, and numerals). The other file (591 entries) includes declensional and derivational affixes arranged according to the ranks into 4 pairs of graphs. The initial procedure of Uzbek text processing involves separating the lexical stem and chain of affixes of each word form. It is to be done by the sequential cutting of the characters of the word from left to right and vice versa and by the comparing the remaining part of the word with the vocabulary entries of both files. The lexical analysis is complete when the results of both passes are matched. In this case, all the morphologic and semantic attributes from AD have to be assigned to the word form. Next, the lexical analysis is completed and the syntactic analysis module can be activated.

The syntactic analysis module has to build the nominal and verbal groups as well as the sentence structures from the lexical units according to their morphologic attributes and grammar rules. Three levels of the groups are available: nominal groups; verbal groups; and sentences (clauses). The groups, formed by a previous level, are treated as indivisible units at the next level of analysis. The module of syntactic analysis is realized with the Augmented Transition Network technique, defining the hierarchy and the structure of the groups of every level. An internal representation, that can be transformed into a surface structures of the target language, is the result of the syntactic analysis module. The transformation includes also assigning the morphologic information of the target language to every lexical unit. Two last steps are achieved by the syntactic and morphologic synthesis modules.

