INTEX for Windows Description of Bulgarian Lexical and Grammatical knowledge

Svetla Koeva
Computer modeling Department
Institute for Bulgarian language
Bulgarian Academy of Sciences
Sofia, Bulgaria

"INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used in over 30 research centers as an information retrieval system, to analyze literary texts, to quantify language variations, to teach second languages, as a terminological extractor, as well as to teach computational linguistics to graduate students." (Max Silberztein, [])

Large-coverage Bulgarian INTEX for Windows 95-NT (INTEX 4.0) representation is being developed by a small group of researchers working at the Bulgarian Academy of Sciences (BAS) within the joint research project with the Laboratory for Information Retrieval Systems (SNRS). Here we present the results achieved in the description of Bulgarian lexical and grammatical knowledge in the framework of INTEX for Windows system.

INTEX includes tools for preprocessing texts - that means high precision identification of:

  • sentence borders,
  • unambiguous compounds, some special cases such as contractions and elisions. We have to point that an important feature of INTEX is that texts, dictionaries and grammars are represented by Finite State Transducers (FST). We provide FST for sentence delimiter in Bulgarian, dictionary with Bulgarian unambiguous compound words and several FSTs for identifying some specific Bulgarian tokens.

After preprocessing the simple words and the compound words can be identified with corresponding dictionaries and FSTs. INTEX dictionaries must

be in the format of a DELAF (for simple forms) or a DELACF (for compounds). These dictionaries associate tokens with a lemma and linguistic information - part of speech (e.g. Noun), inflectional information (e.g. first person singular present) etc. The Bulgarian DELAF dictionary consists of about 1 300 000 simple words, basically all inflected simple words. It is constructed from the Bulgarian Grammatical Dictionary (Gramatik 2000 - 80 000 entries), created during the last year under the auspices of Plovdiv University, with the technical support of students from Plovdiv University and the additional support of Technocomp. Some lexical entries (e.g. numerals) will be presented with FSTs. The result of the application of DELAF dictionaries and FSTs is a list of all recognized words, a list of words that not been found in the dictionaries and lists of ambiguous and unambiguous compounds.

The integration of Bulgarian lexical and grammatical resources in INTEX allows different processing of Bulgarian texts:

  • indexing all occurrences of a given word, of a list of words (listed in a dictionary), of a given category or, more generally, of any syntactic pattern given in the form of a regular expression or a Finite State Automaton;
  • extracting corpora from the text, building concordances and analyzing texts with INTEX statistical tools.

One of the important tools of INTEX is removing word ambiguity with local grammars. We will present a large list of Bulgarian local grammars for disambiguation.

In general, we supply INTEX with Bulgarian lexical and grammatical knowledge as follows:

  • FST for sentence recognition;
  • Dictionary for unambiguous compounds;
  • FSTs for special tokens;
  • DELAF dictionary with over 1 300 000 entries;
  • DELACF dictionary;
  • FSTs for some lexical entries;
  • FSTs - local grammars for disambiguation.

INTEX is a research tool with a lot of applications in computational linguistics, corpus-based linguistics, information retrieval, etc. and we believe that the description of Bulgarian lexical and grammatical knowledge will contribute to the multilingual lexicography as well as to the unification of the language resources.

