INTEX for Windows Description of Bulgarian Lexical and Grammatical knowledge Svetla Koeva "INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used in over 30 research centers as an information retrieval system, to analyze literary texts, to quantify language variations, to teach second languages, as a terminological extractor, as well as to teach computational linguistics to graduate students." (Max Silberztein, [http://www.ladl.jussieu.fr/INTEX/]) Large-coverage Bulgarian INTEX for Windows 95-NT (INTEX 4.0) representation is being developed by a small group of researchers working at the Bulgarian Academy of Sciences (BAS) within the joint research project with the Laboratory for Information Retrieval Systems (SNRS). Here we present the results achieved in the description of Bulgarian lexical and grammatical knowledge in the framework of INTEX for Windows system. INTEX includes tools for preprocessing texts - that means high precision identification of:
After preprocessing the simple words and the compound words can be identified with corresponding dictionaries and FSTs. INTEX dictionaries must be in the format of a DELAF (for simple forms) or a DELACF (for compounds). These dictionaries associate tokens with a lemma and linguistic information - part of speech (e.g. Noun), inflectional information (e.g. first person singular present) etc. The Bulgarian DELAF dictionary consists of about 1 300 000 simple words, basically all inflected simple words. It is constructed from the Bulgarian Grammatical Dictionary (Gramatik 2000 - 80 000 entries), created during the last year under the auspices of Plovdiv University, with the technical support of students from Plovdiv University and the additional support of Technocomp. Some lexical entries (e.g. numerals) will be presented with FSTs. The result of the application of DELAF dictionaries and FSTs is a list of all recognized words, a list of words that not been found in the dictionaries and lists of ambiguous and unambiguous compounds. The integration of Bulgarian lexical and grammatical resources in INTEX allows different processing of Bulgarian texts:
One of the important tools of INTEX is removing word ambiguity with local grammars. We will present a large list of Bulgarian local grammars for disambiguation. In general, we supply INTEX with Bulgarian lexical and grammatical knowledge as follows:
INTEX is a research tool with a lot of applications in computational linguistics, corpus-based linguistics, information retrieval, etc. and we believe that the description of Bulgarian lexical and grammatical knowledge will contribute to the multilingual lexicography as well as to the unification of the language resources. Back to Newsletter no. 9. |
||
© TELRI, 19.11.1999 |