1st Seminar

Jan Laciga (ByllBase, Prague):

BYLLBASE - A FULL TEST RESEARCH METHOD USING LINGUISTIC METHOD

There are principally two approaches to the task of information retrieval of textual data: (i) to select the text according to indexes (key words) assigned to each text, or (ii) to retrieve a word or combination of words directly in the texts and thus to select documents where the issues referred to by the given (string of) words are discussed.

The system developed by our company belongs to the type (ii), which we consider to be more convenient for large scale applications. We had to develop a system specifically designed for Czech because the systems available mostly for English are not applicable: the inflectional character of Czech (in contrast to English) brings problems connected with the rich abundance of forms of a single lexical item.

The first commerically available system for text retrieval for Czech , called ByllBase, has been developed in cooperation with the group of computational linguistics at Charles University in Prague and its special feature is an integration of the lemmatizer of Czech into the system. This lemmatizer makes it possible also to distinguish among homonyms. This enables the user to formulate the queries in a natural form, it speeds up the whole process and lowers the requirements on memory capacity for the auxiliary files. At the present stage, we make amendments to the semantic analysis to make it possible (without a human interference) to distinguish among homonyms.

ByllBase is used nowadays at such big institutions as the Czech saving bank Ceska sporitelna, the Czech National Bank, the city council of Brno, Bratislava, some industrial plants, editorial offices etc. One of the sucessful installation of ByllBase is the legal system ASPI, a most complex and widespread automatic retrieval system of legal documents in the Czech Republic and in Slovakia, which contains Czech legal documents and legal literature since 1811.