No. 2

Wolfgang Teubert, Coordinator of TELRI

One year ago TELRI was set up by 22 institutions in seventeen European countries. Meanwhile, more institutions have joined, if only as members of the Advisory Board: Belgrade, Zagreb, and, quite recently, Moscow. TELRI is a Concerted Action. Its primary goal is to pool existing language resources, corpora, and lexicons and to make them available to the growing NLP community. New resources are being created; all resources will be standardised. Together they will allow for the development of a new generation of powerful multilingual language technology applications. They are already used for corpus-based dictionaries and lexical databases. The TELRI Resources Catalogue is available on the WWW ( /index.html). The success of the TELRI network depends on the organisation of national networks bringing together academic research and commercial language industry. Therefore, TELRI and the partner project ELSNET Goes East joined forces to conduct an indepth survey on the leading actors in the field of language technology, their resources, and their activities. Data collection is about to begin. Results will be made quickly available on Internet and in printed form.
Our first European Seminar, "Language Resources for Language Technology", conducted in Tihany, Hungary on September 15 and 16 clearly demonstrated the need for a continuous platform for research and industry. All applications more sophisticated than spelling checkers heavily rely on linguistic data and knowledge extracted from the data. Successful applications are those that have an accuracy rate of more than 95%. The higher your goal on this scale, the more linguistic knowledge required. This knowledge is available at academic research centres. Thus, if you want to upgrade your language technology products contact us now.
Meanwhile, we have selected some areas of ongoing research. We set out to produce electronic text versions of Plato's "Republic" in as many languages as possible. Nine (including Chinese) are already available. This small, but very helpful parallel corpus, will be used to test corpus alignment software and to find new ways to detect translation equivalents of multi-word units, collocations, and phraseologisms.
Another area is the validation of textual and lexical resources: here we want to complement similar activities in Western Europe and to assess ge neric tools such as alignment software. Validation of language resources will have to be based upon linguistic specifications like morphosyntactic tagsets for corpora or data categories for lexicons. Some of these are language independent (e.g., parts of speech), others (e.g., gender) are language specific. TELRI will contribute to a joint European register of linguistic specifications to be used as a standard for language resources and generic software. Whoever would like to hear more about these activities or would like to participate in them, please contact us.

Telri Partners

In the first issue of our newsletter we published information about TELRI partners. For technical reasons , the following descriptions could not be included there, so that we publish it in the present issue.

Institute of Linguistics and Literature
Prof. Dr. Bahri Beci
Tirana, Albania

The Institute of Linguistics and Literature was founded in 1972. The first nucleus of the Institute was the section of Language, literature and folklore in the Institute of Studies (1946). With the extension of the scientific activity there were set up five sectors: grammar and dialectology, lexicology and lexicography, terminology, history and literature and folklore.
The fundamental task of the Institute of Language and Literature in the field of linguistics is the study of Albanian language and its dialects in their actual state and on the historical plan, especially the study of the national literary language, elaboration of scientific grammar, compiling of explanatory dictionaries of Albanian language, elaboration of terminology and compiling of multilingual dictionaries of terminology, treatment of onomastic problems, and culture of language.

Centre of Scientific Research of the Slovene Academy of Arts and Sciences
Institute of the Slovene Language "Fran Ramovs"
Primoz Jakopin
Ljubljana, Slovenia

Under the cover project "Lexicology, grammar and dialectology of the Slovene language," the Institute is currently engaged in research on the following projects:

(1) the word corpus of the contemporary standard language (the one-volume dictionary of the contemporary standard Slovene language; the orthographical dictionary; the backwards dictionary of the contemporary standard Slovene language, based on the five-volume dictionary of the con temporary standard Slovene language; dictionary of standing expressions of the contemporary standard Slovene language; synonymity in the contemporary standard Slovene language; the transfer of the word- list Besedisce slovenskega knjiznega jezika to machine-readable form; the preparation of the five-volume dictionary of the contemporary standard Slovene language for edition in electronic form)

(2) the morphological and word-formational analysis of the contemporary standard Slovene language

(3) comparative and etymological investigations of the Slovene language (the publication of volume 3, and the preparation of volume 4, of the Etymological Dictionary of the Slovene Language

(4) historical dictionaries (the dictionary of the writings of Slovene protestant writers of the 16th century; the dictionary of the one-time standard language based on the Prekmursko dialect)

(5) Dialect atlases and dictionaries (the preparation of dialect maps for the Slovene linguistic atlas; the participation in the international projects The Pan-Slavic Linguistic Atlas [OLA] and The European Linguistic Atlas [ALE]. The preparation of dialect dictionaries: of the Kostelsko dialect, of the dialect of Zadrecka dolina between Gornji grad and Nazarje. Monographs: the tonemes in the word-formation of the contemporary standard Slovene language, contrasted with the dialect of Vnanje Gorice; the dialect of Kropa)

(6) Terminological dictionaries (the terminological dictionary of law; the dictionary of general technical terminology; the terminological dictionary of medicine; the terminological dictionary of veterinary sciences; the terminological dictionary of railways).

The cover project "Lexicology, grammar and dialectology of the Slovene language" is a long-term research project designed to ensure (a) the full inventorisation of the Slovene lexical material, and (b) a systematic analysis plus interpretation of the language facts at all levels of grammar, from a historical as well as from a descriptive point of view. The main purpose of the cover project is to produce the basic publications in the field of the Slovene language, which publications (1) will deepen the insight into the Slovene language as it is now and as it was in the past, (2) contribute towards the equitable treatment of the Slovene language in international linguistic circles.

News from TELRI Working Groups

Co-ordinator: Wolfgang Teubert

In this Working Group, we had a very ambitious goal: each TELRI partner was to engage in three small-size joint ventures with small- and medium-sized language industry enterprises. In general, our partner was to contribute the necessary language resources and the linguistic knowledge required while the company would produce the result: a NLP application, a dictionary, or some other product that could be marketed. We wanted to show that cooperation between research and industry is not only possible but that it also can be profitable for both partners involved.
Our goal could not be reached everywhere. In some countries, commercial language industry is still in its infancy. The small software houses have very little money to invest, and they are looking for quick returns. However, good and solid language technology applications need a longer breath, and some companies will still have to learn that you need more than a traditional small dictionary and a smart programmer to develop sophisticated NLP software.
National networks will bring together research and industry, providers and users of language resources. Their experience, knowledge, and information will be shared; however, we also need national programmes to encourage this kind of cooperation and to induce language industry to engage in more sophisticated applications. In a number of successful projects, our TELRI Seminar has demonstrated that cooperation between research and industry is not only possible but also profitable.

Co-ordinator: Ruta Marcinkeviciene

The most important piece of news about WG2 is that it has started coordinating information collection and documentation with the other two projects - ELSNET and ELSNET goes EAST, involved in the same activities. Representatives of all the three projects met in London, August 12 and decided to pool their efforts in creating a widely accessible database of the language and speech technology groups in industry and academia. They discussed what sort of information still has to be gathesed, where information is available and how to task should be approached in a practical way. They also agreed to develop a common set of questionaires. In adition, in order to avoid duplication of activities, to increase the response rate and to use EC funds more effectively it was decided to assign the Western European countries to ELSNET and to divide up the Central and Eastern countries between ELSNET goes EAST and TELRI.

Co-ordinator: Tomaz Erjavec

The initial work plan for WG5 (Tool assessment; 1995 - tagger assessment) was found to be unattainable, due to the lack of human, language and computational resources. At the Tihany meeting it was therefore decided to change the name and tasks of the WG: WG5 Tool Availability.

WG5 will aim to increase the availability of language engineering tools by:
_ making available, via WWW;
_ providing, via WWW, the public tools of WG5 members and TELRI partners;
_ improving such tools by adapting them to various languages and platforms.
The WG will concentrate on multilingual or language independent tools for developing and exploiting textual resources.

The above goals will be pursued in cooperation with:
_ WG9 Joint Research (`Cascade' project);
_ WG7 Networking (organisation and utilisation of WWW);
_ WG10 User Needs (tools for corpora validation);
_ Copernicus JP MULTEXT-East.

Current results:
_ a WWW page providing information on various public and commercial language engineering tools (;
_ Gothenburg site has started efforts to make their lexicon building tool robust and publicly available.

_ WG5 Ljubljana site, in cooperation with Prague will make the MULTEXT tools publicly available and work on adapting these tools for various languages and platforms;
_ WG5 Prague site has agreed to make their morphological analyser publicly available;
_ WG10 plans to use the MULTEXT tools for assessing corpora;
_ WG9 Birmingham site has plans to make available their suite of corpus handling tools (`Cascade').

Co-ordinator: Pierre Lafon

Due to specific circumstances, the coordination of this Working Group was temporarily attributed to Dan Tufis. Under a short notice he prepared the meeting in Tihany and afterwards this report on the decisions and agreements that have been made concerning future work.
The main issues came into three categories:
_ types of services the TELRI consortium might offer
_ legal aspects concerning services
_ future plans


This issue was raised mainly considering the already established associations with similar aims. It was agreed that given the fact that TELRI includes (either as partners or associates) representatives from most of European Countries creates a very strong advantage in acting as a bridge to Central and Eastern European language Market for the associations that are almost exclusively based on Western Europe and overseas countries. Besides the standard services ensured by such associations as ELRA or LDC some others were discussed:
_ evaluation of language resources (including assesment of lingware) for own-language
_ porting/extending software to cover missing features
_ implementation teams - design teams
_ linguistic assistance
In a previous phase, IDS-Manheim compiled a list of existing resources at the sites of the TELRI partners, which it was decided to be updated regularly and further extended with a precise statement of the services (including rates, fees, copyright problems etc) that could be ensured by the holders of these resources.


This issue was quite a hot one as the legislation (mainly the copyright law) appears to be quite different in the member countries of TELRI consortium (in Romania, there is no copyright law yet). The difference in the legislations might be a source of difficulties in ensuring a unitary service. This is particularly true of corpora-related services, containing material subject to the copyright regulations. A specific user-agreement (used by INL for the use of an retrieval system) was discussed trying to point out the important items that might be included into a generic TELRI-service agreement. A general advise: avoid being too specific. According to the most existing copyright regulations, for each use not agreed on in writing, permission has to be requested. The main ideas (contributed by T. Kruyt, based on the INL experience) are the following:

_ parties involved in the agreement: - names of provider and the user
_ topic of the agreement
_ purpose of the agreement
_ permission
_ time schedule agreement
_ details of delivery
_ guarantees


The members of this WG with support form all the other members of TELRI will gloss over existing and potential services which could be of interest both internally and externally the project. This information should be made available to an as large audience as possible. The WG6 should act as a broker for the pooled services. It was decided to have one of the Newsletters dedicated to the services offered by the TELRI consortium. The special issue of the TELRI Newsletter will present also the available linguistic resources. By the end of November a questionnaire would be circulated among the TELRI partners for updating the information already collected and for extending it with new types of services.

WG8 Linking
Co-ordinator: Wolfgang Teubert

This Working Group aims at broadening TELRI as a European platform for the creation, enrichment, distribution and exchange of high quality monolingual and multilingual language resources. Working Group Linking is setting up close operational ties with the recently established PAROLE Consortium, an association of all leading national language centres in the European Union. This Consortium has compiled linguistic specifications for generic written resources and is now creating comparable reference corpora and lexicons. TELRI will engage in these activities on a complementary basis. TELRI also has formed relations with Professor Vladislav Mitrofanovich Andrjuscenko and his Institute for Russian Language. It is the Russian centre of corpus linguistics. Furthermore it has established links with ELSNET and ELSNET Goes East. Together these projects will conduct an indepth survey of actors, data and activities in the field of language resources in Central and Eastern Europe (including the New Independent States).

Co-ordinator: Andrej Spektors

The results of the work on questioning potential users and their needs were discussed. During the exchange of the results of distributed questionaires we found some features common for Central and Eastern European countries. The existing resources are mostly in text file format, and they have not been validated. Potential users just start to show their interest in linguistic software . User needs are very close to the results achieved by NERC and PAROLE projects . The only problem in the CEE countries is that user needs are not yet strictly formulated. A plan for the future work has been accepted. The plan is based on the development of validation methods. The possibility to validate corpora according to SGML, TEI and EAGLES standards and recommendations is proposed as the main objective.

TELRI Events

TELRI Seminar, Tihany, Hungary, September 15-16 1995

The European Seminar "Language Resources for Language Technology" was the first of its kind organized by TELRI. It took place in the Institute for Limnology in Tihany, Hungary, 15-16 September 1995.
The aim of the seminar was to bring together scholars, software - lingware developers and end-users to exchange information. Several state of the art papers were presented by the invited speakers, the covered fields ranged from speech processing through machine translation to corpus application. We even had a presentation on the language resources and softwares in China, and we could also broden our knowledge on the American market for linguistic data. Some talks dealt with other language engineering COPERNICUS project. The representatives of the Hungarian government and the European Comission also gave lectures.
The members of TELRI user group presented some joint venture case studies, among them we saw a modern railway dictionary, two spell-checkers, a new project for collecting neologisms from corpora, and the use of the computer fund of Russian. We also had a chance to see several demonstrations, some of them presented by TELRI members or partners, some of them by external participants. One of the most challenging was the demonstartion of the LANGMaster Multimedia System for Language Teaching. The more then 70 participants came from 24 countries. Beside TELRI members and partners we had guests from universities, research institutes, private software companies and publishing houses. Beyond the usual advantage of conferences, the seminar offered a possibility of communication between software developers and users, and as a result, some business agreement were settled.
The seminar was a succesfull attempt to establish a new forum for researchers in the field of corpus linguistics and natural language processing, and for the possible users of their results.
Julia Pajzs


Demonstrations of NLP systems of most different kinds were one of the most interesting parts of the Tihany Workshop Programme. We publish de scriptions of some the demonstrated systems for information, where we are, and for inspirations.

Primoz Jakopin (Ljubljana)

A text editing program, which has, from 1985 on, evolved into a tool, which served for processing of a sizeable number of textual corpora and preparation of dictionaries in the Slovenian academic environment, is presented. EVA started on a Sinclair Spectrum (EVE), has been ported to ATARI ST machine in 1986 (STEVE); DOS version is in use since 1991. Porting to Windows NT/Windows 95 is under way.
EVA has been designed, from the start on, to be as flexible as possible, to allow the accomodation to different needs and situations by the user himself. It is more or less self contained, with its own keyboard, screen characters, DTP mode, graphics editor and an OCR facility. To conform to modern character set standards such as UNICODE EVA has a capability to process either 8- or 16-bit characters. If a line of text contains only characters with codes below 256, it is, in RAM as well as on disk stored as 8-bit; if, on the other hand, it contains one or more characters with codes above 255, it is stored as a 16-bit entity. All internal line and data record buffering is of course 16-b it. Data base functions include general purpose routines such sorting or searching and more specialized function such as splitting of text into sentences, wordwise translation and markup or computation of entropy.
Currently EVA is also used in production of a lemmatization dictionary of Slovenian, based on the 93.500 entries long Dictionary of the Slovenian Literary Language. So far nouns (54.522 lemmas to generate 468.281 word forms) and adjectives (22.861 lemmas and 277.831 words forms) have been completed.

Prof. Dr. Elena Paskaleva (lingware) / Bojaka Zaharieva (software)(Sofia):

The object of the demonstration is the system SUPERLINGUA. SUPERLINGUA is a tagging tool for highly inflected languages in extreme conditions: if the morphological component is missing or unusable or if the language is NLP virgin (not having been processed at all). The system is language independ ent, the tagging is flexible and friendly and a special interface is provided for the optimal distribution between the system's and user's linguistic knowledge. The programming language is CLIPPER 5.2 in DOS environnment. The system is supposed to be available in public domain in 6-9 months in DOS and WINDOWS environnment. The product is made entirely by the software specia lists of the Laboratory of Linguistic Modelling.

Dr. Andrejs Spektors (Riga):

The system was designed as a tool for people who have problems with writing documents in foreign languages (Latvian, Russian, English).
How it works: The system is developed for PC computers and works under DOS. It consists of three items: general term dictionary, thematic dictionary and case generation tool.
Developer: AI Lab. of Institute of Mathematics and Computer Science, University of Latvia.
Can be obtained: by agreement with AI Lab

Dr. Andrejs Spektors (Riga):

The system is used to generate wordforms in Latvian for different purposes, e.g., morphological analyzer, vocabulary for spelling-checker, computer aided language learning. During the work a database of anomalous words was developed.
How it works: The system is developed for PC computers and works under DOS. It includes case generation for nouns, adjectives and verbs. The system can be used in different modes, i.e., demonstration or learning mode and wordform generation mode for lexicon.
Developer: AI Lab. of Institute of Mathematics and Computer Science, University of Latvia.
Can be obtained: by agreement with AI Lab

Dr. Andrejs Spektors (Riga):

The system is developed for further usage in syntactic analysis and for lexicalization
How it works:The system is developed for PC computers and works under DOS. It analyses sentences separately and returns base forms of each word as well as part of speech and grammatical information. For homonyms all possible solutions are produced.
Developer: AI Lab. of Institute of Mathematics and Computer Science, University of Latvia.
Can be obtained: by agreement with AI Lab

Dr. Andrejs Spektors (Riga):

providing user interface to monoligual Latvian_Latvian dictionary.
How it works: The dictionary works under MS Windows. The list of all words in the dictionary is presented to the user and can be scrolled or incrementally searched for some word. In another window area dictionary entries can be viewed, and the user can easily get possible base form(s) of any word form present in dictionary entries.
Developers: New Mexico State University, U.S.A. (prof. J.Reinfelds), Univerity of Latvia, Department of Baltic languages, AI Lab. of Institute of Mathematics and Computer Science, University of Latvia.
Can be obtained: by agreement with NMSU

Jan Laciga (ByllBase, Prague):

There are principally two approaches to the task of information retrieval of textual data: (i) to select the text according to indexes (key words) assigned to each text, or (ii) to retrieve a word or combination of words directly in the texts and thus to select documents where the issues referred to by the given (string of) words are discussed.
The system developed by our company belongs to the type (ii), which we consider to be more convenient for large scale applications. We had to develop a system specifically designed for Czech because the systems available mostly for English are not applicable: the inflectional character of Czech (in contrast to English) brings problems connected with the rich abundance of forms of a single lexical item.
The first commerically available system for text retrieval for Czech , called ByllBase, has been developed in cooperation with the group of computational linguistics at Charles University in Prague and its special feature is an integration of the lemmatizer of Czech into the system. This lemmatizer makes it possible also to distinguish among homonyms. This enables the user to formulate the queries in a natural form, it speeds up the whole process and lowers the requirements on memory capacity for the auxiliary files. At the present stage, we make amendments to the semantic analysis to make it possible (without a human interference) to distinguish among homonyms.
ByllBase is used nowadays at such big institutions as the Czech saving bank Èeská spoøitelna, the Czech National Bank, the city council of Brno, Bratislava, some industrial plants, editorial offices etc.
One of the sucessful installation of ByllBase is the legal system ASPI, a most complex and widespread automatic retrieval system of legal documents in the Czech Republic and in Slovakia, which contains Czech legal documents and legal literature since 1811.
The system was developed by ByllBase in close cooperation with the researches of the team of computational lunguistics at Charles University, Prague.

Prof. Dr. Dan Tufis (Bucharest):

The MAC-ELU is an integrated unification-based system aimed at developing reversible linguistic descriptions. It consists of a morphological analyser/generator, a chart parser, a head-driven generator and a transfer module, all of them relying on unification mechanisms for dealing with grammatical constraints. The morphological processor works on a continuation-classes basis, with the usual clustering of morphemes into distinct dictionaries (called continuation classes). Successful transitions from one cluster to the other, corresponding either to analysis of a word-form structure or to the generation of a word-form by concatenation, are constrained by specific restrictions introduced by means of a powerful macro-definitions mechanisms. The implementation of Romanian morphology (and NP analysis/generation) will be exemplified and the structure of the lexical entries will be discussed. Further development plans will be mentioned.
The ELU system was developed by ISSCO (Rod Johnson, Mike Rosner, Graham Russel, Afzal Balim, Amy Winarske) , running in ALLEGRO-COMMON LISP on SUN machines. The MAC-ELU version of it, was ported in Machintosh Common Lisp running on Macintoshes by Dan Tufis and Octav Popescu. The ported version includes some new facilities, a menu based interface and it was code-optimized in order to ensure a reasonable response time for the smaller machines. The Romanian morphology was implemented by Dan Tufis, Octav Popescu, Lidia Diaconu, Calin Diaconu and Ana-Maria Barbu. Most of the NP rule set is due to Lidia Diaconu, Calin Diaconu and Cristian Dumitrescu.
For information on how to obtain this system, contact: Dan Tufis: e_mail address:

Prof. Dr. Dan Tufis (Bucharest):

This is an extension of Tomita's GLR parser, with the significant departure from the original algorithm of working with feature-based grammars. Also, the data structuring introduced by Tomita to take care of the conflicting entries in the LR-tables (graph-structured stack, packed shared parse forests) have been enhanced to allow for non-monadic grammar categories (DAGs). To deal with the complexities problems raised by unification extension of the GLR parser we used special data structures (virtual copying vectors). Although test data are not available yet, we expect GULiveR to be provide very good response time in real applications.
This parser was developed in 1992 by Dan Tufis and Octav Popescu. The initial implementation (Golden Common Lisp for PC) was ported this year on Macintoshes (MCL2.0.1) by Stefan Bruda and Mihai Ciocoiu.

Prof. Dr. Dan Tufis (Bucharest):

This demo, based on a larger education software, called PAIL, is a nice tutorial system for learning about parsing. It presents two different paradigms: the old ATN procedural approach and the declarative one supported by a parametrized chart-parser. The graphical interface, the steppers, the browsers, animated graphic, demos and on-line documentation made this system a very effective educational tool, highly appreciated by students. The larger PAIL system (which includes besides NLP modules, several other interesting Artificial Intelligence systems - theorem proving, rule-based systems, neural networks, back propagation, constraint satisfaction programming, inductive learning, genetic algorithms) was initially implemented at IDSIA-Lugano by Rod Johnson, Mike Rosner, Paolo Cattaneo and Fabio Baj (with contributions from some others) in Allegro Common Lisp on SUN workstations. The system was ported in MCL for Macintoshes by a joint team consisting of Mike Rosner, Paolo Cattaneo from IDSIA and Dan Tufis, Octav Popescu, Stefan Trausan and Adrian Boangiu from Romanian Academy and ICI.

Prof. Dr. Dan Tufis (Bucharest):

This system is based on a natural language generator (ALLP), developed by Sue Felshin and Stuart Malone of MIT ATHENA's group. ALLP takes as input a highly verbose interlingual representation of a syntactic structure (GB flavoured) and produces natural language text. KRIL allows for a highly conceptually specified input. On the basis of the linguistic information already present in the lexicon it automatically generates the lengthy structures needed by ALLP. The overhead added by KRIL is less than 10% of the overal generation time (the medium response time is below 1 second for a 6-8 word sentence). The KRIL generator makes the linguistic processing fully transparent to a client application (such as, for instance, an intelligent tutoring system in second language learning).The KRIL interface was implemented by Dan Tufis.

Jan Prùcha (Dr. Lang Group, Prague):

LANG Master Teaching System is a system of computer programs and instructions designed for the teaching of foreign language by means of a computer. The whole project consists of three main points:

1. LANG Master Technology
LANG Master Technology is a procedure that prepares data for LANG Master Presentation. Explicitly it converts a chosen language course or dicitionary from a book form into the form of computer data.

2. LANG Master Presentation
LANG Master presentation is a powerful multimedia application designed to present LANG Master computer courses and dictionaries for the teaching of foreign language.

3. RE-WISE Method
The aim of the method is to keep in the student's memory all the expressions learnt and, at the same time, to minimalise the frequency of revision.

Vladimír Benko (Bratislava):
Concise Dictionary of the Slovak Language
_ Electronic Version

The Concise Dictionary of the Slovak Language (KSSJ Krátký slovník slovenského jazyka, VEDA, Bratislava 1987) is a one-volume everyday-use explanatory dictionary of present-day Slovak, covering some 55,000 headwords (36,000 main entries). The last `paper' edition appeared in 1979.
The electronic version of KSSJ is based on the typesetting tape of the dictionary's first edition that had been transformed into the (slightly) tagged MRD form, (extensively) validated and (manually) updated to match the second printed edition. The reformated data have been indexed by means of the WordCruncher corpus processing package.
The current MRD version of KSSJ has been used as one of the reference sources to compile the new Slovak Synonyms Dictionary (Synonymický slovník slovenèiny, VEDA, Bratislava, in print), as a tool for various research projects in Slovak lexicology and as teaching material at the Comenius University's Faculty of Education. The CD-ROM version of KSSJ is on consideration to appear simultaneously with the third edition of KSSJ, that is being prepared to appear in the end of 1996.

Dr. Truus Kruyt (Leiden):

The Institute of Dutch Lexicology INL is a research institute subsidized by the Dutch and Belgian governments. Corpus development at the INL dates from the mid-seventies. Up to 1990, the INL text corpora were mainly developed for lexicographical purposes. Presently, they are used for a broad variety of research and applications. INL text corpora of present-day Dutch include two linguistically annotated corpora which can be consulted via Internet: the 5 Million Words Corpus 1994, which covers a variety of topics and text types, and the 27 Million Words Newspaper Corpus 1995. The retrieval program developed for the latter will be demonstrated.

Characteristics of the 27 Million Words Newspaper Corpus 1995:
The newspaper texts, dating from 1994 and 1995, were obtained in machine-readable form, on a contract basis with the publishing company. The contract specifies the conditions of use. The texts were input for automatic linguistic encoding. Part of speech (POS) and headword were automatically assigned to the word forms in the electronic texts by a lemmatizer/POS-tagger developed by the INL. Most of the data has not been corrected, neither on the level of the proper text, nor on the level of POS and headword. The linguistically encoded texts were loaded into an on-line retrieval system developed by the INL. Queries may concern the whole corpus, or a subcorpus defined by the user along the parameters year and month of publication. The system allows the user to search for single words or word patterns, including some, still rather primitive, predefined syntactic patterns which can be revised by the user. Search definitions may include references to word forms, POS and head words, both separately and in combination by use of Boolean operators and proximity searches. Output data most often is a list of items, or a series of concordances with a user-defined context size. With limitations due to copyright, the output of searches can be transferred to the user's computer by e-mail (it is not allowed to transfer complete texts or substantial text fragments). Among the other facilities are the use of wild cards and various sorting facilities.

Access to the 27 Million Words Newspaper Corpus 1995:
Consultation of the corpus is free for non-commercial purposes. Please contact the director of the INL, Prof. dr. P.G.J. van Sterkenburg, about the conditions for commercial applications. To get access to the corpus, an individual user agreement has to be signed. An electronic user agreement form can be obtained from our mailserver Mailserv@Rulxho.Leidenuniv.NL. Type in the body of your e-mail message: SEND [27MLN95]AGREEMNT.USE. Please make a hard copy of the agreement form, sign it, keep a copy yourself, and return a signed copy to: Institute for Dutch Lexicology INL, P.O. Box 9515, 2300 RA Leiden. After receipt of the signed user agreement, you will be informed about your username and password. Use of a VT 220 (or higher) terminal, or an appropriate terminal-emulator (e.g. Kermit) is recommended. If you need additional information, please send an e-mail message to Helpdesk@Rulxho.Leidenuniv.NL, or send a fax to Mrs. dr. J.G. Kruyt (31 71 27 2115).


One of the aims of the TELRI project is to promote cooperation between academia and industry. Contributions devoted to some joint ventures that were presented at the Tihany seminar have shown the usefulness of such cooperation.

Dr. Truus Kruyt (Leiden):

Dr. Truus Kruyt and Prof. Dr. Sterkenburg,
Institute for Dutch Lexicology INL, Leiden, The Netherlands.

Dutch Spelling Guides: 1954, 1990, 1995
The most recent official Dutch spelling guide, compiled in order of the governments of the Netherlands and Belgium, dates from 1954. The Belgian Spelling Resolution of 1946 and the Dutch Spelling Law of 1947 were applied to the Dutch and Flemish vocabulary by a Dutch-Belgian spelling committee consisting of 12 experts in the field.
In the past decades, this spelling was considered too complicated. New spelling principles were proposed by several official and inofficial committees, without any success up to October 1994, when the Dutch and Belgian governments agreed on not too radically changing principles for a spelling revision. A new guide is being compiled in order of the Dutch-Belgian government body `Nederlandse Taalunie' by the Institute for Dutch Lexicology INL, and will be published in printed and in electronic form by the `Staats Drukkerij en Uitgevery' SDU.
In the meantime, in 1990, the INL and the SDU published an unofficial spelling guide, including the ca. 65.000 entries of the 1954 guide and additionally ca. 30.000 new entries, which for the most part represent words that have come into use since 1954. INL was responsible for the contents of the guide, SDU for its publication. The division of the revenues is established by contract.

Dutch Spelling Guides 1990, 1995 and Language Resources
The spelling guides not only list entries with their correct orthography, but also provide information on spelling variants, hyphenation, genus, conjugation and inflexion, etc. Both the selection of entries (macrostructure) and the contents of the information categories per entry (microstructure) are determined by evidence coming from a collection of electronic written language resources, containing over 150 million words, available at INL. The resources include three text corpora (5, 27 and 50 million words, resp.) which are linguistically annotated for headword and part of speech (POS) and accessible on these parameters by a retrieval program (cf. demo '27 Million Words Corpus of Dutch Newspaper Texts via Internet'). The word forms in the additional textual resources needed still to be lemmatized and the texts to be made accessible for the purpose. Main criteria for the empirical basis of the information in the guides are frequency and coverage.
INL acquires the textual materials from several publishing houses on a contract basis. Due to the use of different systems for text preparation by the publishing houses, the acquired texts have different formats. The texts were to be converted, filtered for information not relevant for this application, and formally harmonized to some extent, so as to make them appropriate as input for further processing and consultation.

Future cooperation
Apart from this one, the INL resources have proven to be of interest for other product development projects of commercial companies. Future cooperation could be supported and improved by more uniform standards, at the levels of text preparation, data exchange and consultation of linguistic data.

Gabor Proszeky (Morphologic, Budapest):
HUMOR, a Morphological System for Corpus Analysis

Humor, a reversible, string-based, unification approach for lemmatizing and disambiguation has been introduced for both corpus analysis in the Research Institute for Linguistics, and creating a variety of other lingware applications, like spell-checking, hyphenation, etc. for the wide public. The system is language independent, that is, it allows multilingual applications: besides agglutinative languages (e.g. Hungarian, Turkish) and highly inflectional languages (e.g. Polish, Rumanian) it has been applied to languages of major economic and demographic significance (e.g. English, German, French).
The basic strategy of Humor is inherently suited to parallel execution. Search in the main dictionary, secondary dictionaries and affix dictionaries can happen simultaneously. What is more, in the near future it is going to be extended by a disambiguator based on the same strategy. This is a new parallel processing method of various levels (higher than morphology) called HumorESK (Humor Enhanced with Syntactic Knowledge). Both Humor and HumorESK have a very simple and clear strategy based on surface-only analyses, no transformations are used; all the complexity of the systems are hidden in the graphs describing morpho-syntactic behavior.
Humor is rigorously tested by "real" end-users. The Hungarian version has been used in every-day work since 1991 both by lexicographers and other researchers of the Research Institute of Linguistics of the Hungarian Academy of Sciences, and users of word-processing tools (Humor-based linguistic modules have been licensed by Microsoft, Lotus, Inso and other software developers). The lemmatizer shares some of the extra features of Helyes, the speller derived from Humor, because lexicographers need a fault-tolerant lemmatizer that is able to overcome simple orthographic errors and frequent mis-typings. It is useful in analyzing Hungarian texts from the 19th century when the Hungarian orthography was not standardized.
Humor's Hungarian version the largest and most precise implementation contains nearly 100.000 stems which cover all (approx. 70.000) lexemes of the Concise Explanatory Dictionary of the Hungarian Language. Suffix dictionaries contain all the inflectional suffixes and the productive derivational morphemes of present-day Hungarian. With the help of these dictionaries Humor is able to analyze and/or generate several billions(!) of different well-formed Hungarian word-forms. The whole software package is written in standard C using C++ like objects. It runs on any platform where C compiler can be found.

Primoz Jakopin (Ljubljana):

The two partners involved are Slovenske zeleznice, the Slovenian Railway (Railway Traffic Institute) and the Institute for Slovenian Language at the Scientific Research Centre of the Slovenian Academy of Sciences and Arts. The work on the project, Dictionary of the Railway Terminology (Zelezniski terminoloski slovar) began in January, 1994 and is to be completed by the end of 1998.
The dictionary is a part of a larger European undertaking, Rail-lex Europe, under way by coordinated efforts of 29 members of the UIC, Union internationale des chemins de fer (International Union of Railways). UIC consists of 97 railway and other transport organizations from Europe and other parts of the world. The aim of the Rail-lex project, which has so far, in 1994, produced an 11-language CD ROM Rail Lexic with over 12.000 keywords (English, German, French, Italian, Spanish, Esperanto, Hungarian, Dutch, Polish, Portuguese, Swedish), is to put together a modern, multilingual communications infrastructure, to promote links between railways themselves and between railways and the Industry, research and commerce and to contribute to the standardization of railway terminology. Rail-lex is coordinated by UICs European Rail Research Institute (ERRI), based in the Netherlands.
On Slovenian side head of the project is mag. Peter Verlic, leader of the team at the Railway Traffic Institute in Ljubljana, aided by Marjan Vrabl, who is leading the team in Maribor, the second largest Slovenian city, where a new set of railway codes, manuals and other documentation is being prepared. After Slovenia has become independent in 1991, the changes, needed to bring the railway closer to UICs standards, have to be made. The bulk of the keywords from Rail Lexic have now been translated, and together with additional keywords, which reflect the social and other specific circumstances in Slovenian railway they now form the first draft of a 15.000-keyword dictionary. It will be open to criticism from the railway staff and wider audience till end of 1996, when a revision from the side of the Institute for the Slovenian language will also be completed.

Norbert Volz (Institut für deutsche Sprache, Mannheim,

, a multinational concerted project jointly carried out by academical and industrial partners, aims to provide a modular, language-independent client/server software solution for the automatic detection of neologisms _ new words or multi-word-units denoting new concepts _ in texts using monitor corpora.
New concepts reflecting changes in culture, society, industry and science quickly show their influence to language. New words or multi-word-units emerge, enabling the integration of these concepts in the communication progress. The identification and documentation of those changes therefore is of major importance for maintaining the actuality of language resources, language processing tools and terminology databases.
Monitor corpora can be used to recognise and trace the changing patterns of collocations and similar phenomena that give clues to the emergence of new terms. Basically, two types of tools are needed for this purpose:
_ a tool to correlate lexical and terminological items with temporal intervals, based on frequency and distribution over text types; using statistical methods such as c²-tests to assess the significance of noticeable irregularities in the distribution of words of a corpus within a certain time
_ a statistics-driven tool to establish context patterns for lexical and terminological items, reflecting their various usages, e.g. by the examination of the verbal environment of repeating instances of words, looking for repetitions and regularities within the environment.
A combination of these tools working on monitor corpora will enable the identification of "candidates" for neologisms, which then can be listed and processed for further analyses and applications.
The envisaged software product will be a minimal assumption, generic modular solution that any users can adapt to their own texts and corpora regardless of language. Possible applications will mainly be within lingware products, e.g. machine translation systems, multilingual termbanks, databases etc. CORDON will also prove useful for the automatic updating and expansion of natural language lexicons and translation memories.
The project consortium consists of four academic and four industrial partners. The academical partners will provide research facilities and staff. The industrial partners will be responsible for project management, supervision, validation, evaluation and assessment of the final product in order to guarantee maximum response to user needs.
Project duration will be two years. At the end of this phase, the result of the CORDON project will be a demonstrable robust prototype that will work on existing application and corpora.
The proposal for this project will be handed in under the current TELEMATICS call within the 4th Framework Programme of the European Commission.

Elena Paskaleva (Sofia):

CEU RSS (Central European University-Research Support Scheme) has sponsored a project with 5 participants from 3 countries - 2 from CRLF, 2 from GMS-Berlin and 1 from LML (Linguistic Modeling Laboratory - Bulgarian Academy of Sciences). Limited resources have been granted for the application of 10 000 dictionary entries from Ozhegov's Dictionary of the Russian Language to the Russian part of the data in a METAL-type system for Machine Translation.

