Issue
No. 1 | Issue
No. 3 | Issue
No. 4 | Issue
No.
5
Wolfgang Teubert, Coordinator of TELRI
One year ago TELRI was set up by 22 institutions in seventeen European
countries. Meanwhile, more institutions have joined, if only as members
of the Advisory Board: Belgrade, Zagreb, and, quite recently, Moscow. TELRI
is a Concerted Action. Its primary goal is to pool existing language resources,
corpora, and lexicons and to make them available to the growing NLP community.
New resources are being created; all resources will be standardised. Together
they will allow for the development of a new generation of powerful multilingual
language technology applications. They are already used for corpus-based
dictionaries and lexical databases. The TELRI Resources Catalogue is available
on the WWW ( /index.html).
The success of the TELRI network depends on the organisation of national
networks bringing together academic research and commercial language industry.
Therefore, TELRI and the partner project ELSNET Goes East joined forces
to conduct an indepth survey on the leading actors in the field of language
technology, their resources, and their activities. Data collection is about
to begin. Results will be made quickly available on Internet and in printed
form.
Our first European Seminar, "Language Resources for Language Technology",
conducted in Tihany, Hungary on September 15 and 16 clearly demonstrated
the need for a continuous platform for research and industry. All applications
more sophisticated than spelling checkers heavily rely on linguistic data
and knowledge extracted from the data. Successful applications are those
that have an accuracy rate of more than 95%. The higher your goal on this
scale, the more linguistic knowledge required. This knowledge is available
at academic research centres. Thus, if you want to upgrade your language
technology products contact us now.
Meanwhile, we have selected some areas of ongoing research. We set out
to produce electronic text versions of Plato's "Republic" in
as many languages as possible. Nine (including Chinese) are already available.
This small, but very helpful parallel corpus, will be used to test corpus
alignment software and to find new ways to detect translation equivalents
of multi-word units, collocations, and phraseologisms.
Another area is the validation of textual and lexical resources: here we
want to complement similar activities in Western Europe and to assess ge
neric tools such as alignment
software. Validation of language resources will have to be based upon linguistic
specifications like morphosyntactic tagsets for corpora or data categories
for lexicons. Some of these are language independent (e.g., parts of speech),
others (e.g., gender) are language specific. TELRI will contribute to a
joint European register of linguistic specifications to be used as a standard
for language resources and generic software. Whoever would like to hear
more about these activities or would like to participate in them, please
contact us.
In the first issue of our newsletter we published information about TELRI
partners. For technical reasons , the following descriptions could not
be included there, so that we publish it in the present issue.
Institute of Linguistics and Literature
Prof. Dr. Bahri Beci
Tirana, Albania
The Institute of Linguistics and Literature was founded in 1972. The first
nucleus of the Institute was the section of Language, literature and folklore
in the Institute of Studies (1946). With the extension of the scientific
activity there were set up five sectors: grammar and dialectology, lexicology
and lexicography, terminology, history and literature and folklore.
The fundamental task of the Institute of Language and Literature in the
field of linguistics is the study of Albanian language and its dialects
in their actual state and on the historical plan, especially the study
of the national literary language, elaboration of scientific grammar, compiling
of explanatory dictionaries of Albanian language, elaboration of terminology
and compiling of multilingual dictionaries of terminology, treatment of
onomastic problems, and culture of language.
Centre of Scientific Research of the Slovene Academy of Arts and Sciences
Institute of the Slovene Language "Fran Ramovs"
Primoz Jakopin
Ljubljana, Slovenia
Under the cover project "Lexicology, grammar and dialectology of the
Slovene language," the Institute is currently engaged in research
on the following projects:
(1) the word corpus of the contemporary standard language (the one-volume
dictionary of the contemporary standard Slovene language; the orthographical
dictionary; the backwards dictionary of the contemporary standard Slovene
language, based on the five-volume dictionary of the con temporary
standard Slovene language; dictionary of standing expressions of the contemporary
standard Slovene language; synonymity in the contemporary standard Slovene
language; the transfer of the word- list Besedisce slovenskega knjiznega
jezika to machine-readable form; the preparation of the five-volume dictionary
of the contemporary standard Slovene language for edition in electronic
form)
(2) the morphological and word-formational analysis of the contemporary
standard Slovene language
(3) comparative and etymological investigations of the Slovene language
(the publication of volume 3, and the preparation of volume 4, of the Etymological
Dictionary of the Slovene Language
(4) historical dictionaries (the dictionary of the writings of Slovene
protestant writers of the 16th century; the dictionary of the one-time
standard language based on the Prekmursko dialect)
(5) Dialect atlases and dictionaries (the preparation of dialect maps for
the Slovene linguistic atlas; the participation in the international projects
The Pan-Slavic Linguistic Atlas [OLA] and The European Linguistic Atlas
[ALE]. The preparation of dialect dictionaries: of the Kostelsko dialect,
of the dialect of Zadrecka dolina between Gornji grad and Nazarje. Monographs:
the tonemes in the word-formation of the contemporary standard Slovene
language, contrasted with the dialect of Vnanje Gorice; the dialect of
Kropa)
(6) Terminological dictionaries (the terminological dictionary of law;
the dictionary of general technical terminology; the terminological dictionary
of medicine; the terminological dictionary of veterinary sciences; the
terminological dictionary of railways).
The cover project "Lexicology, grammar and dialectology of the Slovene
language" is a long-term research project designed to ensure (a) the
full inventorisation of the Slovene lexical material, and (b) a systematic
analysis plus interpretation of the language facts at all levels of grammar,
from a historical as well as from a descriptive point of view. The main
purpose of the cover project is to produce the basic publications in the
field of the Slovene language, which publications (1) will deepen the insight
into the Slovene language as it is now and as it was in the past, (2) contribute
towards the equitable treatment of the Slovene language in international
linguistic circles.
WG1 TELRI USER GROUP
Co-ordinator: Wolfgang Teubert
In this Working Group, we had a very ambitious goal: each TELRI partner
was to engage in three small-size joint ventures with small- and medium-sized
language industry enterprises. In general, our partner was to contribute
the necessary language resources and the linguistic knowledge required
while the company would produce the result: a NLP application, a dictionary,
or some other product that could be marketed. We wanted to show that cooperation
between research and industry is not only possible but that it also can
be profitable for both partners involved.
Our goal could not be reached everywhere. In some countries, commercial
language industry is still in its infancy. The small software houses have
very little money to invest, and they are looking for quick returns. However,
good and solid language technology applications need a longer breath, and
some companies will still have to learn that you need more than a traditional
small dictionary and a smart programmer to develop sophisticated NLP software.
National networks will bring together research and industry, providers
and users of language resources. Their experience, knowledge, and information
will be shared; however, we also need national programmes to encourage
this kind of cooperation and to induce language industry to engage in more
sophisticated applications. In a number of successful projects, our TELRI
Seminar has demonstrated that cooperation between research and industry
is not only possible but also profitable.
WG2 DOCUMENTATION
Co-ordinator: Ruta Marcinkeviciene
The most important piece of news about WG2 is that it has started coordinating
information collection and documentation with the other two projects -
ELSNET and ELSNET goes EAST, involved in the same activities. Representatives
of all the three projects met in London, August 12 and decided to pool
their efforts in creating a widely accessible database of the language
and speech technology groups in industry and academia. They discussed what
sort of information still has to be gathesed, where information is available
and how to task should be approached in a practical way. They also agreed
to develop a common set of questionaires. In adition, in order to avoid
duplication of activities, to increase the response rate and to use EC
funds more effectively it was decided to assign the Western European countries
to ELSNET and to divide up the Central and Eastern countries between ELSNET
goes EAST and TELRI.
WG5 TOOL AVAILABILITY
Co-ordinator: Tomaz Erjavec
The initial work plan for WG5 (Tool assessment; 1995 - tagger assessment)
was found to be unattainable, due to the lack of human, language and computational
resources. At the Tihany meeting it was therefore decided to change the
name and tasks of the WG: WG5 Tool Availability.
WG5 will aim to increase the availability of language engineering tools
by:
_ making available, via WWW;
_ providing, via WWW, the public tools of WG5 members and TELRI partners;
_ improving such tools by adapting them to various languages and platforms.
The WG will concentrate on multilingual or language independent tools for
developing and exploiting textual resources.
The above goals will be pursued in cooperation with:
_ WG9 Joint Research (`Cascade' project);
_ WG7 Networking (organisation and utilisation of WWW);
_ WG10 User Needs (tools for corpora validation);
_ Copernicus JP MULTEXT-East.
Current results:
_ a WWW page providing information on various public and commercial language
engineering tools (http://nl.ijs.si/~tomaz/pub_tools);
_ Gothenburg site has started efforts to make their lexicon building tool
robust and publicly available.
Plan:
_ WG5 Ljubljana site, in cooperation with Prague will make the MULTEXT
tools publicly available and work on adapting these tools for various languages
and platforms;
_ WG5 Prague site has agreed to make their morphological analyser publicly
available;
_ WG10 plans to use the MULTEXT tools for assessing corpora;
_ WG9 Birmingham site has plans to make available their suite of corpus
handling tools (`Cascade').
WG6 SERVICE POOL
Co-ordinator: Pierre Lafon
Due to specific circumstances, the coordination of this Working Group was
temporarily attributed to Dan Tufis. Under a short notice he prepared the
meeting in Tihany and afterwards this report on the decisions and agreements
that have been made concerning future work.
The main issues came into three categories:
_ types of services the TELRI consortium might offer
_ legal aspects concerning services
_ future plans
A. TYPES OF SERVICES THE TELRI CONSORTIUM MIGHT OFFER
This issue was raised mainly considering the already established associations
with similar aims. It was agreed that given the fact that TELRI includes
(either as partners or associates) representatives from most of European
Countries creates a very strong advantage in acting as a bridge to Central
and Eastern European language Market for the associations that are almost
exclusively based on Western Europe and overseas countries. Besides the
standard services ensured by such associations as ELRA or LDC some others
were discussed:
_ evaluation of language resources (including assesment of lingware) for
own-language
_ porting/extending software to cover missing features
_ implementation teams - design teams
_ linguistic assistance
In a previous phase, IDS-Manheim compiled a list of existing resources
at the sites of the TELRI partners, which it was decided to be updated
regularly and further extended with a precise statement of the services
(including rates, fees, copyright problems etc) that could be ensured by
the holders of these resources.
B. LEGAL ASPECTS CONCERNING THE SERVICES
This issue was quite a hot one as the legislation (mainly the copyright
law) appears to be quite different in the member countries of TELRI consortium
(in Romania, there is no copyright law yet). The difference in the legislations
might be a source of difficulties in ensuring a unitary service. This is
particularly true of corpora-related services, containing material subject
to the copyright regulations. A specific user-agreement (used by INL for
the use of an retrieval system) was discussed trying to point out the important
items that might be included into a generic TELRI-service agreement. A
general advise: avoid being too specific. According to the most existing
copyright regulations, for each use not agreed on in writing, permission
has to be requested. The main ideas (contributed by T. Kruyt, based on
the INL experience) are the following:
KEY ELEMENTS
_ parties involved in the agreement: - names of provider and the user
_ topic of the agreement
_ purpose of the agreement
_ permission
_ time schedule agreement
_ details of delivery
_ guarantees
C. FUTURE PLANS
The members of this WG with support form all the other members of TELRI
will gloss over existing and potential services which could be of interest
both internally and externally the project. This information should be
made available to an as large audience as possible. The WG6 should act
as a broker for the pooled services. It was decided to have one of the
Newsletters dedicated to the services offered by the TELRI consortium.
The special issue of the TELRI Newsletter will present also the available
linguistic resources. By the end of November a questionnaire would be circulated
among the TELRI partners for updating the information already collected
and for extending it with new types of services.
WG8 Linking
Co-ordinator: Wolfgang Teubert
This Working Group aims at broadening TELRI as a European platform for
the creation, enrichment, distribution and exchange of high quality monolingual
and multilingual language resources. Working Group Linking is setting up
close operational ties with the recently established PAROLE Consortium,
an association of all leading national language centres in the European
Union. This Consortium has compiled linguistic specifications for generic
written resources and is now creating comparable reference corpora and
lexicons. TELRI will engage in these activities on a complementary basis.
TELRI also has formed relations with Professor Vladislav Mitrofanovich
Andrjuscenko and his Institute for Russian Language. It is the Russian
centre of corpus linguistics. Furthermore it has established links with
ELSNET and ELSNET Goes East. Together these projects will conduct an indepth
survey of actors, data and activities in the field of language resources
in Central and Eastern Europe (including the New Independent States).
WG10 USER NEEDS
Co-ordinator: Andrej Spektors
The results of the work on questioning potential users and their needs
were discussed. During the exchange of the results of distributed questionaires
we found some features common for Central and Eastern European countries.
The existing resources are mostly in text file format, and they have not
been validated. Potential users just start to show their interest in linguistic
software . User needs are very close to the results achieved by NERC and
PAROLE projects . The only problem in the CEE countries is that user needs
are not yet strictly formulated. A plan for the future work has been accepted.
The plan is based on the development of validation methods. The possibility
to validate corpora according to SGML, TEI and EAGLES standards and recommendations
is proposed as the main objective.
The European Seminar "Language Resources for Language Technology"
was the first of its kind organized by TELRI. It took place in the Institute
for Limnology in Tihany, Hungary, 15-16 September 1995.
The aim of the seminar was to bring together scholars, software - lingware
developers and end-users to exchange information. Several state of the
art papers were presented by the invited speakers, the covered fields ranged
from speech processing through machine translation to corpus application.
We even had a presentation on the language resources and softwares in China,
and we could also broden our knowledge on the American market for linguistic
data. Some talks dealt with other language engineering COPERNICUS project.
The representatives of the Hungarian government and the European Comission
also gave lectures.
The members of TELRI user group presented some joint venture case studies,
among them we saw a modern railway dictionary, two spell-checkers, a new
project for collecting neologisms from corpora, and the use of the computer
fund of Russian. We also had a chance to see several demonstrations, some
of them presented by TELRI members or partners, some of them by external
participants. One of the most challenging was the demonstartion of the
LANGMaster Multimedia System for Language Teaching. The more then 70 participants
came from 24 countries. Beside TELRI members and partners we had guests
from universities, research institutes, private software companies and
publishing houses. Beyond the usual advantage of conferences, the seminar
offered a possibility of communication between software developers and
users, and as a result, some business agreement were settled.
The seminar was a succesfull attempt to establish a new forum for researchers
in the field of corpus linguistics and natural language processing, and
for the possible users of their results.
Julia Pajzs
DEMONSTRATIONS
Demonstrations of NLP systems of most different kinds were one of the most
interesting parts of the Tihany Workshop Programme. We publish de scriptions
of some the demonstrated systems for information, where we are, and for
inspirations.
Primoz Jakopin (Ljubljana)
EVA _ A TEXTUAL DATA PROCESSING TOOL
A text editing program, which has, from 1985 on, evolved into a tool, which
served for processing of a sizeable number of textual corpora and preparation
of dictionaries in the Slovenian academic environment, is presented. EVA
started on a Sinclair Spectrum (EVE), has been ported to ATARI ST machine
in 1986 (STEVE); DOS version is in use since 1991. Porting to Windows NT/Windows
95 is under way.
EVA has been designed, from the start on, to be as flexible as possible,
to allow the accomodation to different needs and situations by the user
himself. It is more or less self contained, with its own keyboard, screen
characters, DTP mode, graphics editor and an OCR facility. To conform to
modern character set standards such as UNICODE EVA has a capability to
process either 8- or 16-bit characters. If a line of text contains only
characters with codes below 256, it is, in RAM as well as on disk stored
as 8-bit; if, on the other hand, it contains one or more characters with
codes above 255, it is stored as a 16-bit entity. All internal line and
data record buffering is of course 16-b it. Data base functions include
general purpose routines such sorting or searching and more specialized
function such as splitting of text into sentences, wordwise translation
and markup or computation of entropy.
Currently EVA is also used in production of a lemmatization dictionary
of Slovenian, based on the 93.500 entries long Dictionary of the Slovenian
Literary Language. So far nouns (54.522 lemmas to generate 468.281 word
forms) and adjectives (22.861 lemmas and 277.831 words forms) have been
completed.
Prof. Dr. Elena Paskaleva (lingware) / Bojaka Zaharieva (software)(Sofia):
THE ARDUOUS TAGGING OF HIGHLY INFLECTIVE LANGUAGES
(NON-ENGLISH; NON-LATIN ALPHABET; NO MORPHO AUTOMATION)
The object of the demonstration is the system SUPERLINGUA. SUPERLINGUA
is a tagging tool for highly inflected languages in extreme conditions:
if the morphological component is missing or unusable or if the language
is NLP virgin (not having been processed at all). The system is language
independ ent, the tagging is
flexible and friendly and a special interface is provided for the optimal
distribution between the system's and user's linguistic knowledge. The
programming language is CLIPPER 5.2 in DOS environnment. The system is
supposed to be available in public domain in 6-9 months in DOS and WINDOWS
environnment. The product is made entirely by the software specia lists
of the Laboratory of Linguistic Modelling.
Dr. Andrejs Spektors (Riga):
MULTILINGUAL OFFICE-TERM DICTIONARY
Purpose: The system was designed as a tool for people who have problems
with writing documents in foreign languages (Latvian, Russian, English).
How it works: The system is developed for PC computers and works
under DOS. It consists of three items: general term dictionary, thematic
dictionary and case generation tool.
Developer: AI Lab. of Institute of Mathematics and Computer Science,
University of Latvia.
Can be obtained: by agreement with AI Lab
Dr. Andrejs Spektors (Riga):
AUTOMATED CASE GENERATION SYSTEM FOR LATVIAN
Purpose: The system is used to generate wordforms in Latvian for different
purposes, e.g., morphological analyzer, vocabulary for spelling-checker,
computer aided language learning. During the work a database of anomalous
words was developed.
How it works: The system is developed for PC computers and works
under DOS. It includes case generation for nouns, adjectives and verbs.
The system can be used in different modes, i.e., demonstration or learning
mode and wordform generation mode for lexicon.
Developer: AI Lab. of Institute of Mathematics and Computer Science,
University of Latvia.
Can be obtained: by agreement with AI Lab
Dr. Andrejs Spektors (Riga):
MODEL OF LATVIAN MORPHOLOGICAL ANALYSER
AND REDUCTION TO THE BASE FORM
Purpose: The system is developed for further usage in syntactic analysis
and for lexicalization
How it works:The system is developed for PC computers and works
under DOS. It analyses sentences separately and returns base forms of each
word as well as part of speech and grammatical information. For homonyms
all possible solutions are produced.
Developer: AI Lab. of Institute of Mathematics and Computer Science,
University of Latvia.
Can be obtained: by agreement with AI Lab
Dr. Andrejs Spektors (Riga):
ELECTRONICALLY TRACTABLE LATVIAN DICTIONARY
Purpose: providing user interface to monoligual Latvian_Latvian dictionary.
How it works: The dictionary works under MS Windows. The list of
all words in the dictionary is presented to the user and can be scrolled
or incrementally searched for some word. In another window area dictionary
entries can be viewed, and the user can easily get possible base form(s)
of any word form present in dictionary entries.
Developers: New Mexico State University, U.S.A. (prof. J.Reinfelds),
Univerity of Latvia, Department of Baltic languages, AI Lab. of Institute
of Mathematics and Computer Science, University of Latvia.
Can be obtained: by agreement with NMSU
Jan Laciga (ByllBase, Prague):
BYLLBASE - A FULL-TEXT RETRIEVAL METHOD USING LINGUISTIC METHOD
There are principally two approaches to the task of information retrieval
of textual data: (i) to select the text according to indexes (key words)
assigned to each text, or (ii) to retrieve a word or combination of words
directly in the texts and thus to select documents where the issues referred
to by the given (string of) words are discussed.
The system developed by our company belongs to the type (ii), which we
consider to be more convenient for large scale applications. We had to
develop a system specifically designed for Czech because the systems available
mostly for English are not applicable: the inflectional character of Czech
(in contrast to English) brings problems connected with the rich abundance
of forms of a single lexical item.
The first commerically available system for text retrieval for Czech ,
called ByllBase, has been developed in cooperation with the group of computational
linguistics at Charles University in Prague and its special feature is
an integration of the lemmatizer of Czech into the system. This lemmatizer
makes it possible also to distinguish among homonyms. This enables the
user to formulate the queries in a natural form, it speeds up the whole
process and lowers the requirements on memory capacity for the auxiliary
files. At the present stage, we make amendments to the semantic analysis
to make it possible (without a human interference) to distinguish among
homonyms.
ByllBase is used nowadays at such big institutions as the Czech saving
bank Èeská spoøitelna, the Czech National Bank, the
city council of Brno, Bratislava, some industrial plants, editorial offices
etc.
One of the sucessful installation of ByllBase is the legal system ASPI,
a most complex and widespread automatic retrieval system of legal documents
in the Czech Republic and in Slovakia, which contains Czech legal documents
and legal literature since 1811.
The system was developed by ByllBase in close cooperation with the researches
of the team of computational lunguistics at Charles University, Prague.
Prof. Dr. Dan Tufis (Bucharest):
UNIFICATION-BASED IMPLEMENTATAION OF A WIDE COVERAGE ROMANIAN MORPHOLOGY
The MAC-ELU is an integrated unification-based system aimed at developing
reversible linguistic descriptions. It consists of a morphological analyser/generator,
a chart parser, a head-driven generator and a transfer module, all of them
relying on unification mechanisms for dealing with grammatical constraints.
The morphological processor works on a continuation-classes basis, with
the usual clustering of morphemes into distinct dictionaries (called continuation
classes). Successful transitions from one cluster to the other, corresponding
either to analysis of a word-form structure or to the generation of a word-form
by concatenation, are constrained by specific restrictions introduced by
means of a powerful macro-definitions mechanisms. The implementation
of Romanian morphology (and NP analysis/generation) will be exemplified
and the structure of the lexical entries will be discussed. Further development
plans will be mentioned.
The ELU system was developed by ISSCO (Rod Johnson, Mike Rosner, Graham
Russel, Afzal Balim, Amy Winarske) , running in ALLEGRO-COMMON LISP on
SUN machines. The MAC-ELU version of it, was ported in Machintosh Common
Lisp running on Macintoshes by Dan Tufis and Octav Popescu. The ported
version includes some new facilities, a menu based interface and it was
code-optimized in order to ensure a reasonable response time for the smaller
machines. The Romanian morphology was implemented by Dan Tufis, Octav Popescu,
Lidia Diaconu, Calin Diaconu and Ana-Maria Barbu. Most of the NP rule set
is due to Lidia Diaconu, Calin Diaconu and Cristian Dumitrescu.
For information on how to obtain this system, contact: Dan Tufis: e_mail
address: tufis@u1.ici.ro
Prof. Dr. Dan Tufis (Bucharest):
GULiveR: A GENERALIZED UNIFICATION LR PARSER
This is an extension of Tomita's GLR parser, with the significant departure
from the original algorithm of working with feature-based grammars. Also,
the data structuring introduced by Tomita to take care of the conflicting
entries in the LR-tables (graph-structured stack, packed shared parse forests)
have been enhanced to allow for non-monadic grammar categories (DAGs).
To deal with the complexities problems raised by unification extension
of the GLR parser we used special data structures (virtual copying vectors).
Although test data are not available yet, we expect GULiveR to be provide
very good response time in real applications.
This parser was developed in 1992 by Dan Tufis and Octav Popescu. The initial
implementation (Golden Common Lisp for PC) was ported this year on Macintoshes
(MCL2.0.1) by Stefan Bruda and Mihai Ciocoiu.
Prof. Dr. Dan Tufis (Bucharest):
PARSING PORTABLE LABORATORY
This demo, based on a larger education software, called PAIL, is a
nice tutorial system for learning about parsing. It presents two different
paradigms: the old ATN procedural approach and the declarative one supported
by a parametrized chart-parser.
The graphical interface, the steppers, the browsers, animated graphic,
demos and on-line documentation made this system a very effective educational
tool, highly appreciated by students. The larger PAIL system (which includes
besides NLP modules, several other interesting Artificial Intelligence
systems - theorem proving, rule-based systems, neural networks, back propagation,
constraint satisfaction programming, inductive learning, genetic algorithms)
was initially implemented at IDSIA-Lugano by Rod Johnson, Mike Rosner,
Paolo Cattaneo and Fabio Baj (with contributions from some others) in Allegro
Common Lisp on SUN workstations. The system was ported in MCL for Macintoshes
by a joint team consisting of Mike Rosner, Paolo Cattaneo from IDSIA and
Dan Tufis, Octav Popescu, Stefan Trausan and Adrian Boangiu from Romanian
Academy and ICI.
Prof. Dr. Dan Tufis (Bucharest):
KRIL - A KNOWLEDGE REPRESENTATION INTERFACE
TO AN INTERLINGUAL NATURAL LANGUAGE GENERATOR
This system is based on a natural language generator (ALLP), developed
by Sue Felshin and Stuart Malone of MIT ATHENA's group. ALLP takes as input
a highly verbose interlingual representation of a syntactic structure (GB
flavoured) and produces natural language text. KRIL allows for a highly
conceptually specified input. On the basis of the linguistic information
already present in the lexicon it automatically generates the lengthy structures
needed by ALLP. The overhead added by KRIL is less than 10% of the overal
generation time (the medium response time is below 1 second for a 6-8 word
sentence). The KRIL generator makes the linguistic processing fully transparent
to a client application (such as, for instance, an intelligent tutoring
system in second language learning).The KRIL interface was implemented
by Dan Tufis.
Jan Prùcha (Dr. Lang Group, Prague):
THE LANG MASTER TEACHING SYSTEM
LANG Master Teaching System is a system of computer programs and instructions
designed for the teaching of foreign language by means of a computer. The
whole project consists of three main points:
1. LANG Master Technology
LANG Master Technology is a procedure that prepares data for LANG Master
Presentation. Explicitly it converts a chosen language course or dicitionary
from a book form into the form of computer data.
2. LANG Master Presentation
LANG Master presentation is a powerful multimedia application designed
to present LANG Master computer courses and dictionaries for the teaching
of foreign language.
3. RE-WISE Method
The aim of the method is to keep in the student's memory all the expressions
learnt and, at the same time, to minimalise the frequency of revision.
Vladimír Benko (Bratislava):
Concise Dictionary of the Slovak Language
_ Electronic Version
The Concise Dictionary of the Slovak Language (KSSJ Krátký
slovník slovenského jazyka, VEDA, Bratislava 1987) is a one-volume
everyday-use explanatory dictionary of present-day Slovak, covering some
55,000 headwords (36,000 main entries). The last `paper' edition appeared
in 1979.
The electronic version of KSSJ is based on the typesetting tape of the
dictionary's first edition that had been transformed into the (slightly)
tagged MRD form, (extensively) validated and (manually) updated to match
the second printed edition. The reformated data have been indexed by means
of the WordCruncher corpus processing package.
The current MRD version of KSSJ has been used as one of the reference sources
to compile the new Slovak Synonyms Dictionary (Synonymický slovník
slovenèiny, VEDA, Bratislava, in print), as a tool for various research
projects in Slovak lexicology and as teaching material at the Comenius
University's Faculty of Education. The CD-ROM version of KSSJ is on consideration
to appear simultaneously with the third edition of KSSJ, that is being
prepared to appear in the end of 1996.
Dr. Truus Kruyt (Leiden):
ACCESS TO A LINGUISTICALLY ANNOTATED 27 MILL WORD CORPUS
OF DUTCH NEWSPAPER TEXTS VIA INTERNET
The Institute of Dutch Lexicology INL is a research institute subsidized
by the Dutch and Belgian governments. Corpus development at the INL dates
from the mid-seventies. Up to 1990, the INL text corpora were mainly developed
for lexicographical purposes. Presently, they are used for a broad variety
of research and applications. INL text corpora of present-day Dutch include
two linguistically annotated corpora which can be consulted via Internet:
the 5 Million Words Corpus 1994, which covers a variety of topics and text
types, and the 27 Million Words Newspaper Corpus 1995. The retrieval program
developed for the latter will be demonstrated.
Characteristics of the 27 Million Words Newspaper Corpus 1995:
The newspaper texts, dating from 1994 and 1995, were obtained in machine-readable
form, on a contract basis with the publishing company. The contract specifies
the conditions of use. The texts were input for automatic linguistic encoding.
Part of speech (POS) and headword were automatically assigned to the word
forms in the electronic texts by a lemmatizer/POS-tagger developed by the
INL. Most of the data has not been corrected, neither on the level of the
proper text, nor on the level of POS and headword. The linguistically encoded
texts were loaded into an on-line retrieval system developed by the INL.
Queries may concern the whole corpus, or a subcorpus defined by the user
along the parameters year and month of publication. The system allows the
user to search for single words or word patterns, including some, still
rather primitive, predefined syntactic patterns which can be revised by
the user. Search definitions may include references to word forms, POS
and head words, both separately and in combination by use of Boolean operators
and proximity searches. Output data most often is a list of items, or a
series of concordances with a user-defined context size. With limitations
due to copyright, the output of searches can be transferred to the user's
computer by e-mail (it is not allowed to transfer complete texts or substantial
text fragments). Among the other facilities are the use of wild cards and
various sorting facilities.
Access to the 27 Million Words Newspaper Corpus 1995:
Consultation of the corpus is free for non-commercial purposes. Please
contact the director of the INL, Prof. dr. P.G.J. van Sterkenburg, about
the conditions for commercial applications. To get access to the corpus,
an individual user agreement
has to be signed. An electronic user agreement form can be obtained from
our mailserver Mailserv@Rulxho.Leidenuniv.NL. Type in the body of your
e-mail message: SEND [27MLN95]AGREEMNT.USE. Please make a hard copy of
the agreement form, sign it, keep a copy yourself, and return a signed
copy to: Institute for Dutch Lexicology INL, P.O. Box 9515, 2300 RA Leiden.
After receipt of the signed user agreement, you will be informed about
your username and password. Use of a VT 220 (or higher) terminal, or an
appropriate terminal-emulator (e.g. Kermit) is recommended. If you need
additional information, please send an e-mail message to Helpdesk@Rulxho.Leidenuniv.NL,
or send a fax to Mrs. dr. J.G. Kruyt (31 71 27 2115).
JOINT VENTURE STUDIES
One of the aims of the TELRI project is to promote cooperation between
academia and industry. Contributions devoted to some joint ventures that
were presented at the Tihany seminar have shown the usefulness of such
cooperation.
Dr. Truus Kruyt (Leiden):
A NEW DUTCH SPELLING GUIDE
Dr. Truus Kruyt and Prof. Dr. Sterkenburg,
Institute for Dutch Lexicology INL, Leiden, The Netherlands.
Dutch Spelling Guides: 1954, 1990, 1995
The most recent official Dutch spelling guide, compiled in order of the
governments of the Netherlands and Belgium, dates from 1954. The Belgian
Spelling Resolution of 1946 and the Dutch Spelling Law of 1947 were applied
to the Dutch and Flemish vocabulary by a Dutch-Belgian spelling committee
consisting of 12 experts in the field.
In the past decades, this spelling was considered too complicated. New
spelling principles were proposed by several official and inofficial committees,
without any success up to October 1994, when the Dutch and Belgian governments
agreed on not too radically changing principles for a spelling revision.
A new guide is being compiled in order of the Dutch-Belgian government
body `Nederlandse Taalunie' by the Institute for Dutch Lexicology INL,
and will be published in printed and in electronic form by the `Staats
Drukkerij en Uitgevery' SDU.
In the meantime, in 1990, the INL and the SDU published an unofficial spelling
guide, including the ca. 65.000 entries of the 1954 guide and additionally
ca. 30.000 new entries, which for the most part represent words that have
come into use since 1954. INL was responsible for the contents of the guide,
SDU for its publication. The division of the revenues is established by
contract.
Dutch Spelling Guides 1990, 1995 and Language Resources
The spelling guides not only list entries with their correct orthography,
but also provide information on spelling variants, hyphenation, genus,
conjugation and inflexion, etc. Both the selection of entries (macrostructure)
and the contents of the information categories per entry (microstructure)
are determined by evidence coming from a collection of electronic written
language resources, containing over 150 million words, available at INL.
The resources include three text corpora (5, 27 and 50 million words, resp.)
which are linguistically annotated for headword and part of speech (POS)
and accessible on these parameters by a retrieval program (cf. demo '27
Million Words Corpus of Dutch Newspaper Texts via Internet'). The word
forms in the additional textual resources needed still to be lemmatized
and the texts to be made accessible for the purpose. Main criteria for
the empirical basis of the information in the guides are frequency and
coverage.
INL acquires the textual materials from several publishing houses on a
contract basis. Due to the use of different systems for text preparation
by the publishing houses, the acquired texts have different formats. The
texts were to be converted, filtered for information not relevant for this
application, and formally harmonized to some extent, so as to make them
appropriate as input for further processing and consultation.
Future cooperation
Apart from this one, the INL resources have proven to be of interest for
other product development projects of commercial companies. Future cooperation
could be supported and improved by more uniform standards, at the levels
of text preparation, data exchange and consultation of linguistic data.
Gabor Proszeky (Morphologic, Budapest):
HUMOR, a Morphological System for Corpus Analysis
Humor, a reversible, string-based, unification approach for lemmatizing
and disambiguation has been introduced for both corpus analysis in the
Research Institute for Linguistics, and creating a variety of other lingware
applications, like spell-checking, hyphenation, etc. for the wide public.
The system is language independent, that is, it allows multilingual applications:
besides agglutinative languages (e.g. Hungarian, Turkish) and highly inflectional
languages (e.g. Polish, Rumanian) it has been applied to languages of major
economic and demographic significance (e.g. English, German, French).
The basic strategy of Humor is inherently suited to parallel execution.
Search in the main dictionary, secondary dictionaries and affix dictionaries
can happen simultaneously. What is more, in the near future it is going
to be extended by a disambiguator based on the same strategy. This is a
new parallel processing method of various levels (higher than morphology)
called HumorESK (Humor Enhanced with Syntactic Knowledge). Both Humor and
HumorESK have a very simple and clear strategy based on surface-only analyses,
no transformations are used; all the complexity of the systems are hidden
in the graphs describing morpho-syntactic behavior.
Humor is rigorously tested by "real" end-users. The Hungarian
version has been used in every-day work since 1991 both by lexicographers
and other researchers of the Research Institute of Linguistics of the Hungarian
Academy of Sciences, and users of word-processing tools (Humor-based linguistic
modules have been licensed by Microsoft, Lotus, Inso and other software
developers). The lemmatizer shares some of the extra features of Helyes,
the speller derived from Humor, because lexicographers need a fault-tolerant
lemmatizer that is able to overcome simple orthographic errors and frequent
mis-typings. It is useful in analyzing Hungarian texts from the 19th century
when the Hungarian orthography was not standardized.
Humor's Hungarian version the largest and most precise implementation contains
nearly 100.000 stems which cover all (approx. 70.000) lexemes of the Concise
Explanatory Dictionary of the Hungarian Language. Suffix dictionaries contain
all the inflectional suffixes and the productive derivational morphemes
of present-day Hungarian. With the help of these dictionaries Humor is
able to analyze and/or generate several billions(!) of different well-formed
Hungarian word-forms. The whole software package is written in standard
C using C++ like objects. It runs on any platform where C compiler can
be found.
Primoz Jakopin (Ljubljana):
RAIL-LEX SLOVENIA - A MODERN RAILWAY DICTIONARY
The two partners involved are Slovenske zeleznice, the Slovenian Railway
(Railway Traffic Institute) and the Institute for Slovenian Language at
the Scientific Research Centre of the Slovenian Academy of Sciences and
Arts. The work on the project, Dictionary of the Railway Terminology (Zelezniski
terminoloski slovar) began in January, 1994 and is to be completed by the
end of 1998.
The dictionary is a part of a larger European undertaking, Rail-lex Europe,
under way by coordinated efforts of 29 members of the UIC, Union internationale
des chemins de fer (International Union of Railways). UIC consists of 97
railway and other transport organizations from Europe and other parts of
the world. The aim of the Rail-lex project, which has so far, in 1994,
produced an 11-language CD ROM Rail Lexic with over 12.000 keywords (English,
German, French, Italian, Spanish, Esperanto, Hungarian, Dutch, Polish,
Portuguese, Swedish), is to put together a modern, multilingual communications
infrastructure, to promote links between railways themselves and between
railways and the Industry, research and commerce and to contribute to the
standardization of railway terminology. Rail-lex is coordinated by UICs
European Rail Research Institute (ERRI), based in the Netherlands.
On Slovenian side head of the project is mag. Peter Verlic, leader of the
team at the Railway Traffic Institute in Ljubljana, aided by Marjan Vrabl,
who is leading the team in Maribor, the second largest Slovenian city,
where a new set of railway codes, manuals and other documentation is being
prepared. After Slovenia has become independent in 1991, the changes, needed
to bring the railway closer to UICs standards, have to be made. The bulk
of the keywords from Rail Lexic have now been translated, and together
with additional keywords, which reflect the social and other specific circumstances
in Slovenian railway they now form the first draft of a 15.000-keyword
dictionary. It will be open to criticism from the railway staff and wider
audience till end of 1996, when a revision from the side of the Institute
for the Slovenian language will also be completed.
Norbert Volz (Institut für deutsche Sprache, Mannheim,
E-Mail: volz@mx300c.ids-mannheim.de):
CORDON _ CORPUS-ORIENTED DETECTION OF NEOLOGISMS
CORDON , a multinational concerted project jointly carried out by academical
and industrial partners, aims to provide a modular, language-independent
client/server software solution for the automatic detection of neologisms
_ new words or multi-word-units denoting new concepts _ in texts using
monitor corpora.
New concepts reflecting changes in culture, society, industry and science
quickly show their influence to language. New words or multi-word-units
emerge, enabling the integration of these concepts in the communication
progress. The identification and documentation of those changes therefore
is of major importance for maintaining the actuality of language resources,
language processing tools and terminology databases.
Monitor corpora can be used to recognise and trace the changing
patterns of collocations and similar phenomena that give clues to the emergence
of new terms. Basically, two types of tools are needed for this purpose:
_ a tool to correlate lexical and terminological items with temporal intervals,
based on frequency and distribution over text types; using statistical
methods such as c²-tests to assess the significance of noticeable
irregularities in the distribution of words of a corpus within a certain
time
_ a statistics-driven tool to establish context patterns for lexical and
terminological items, reflecting their various usages, e.g. by the examination
of the verbal environment of repeating instances of words, looking for
repetitions and regularities within the environment.
A combination of these tools working on monitor corpora will enable the
identification of "candidates" for neologisms, which then can
be listed and processed for further analyses and applications.
The envisaged software product will be a minimal assumption, generic modular
solution that any users can adapt to their own texts and corpora regardless
of language. Possible applications will mainly be within lingware products,
e.g. machine translation systems, multilingual termbanks, databases etc.
CORDON will also prove useful for the automatic updating and expansion
of natural language lexicons and translation memories.
The project consortium consists of four academic and four industrial partners.
The academical partners will provide research facilities and staff. The
industrial partners will be responsible for project management, supervision,
validation, evaluation and assessment of the final product in order to
guarantee maximum response to user needs.
Project duration will be two years. At the end of this phase, the result
of the CORDON project will be a demonstrable robust prototype that
will work on existing application and corpora.
The proposal for this project will be handed in under the current TELEMATICS
call within the 4th Framework Programme of the European Commission.
Elena Paskaleva (Sofia):
EUROPEAN LANGUAGE RESOURCES AND THE COMPUTERIZED RUSSIAN LANGUAGE FUND
CEU RSS (Central European University-Research Support Scheme) has sponsored
a project with 5 participants from 3 countries - 2 from CRLF, 2 from GMS-Berlin
and 1 from LML (Linguistic Modeling Laboratory - Bulgarian Academy of Sciences).
Limited resources have been granted for the application of 10 000 dictionary
entries from Ozhegov's Dictionary of the Russian Language to the Russian
part of the data in a METAL-type system for Machine Translation.
Top of this issue |
TELRI Main Page |