1. Editorial - Wolfgang Teubert
2. Topic of the issue: Multilingual technology
3. TELRI Event - Montecatini 3rd TELRI seminar
4. On the TELRI Newsletter
Topic of the issue: Multilingual technology| Montecatini seminar | On the TELRI Newsletter
Wolfgang Teubert, Coordinator of TELRI
This TELRI Newsletter contains the abstracts of the presentations at the Third European Seminar: "Translation Equivalence - Theory and Practice" which will take place in Montecatini, Italy, from October 16 to 18. Like with the two preceding TELRI Seminars, it is our goal to set up a forum where industry and academia trade expertise, exchange tools and resources and prepare for the challenge of the multilingual global information society.
The synergy of 25 focal national language centers from all over Europe will give rise to new ideas and approaches for the next generation of multilingual technology: authoring tools, information retrieval and translation aids. This new generation of tools will be based on language data derived from multilingual resources: comparable and parallel corpora covering all the languages involved.
Methodologies for extracting, processing, and applying multilingual linguistic knowledge from corpora are now being developed. TELRI has undertaken a joint study on parallel texts. The results will be presented at this Seminar. Also, other speakers who are working in related relevant projects will demonstrate alternative methodologies.
We hope that the Seminar will stimulate the current multilingual NLP research and, like preceding TELRI Seminars, will lay the foundation to new joint ventures between academic institutions, language industry, and dictionary publishers all over Europe.
| Montecatini seminar | On
the TELRI Newsletter
Multilingual Tools at the Xerox Research Centre
Xerox Research Centre
The Xerox Research Centre (see http://www.rxrc.xerox.com for more information) pursues a vision of document technology where language, physical location and medium - electronic, paper or other - impose no barrier to effective use.
Our primary activity is research. Our second activity is a Program of Advanced Technology Development, to create new document services based on our own research and that of the wider Xerox community. We also participate actively in exchange programs with European partners.
Language issues cover important aspects in the production and use of documents. As such, language is a central theme of our research activities. More particularly, our Centre focuses on multilingual aspects of Natural Language Processing (NLP). Our current developments cover more than ten European languages and some non-European languages such as Arabic. Some of these developments are conducted through direct collaboration with academic institutions all over Europe.
The present articles is an introduction to our basic linguistic components and to some of their multilingual applications.
The MLTT (Multilingual Theory and Technology) team creates basic tools for linguistic analysis, e.g. morphological analysers, taggers, parsing and generation platforms. These tools are used to develop descriptions of various languages and the relation between them. They are later integrated into higher level applications, such as terminology extraction, information retrieval or translation aid. The Xerox Linguistic Development Architecture (XeLDA) developed by the Advanced Technology Systems group incorporates the MLTT language technology.
Finite-state technology is the fundamental technology on which Xerox language R&D is based. It encompasses both work on the basic calculus and on linguistic tools, in particular in the domain of morphology and syntax.
The basic calculus is built on a central library that implements the fundamental operations on finite-state networks. It is based on long-term Xerox research, originated at PARC in the early 1980s. The most recent development in the finite-state calculus is the introduction of the replace operator. The replacement operation is defined in a very general way, allowing replacement to be constrained by input and output contexts, as in two-level rules but without the restriction of only single-symbol replacements. Replacements can be combined with other kinds of operations, such as composition and union, to form complex expressions.
The finite-state calculus is widely used in our linguistic development, to create tokenisers, morphological analysers, noun phrase extractors, shallow parsers and other language-specific linguistic components.
The MLTT work on morphology is based on the fundamental insight that word formation and morphological or orthographic alternation can be solved with the help of finite automata:
1. the allowed combinations of morphemes can be encoded as a finite-state network;
2. the rules that determine the form of each morpheme can be implemented as finite-state transducers;
3. the lexicon network and the rule transducers can be composed into a single automaton, a lexical transducer, that contains all the morphological information about the language including derivation, inflection, and compounding.
Lexical transducers have many advantages. They are bi-directional (the same network for both analysis and generation), fast (thousands of words per second), and compact.
We have created comprehensive morphological analysers for many languages including English, German, Dutch, French, Italian, Spanish, and Portuguese. More recent developments include Czech, Hungarian, Polish, Russian, Scandinavian languages and Arabic.
The general purpose of a part-of-speech tagger is to associate each word in a text with its morphosyntactic category (represented by a tag), as in the following example:
This+PRON is+VAUX_3SG a+DET sentence+NOUN_SG .+SENT
The process of tagging consists in three steps:
1. tokenisation: break a text into tokens
2. lexical lookup: provide all potential tags for each token
3. disambiguation: assign to each token a single tag
Each step is performed by an application program which uses language specific data:
Incremental finite-state parsing
Finite State Parsing is an extension of finite state technology to the level of phrases and sentences.
Our work concentrates on shallow parsing of unrestricted texts. We compute syntactic structures, without fully analysing linguistic phenomena that require deep semantic or pragmatic knowledge. For instance, PP-attachment, co-ordinated or elliptic structures are not always fully analysed. The annotation scheme remains underspecified with respect to yet unresolved issues. On the other hand, such phenomena do not cause parse failures, even on complex sentences.
Syntactic information is added at the sentence level in an incremental way, depending on the contextual information available at a given stage. The implementation relies on a sequence of networks built with the replace operator. The current system has been implemented for French and is being expanded to new languages.
The parsing process is incremental in the sense that the linguistic description attached to a given transducer in the sequence relies on the preceding sequence of transducers, covers only some occurrences of a given linguistic phenomenon and can be revised at a later stage.
The parser output can be used for further processing such as extraction of dependency relations over unrestricted corpora. In tests on French corpora (technical manuals, newspaper ), precision is around 90-97% for subjects (84-88% for objects) and recall around 86-92% for subjects (80-90% for objects).
The LFG PARGRAM project
The LFG PARGRAM project is a collaborative effort involving researchers from Xerox PARC in Palo Alto, the Xerox Research Centre in Grenoble, France, and the University of Stuttgart in Stuttgart, Germany. The aim of the project is to produce wide coverage LFG grammars for English, French, and German which are written collaboratively, based on a common set of linguistic principles and with a commonly-agreed-upon set of grammatical features.
The grammarians use a new platform, the Xerox Linguistic Environment, which is still under development; a unification-based generator is also under development.
The grammars consist of phrase-structure rules and abbreviatory rule macros; LFG allows the right-hand side of phrase structure rules to consist of regular expressions (including the Kleene Star notation) and arbitrary Boolean combinations of regular predicates, so the rules in the grammar actually abbreviate a large set of rules written in a more conventional framework. The lexicons used by the sites consist of entries for stems, template definitions, and lexical rules. The Xerox Linguistic Environment allows for an interface to an external finite-state morphological analyser, and so the lexicons include entries for the information about morphological inflection supplied by the analyser.
LOCOLEX: a Machine Aided Comprehension Dictionary
LOCOLEX is an on-line bilingual comprehension dictionary which aids the understanding of electronic documents written in a foreign language. It displays only the appropriate part of a dictionary entry when a user clicks on a word in a given context. The system disambiguates parts of speech and recognises multiword expressions such as compounds (e.g. heart attack), phrasal verbs (e.g. to nit pick), idiomatic expressions (e.g. to take the bull by the horns) and proverbs (e.g. birds of a feather flock together). In such cases LOCOLEX displays the translation of the whole phrase and not the translation of the word the user has clicked on.
For instance, someone may use a French/English dictionary to understand the following text written in French:
Lorsqu'on évoque devant les cadres la séparation négociée, les
rumeurs fantaisistes vont apparemment toujours bon train.
When the user clicks on the word cadres, LOCOLEX identifies its POS and base form. It then displays the corresponding entry, here the noun cadre, with its different sense indicators and associated translations. In this particular context, the verb reading of cadres is ignored by LOCOLEX. Actually, in order to make the entry easier to use, only essential elements are displayed:
cadre I: nm
1: *[constr,art] (of a picture, a window) frame
2: *(scenery) setting
3: *(milieu) surroundings
4: *(structure, context) framework
5: *(employee) executive
6: *(of a bike, motorcycle) frame
The word train in the same example above is part of a verbal multiword expression aller bon train. In our example, the expression is inflected and two adverbs have been stuck in between the head verb and its complement. Still LOCOLEX retrieves only the equivalent expression in English to be flying around and not the entire entry for train.
train I: nm
5 : * [rumeurs] aller bon train : to be flying round
LOCOLEX uses an SGML-tagged bilingual dictionary (the Oxford-Hachette French English Dictionary). To adapt this dictionary to LOCOLEX required the following:
The lookup process itself may be represented as follows:
Besides being an effective tool for understanding, LOCOLEX could also be useful in the framework of language learning. LOCOLEX also points out that existing on-line dictionaries, even when organised like a database rather than a set of type-setting instructions, are not necessarily suitable for NLP-applications. By adding grammar rules to the dictionary in order to describe the possible variations of multiword expressions we add a dynamic feature to this dictionary. SGML functions no longer point to text but to programs.
Multilingual Information Retrieval
Many of the linguistic tools being developed at our Centre are being used in applied research into multilingual information retrieval. Multilingual information retrieval allows the interrogation of texts written in a target language B by users asking questions in source language A.
In order to perform this retrieval, the following linguistic processing steps are performed on the documents and the query:
This morphological analysis, tagging, and subsequent lemmatisation of analysed words has proved to be a useful improvement for information retrieval as any information-retrieval specific stemming. To process a given query, an intermediate form of the query must be generated which he normalised language of the query to the indexed text of the documents. This intermediate form can be constructed by replacing each word with target language words through an on-line bilingual dictionary. The intermediate query, which is in the same language as the target documents, is passed along to a traditional information retrieval system, such as SMART. This simple word-based method is the first approach we have been testing. Initial runs indicate that incorporating multi-word expression matching can significantly improve results. The multi-word expressions most interesting for information retrieval are terminological expressions, which most often appear as noun phrases in English.
Callimaque: a collaborative project for virtual libraries
Digital libraries represent a new way of accessing information distributed all over the world, via the use of a computer connected to the Internet network. Whereas a physical library deals primarily with physical data, a digital library deals with electronic documents such as texts, pictures, sounds and video.
We expect more from a digital library than only the possibility of browsing its documents. A digital library front-end should provide users with a set of tools for querying and retrieving information, as well as annotating pages of a document, defining hyper-links between pages or helping to understand multilingual documents.
Callimaque is one of our projects dealing with such new functionalities for digital libraries. More precisely, Callimaque is a collaborative project between the Xerox Research Centre and research/academic institutions of the Grenoble area (IMAG, INRIA, CICG). The goal is to build a virtual library that reconstructs the early history of information technology in France. The project is based on a similar project, the Class project, which was started by the University of Cornell several years ago under the leadership of Stuart Lynn to preserve brittling old books. The Class project runs over conventional networks and all scanned material is in English.
The Callimaque project includes the following steps:
Salah Aďt-Mokhtar, Jean-Pierre Chanod, “Incremental finite-state parsing”, in Proceedings of Applied Natural Language Processing 1997, Washington, DC. April 97
Salah Aďt-Mokhtar, Jean-Pierre Chanod, “Subject and Object Dependency Extraction Using Finite-State Transducers”, ACL workshop on Automatic Information Extraction and Building of Lexical Semantic
Resources for NLP Applications. 1997, Madrid
D. Bauer, F. Segond, A. Zaenen. “LOCOLEX: the translation rolls off your tongue.” in Proceedings of the ACH-ALLC conference, Santa Barbara, pp. 6-8, 1995.
Jean-Pierre Chanod, Pasi Tapanainen. “Tagging French -- comparing a statistical and a constraint-based method” in Seventh Conference of the European Chapter of the ACL. Dublin, 1995.
Gregory Grefenstette. Explorations in Automatic Thesaurus Discovery. Kluwer Academic Press, Boston, 1994.
Gregory Grefenstette, Ulrich Heid and Thierry Fontenelle. “The DECIDE project: Multilingual Collocation Extraction.” Seventh Euralex International Congress, University of Gothenburg, Sweden, Aug 13-18, 1996.
Barbora Hladka and Jan Hajic. “Probabilistic and Rule-based Tagger of an Inflective Language”
In Proceedings of Applied Natural Language Processing 1997 Washington, DC. April 97
Ronald M. Kaplan, Martin Kay. “Regular Models of Phonological Rule Systems”. Computational Linguistics, 20:3 331-378, 1994.
Kaplan, Ronald M. and Joan Bresnan. 1982. Lexical-Functional Grammar: A formal system
for grammatical representation. In Joan Bresnan, editor, The Mental Representation of
Grammatical Relations. The MIT Press, Cambridge, MA, pages 173--281.
Kaplan, Ronald M. and John T. Maxwell. 1996. LFG grammar writer's workbench. Technical
report, Xerox PARC.
Lauri Karttunen. “Constructing Lexical Transducers”. In Proceedings of the 15th International Conference on Computational Linguistics, Coling, Kyoto, Japan, 1994.
Lauri Karttunen. “The Replace Operator. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, ACL-95} 16-23, Boston, 1995.
Kimmo Koskenniemi. “A General Computational Model for Word-Form Recognition and Production.
Department of General Linguistics”. University of Helsinki. 1983
Julian Kupiec and Mike Wilkens. The dds tagger guide version 1.1. Technical report, Xerox Palo Alto Research Center, 1994.
Maxwell, III, John T. and Ronald M. Kaplan. 1991. A method for disjunctive constraint
satisfaction. In Masaru Tomita, editor, Current Issues in Parsing Technology. Kluwer
Academic Publishers, Dordrecht, pages 173--190.
John Nerbonne, Lauri Karttunen, Elena Paskaleva, Gabor Proszeky and Tiit Roosmaa
“Reading more into Foreign Languages”. In Proceedings of Applied Natural Language Processing 1997
Washington, DC. April 97
F. Segond and P. Tapanainen. Using a finite-state based formalism to identify and generate
multiword expressions. Technical Report MLTT-019, Xerox Research Centre, Grenoble,
(EuroWordNet and Czech Wordnet)
Faculty of Informatics,
Brno, Czech Republic
1. What is WordNet and EuroWordNet?
WordNet is a database of English word meanings with basic semantic relations between them, such as synonymy, hyponymy (between expressions denoting specific and more general concepts), meronymy relations (between expressions denoting relations of parts and wholes), causal and entailment relations etc. By means of these relations all meanings can be interconnected, constituting a huge network or wordnet. Such a wordnet can be used for making various semantic inferences about the meanings of words (e.g. what words can name diseases), for finding alternative expressions or wordings, or for simply expanding words to sets of semantically related or close words in information retrieval. This approach has been developed in Princeton by G.A.Miller and his colleagues (Miller et al. 1991) and its last version is known as WordNet 1.5.
EuroWordNet is then conceived as a generic multilingual semantic database, which is the first in its kind. At present it contains the basic semantic information for Dutch, Italian, Spanish and English, while each of these resources is linked to a shared inter-lingua. This database can be directly used for semantic information-retrieval in each of these languages but also for multi-lingual retrieval across these languages. The next step is to extend EuroWordnet with a French and German wordnets so that all major European languages are covered. The basic mono-lingual databases for German and French are already being produced with national and private funding. Finally, two Eastern-Middle European languages - Czech and Estonian will be also involved in producing wordnets for them - so typologically different languages will be included as well and a standard for multilingual semantic resources for a variety of language-types will be produced.
In EuroWordNet (EWN-1) the wordnet for each language is structured along the same lines as in the Princeton WordNet 1.5 (Miller et al. 1991), in such a way that they contain synsets (set of synonymous word meanings) and basic semantic relations between these synsets. In addition each synset has an equivalence relation to a so-called Inter-Lingual-Index, mainly based on the synsets of WordNet 1.5. Via the Inter-Lingual-Index all synsets are interconnected, thus constituting a flexible and powerful multi-lingual system.
The preparation of EuroWordNet in the mentioned framework is now being designed within the EuroWordNet-2 EC Project whose co-ordinator is Piek Vossen from the Amsterdam University.
The usefulness of EuroWordNet lexical database is obvious - it will represent a resource which is essential for providing non-expert users access to the multilingual and multi-cultural European information society. Obviously, EuroWordNet is restricted to a few European languages and therefore only partially addresses the multi-linguality problem.
Furthermore, semantic networks give information about the lexicalization patterns of a language, the conceptual density of the vocabulary areas and the semantic distinctions that play a role (i.e. which meanings and which relations play a role in different semantic fields). Internet browsers are just one example of the relevance of multilingual semantic information about words for applications in the area of information retrieval. Other applications that can directly benefit from multilingual semantic resources are:
- information-acquisition tools,
- language-learning tools,
A prototypical area for the application of such resources are Internet search engines, which are already well established in the information market. Although the number of users and usage of these services is increasing exponentially, the potential of the quality of results is far from being fully exploited. This holds especially for the areas of search term expansion and multilinguality. Queries are typically restricted to the enumeration (and logical combination) of mere keywords, which do not provide information about terms related to the keywords. In the present systems, a search for "health" will not disclose documents, that use clearly related terms such as "disease", "disorder", "stress", "deafness" and "headache", unless the documents also include the term "health" itself. In addition, the multilingual nature of the information society is not reflected by these engines in that they do not offer means to simultaneous access to documents in the variety of different languages, to which they have, in principle, access.
One of the main goals of EuroWordNet-2 is to include newly developed wordnets of the two Eastern-Middle European languages, particularly Czech and Estonian. The integration of these upcoming national wordnets into the EuroWordNet framework will ensure maximum compatibility between the wordnets for the individual languages and will allow true multilinguality by linking the additional resources to the shared inter-lingual database of EuroWordNet. The extension will also strengthen the role of EuroWordNet's technology and data format as a de facto standard for the representation of lexical semantic data for Europe's information society. Such a standard will not only allow for future incorporation of further languages, but also provide a unique interface for software developers in the information industry to lexical semantic data. In a longer run the wordnets will become the backbone of any semantic database of the future and will open up a whole range of new applications and services in Europe at a trans-national and trans-cultural level.
To provide non-expert users flexible access to the information society it is crucial to develop tools that can expand their general and common words in a specific language to any possible variant or term in any other language. The user should be able to get around the choice of words in a document or the choice of key words by matching meanings rather than words. Such tools depend on the availability of generic resources with semantic information on words in each of the languages, preferably with cross-linguistic links.
2. Preparation of Czech WordNet
The development of Czech WordNet will go along the lines outlined above. The main task is to:
a) Definition of a common set of Base-Concepts for Czech: this is a set of meanings that play a key-role in the individual wordnets. Estimated size = 1,000 synsets: 700 nominal synsets, 300 verbal synsets.
b) To encode the language-internal relations and the equivalence relations around the Base-Concepts for Czech. This should result in a Czech core wordnet of at most 10,000 synsets: 7,000 nouns and 3,000 verbs.
c) To encode the language-internal relations and equivalence relations for adjectives in Czech (and, of course, to establish links to English, Dutch, Italian, Spanish, German, French and Estonian wordnets).
d) To include Czech Base Concepts into the Inter-Lingual-Index and in this way to integrate it into EuroWordNet as its part.
The starting point for building the basic set of Czech synsets are the following resources:
i) Dictionary of Czech Synonyms (Pala, Všianský, 1994) which exists both in printed and electronic form and contains about 20 000 headwords,
ii) newly developed Electronic English-Czech and Czech-English dictionary (Ševeček, 1997) containing at present approximately 25 000 headwords,
iii) list of Czech verbs with their verb frames comprising now about 12 000 items,
iv) Czech morphological analyser and lemmatiser (Ševeček, 1996) able to retrieve the complete inventory of Czech word forms.
2.1 Techniques and/or approaches used
The scope of Czech WordNet lexical database is limited to the basic semantic relations that are well-understood - i.e. to the relations between synonyms, hyponyms, hypernyms, meronyms and holonyms plus causal relations and also verb frames.
Establishing these relations within the selected collection of Czech lexical units should be done in part semi-automatically and in part manually. The selection of the set of Base Concepts will follow the corresponding sets in English and other languages within EuroWordNet.
The basic techniques will mainly rely on semi-automatic extracting data from above mentioned electronic resources (machine readable dictionaries) and also on using Novell toolkit.
We assume that Czech wordnet will be necessary to interconnect with the mentioned lemmatiser: in practice we have to expect that queries will have to undergo morphological analysis - this is in Czech - a highly inflected language - sine qua non for any realistic processing.
Miller, G.A., et al, Five Papers on Wordnet, Princeton, 1991.
Pala, K., Všianský, J., Dictionary of Czech Synonyms (Slovník českých synonym), Lidové Noviny, Praha, 1994.
Ševeček, P., Electronic English-Czech Dictionary, Langea, Brno, 1997.
Ševeček, P., Morphological Analyser and Lemmatiser for Czech, program in C (for DOS, Unix and Macintosh platforms), Brno, 1996.
Editorial | Topic of the issue: Multilingual technology| On the TELRI Newsletter
Abstracts of the papers submitted for the 3rd TELRI seminar
Translation Equivalence - Theory and Practice
Department of Linguistics,
home page: www.ruf.rice.edu/~barlow
This paper provides a brief overview of some practical and theoretical issues related to parallel corpora (i.e., texts that are translations). After very brief description of a parallel concordancer, ParaConc, we will examine the potential of such a program in conjunction with monolingual text analysis programs to provide insights into the form and function of languages.
Taking a language to consist of form-meaning links, what we have in parallel corpora are two sets of form-meaning linkings, one for each language. And since the two texts are translations, the meaning part---the description of an event---can be assumed to be approximately the same in both texts. Thus we are able to see how two different languages encode equivalent meanings. The art of translation is undeniably complex, involving many different kinds of processes, and there are known problems associated with the use of translation texts, but we can fruitfully examine three main aspects of translation, namely, language particular encodings of
Each of these areas of form-meaning mapping can be profitably analysed using parallel corpora to yield results of interest to linguists, lexicographers, translators and language teachers. In this talk I will concentrate on the use of parallel corpora to investigate language- particular preferences with respect to the structuring of the conceptual domain in terms of metaphor and image schemas. We will see, for example, how the up-down image schema (representing the vertical dimension) is exploited to markedly different degrees in the structuring of English and French.
(with respect to their Czech-English equivalents in the text and their elaboration in the dictionaries)
Faculty of Philosophy,
Prague, Czech Republic
The topic of this contribution was chosen in accordance with the nature of the text studied - the dialogue form: the participants react to each other's statements, questions etc. Therefore, rather great number of verbal forms in the 1st person, present indicative appear in the text, such as I tell you, I ask you, I agree, I admit x říkám, tvrdím, souhlasím etc. The performative verbs were defined by J. L. Austin in his book "How to do thing with words" (1962) and then by J. R. Searle in his work "Speech acts" (1969). The theory of speech acts was summarized in the work "Pragmatics" by S. Levinson (1983). Levinson criticized Searle's typology of speech acts and stated that "the 'fundamental part' part of human communication is carried out... by specific classes of communicative intention" (p. 241). According to Levinson there are three basic sentence-types, i. e. interrogative, imperative and declarative and they seem to be the universal of most of the languages. These three sentence-types may contain the performative phrases or prefixes, e. g. I request you to, i. e. explicit performative verbs. These sentence-types were taken as the basis for our analysis.
The declarative sentences are represented in the greatest number of occurrences. Most frequent performative forms in Czech are the following:
tvrdím / netvrdím 16x
shoduji se s tebou 6x
jsem zajedno 1x
Most frequent English equivalents: I affirm, I say, I agree, I admit, I tell you
As an example we can take the text equivalents of the Czech performative "(já) ne/tvrdím":
I affirm that 3x, I say 4x, I concur 1x, I (don't) mean 2x, I am trying to say 1x, I am ready to admit 1x, we say 1x, I will say 1x.
1) Pak nebudeme múzicky vzdělaní, tvrdím při bozích, ani my...
Then, by heaven , am I not right in saying that by the same token we shall never be true musicians, either
2) ... pokud ty tady netvrdíš něco jiného. - Netvrdím, řekl.
unless you have something different to say." - "No, nothing," said he;
3) tu tvrdím, nemohl by k tomuto poznání nikdy dojít
I would never say that he really learns
One of the text variants which appears as an equivalent of other verbs, is, of course, the auxiliary verb, i. e. I do. According to the Czech-English dictionary by I. Poldauf the equivalents of tvrdit are: to insist, to claim, to assert, to affirm, to allege, to aver, to predicate, to vindicate, to submit, to argue, to contend, to maintain, to warant.It is quite interesting that among these equivalents the most common text equivalent to say does not occur.
In the similar way within the contribution the interrogative and imperative sentences will be analysed.
Eva Hajičova, Zdeněk Kirschner
Charles University, Prague
1. Adding new words to the lexical stock of natural language is an endless process. Therefore, every lexicon is an open list, even if based on corpora of hundreds of millions of word occurrences. For purposes of multilingual applications one needs to think of "fail-soft" measures to cover the text as a whole, without blanks substituted for the unknown words.
One possible solution is to study productive word formation processes as a basis for an automatic "transduction" of the given unknown lexical unit of the source language.
2. Such a transducing device was developed by Zd. Kirschner within the project of English-to-Czech machine translation in the eighties. It was based on the observation that most newly coined Czech words in the domain of technology and science are taken over from English, as loans from Latin and Greek with slight (and mostly regular) modifications as for endings and orthography. Based on this observation, a set of about 60 rules was formulated to cover the most productive modifications.
The first step consists in the interpretation of the unrecognized words according to their typical and (mostly) productive suffixes (the inflectional endings being detached and dictionary forms reconstrcuted by morphemic analysis in the preceding steps), and to assign them the POS and semantic information. Thus e.g. words ending in -er, -or, -graph, -ode and some others are interpreted as nouns, concrete, denoting actors/instruments (e.g. adapter, detector, cyclograph, cathode); words ending in -ce, -cy, -ess, -tude are supposed to be nouns, abstract, properties and forming a regular adjective in Czech (equivalence - ekvivalence, ekvivalentní; tendency - tendence, tendenční; absurdness - absurdnost, absurdní; altitude - altituda, altitudní); the same characteristics are assigned to unknown words ending in -ity, -sm, -ship, -hood, -thm, except for the morphemic information on the formation of adjectives (selectivity - selektivita, *selektivitní, isomorphism - izomorfismus, *izomorfizní; dicatorship - diktátorství, etc.); the endings -fy, -ate, -ise(-ize), -duce indicate verbs that can be both transitive and intransitive, of causative and (semi)terminological character, yet not allowed to form adjectives of the purposive character (calcify - kalcifikovat; alternate - alternovat; formalize - formalizovat, induce - indukovat). A number of adjectival endings was covered by the rules as well, e.g. -ary, -al, -rse, -ive, -ous, -ic, -ble, -less, -anar, -lear, -near, -olar, -ular (evolutionary - evoluční; global - globální; disperse - disperzní). The transducing device covers about 50 classes of nouns, 13 classes of adjectives and 4 classes of verbs.
In the next step, the English suffixes are replaced by the Czech ones, and, finally, the word bases are scanned for spelling configurations to be transformed or adapted to Czech orthography. Thus (as the above examples illustrate), e.g., ph is replaced by f, th by t, c preceding a, l, o, r, t, u by k, s preceded by a, e, i, n, o, r, y and followed by a, e, i, o is replaced by z, etc.
To give some more examples, photolithographic is translated as fotolitografický, cyclotron is translated as cyklotron, operational as operační, etc. This is not to say that the transduction always results in an existing Czech word, but in most cases it does, and in most of the remaining cases the transduction leads at least to a satisfactory classification of the word as for its POS and its morphemic properties, so that the specialist of the domain gets a reasonable picture of the structure of the whole sentence (e.g. the English word amplifier would be translated as amplifikátor rather than as the correct zesilovač, but this transduction would not lead to a misunderstanding on the side of the reader).
Such a transduction device, of course, must be based on a careful empirical analysis of word formation in the given pair of languages; otherwise, the process may result in unpleasant misinterpretations. Thus, in one of the beginning phases of our experiments, the transduction procedure checked first on texts from the domain of electronics, was applied to a more general domain, where a source text included the collocation international conference. Since one of the rules rewrites the ending -ational to the Czech ending -ační, the resulting translation was internační konference rather than the correct internacionální konference (inter-national); however, the adjective internační does exist in Czech with the meaning 'internment': an interment conference (especially under the totalitarian regime) is far from an international conference.
To estimate the scope of coverage of the transducing procedure formulated within the mentioned MT project, we have scanned the inverse dictionary of English and counted how many words would be correctly treated by the transducing device. The set of about 60 rules covers about 20000 lexical entries from the dictionary.
4. These good results have encouraged us to try and test this fail-soft measure in the Czech-to-Russian MT system developed by our research team (Bémová and Kuboň 1990). The initial expectation was that with languages that are closely related to each other the idea of a transduction dictionary could be applied in an even larger range. A contrastive analysis of Czech and Russian has shown that many items actually can be translated in the above illustrated algorithimc way. A large class of Czech words can be translated into Russian by a mere transcription (at least in the nominative or nom./acc. case), cf. e.g. elektroskop - elektroskop, expozimetr - ekspozimetr, cyklograf - ciklograf, agregát - agregat, demontáž - demontaž. Another group contains words the form of which must (also) be modified by a regular procedure, cf. e.g. the derivation suffixes and inflectional endings in formalismus - formalizm, linearizace - linearizacija, extrakce - ekstrakcija, tendence - tendencija, stenogram - stenogramma, homeostaze - gomeostazis, báze - bazis, hypotaxe - gipotaksis, galium - galij, selektivita - selektivnost', helium - gelij, specialista - specialist. The third group includes semantically uniform and productive classes of words, as e.g. deverbative nouns (-ání -> -anie), nouns denoting a property (-ost -> -ost'), nouns with the meaning of a certain place or space (-ště -> -šče), nouns with a meaning with a feature of property (-tví -> -tvo).
However, in the course of a long-term development of these languages, the semantic shifts of the word bases prevent the possibility of translation of these types only by means of the word-formation correspondences of the transductive dictionary. This point can be illustrated on the example of deverbative nouns in -ání, -ení: projektování = projektirovanie (designing), referování = referirovanie (refereeing), but simulování = imitacija or modelirovanie (simulation or modelling) rather than simulacija. In several cases it is possible to apply the regularity of sound changes between Czech and Russian: the Cz. prefix pře- can be transduced as pere- (přejmenování = pereimenovanie), but we also face such cases as přetečení (= perepolnenie, overflow), přepínání (pereključenie), which cannot be translated in such a mechanical way. In such cases, the transducing dictionary cannot do more than specify the word class or the gender, that is the information to be used in the syntactic analysis of the source language, but in the Russian output the word has to be marked as "not found in the dictionary".
The above remarks are just an illustration of the possibilities and limitations of an application of a transducing procedure to a pair of closely related languages: our experience indicates that with closely related languages, there is a bigger danger of "coining" false equivalents than with a pair of languages that belong to different families but share the tendency to coin new words from the same (Latin or Greek) basis.
Nevertheless, we hope to have illustrated that even with large-scale multilingual corpora one has to look around for some fail-soft measures that take care of the outcome of the dynamic processes of the formation of neologisms. One of such measures has been described in this contribution.
Bémová A. and V. Kuboň (1990), Czech-to-Russian Transducing Dictionary. In: COLING-90, Peprs presented to the 13th Int. Conference on Computational Linguistics, Helsinki, 314-316.
Hajičová E. and Z. Kirschner (1987), Fail-Soft ("Emergency") Measures in a Production-Oriented Mt System. In: Proceedings of the Third Conference of the European Chapter of the Association for Computational Linguistics, Copenhagen, 104-108.
Vassar College, USA
Word Sense Disambiguation (WSD) is one of the foremost problems facing research in natural language processing today. Polysemous words present obstacles in areas as diverse as machine translation, document retrieval, and speech synthesis. Recent work on WSD suggests that aligned parallel corpora offer a ready-made solution to sense disambiguation, since the translation of different senses of a polysemous word often differs. For example, the word "sentence" in English is translated in French as "phrase" in its sense as a grammatical construct, and as "peine" in its sense as a prison term. To disambiguate an occurrence of the word "sentence" in an aligned English-French corpus, then, one need only consult the translation in the French to see which translation is used. However, this disambiguation method is only partially reliable, for several reasons. First, in many cases the ambiguity is preserved in the translation (e.g., "interest" in English is "interet" in French regardless of its sense). Second, translation is not always word-for-word, and semantic mappings may vary with subtleties of use, etc.
So far, all work using parallel corpora for WSD has involved alignment between only two languages. However, the availability of parallel corpora in multiple languages (e.g., the Republic of Plato and Orwell's "1984" in several languages) offers new potential for exploiting this resource in WSD work. Such corpora provide even more potential because they involve translations in languages from different linguistic families, in which sense ambiguity is less likely to be preserved, and where it is more likely that at least one parallel text could provide the information required for disambiguation. The potential for the use of such corpora for WSD needs to be systematically explored, in order to determine how many and which kinds of languages are required for effective WSD and which kinds of information is necessary to extract from the parallel translation; to identify potential problem areas; to develop appropriate methodologies; etc. This paper is intended to assess the potential of multiple parallel translations for WSD, and provide some principles and methods based on the results.
Trados (Schweiz) AG,
Whereas in the past, automation of the professional translation process was mostly connected to the use of machine translation (MT), this has significantly changed in the last few years. Today, the keywords for professional translators are computer aided translation tools (CAT-Tools) and, notably a key-component: the translation memory. The general idea of a translation memory is very simple: All translations made by a translator are stored in a database and are then in case of re-translations immediately retrievable.
Modern CAT-Tools, in most cases an integration of several functionalities into one "workbench", are gaining more and more ground as a standard tool in the hand of professional translators. Except for literary translations or generally idiosyncratic text types, the use of CAT-Tools has been extended to almost every type of translation work. This includes political, administrative, technical, advertising, biographical, and other text types.
Nowadays companies are faced with a rapidly growing volume of documentation that needs to be produced with ever shorter production cycles while still maintaining the high quality standards expected by its international clients.
This is one of the many reasons why the Trados CAT-Tools are used by companies and institutions like the European Commission and Microsoft. These tools consisting of a terminology database and a translation memory system, make translation work much more efficient. Clients using the Translator's Workbench, Trados translation memory system, speak of timesavings between 30%-50% on text with a certain percentage of repetitiveness and a higher quality standard.
But lets have a closer look at the following products on the basis of a translation project: "computer manual - English > German"
And what does the future bring?
Terminology search on the Internet/Intranet possible today - with MultiTerm Web Interface.
This will be shown on the basis of a worldwideweb on-line search on the database of the European Parliament "Euterpe" or the database of the Credit Suisse.
TRADOS, founded in 1984, based in Stuttgart (Germany), develops and markets tools for professional translators, providing a full range of products and services in this field. Today, with a network of sales and support offices throughout Europe and the US, TRADOS is considered to be one of the leading tools vendors in this market.
School of English,
University of Birmingham,
Birmingham, Great Britain
Center for Advanced Research in Machine learning, NLP and Cognitive Modelling,
This talk describes the process of adapting a parts-of-speech tagger originally developed for English to work language independent. It is shown that a probabilistic tagging approach works well if the language specific information can be separated from the processing engine. An evaluation has been done on Rumanian data which showed encouraging results. With only about 200K words of training data a rate of 97.5% correct tag assignments could be achieved.
Consortium for the Training & Development of SMEs,
The paper generally deals with the problem of translating technical-scientific documents within the area of the Small and Medium Industries in Italy exporting abroad. To master the art of translating, not only from the simple practical point of view, is a necessity felt ever more deeply by the industrial world, where foreign trading is becoming an essential part of the economy. The modern post-industrial society lives on communication, and documentation constitutes the most important information vehicle; a document able to concretely inform the specialised reader in fact transforms the latter in a confident and knowledgeable user.
In the case of technical document translation, the fast evolution of specialist languages makes dictionaries obsolescent and terminologically inadequate - and these dictionaries are the traditional sources of reference, often still the translator's only working tools. However, in spite of the uncertainties of dictionaries, the scarcity of alternative reliable sources makes it possible for them to be regarded as gospel, with the imaginable poor results.
What makes a technical-scientific document hard to translate is mostly the lack of sure definitions and reliable terminological sources and references. Furthermore, terminological work presents the typical difficulties of a strongly comparative and relational activity, and terminological analysis constitutes the initial and primary part of a technical translator. The quality of the terminology employed in a technical document is determined by its definition level and influences the degree of uniformity and coherence achievable, and thus the degree of text ambiguity. Often one makes indiscriminate use of jargon expressions, most of the time conforming only to realities within their own texture, exactly because used in an inappropriate or incorrect manner.
A terminological data bank can make available to users (even non-advanced ones) a tool of easy and prompt consultation, yet at the same time efficient and exhaustive, complementing the traditional references and flexible in its updating when employed in systematic translation. The small and medium-sized enterprises in Italy, and especially in Emilia Romagna, are ripe for being properly introduced to advanced systematic translation and specialised data banks, and would greatly benefit from them if educated to understand their principles and advantages. The author expresses his interest for any suggestions originating from the TELRI Seminar, and would be ready to disseminate any relevant information to the industrial world of Emilia Romagna, keeping also in mind the current legislation relating to linguistic requisites within the EEC.
It is clear that, besides competences in the foreign language (certainly not only of grammatical and lexical nature), a translator would need to possess specific skills of five broad orders:
1) encyclopaedic knowledge of the topic treated;
2) capacity to identify and manipulate concepts;
3) knowledge of textual strategies;
4) expressive capacities of writing in the target language;
5) capacity to manipulate transcultural phenomena.
To the author's knowledge, no complete didactic project exists able to articulate in a progression these five competencies and thus satisfy the current autonomous needs of Italian SMEs. Educational institutions give priority to training in foreign languages, both in the university curricula and in the translation/interpreting schools, complemented by notions on "general culture", "civilisation" of the country of origin and study of international organisations. All this is quite inadequate, if we look at the translating problems currently existing in the field, and the result basically is that apprentice translators deem knowledge in a foreign language still the decisive factor.
It is a fact that the skills for translating are not acquired solely by learning the foundations of a foreign language; solid linguistic knowledge should be complemented by a good acquaintance with the topic to be dealt with, as well as a noticeable dose of precision and creativity. To write, in fact, is still an arduous and demanding task, and acquaintance with the subject considered is fundamental to the elaboration of technical documents, because by this depends the capacity of properly transferring technical-scientific information.
Documentation is the interface between user and product or service, and should place the user in the condition to use it, not only by transmitting information, but also by combining the product's or service's functions with the user's needs and his expectations, and thus it should constitute its integrating element.
Within this perspective, the paper asserts that translating should be deemed a product which is modular and functional to that which it integrates, such that it can be considered the result of an independent activity, yet essential and inseparable part of the product and its generating process. Therefore, perfect transposition cannot be achieved outside the product to which it is associated, and each process phase should be guided by the requisites of the final product. Similarly, the operators of such transposition ought to be deemed part of an integrated group and their activity should be taken into consideration during the product planning stages as well as during the stages defining the producing cycle and life, so that the different product's versions may follow the same evolution of the original product.
The third Chapter discusses more at length some points excerpted from the mentioned statistical research conducted by COFIMP in April 1996: it appears evident that there is a great need for appropriate technical translating systems and expertise among the SMEs of Italy, which are constantly expanding their export markets to now include eastern European countries and Asia. Yet there is a reluctance on their part to approach the problem - since it is a problem, due to the SME's lack of adequate translating structures - in a serious and professional way, because thus far the SMEs have not really been informed of the actual resources currently available. COFIMP's research has shown that, however, the SMEs have a clear concept of what they would require of a "translator", should they accept an in-house presence (be it physical or "electronic") instead of random subcontracting their language requirements to insecure local Translating Agencies. This Chapter should be read in conjunction with the actual COFIMP Survey, available in hardcopy during the course of the Seminar.
Form and Sense Relations as Seen Through Parallel Corpora.
University of Joensuu
Savonlinna School of Translation Studies
The paper starts from the position that translated texts constitute a valuable component of any representative corpus of a natural language; the contrast between languages is seen as relevantly embodied in the practices of bilingual users. The particular focus of the paper is on the value of parallel corpora for contrastive language study in the light of a corpus of English texts and their translations into Finnish. By taking a single common lexical item as a point of departure (the lemma think), the paper shows that the translation equivalents in the corpus have a different profile for each of the forms of think. The target language equivalents provided by professional translators in real contexts can thus be seen as reflecting the sense profiles of the source language word forms. This finding throws doubt on the common practice in contrastive analysis of taking the equivalence between lemmas as the basis of comparison, and, by extension, on the usual practice of compiling bilingual dictionaries.
In addition to reflecting the source language, the juxtaposition of a source language and translations also allows insights into the target language: for example, the study reported here discovered certain delexicalised uses of a major Finnish equivalent of thought. A parallel corpus can thus be seen as a unique source of insights into both the languages concerned; as well as offering material for developing hypotheses for further testing with monolingual corpora, it also provides a data-driven starting-point for contrastive analysis.
Linguistic Modeling Laboratory,
Bulgarian Academy of Sciences,
The paper describes the system MARK-ALISTeR for automatic alignment and search of translation equivalents in large bilingual corpora. In MARK-ALISTeR the Gale-Church algorithm is chosen as an aligning procedure for parallel texts and the Ted Dunning's method based on likelihood ratios was adopted for searching of translation equivalents. Special attention is paid to the extension of the system for searching exact translation equivalents of words and phrases. This implementation is related to BILEDITA #790 Copernicus'94 Joint Research Project where a French-Bulgarian Bilingual terminological dictionary was automatically extracted from parallel legal texts. Evaluation of the results of searching translation equivalents is presented.
Göteborg , Sweden
It goes without saying that parallel texts must offer a wealth of information that can be used in a translation context. Much energy is spent on isolating translation equivalents for words from the general language and technical terms. Most of these approaches get us quite far, but rarely far enough that we can say for sure exactly what is equivalent to what below the level of a "sentence." These problems are continually supplying us academics with the raw material for further research and investigation.
In the meantime, the translators are waiting for the simple tools we promised them. This presentation will exemplify how preliminary results of research projects are lifted out of their academic context and combined with tools that are already on the market in order to offer translators assistance over and beyond what is available in paper format. This is illustrated by showing how prelimary results from the parallel text project in Gothenburg have been integrated with MULTITERM, a commercial terminological management system from TRADOS.
The main points that will be dealt with are the implications that corpus-based multilingual lexicography has on the structure of the lexical database and on how we implement some tentative results of collocational studies into a production environment using MULTITERM for students from the translator training programme at Göteborg University.
Center for Information Research,
Moscow State University,
The bilingual Thesaurus on modern life in Russia is part of the Information System RUSSIA project. The Thesaurus is being developed as a component of the NLP-technology and serves both for full text documents' indexing and categorization and as a search instrument. It is being translated into English in order to provide for foreign specialists to retrieve the IS RUSSIA and as a tool to search the Internet sites' documents in English and to produce index and event-categorization of them in Russian. The Thesaurus on Modern Life in Russia incorporates more than 30,000 linked entries (it includes geographical part of 7.000 entries), it is created by mutual work of programmers, linguists and experts in social, political, economic sciences. 150 Mb of Russian political texts were processed in a semi-automatic mode to produce thesaurus entries. Thesaurus translation resembles with the most sophisticated ones - the Legislative Indexing Vocabulary of the US Congressional Research Service, L.C.; LegiSlate Thesaurus, United Nations Thesaurus, WestLaw Thesaurus, the EVROVOC (thesaurus of the Commission of European Communities). It is also arranged to meet the standards enforced by UNESCO to ensure its international compatibility.
The Information System RUSSIA (IS RUSSIA) is an integrated computer-based information resource for Internet access to data and documents on government and politics in the Russian Federation. The IS RUSSIA project has initially pursued the main goal to create a free computer-based library for general public access, functioning as a data archive for research and education in human sciences. Special part of the project - the NLP-technology - provides for automatic processing of large scope of data and value-added (analytical) services. This component is especially important for human studies given how large the volume of information (including full text documents) are to be processed daily to monitor and analyze the social developments. A pecial part of the project is bilingual complex that includes friendly interface and help screens, developed search tools and abridged versions of reference databases. The thesaurus-based search tools allow advanced query expansion based on the concept relationship encoded in the thesaurus. It makes the search more intelligent, efficient, rational, time- and cost-saving.
The IS RUSSIA project is being developed by a non-commercial organization - the Center for Information Research and is housed at the Scientific Computer Center of the Moscow State University. Financial support was provided by foreign charitable funds, Russian government and scientific funds: the MacArthur Foundation, USA, (1993, 1995, 1996), the Ministry of Science of Russia (1995, 1996), the Ford Foundation, USA, (1996), the Russian Fund for Fundamental Research (1997), the Russian Humanitarian Scientific Fund (1997). Two specialists working with the team have received individual grants from the Soros Foundation in 1995 and the MacArthur Foundation in 1997.
The IS RUSSIA is available on the Internet (http://terminus.srcc.msu.su) since April 1997. User access is currently limited by the hardware capabilities.
Editorial | Topic of the issue: Multilingual technology| Montecatini seminar
On TELRI Newsletter*
Eva Hajičová, Barbora Hladká
The main task of the working group ” Newsletter” was to prepare and publish TELRI Newsletter in regular intervals (three times per year) to inform the academic community, their industrial partners and also the prospective users about the activities of individual TELRI working groups, about available resources and about methods for their processing.
The first issue of the Newsletter was printed and distributed for the September 1995 Tihany meeting. Thanks to this issue we could meet with Trans-European Language Resources Infrastructure, i. e. with TELRI partners, working groups and planned TELRI events.
The second issue (December 1995) was prepared and ready to be sent out in December 1995 and was devoted to the first European Seminar, ” Language Resources for Language Technology” conducted in Tihany, Hungary. Demonstrations of NLP systems of most different kinds were one of the most interesting part of the Seminar. No. 2 brings short descriptions of demonstrations and contributions devoted to some joint ventures.
In 1996, the issues 3 (June 1996) and 4 (October 1996) were put together, edited and printed. We introduced a new column called ” topic of this issue” in No. 3. The first discussed topic was ” syntactic tagging”. In No. 4, we continued with discussing ” syntactic tagging”. The Nancy workshop was an example of joint activities between the members of the working groups. Some workshop participants´ remarks are presented on the pages of the that issue.
The contents of No. 5 (April 1997) described mainly the results of the Ljubljana workshop, which was concentrated on the work on electronic text version of the sample text, Plato´s ” Republic”.
The second European Seminar, ” Language Applications for a Multilingual Europe” was held in Kaunas, Lithuania. Newsletter No. 6 (August 1996) was mainly focused on the descriptions of the demonstration in Kaunas. Like in No. 5 the lexicons were the core theme of the issue.
The present issue (No. 7) is the last issue of TELRI Newsletter and is devoted mainly to the third European Seminar, ” Translation Equivalence - Theory and Practice”, which takes place in Montecatini, Italy. The main topic of the seminar – multilingual aspects of corpora processing – is reflected in most of the contributions to this issue.
We would like to acknowledge the efforts of all contributors who made our work easier. We hope that the Newsletter has been a useful and functional link between the partners and the language engineering community.