Trans-European Language Resources Infrastructure - II

Current Events | Write to us | TELRI Main Page | TELRI Seminar
List of newsletters

TELRI Newsletter No. 8




It is a pleasure and privilege that, on behalf of Coordinators of TELRI II and Telri Association Board, I can welcome all of you, TELRI II Project partners, to another three-year run of Telri's joint ideas, work and people. The Project is to reach over to the next millennium, ending in 2001, as you know. This time, the number of participants is even greater, drawing on 21 countries, which makes this Concerted Action type of programme truly pan-European and one of the greatest research events in Europe at all. The work centred in five Work Packages, each subdivided in various Work Groups, is going to be as diverse as the needs of all of you are and will, no doubt, produce impressive results which could match those of its predecessor, TELRI (I). To quote from the Annex I, our joint multiple task and primary objective will be:

  • to strengthen the Pan-European infrastructure for the multilingual language research and development community;
  • to collect, promote, and make available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge;
  • to offer a customized comprehensive service to academic and industrial users;
  • to prepare and organize research and development projects focusing on translation aids, multilingual authoring systems, and information retrieval;
  • to provide a forum where experts from academia and industry share and assess approaches, and engage in joint activities;
  • to make available the expertise of its partner institutions to the research community, to the public, and to language industry.

I hope our joint efforts will lead us to better mutual cooperation, valuable insights, papers, results and, ultimately, projects. In doing this, I wish all of us every success.

Professor Dr. František Čermák
The Institute of the Czech National Corpus
Faculty of Arts, Charles University, Prague



Wolfgang Teubert
Corpus Linguistics-A Partisan View

Corpus Linguistics as a Theoretical Approach

Corpus Linguistics contributes to the acquisition of empirical linguistic knowledge by conjoining three methods: automatic extraction of language data from corpora; output processing by mostly statistical procedures, and finally the validation and interpretation of the processed data. While the first two steps are, or should be, fully algorithmic, the third step involves intentionality and human reasoning.

Corpus linguistics is based on the concept that language is a fundamentally social phenomenon, which can be observed and described first and foremost in the empirical data readily available, that is, in communication acts. Corpora are cross-sections of a discourse universe comprising all communication acts. The texts they monitor are principally transient communication acts.

To view language as a social phenomenon entails that we do not know or strive to know how the speaker or the hearer understands the words, sentences, or texts they say or hear. As a social phenomenon, language manifests itself in texts, texts that can be observed, recorded, described, and analyzed. Internal, mute texts are also texts, but they cannot be observed and are, therefore, not a social phenomenon. Other texts, as well, do not, by their origin, qualify as communication acts, like utterances made in isolation, be it in spoken form (e.g., soliloquies) or in written form (e.g., diary entries). Most texts occur as communication acts, that is, as the interactions between members of a language community. The universe of discourse is, in principle, composed of all the communication acts of a language community. For a long time, the large majority of these acts occurred unrecorded. For classical Greek, all recorded texts extant today are not too large for today's corpus tools. They easily fit into a manageable corpus. But for our age with its exponential growth of recorded communication acts (on magnetic tapes or in written form), we cannot aspire to catch a full glimpse of the discourse universe. The totality of what is recorded and accessible is too large for analysis. It must be condensed down to a sample in view of the special phenomena which are to be described. It is the task of the linguist to define and delimit the scope of the discourse universe she or he is interested in in such a way that it can be reduced to a corpus. Parameters can be language, time segment, region, situation, external and internal properties of texts, and many others.

By instituting expert arbitration or by exercising its own sovereignty, the language community has come to accept what makes a sentence grammatically correct and on what it means. Such accords are as dynamic as they are implicit; they are not cast in stone. Each communication act can introduce a new syntactic structure or syntagmatic pattern, can enter a new word or a new collocation, can suggest a new semantic interpretation. These alterations can modify the existing consensus if enough other communication acts follow suit. The unique feature of natural language to welcome metalinguistic statements also helps to keep language dynamic. Existing rules can be discussed, contested, and rejected, and new rules can be propagated in communication acts. Meanings, in particular, are negotiable (one reason why classical linguistics finds it difficult to deal with them) in spite of all the efforts of lexicographers to standardize them. The apparent durability of morphological and syntactic rules shows more the impact of language teaching and of the social importance attributed to them than the workings of a language organ. If there are immutable language laws, they would be of little interest because they are universal and we would find them in all languages. Corpus linguistics describes single natural languages, not universal linguistic properties.

As we cannot access other people's minds (or even our own), we do not know if and how the linguistic conventions of a language community are inscribed there. We find them only in communication acts, in texts. Dictionaries, grammars, and language course books are part of the discourse universe; and in as far as they represent socially accepted standards, we must give them fair credit for their privileged position. But the account they provide is neither comprehensive nor always factual. Corpus linguistics therefore looks at the full picture to discover the conventions of a language community. For corpus linguistics, language is a virtual, dynamic phenomenon that can only be captured and made accessible in the form of samples of a discourse universe of texts.

Corpus Typology

A corpus is an electronic collection of texts in a uniform representation. Corpora must be appropriate for the task at hand, and their composition must be well-founded. One (arguable) special case of a corpus is the reference corpus which should in principle be "saturated" with reference to selected parameters. "Saturation," a notion introduced into corpus linguistics by Cyril Belica, is a statistical property for the type-token-ratio; it is founded on corpus-internal rather then corpus-external features. A corpus, divided up into equal size segments (defined by tokens), from beginning to end, is saturated in respect to the feature of lexical items if a new segment of the same size added to the corpus yields about the same number of new lexical items (defined as tokens) as the previous segment, that is, it is saturated when the growth of lexical items has stabilized and the curve of lexical growth has become asymptotic. Saturation is a property that can actually be measured; therefore, it is superior to concepts such as representativity or balancedness, which cannot be operationalized.

Opportunistic corpora are corpus collections from which linguists can set up the corpus they need for the research they want to carry out. Monitor corpora, introduced by John Sinclair, have a diachronic dimension; they document language change. Therefore, they must have the same composition for each time period, related to selected and well-founded text-external characteristics. Comparable corpora are multilingual corpora of similar or identical composition, relative to selected parameters. Parallel corpora are also multilingual corpora which consist of texts in the original language and their translations in target language(s). Reciprocal parallel corpora are multilingual corpora which contain, for all languages included, original texts as well as their translations into all the languages included.

Corpus Linguistics and Language

Corpus linguistics provides a methodology to describe language under new aspects. This methodology is realized as a set of computational, fully or partly automatic procedures to extract language data from corpora and to process them further for an intellectual analysis.

The new aspect that corpus linguistics contributes to classical linguistics is the notion of intratextual and intertextual co-occurrence of textual elements. As long as it was not possible to analyze large amounts of language data in a systematic, procedural way, there was no way to capture the complex relations of co-occurrence between text segments (e.g., words) other than by syntactic rules. These describe the behavior of classes of isolated elements (e.g., nouns) in relation to other classes of isolated elements (e.g., attributive adjectives). Collocations consisting of two or more lexical elements and exhibiting specified co-occurrence properties corresponding to some kind of semantic coagulation cannot be exhaustively described by syntax rules; nor does lexicology provide an operational conception that would allow their identification. Only corpus linguistics is methodologically equipped to deal with these co-occurrence phenomena. Collocation dictionaries in the true sense are corpus-based. Before corpus linguistics, dictionaries and lexica were, in principle, oriented towards single, isolated lexical units. They did not strive for a systematic coverage of semantic coagulations, and they did not fully acknowledge the impact of context on the meaning of words in texts.

For corpus linguistic, words should be analyzed in their contexts. Words are first and foremost text elements and not entries of a lexicon or a dictionary. Corpus linguistics does not endeavor, as classical lexicography does, to abstract all essential semantic features of words from their occurrence as text elements and to describe their meaning in splendid isolation, and it does not view the single word (often defined as a chain of characters uninterrupted by blanks) as its main research focus. Instead, corpus linguistics is centrally interested in text segments, which can consist of one or more text elements, which exhibit their cohesion by a specific co-occurrence pattern and which are embedded in contexts.

The embeddedness of text segments in their surrounding contexts is essential for corpus linguistics. A major part of our general language vocabulary consists of lexical items with fuzzy meanings, words like sorrow or shame, and it is only in connection with their contexts that these words acquire a concrete sense. From the frequency data for the items that make up the context, statistical procedures can generate context profiles that help to group similar contexts together. A classification of text segments based on their context profiles seems to correlate better with our intuitive, practice-driven perception of word meaning than those (often futile) attempts to identify a text word with one of the senses offered by the dictionary.

Corpus linguistics focuses our attention on the embedding of words in their contexts as well as on text segments larger than the single word: multi word units, collocations, and set phrases oscillating between stability and flexibility. Co-occurrence patterns are the output of statistical analysis; the research task is to validate and to interpret this output, relying on our human understanding of texts and text segments. Thus, corpus linguistics can close the gap between syntax and the lexicon. The individual co-occurrence patterns of textual elements from which the language data are processed are candidates for semantic coagulations.

Corpus Linguistics and the Minimal Assumption Postulate

Any systematic exploration of corpora with the goal to generate new linguistic knowledge must depend on theoretical premises, categories that have proven to be useful to classify these data. Preconceptions entered into corpus analysis run the risk, however, of distorting the data by creating a self-fulfilling corroboration of underlying assumptions. In the early nineties, John Sinclair introduced the principle of minimal assumption to filter out this kind of unwelcome bias. It postulates that any corpus analysis must be based only on those premises that have no influence on the research target. If the goal is to find a context-driven distinction between participles used as attributes and attributive adjectives, we cannot rely on standard POS-tagging as it would anticipate the results we wanted to obtain. If, on the other hand, our research target is a list of collocation candidates of the type adjective+noun, POS-tagging would be useful and would not bias the research target.

The minimal assumption postulate does not, in principle, question the results of classical linguistics. There is a broad consensus on the central tenets of syntax, and this consensus is nourished by the plausibility of the basic categories and rules and by their obvious translatability from one syntactic calculus to another. Of course there are nouns, verbs, and adjectives, complements and adjuncts, inflexion rules and rules of congruency. These categories and rules have proven their usefulness over centuries. It is not the core of these categories and rules that are in doubt; it is the margins where the evidence can be presented and described in different ways and where it is possible to disagree with a particular definition. The methodology of corpus linguistics helps to sift the evidence and thus prepare the way for better informed decisions.

There is, however, one important difference between classical linguistics and corpus linguistics. The methodology of corpus linguistics extracts language data from corpora by employing algorithmic procedures, procedures that can be carried out by computers and that do not presuppose intentionality. Where intellectual analysis includes inference by analogy and inductive reasoning, computers merely carry out instructions. This may have consequences for the definition of categories and rules. Not all that is well-defined for a thinking mind is equally well-defined for a mindless computer. The minimal assumption postulates makes it possible to review traditional categories in light of the strictly algorithmic methodology of corpus linguistics.

All computational techniques for extracting and elaborating data from a corpus require, in their developmental phase, the repeated intellectual analysis of sufficiently large output samples as feedback. This is the case, for example, for all syntactic parsers. A frequent but different approach is to manually annotate a sufficiently large corpus sample and to use the annotated sample as benchmark for the performance of the corpus tool. Here, the developmental task is to fine-tune the tool to parse new sentences along the same rules used for the annotation of the sample. In this case, the parser will be used to validate the preconceptions guiding the annotation. By parsing large corpora, it will show whether all sentences can be parsed with the underlying rules or whether there are sentences where these rules cannot be applied, where they do not fit or suffice. Validation does not tell us if the rules are appropriate; it does not enrich our linguistic knowledge. Syntactic parsing has little to do with linguistics; it belongs to natural language processing for language technology applications. Only if used scrupulously bearing in mind the minimal assumption postulate, can it be used as a filtering technique for the extraction of language data from corpora.

Research and Development

Corpus linguistics contributes to basic linguistic research by introducing a new focus on linguistic phenomena that adds new research items to the agenda of linguistics. Corpus linguistics addresses a hitherto largely unnoticed layer of linguistic analysis between syntax and the lexicon. This is the layer of statistically determined co-occurrence patterns between text elements based on the fundamental type-token relation. Thus, the central issue of corpus linguistics is the dichotomy of stability vs. change, under the diachronic aspect, under the aspect of language variation (regional, social, and functional), and, most important, under the aspect of semantic coagulation. As the chaos theory tries to grasp phenomena which defy description by traditional laws of nature, corpus linguistics helps us to correlate co-occurrence patterns with patterns of semantic coagulation and determination, that is, to project an order where we formerly only saw the random emergence of syntactic constituents and the unpredictability of lexical insertion.

Corpus linguistics thus paves the way for a new generation of language technology, including information retrieval and knowledge extraction. A virtual corpus of Internet documents belonging to a particular knowledge domain, structured to reveal temporal change and theoretical, regional, and functional variation can be subjected to computational procedures extracting highly relevant information like theoretical innovation, the emergence of new ideas, and paradigmatic change. The single word level is not sufficient to detect text sequences indicative of novelty, not even if combined with syntactic analysis. Information is contained in items of semantic coagulation and contextual determination. Whenever new ideas are formulated, they co-occur with text segments signifying a metalinguistic level, text segments like we call A from now on X or in this presentation Y is used to denote B. The text segments A, B, X, and Y are rarely single words (neologisms); they are commonly semantic coagulations. Therefore, most of the computational procedures in this field of information technology will have to combine classical approaches in language technology like syntactic analysis and conceptual ontologies with the pattern-oriented dynamic co-occurrence analysis of corpus linguistics.

Corpus linguistics also provides a new approach to machine-aided translation. By extracting translation units and their corresponding translation equivalents from parallel corpora, it can offer reliable candidates for translating 98% of the semantic units in an average general language text that have been translated before. The short history of corpus linguistics has demonstrated that there is a remarkable synergy between the aspect of basic research and the aspect of developing operational applications in language technology and what is called knowledge extraction.

Corpus Semantics

Linguistics deals with language. The main function of language is to mean. Words, sentences, and texts represent meanings. It is the core task of linguistics to explain and describe how language means. After tremendous advancements in phonology, in syntax, and in many other areas of linguistics, there is still no sound theory of semantics. Semantics is the main challenge of linguistics.

Corpus linguistics studies language as a social phenomenon. It is not possible to know (and it makes no sense to strive to know) how the speaker and the hearer understand the words, sentences, and texts they say or hear. Understanding is a psychological, a mental phenomenon. It is a central issue of psychology and of cognitive linguistics, in particular. As a social phenomenon, language manifests itself in texts, and only there. However, if understanding words, texts, and sentences is a psychological phenomenon, how can linguistics deal with semantics?

Sentences and words are basic concepts in classical linguistics. Corpus semantics looks at language from a new point of view. Its focus is on the intermediate layer between the lexicon and syntax. Corpus semantics is interested in semantic coagulations, composed of smaller parts and forming some kind of unit which may be more or less stable, where stability means a high measure of recurrence of a pattern and of the majority of the parts which comprise the pattern. Corpus semantics has provided evidence that such coagulations are a least as dominant as the single word that classical linguistics has focused upon. Compounds, multi word units, collocations, and set phrases deserve as much attention as the single lexical item. This is why, in this paper, I do not talk about words but about text segments and their parts.

Unlike understanding, meaning is a linguistic phenomenon. The meaning of a text segment comprises the history of its previous occurrences, and this includes everything that has been said there about its meaning and about the meaning of the parts it is composed of. We have to distinguish two types of occurrences. Most of them are on the object level: text segments are being used within their contexts; they are not an object of reflection. A statistical analysis of the contexts in which these text segments occur will yield context profiles correlating with different usages. This data can be interpreted; that is, it can be elaborated into lexical knowledge. Different usages represent different meanings. More important are occurrences of text segments as object of reflection, that is, metalinguistic utterances; here we find statements like: A should rather be called B, B does not mean the same as A, some people talk of A when they mean B, etc. Metalinguistic statements let us view the discourse universe as a network of (mostly implicit) references, but not cross-references: all references point to prior communication acts. This shows the diachronic dimension of the discourse universe. The discourse universe also includes dictionaries. They try to capture the essence of chiefly single word meanings by isolating their essential semantic features (questionable as they may be) from what traditional lexicographers take to be the background noise of the context. In some language communities more than in others, these definitions enjoy a privileged status among all other metalinguistic statements on text segments in as far as they are accepted as devices for language standardization. All of these textual data on the object level and on the metalinguistic level taken together constitute meaningful material; indeed, they constitute the meaning. But the meaning is only accessible once it is condensed and interpreted, that is, paraphrased into a text (and traditionally called a definition). The interpretation becomes of the language universe, and subsequent texts can refer to it.

To paraphrase and to interpret textual data is a human task. It presupposes intentionality and, being an action, it cannot be emulated by a computational process. As described above, it should be clear that the meaning of a textual element is nothing fixed or stable. New occurrences on the object level in new contexts can introduce a new semantic potential, and existing interpretations can always be superseded by new ones. The default value for these interpretations is not correctness (which could not be ascertained) but their acceptance by the language community.

In the sequence of texts constituting the language universe, we can observe that text segments (and their elements) are resumed in the same text and in subsequent texts. They are cited, paraphrased, reformulated, interpreted, ascertained, or repudiated. With the exception of neologisms, they have had a long history of being used and being referred to. The discourse universe presents itself as a diachronically ordered network of textual segments.

This notion of semantics, which we shall call corpus semantics, does entirely without psychology or cognitive semantics. Corpus semantics is not interested in mental concepts or in beliefs people have; it is not concerned with the mental act of understanding a text or a text segment. All evidence we may have concerning mental representations comes, after all, from hearsay, not from direct observation; and this kind of evidence is not readily admissible into empirical linguistics. What corpus semantics deals with are only verbalizations of what may be mental representations. Such verbalizations are texts, and therefore part of the language universe.

Corpus semantics holds that the concrete meaning of text segments can only be derived from the context in which they occur. However, this is true only for general language text segments and not for terminological units occurring in a domain-specific language. In theory, terminological units do not have a meaning; rather they designate a concept that is defined language-neutral and has a unique position within a conceptual ontology. It seems that for the part of natural language vocabulary denoting what is now commonly called natural kinds and for similar words interpretations of meaning often include nonverbal deictic acts. The meanings of words like tiger, elm, shell, etc. are combinations of textual data and deictic acts involving sensory perception; some dictionaries honor this facet by incorporating illustrations in their definitions.

Corpus semantics rejects the view that the meaning of a text segment (or a sentence or a text) exists independently of its expression, for example, in the mind of the speaker or the hearer. Meaning and expression belong together; they are, like photons as corpuscles and as waves, nothing but different aspects of one entity, in our case the linguistic symbol. Most cognitive linguists agree that mental concepts are symbols, as well, and these concepts, too, cannot exist as pure meaning-they must have a form or an expression. There is no meaning without form, either in language or in the mind. Corpus semantics, therefore, rejects the view that the speaker encodes a meaning (the message) into the language and that the hearer decodes this message from the utterance. For corpus semantics, any text, text segment, or part of a text segment can be viewed under two aspects: form and meaning. Form and meaning cannot be separated, neither in a linguistic sign nor in any other symbol.

Contexts can be analyzed under three aspects: rules, lists, and frequency. Corpus linguistics uses them all. The occurrence of a noun phrase preceded by a particular preposition in the close context of an ambiguous segment can be indicative of one of several senses. Here it is the rules of syntax that are used for disambiguation. The occurrence of a particular word like briefcase in the context of the text segment diary may indicate that it is not the book under the pillow but the book with spaces for each day of the year to jot down appointments. Here the lexicon is used for disambiguation. The lexicon covers the aspect of lists. Corpus linguistics thus makes use of the knowledge elaborated by classical linguistics, while paying heed to the minimal assumption postulate. The rule-based and the list-based approach are fundamental for the first step of corpus methodology: the extraction of data.

The second step is the calculation of context profiles for text elements. This is done by statistical procedures that measure co-occurrence data for text elements found in the contexts of all occurrences of a text element in a corpus. Looking at the context, we often find that not a single word but rather a specific co-occurrence pattern disambiguates different usages or meanings of a text segment like green action as either a political, environmental activity or as, let us say, an artistic happening. Statistics deals with the aspect of frequency and its impact. All three aspects together will help to disambiguate the meaning of a text segment. This shows us that word (or better: text segment) sense disambiguation often can be achieved automatically. But we can never be sure that output from any set of computational operations, well designed as they may be, will correlate with human understanding. Strictly speaking, automatic corpus analyses can only yield candidates for sense disambiguation. Whether two occurrences of the same text segment have the same meaning or different meanings is always a matter of interpretation. Well-designed tools may achieve a 98% correlation, and such a result is certainly good enough for most language technology applications. But we have to remain aware of the theoretical difference in status.

For corpus semantics, the meaning of text segment types is documented in the history of the corresponding text segment tokens. The finer details of meaning, for example, semantic shades, deontic and pragmatic aspects, and particularly the semantic features discussed controversially by the language community in metalinguistic statements, can be extracted from the contexts, condensed and paraphrased into a text that describes the meaning of the text segment. The production of such descriptions is a human activity, and it cannot be reduced to a computational process. Whatever the mental representation of a text segment may be-this is not what corpus semantics claims to have access to. It only has access to the text paraphrasing the meaning.

Standard Approaches to Multilingual Semantics

The traditional approach to bilingual lexical semantics is the bilingual dictionary, or, in the case of computer applications, the bilingual lexicon. Some better modern bilingual dictionaries are almost sufficient to translate a text from a foreign language into one's native language. This does not mean that the information provided is so explicit that it can be read as a set of instructions. The bilingual dictionary offers material and gives hints how to use it; but a user must employ his or her mental faculty to draw analogies, to use inductive reasoning, and to handle metonymy, to mention just a few features. More important, she or he must, to a large extent, understand the text to be translated. Bilingual dictionaries, even the best ones, are no great help when we want to translate a text from our native into a foreign language that we do not speak well. Often we do not find enough information for choosing wisely between the alternatives offered, and even where our translation may be formally correct, it commonly differs from what a native speaker would have said. Bilingual dictionaries do not help to translate a text from a language A into a language B if the translator does not understand one of the languages quite well and the other language at least to some extent. Text understanding is a precondition of using dictionaries.

Computers do not understand texts. This is why machine translation based on dictionary-derived lexica cannot work for general language. Lexica for machine translation must not rely on the mental faculty for drawing analogies, for inductive reasoning, and for understanding nonlexicalized metaphors. They must contain explicit, exhaustive, and algorithmic instructions for selecting the proper translation equivalent for every translation unit. Such explicitness would require a translation knowledge that is nowhere in sight, and it would require that this translation knowledge can be formalized, that is, computerized. However, this is often not possible.

With the development of machine translation systems supporting more than two languages, an approach using conceptual ontologies was taken over from artificial intelligence. It had soon become evident that machine translation for general language texts was far beyond the reach of any conceivable system, and, therefore, the focus shifted to the translation of domain-specific technical language with a high saturation of terminology, where the translation issue shows many parallels with the knowledge bases of expert systems. Today machine translation works best for what is called controlled documentation language, a language similar to a formal calculus where, in principle, all syntactic and lexical/terminological ambiguity is suppressed. Translation for controlled languages indeed can be reduced to the permutation of uninterpreted symbols according to algorithmic instructions.

Both the ontological and the lexicon approach are doomed to failure when it comes to machine translation of general language texts, and no other approach is conceivable that would solve this problem. General natural language is different from controlled languages. For in the case of general language, we cannot disregard the aspect of meaning. Meaning is captured in paraphrases and interpretations; paraphrase presupposes understanding. Understanding involves intentionality, and intentionality requires consciousness. There is no alternative to human translation for general language. All we can hope for is computational support for this human task.

Multilingual Corpus Semantics and Meaning

The translation of a text in another language is a paraphrase of the original text. It embodies the meaning of the original text in the same way as a paraphrase in the same language would. If a text is paraphrased or translated by several people, not one of the paraphrases or translations will be identical. This is another indication that the act of paraphrasing or translating cannot be broken down to an algorithm, to a process.

For multilingual corpus semantics, the meaning of a text segment in language A is its translation into language B. The default is translation equivalence. The empirical basis for multilingual corpus semantics is a multilingual discourse universe consisting of all translated texts and all their translations. This virtual corpus is realized by parallel corpora. They are composed of original texts in one language and their translations in other languages. Reciprocal parallel corpora are composed of original texts in all of the languages involved with translations in all of the other languages of this set.

As in the case of monolingual corpus semantics, meaning is a strictly linguistic or, to be more precise, a strictly textual notion. Meaning is paraphrase. The complete meaning of a text segment as a semantic coagulation is contained in the multilingual discourse universe, and it is captured as the sum of all the translations of this text segment that we can find there. For corpus semantics, conceptual ontologies may be reflected in terminologies, but not in the vocabulary of general language. Conceptual onotolgies have nothing to do with meaning.

The basic unit of multilingual corpus semantics is the translation unit; the unit which is translated as a whole into the corresponding translation equivalent. Translation units are the smallest units of translation. Although they may consist of many words, they are translated as a whole and not by translating part by part. Translation units correspond to the text segments or semantic coagulations of monolingual corpus semantics.

In multilingual corpus semantics, it makes sense to say that the meaning of a translation unit is its translation equivalent in another language. Such a circumscription repeats the basic tenet of corpus linguistics that semantic coagulations are not fixed units but that there is a wide range between fixed stability and absolute variability. It is a matter of interpretation. Whether a given co-occurrence of words is a translation unit or has a concatenation of lexical elements will be shown in the translation. This has two consequences. What turns out as one integral translation unit for one target language, can be a straightforward concatenation of single words for another target language. Even within one target language, we may find both alternatives realized, depending on the predilections of different translators. It is really the community of translators for a given language pair who decides what constitutes a translation unit, as it is the monolingual language community who decides on what is a text segment, a coagulation.

What constitutes a translation unit, therefore, depends on the target language as well as on the common practice of translators, as it can be extracted from parallel corpora. A text segment is a translation unit only in respect to those languages into which it is translated as a whole. Translation units are not metaphysical entities; they are the contingent results of translation acts. The analysis of parallel corpora has shown that more than half of the translation units in an average general language text are larger than a single word.

The meaning of a translation unit is its paraphrase, that is, the translation equivalent in the target language. For ambiguous translation units, this means that the unit has as many senses as there are non-synonymous translation equivalents. Defining meaning in this strict sense implies that the meaning or the number of senses a translation unit has depends on the language into which it is translated. For a given translation unit of language A, we may find two non-synonymous translation equivalents in language B and three non-synonymous translation equivalents in Language C. A few illustrations may be not amiss. For English sorrow, we find commonly three translation equivalents in French: chagrin, peine, and tristesse. Two of these, chagrin and peine seem to be fairly synonymous in a number of contexts (pointing to a cause responsible for the ensuing emotion), while tristesse is the variety of sorrow which, unlike chagrin and peine, is not caused by a particular event. In German, there are also three translation equivalents for sorrow: Trauer (caused by loss), Kummer (caused by an infelicitous event, intense and usually of limited duration), and Gram (also caused by an infelicitous event, not necessarily intense, more a disposition than an emotion, but of possibly unlimited duration). These three translation equivalents are non-synonymous, and they do not map on the French equivalents mentioned above.

Our example shows that the notion of synonymy cannot be reduced to a computational process. To declare two expressions synonyms presupposes understanding of what these expressions mean. If we look how Greek proseuchomai in the first sentence of Plato's Republic has been translated into English, we find, in seven translations, four different equivalents: to make my prayers (3x), to say a prayer, to offer up my prayers, to pay my devoirs, to pay my devotions. It is up to us to decide if we declare the Greek verb ambiguous or just fuzzy and therefore the equivalents as representing two or more senses or as being synonymous. This example also shows that the notion of synonymy can only be applied locally, on units or text segments occurring in a context. It makes sense to look at the equivalents listed above as synonyms as we can infer that for Plato's Greek readers the translation unit was perhaps fuzzy but not ambiguous. But viewed as English text segments, we would assume that in most contexts to offer prayers cannot be substituted for to pay one's devoirs without changing the meaning. Our examples also show that the translation equivalent which computers would single out as most common, to make my prayers, is not necessarily the one that modern experts for classical Greek mentality would choose, namely to pay my devotions. This is a fine illustration of the dynamics of meaning and also of the fact that translation is fundamentally a human act and not a computational process. Finally, our example also demonstrates that translation is necessarily a human act and not a computational process.

Neither for choosing the proper equivalent for English sorrow nor the Greek proseuchomai can we define formal instructions a computer could carry out. We need to understand the texts and their text segments to be able to paraphrase or to translate them.

The Application of Multilingual Corpus Semantics

Neither the bilingual dictionary and nor the language-independent conceptual ontology solves the translation problem. Appropriate translations cannot be generated just by following instructions without understanding the text. Multilingual corpus semantics circumvents the obstacle of text understanding. Instead of modeling and emulating mental activity, it builds up upon the results of this activity as it has been carried out by translators time and again, results that often lead to widely accepted translation equivalents.

Parallel corpora are repositories of human translations. They contain translation units with their translation equivalents. Looking at new general language texts, we find that perhaps 99% of all text segments they contain have occurred before and can be identified in a monolingual reference corpus of sufficient size. Neology accounts for the remaining 1%. However, only a rather small portion of texts are translated. Parallel corpora will never be as comprehensive as monolingual reference corpora. But in a reasonably sized parallel corpus, we should be able to detect 95% of all text segments that we find in a new text that we want to translate. With appropriate corpus tools, we should be able to identify the respective translation equivalents. If there is more than one equivalent for a translation unit, corpus tools will analyze the context and tell us if the unit is ambiguous, and they will, if necessary, map its senses with the different equivalents.

Multilingual corpus linguistics does not attempt to be a solution to the quest for machine translation. But it can provide computational support to human translators by presenting possible translation equivalents for the units which have to be translated. It offers, strictly speaking, candidates for translation equivalents, candidates that have been extracted automatically from parallel corpora. Automatic extraction uses tools that reflect the methodology of corpus linguistics. They search for complex patterns of all occurrences of the translation unit and its equivalents to account for different usages. They then process for statistically relevant co-occurrence patterns indicative of semantic coagulation. But whether the translation equivalent candidate presented as a result of processing all relevant language data available does indeed correlate with the implied meaning is up to the translator to decide.

Semantics and the Multilingual Corpus

Multilingual corpus semantics is in a privileged position in comparison to monolingual corpus semantics. In monolingual corpora, we do not systematically find paraphrases of texts and text segments. There we only find data still in need of interpretation. Working with parallel corpora can help to solve the question of meaning. For multilingual corpus semantics gives access to linguistic practice, not the linguistic knowledge of textbooks, grammars, and dictionaries but the knowledge translators use in their translations. If we agree that the meaning of a text or a text segment is only accessible to us by a paraphrase of it, grounded in a multitude of previous occurrences, then parallel corpora are repositories of such paraphrases. Of course, dictionaries, in our case bilingual dictionaries, also provide paraphrases. But while dictionaries have to treat lexical items as words in isolation, the paraphrases generated by translators present the meanings of text segments in their contexts. And they do not focus on the single word as the essential unit. It is they who decide what is a translation unit. If they translate a multi word unit, a collocation, or a set phrase as a whole, they establish what we have called a semantic coagulation. The evidence of parallel corpora can be used to validate the candidates for semantic coagulations generated by the methodology of corpus linguistics.

Translation equivalents show us what a text segment means. They also show us that meanings and their paraphrases are nothing definite. In a different context, for a different text type, with a different bias, a translator will come up with a different equivalent. Much of what a translator does is idiosyncratic. It may or may not become an established practice. Parallel corpora contain successful translation equivalents and those never occurring again. The common practice of translators for a given language pair reveals the conventions of this bilingual language community.

There is no reason to assume that the conventions of a monolingual language community concerning the meanings of text segments differ in principle from those of a community of translators. If at all, they are probably even less reflected and less explicit, since they are not stabilized by a continual practice of paraphrase. It is linguistic practice, the way people deal with language, that linguistics tries to capture. It aims at transforming this implicit competence into explicit knowledge. Multilingual corpus semantics contributes to this goal.


Belica, Cyril: 1996. "Analysis of Temporal Changes in Corpora." International Journal of Corpus Linguistics 1(1): 61-74.

Sinclair, John: 1996. "Corpus Typology. A Framework for Classification." EAGLES.www:

Teubert, Wolfgang: 1998. "Korpus und Neologie." In. Wolfgang Teubert (ed.): Neologie und Korpus. Tübingen: Gunter Narr. 129-170.


TELRI II Summary

The Concerted Action TELRI II is a pan-European alliance of currently 27 focal national language (technology) institutions with the emphasis on Central and Eastern European and NIS countries. It is planned to extend this alliance during the course of the Concerted Action with at least 3 new nodes in CEE/NIS.

TELRI II's primary objectives are:

  • to strengthen the pan-European infrastructure for the multilingual language research and development community;
  • to collect, promote, and make available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge;
  • to offer a customized comprehensive service to academic and industrial users;
  • to prepare and organize research and development projects focusing on translation aids, multilingual authoring systems, and information retrieval;
  • to provide a forum where experts from academia and industry share and assess tools and resources, assess software, evaluate new trends, investigate alternative approaches, and engage in joint activities;
  • to make available the expertise of its partner institutions to the research community, to the public, and to language industry.

TELRI II will implement these objectives in the following activities:

  • Networking: This work package includes liaising with related infrastructure activities, centres, and institutions, promoting TELRI activities (newsletter, webpage, TELRI list), and strengthening the permanent infrastructure of the TELRI Association.
  • TELRI Seminars: The series of successful TELRI seminars will be continued with annual seminars in CEE/NIS countries.
  • TRACTOR Service: This work package comprises promotion, support, and availability of customized service to the TRACTOR User Community.
  • TRACTOR Tools and Resources: This work package focuses on the acquisition of attractive tools and resources for TRACTOR, the TELRI Research Archive of Computational Tools and Resources.
  • Organizing Joint Research: TELRI partners will prepare European R&D projects with strong industrial involvement focusing on multilingual language and terminology issues.

The TELRI II Concerted Action will yield as concrete results: seminar proceedings (in print and electronic), TELRI Newsletters (semiannual, in print and electronic), TRACTOR Service Directory (annual, in print and electronic), Webpage and TELRI list (ongoing, electronic), TRACTOR resources and tools (CD-ROM), project proposals, research publications, etc. All results will be strictly public domain.



Kemal Oflazer
Dan Cristea
Martin Wynne

Kemal Oflazer
Bilkent University
Department of Computer Engineering and Information Services
TR-06533 Ankara

The Center for Turkish Language and Speech Processing at Bilkent University Ankara, was established in 1998, as a joint initiative of the Departments of Computer Engineering and Information Science, and Electrical and Electronics Engineering. The Department of Computer Engineering and Information Science had until now been active in developing language processing technology, while the Department of Electrical and Electronics Engineering was active in speech processing technology. It was envisioned that the time was ripe for collaboration to develop more comprehensive language and speech applications, and the Center was founded.

Prior to the founding of the Center, NLP work at Bilkent University was supported by a NATO Science for Stability Grant (of about $600,000 over 5 years). Thanks to this grant, extensive resource and application development work on Turkish was undertaken. We can cite some of these as follows:

  • Turkish Morphological Analyzer: This wide-coverage analyzer determines all morphological interpretations of a Turkish word. It uses XRCE finite state technology and knows about 30,000 root words and 35,000 proper nouns. It has been tested on millions of words and is quite fast (about 2000 words/sec on an UltraSparc workstation).
  • Turkish Morphological Disambiguator: This constraint-based disambiguator determines the correct morphological interpretation of a Turkish word in context. It can achieve about 96-97% recall and 94-95 % precision on previously unseen Turkish text.
  • On-line Turkish Dictionary: A full-fledged Turkish dictionary has now been made accessible from Bilkent University via the World Wide Web. This dictionary has over 55,000 entries, over 90,000 word senses, and 30,000 usage examples and it can be consulted with inflected and or misspelled forms.
  • Turkish Parser: We have developed a wide-coverage parser for parsing Turkish using the LFG.
  • Turkish Generator: We have developed a wide-coverage tactical generator for Turkish and are currently using this in two machine translation system prototypes.
  • Text Resources: We have compiled quite substantial amount of Turkish texts for various uses. We have tagged some of these. We have also compiled aligned Turkish/English texts.
  • Miscellaneous: In addition to the resources mentioned above, a number of software technologies for spelling checking and correction for Turkish have been developed.

We have also developed two prototype machine translation applications in collaboration with Center for Machine Translation and Carnegie Mellon University (using their KANT system) and with SRI Cambridge Labs (using their CLE system.)

The current work at the Center is now proceeding along a number of directions:

  • Speech Applications: We are developing a large vocabulary discrete word speech recognition system for Turkish. This will be the first such system for Turkish.
  • Statistical Language Modeling: We are working on statistical techniques to apply to Turkish language modeling, for both morphological disambiguation and for structural disambiguation.
  • Finite State Techniques: We have been working on using finite state techniques for low level light parsing and for dependency parsing of Turkish.
  • Information Extraction: We are commencing on work to develop various information extraction systems for Turkish. This will use most of our technology for morphological analysis and disambiguation, light parsing and statistical modeling.
  • Turkish Treebank Compilation: We have completed the preliminary design of a Treebank coding of Turkish sentences using a dependency-oriented representation.

A final goal of our Center is to extend the know-how developed to the needs for other close Turkic languages, and develop applications to translate to and from those languages.

More information about the Center, including copies of publications, can be found at

Dan Cristea
University "A.I.Cuza" Iasi
Bd. Copou, 11
RO-6600 Iasi

The roots of the Faculty of Computer Science at the University "Alexandru I.Cuza" of Iasi must be looked for in the Section of Computing Machines established inside the Faculty of Mathematics back in 1965. Since 1992 it functions as the Faculty of Computer Science, at present still the only faculty having a Computer Science profile among the non-technical universities of Romania. It prepares computer scientists during a long duration education period of 4 years and another year of master, or a short duration period (3 years) in the College of Information Technology.

The Associate Professor Dr. Dan Cristea is the first faculty in Romania who introduced in his Artificial Intelligence course a significant segment exclusively dedicated to NLP (ever since 1985). He is now teaching also a course on Computational Linguistics while also leading a Natural Language Processing group in his Faculty. Dr. Cristea is known in the field of Language Technology especially by his former work in question-answering, natural language interfaces to data-bases and multilingual morphology acquisition. More recently, his interests focussed on Discourse Theory where he developed a method of incremental discourse parsing, architecture for parallel annotation of documents and the Veins Theory. The last item was presented in ACL/Coling, Montreal 1998 in collaboration with Dr. Nancy Ide - Vassar College and Laurent Romary - LORIA, Nancy.

The NLP Group - at the Faculty of Computer Science of the "Alexandru I. Cuza" University of Iasi is a team of students and young scientists, leaded by Dan Cristea, that work mainly in problems related to discourse processing. Some members of the group are writing their graduate papers or master dissertations. The actual focus of the team is to build a system aimed at interpreting unrestricted texts. One paper that describes this project can be accessed on-line at It describes the architecture and behaviour of a system that integrates several ideas from artificial intelligence and natural language processing in order to build a semantic representation for discourse. It is shown how modules that can contribute with different kinds of expertise (syntactic, semantic, common sense inference, discourse planning, anaphora resolution, cue-words and temporal) can be placed around a skeleton made up of a POS/morphological tagger and an incremental discourse parser. The performance of the system is affected but is not vitally dependent of the presence of any of the contributing expert modules.

Linked to this main stream of research, work regarding annotation of corpora in a style appropriate for studying anaphora and discourse structure or the correlation between them is also being pursued. An annotation tool - GLOSS, documentation accessible at - that helps an interactive annotation process of documents in SGML was build. The system produces database images of the annotated objects, allows for simultaneous opening of more documents, can collapse independent annotation views of the same original document (which also allows for a layer-by-layer annotation process in different annotation sessions and by different annotators, including automatic), permits discourse structure annotation by offering a pair of building operations (adjoining and substitution) and remaking operations (undo, delete parent-child link and tree dismember) and offers an attractive interface to the user.

One current direction of research is the study of the relation between discourse structure and references. To do that, we employ part of a MUC corpus (gratuitously offered by Daniel Marcu) annotated for RST structure and co-references, we use GLOSS to unify these two different annotations on a single SGML document and we add markings suggested by the Veins Theory. A piece of software permits then verifying on these data the conjectures made by Veins Theory. The final goal is to arrive at an anaphora resolution algorithm able to work in tandem with an incremental discourse parser. A workshop that is dedicated to the relation between discourse structure and references is being organised by Dan Cristea, Nancy Ide and Daniel Marcu, following this year ACL meeting in Maryland (see

The faculty of Computer Science of the University of Iasi has initiated in 1993 the EUROLAN series of Summer Schools in Human Language Technology. Since 1995 these events that bring together professors and researchers from all over to teach modern HLT topics to mainly Central and Eastern Europe students, was organised together with The Center for Advanced Research in Machine Learning, Natural Language Processing and Conceptual Modelling of the Romanian Academy (director Dan Tufis). Over the years other institutions co-operated for the organisation and support of the Eurolan events, among others: The European Union - Directorate XIII, ACM, TELRI, LIMSI/CNRS Paris-Sud, LIFL/CNRS Université de Lille 1 and DFKI Saarbrücken.

The 1999 School will reach its forth edition. This year edition is dedicated to Lexical Semantics and Multilinguality, a school that will be honoured by a plethora of distinguished researchers and professors from United States and Europe. The interested people are invited to visit the EUROLAN'99 pages at any of the following addresses:

Martin Wynne
Department of English Language
Lodz University
Al. Kosiuszki 65,
90-514 Lodz

A short profile of the new TELRI partners in Lodz:

The corpus research team is based in the Department of English Language at Lodz University in Poland. Our work until now has been based around the PELCRA Project. The principal work has been the building of 2 varieties of corpora, L1 (Polish) and L2 (Polish learner English) corpora. Our ultimate objectives are the development of ELT resources for Polish learners, based on corpus data.

The acronym PELCRA stands for Polish and English Language Corpora for Research and Applications. The PELCRA Project is a British-Council-funded joint project between the Department of English Language at Lodz University, the Department of Linguistics and Modern English Language at Lancaster University, and international publishers, in particular, Routledge. The Project has been in existence for 2 years and official link documents were signed on 14 March, 1998. The total British Council funding for the Project, excluding hidden costs, totals 8600 pounds over three years (1997-2000). The objectives of the PELCRA Project are simply stated. We intend to create and/or exploit the following corpora for research and practical purposes:

Corpus 1

The British National Corpus (BNC), 100,100,100 words of running text, already exists. This will be our English reference corpus and will be our bench-mark for the creation of other corpora.

Corpus 2

The Polish National Corpus (PNC) will, as far as it is appropriate and possible, mirror the BNC in terms of genres and its coverage of written and spoken language. Ideally, we want to collect 100,000,00 words of running text.

Corpus 3

The Polish Learner English Corpus (PLE) will be unique to Poland. We intend to collect learner data from a range of genres.

Corpus 4

This will be a virtual comparable corpus. Because the BNC and PNC will be mirror images of one another, we will be able to link texts from each corpus by genre, thus creating a comparable corpus as a side effect.

Corpus 5

We also intend to develop a parallel corpus of Polish-English. To this end, we have already successfully and accurately aligned the Plato Republic text in English and Polish versions. Data collection activities have been going throughout this academic years, 1997-99. Considerable progress has been made in the collection of raw data. Needless-to-say, all of the data has to be accurately coded and proper profiles for respective donors prepared. Preliminary work on the development of a POS-tagging scheme for Polish is also under way, with the ultimate goal of the development of an automatic tagger. The creation of the above corpora will provide us with vast resources for technical, academic and practical research. All corpora can be examined in isolation or, as is intended, in interaction with each other. For example, the value of the PLE corpus for research purposes is enormous. Members of the PELCRA group in Lodz include Professor Barbara Lewandowska-Tomaszczyk, James Melia, Martin Wynne, Rafal Uzar, Dr. Agnieszka Lenko-Szymanska, Jacek Walinski, Krisztof Kredens and Staszek Godz-Roszkowski. As well as our involvement in the above project goals, members of the group have wide-raging corpus-related research interests including in lexicography, forensic linguistics, translation (especially of legal and medical language) and POS tagging (of English and Polish). The second biennial Practical Applications in Language Corpora (PALC ) conference is held in Lodz in April 1999.


TELRI Event - Bratislava 4th TELRI European Seminar

Text Corpora and Multilingual Lexicography

First Call for Papers

The fourth in the series of TELRI (Trans-European Language Resources Infrastructure) seminars will be held on 5-7 November 1999 in Bratislava, Slovakia with the theme "Text Corpora and Multilingual Lexicography". There will be project-internal meetings directly before the seminar, from 3-5 November (draft timetable below).

TELRI-II invites proposals from TELRI project partners for papers on topics relevant to multilingual lexicography. Papers and presentations should be 30 minutes long including discussion. A selection will be published after the seminar. The papers should present innovative ideas, promising research tracks and novel solutions in the field of multilingual HLT.

No fees are charged for TELRI project partners. In addition, TELRI-II can fund participation for one member from each partner site (cheapest advance air fare or train/car, plus daily allowance).

Young Researchers Workshop

A pre-seminar workshop on Thursday, November 4th is dedicated to presentation on work in progress by young researchers, i.e. graduate students and young research fellows. Seminar fees will be waived for young researchers presenting at the Seminar. TELRI will investigate further possibilities for financial support.

Software Demonstrations

The programme will include sessions devoted to demonstrating industrial and public domain software in the field of multilingual HLT. Seminar fees will be waived for software presenters.

Fees Academic participants: EUR 50/20*
Industrial participants: EUR 100/30*
Students: EUR 20/10*
* refers to fees for participants from CEE/NIS

Important Dates
Submission Deadline: June 1st 1999
Notification of Acceptance: July 10th 1999
Submission Format: abstracts of 600 words maximum
Submission Address:

For more details please see: Bratislava seminar page or contact

Draft Timetable
Wednesday 3rd November
am: arrival
pm: TELRI Association plenary Meeting

Thursday 4th November
am: Internal TELRI meetings and Working Group meetings
pm: Young Researchers Workshop
even: welcome evening

Friday 5th November
am: Seminar - Session One
pm: Seminar - Demo Sessions and Presentations
even: excursion/reception

Saturday 6th November
am: Seminar - Session Two (topic decided depending on papers)
pm: Seminar - Session Three (topic decided depending on papers)

Sunday 7th November
pm: departure


© TELRI, 5.10.1999