Issue No. 1 | Issue No. 2 | Issue No. 4 | Issue No. 5
Wolfgang Teubert, Coordinator of TELRI
1. General Problems
In spite of the relatively smooth progress that TELRI made in the reported period of the second half year (7/95 - 12/95), our experience was that some overall conditions were not entirely beneficial to the objectives of this Concerted Action. We can identify the following problem areas:
1.1 The constitution of Concerted Actions
In the COPERNICUS Programme, the goal of Concerted Actions is to bring together as partners focal institutions in central and Eastern Europe (CEE) with their counterparts in Western Europe. Therefore, TELRI consists of 22 institutions in 17 countries, among them 12 Central and Eastern European countries. By membership in the TELRI Advisory Board, now three and soon some more institutes in more CEE and NIS (Newly Independent States) are closely linked to TELRI. As a typical infrastructure organization, TELRI does not carry out research, but tries to set up a common platform for the exchange of expertise, software, language resources, information, and new ideas, and to establish a common identity that will facilitate cooperation in projects aimed at multilingual language technology.
At the same time, all TELRI partners are involved in a number of research projects, within their own institution and on a multilateral basis. The projects provide funding for the actual research, and this creates a situation with which TELRI cannot really compete. Most of the TELRI budget goes into coordination and travel expenses. With few exceptions, reimbursement for work dedicated to TELRI is not possible.
Considering this situation, the high level of motivation, of efforts and of concrete results achieved in TELRI is quite remarkable. To keep this spirit alive, however, we will have to bridge the gap between research and infrastructure activities by preparing a small list of more research oriented projects that can complement TELRI's wider goals. The VALIDATOR proposal sub mitted to the last COPERNICUS call is a good example. In the second year of TELRI, we will, therefore, explore the possibilities for such multilateral projects which would have a beneficial impact on infrastructure while at the same time provide adequate funding for research work.
1.2. Coordination of COPERNICUS activities
In the area of language resources and language engineering, we find in the COPERNICUS Programme these action lines:
Concerted Actions for the creation of a pan-European infrastructure, Projects leading to concrete results (resources, tools, or applications), Awareness Days, workshops, summer schools, etc.
All these activities have their own individual profile. Still everyone involved agrees that even though efforts for concentration between certain activities already exist, more could and should be done. Perhaps a project complementary to UNGLINK in Western Europe could promote additional synergy effects.
1.3. Lack of support by Western European links
The first year of TELRI gave rise to the impression that some institutions, activities, and circles in Western Europe are not very fond of the idea of a pan-European infrastructure and not very supportive with respect to the needs of their counterparts in Central and Eastern Europe. TELRI is working hard to make the vision of pan-European cooperation attractive. But more encouragement by the European Commission is needed to open up existing Western European standardization, information, and distribution networks not just to select individuals in CEE but to each eligible institution of a fair basis.
1.4. The emergent EU infrastructure
The year 1995 saw the rapid growth of an operational infrastructure for language resources in the EU. ELRA was founded with substantial financial support by the European Community; the academic partners in the PAROLE I Project founded the PAROLE Association; and EAGLES delivered standards and specifications which are intended to be used in the whole of Europe. Some partners in CEE and NIS countries feel uneasy about these developments which will have an impact on them, but which they cannot join. More about this problem in the following chapter.
2. A Changing Environment for Language Resources
In the year 1995, we witnessed an explosion of data, images, sounds, tables, figures, process protocols, options, and visions distributed globally via ever-expanding information superhighways. If these data are to be intelligible, if they are to make sense, they must be bound together by language. Without natural language processing, information remains incomprehensible. For the emergent global information society, we have to develop a language technology that meets the multilingual challenge. It will have to support the production, revision, conversion, presentation, publication, documentation, and last, but not least, translation of texts in technical and every day language; and it will have to grant language-independent retrieval by sophisticated interaction modes based on natural languages.
Europe is determined to remain a multicultural and multilingual society. Where other information technology markets, like North America or Southeast Asia, can restrict themselves to monolingual or, at the most, bilingual applications, Europe has to develop a language technology that creates a truly multilingual information society by helping its citizens to overcome language borders.
Multilingual textual and lexical resources employing the same standards and closely linked in their composition are essential for the development of multilingual applications. Therefore, with the financial support of the European Commission, important steps for a language resources infrastructure in Western Europe were taken in 1995. On the organizational level, ELRA (European Language Resources Association) and, for written resources, the PAROLE Association were founded. The new PAROLE II project was prepared and finally accepted. ELRA delivered first recommendation for standards as well as guidelines and specifications. This had consequences on four levels:
Infrastructure level: Ties between focal language resources institutions were strengthened (PAROLE Association); links between academic research and private industry were established (ELRA).
Standardization and validation level: standards and specifications for text representation, lexicon markup, and morphosyntactic and syntactic features were adopted (EAGLES, but also MULTEXT, MECOLB, and PAROLE I); first outlines for the validation of written resources were designed.
Distribution level: ELRA was set up as a European distribution center for language resources.
Production level: PAROLE II was prepared with the goal of creating a first generation of comparable resources for multilingual applications.
These developments, which so far have largely left out CEE and NIS countries, demand adequate responses by TELRI. It was necessary to strengthen TELRI activities in the area of standardization, validation, and distribution, and to find ways to participate in the creation of comparable language resources. The following chapter will deal with these accommodations.
3. Accommodation of the TELRI Workplan
3.1. Introduction of new work items
The Working Group User Needs (Coordinator: Andrejs Spektors, Riga) completed its first survey on industrial user needs by late fall 1995. A final survey on user needs will be carried out by Working Group User Groups (Coordinator: Wolfgang Teubert, Mannheim) in 1997 on the basis of an analysis of joint ventures carried out by TELRI members. The Working Group of Andrejs Spektors has now adopted the work item Validation. First task is the preparation of a proposal for a COPERNICUS project VALIDATOR. This project will ensure participation of CEE and NIS partners in the design of written resources validation and, thus, establish a uniform and homogenous approach to validation in Europe.
The Working Group Seminars (Coordinator: Julia Pajzs, Budapest) was given the new work item Morphosyntactic Features. The organization of the Tihany Seminar made it clear that subsequent seminars will be organized by the local partner and Mannheim alone, thus, making a Working Group Seminars superfluous. On the other hand, the various (and rather heterogeneous) recommendations categories put forward by ELRA, MULTEXT, MECOLB, and PAROLE were not seen by TELRI partners to suit the peculiarities of Slavic and Baltic languages nor of Hungarian or Estonian. The reconstituted Working Group Morphosyntactic Features will endeavor to unify existing recommendations, complement them with necessary features of languages not yet covered, and propose a synthesis permitting various levels of granularity for different applications. It will seek to establish close links with all relevant activities mentioned as well as with COPERNICUS activities like MULTEXT East. This is a necessary step that has to be taken for the development of truly comparable pan-European language resources.
TELRI plans for 1996 to prepare a proposal for a new COPERNICUS project PAROLE East with the goal to create complementary standardized resources for such CEE and NIS countries where some resources which can be converted already exist. TELRI will set up an informal Working Group Bridge Dictionaries for a concerted preparation of localized versions of the COBUILD Student Dictionary (English entry words, but descriptions in local languages based on the English original). These dictionaries can, as electronic versions. easily be linked and, thus, converted into a multilingual lexicon. In addition, these individual projects will provide useful experiences for joint ventures between academic research and private industry; and it will also generate some income for participating institutions.
Working Group Documentation (Coordinator: Ruta Marcinkeviciene, Kaunas) is cooperating with ELSNET Goes East in the preparation of a new and comprehensive edition of a survey of CEE and NIS institutions, enterprises, and organizations active in the field of language resources and language engineering. TELRI will, in 1996, explore the feasibility of a project Multilingual Terminological Database for Language Resources and Language Engineering with partners all over the world.
3.2. Additional activities
The papers given at the Tihany European Seminar Language Resources for Language Technology will be published as a book.
Working Group Organizing Joint Research (Coordinator: John Sinclair, Birmingham) will link up with Working Group Multilingual Lexicons of the PAROLE Association with the goal of developing a methodology for the realization of translation equivalents. TELRI will expand and regularly update its Web pages and, in addition, set up and open TELRI list for increased visibility of the TELRI Concerted Action.
Institute of Formal and Applied Linguistics,
Faculty of Mathematics and Physics
Prague, Czech Republic
Under the given technical conditions, text corpora often are conceived of as containing not only data on part of speech appurtenance of the individual lexical occurrences, but also information on their morphemic values and syntactic functions. Only if this information is sufficiently rich and reliable, the corpus can serve as a valuable source for large scale exploitation in most different areas of research on language, including not only language and its structure itself, but also the theory of literature and neighboring disciplines.
One of the important questions is how to select and organize the grammatical data in the tagged corpus. It goes without saying that data on morphemics should be maximally detailed and that they should be patterned in such a way that it would be easy to use them with different theoretical frameworks. The latter issue is more complex in what concerns syntax and its relationships to semantics and pragmatics. In these domains, a theory independent approach to tagging cannot be understood as using only concepts independent on any linguistic theory, but rather in the sense mentioned, i.e. as applying sets of categories (with their values) and decision procedures that would allow the linguist using the corpus to classify the tag symbols in accordance with the needs of as many existing (or reasonably imaginable) theoretical frameworks as possible.
Another condition requires the tagging procedure to be simple and modular enough to make a semi-automatic approach possible. To this aim, the basic and most frequent phenomena should be described by means of relatively perspicuous categories and values, not too distant from an intuitive view of the sentence structure and of the grammatical properties of lexical units. From this it follows that errors occurring in the output of a first version of the tagging procedure (which contains a parser, perhaps based on a combination of grammatical and statistical steps) may be identified by individual checking and the quality of the procedure could be amended by solutions avoiding the most frequent errors.
A rather general assumption on which syntactic tagging may be based is that the syntactic relations (and several aspects of morphological information) in the prototypical case are expressed by morphs (prepositions and other function words, endings, or affixes), whereas surface word order serves to the expression of the topic-focus articulation. Also in English, French or Chinese the "given" (contextually bound) information usually precedes the "new" part of the contents of a sentence. The grammatical function of the SVO order certainly will be used in parsing languages in which this kind of configurational structure is present; however, it would not be appropriate to base the identification of syntactic relations on such a starting point also in cases where the word order is "free", i.e. not grammaticalized, be it in languages with a higher degree of "free" word order, or e.g. in the order of some adverbials in English.
Taking this assumption into account, we come to the following conclusions:
(i) the function words should be rendered in the system of tags by symbols indicating the corresponding functions, i.e. morphological values of the corresponding autosemantic words (e.g. values of tense, number, definiteness, degrees of comparison) and syntactic relations (specifiers, arguments and adjuncts, or complements, modifications, and so on); if possible, not only the differences between subject, direct, indirect and "second" object (the latter present e.g. in Fred was elected the chairman) are to be distinguished, but also several tens of kinds of adjuncts (adverbials such as Locative, Manner, Means, Condition, several Directionals, Temporal adverbials, and so on - corresponding to primary and secondary meanings of prepositions, subordinating conjunctions and other means); these values should be indicated in any case, be they expressed by function words, affixes, stem alternations or word order;
(ii) for every autosemantic occurrence other than the main verb of the sentence it should be indicated whether it is a complementation (argument, adjunct, etc., see above) of a certain head or a part of a coordinated construction (for which again the head it depends on would be specified);
(iii) the surface word order of the autosemantic lexical occurrences in the output of the tagging should not differ from their surface order; this would allow for an analysis of the topic-focus articulation of the sentence; if it is probable that the intonation center of the sentence (when read aloud) would be placed elsewhere than at the end of the sentence (as e.g., in English, with sentences containing such a word like yesterday after the verb and its complement, or with short sentences containing a cleft construction), the bearer of the intonation center (constituting the focus proper of the sentence) should be marked by a specific index.
Points (i) and (ii) ensure, at least to a certain degree, that for theories requiring a further classification of syntactic relations it will be possible to specify the additional specification (e.g. the subject of an active verb may be identified as an Actor, corresponding in a cognitive layer either to Agentive, or to Experiencer, Theme, and so on, according to the context; or it may be classified as the NP constituting an immediate constituent of the sentence).
The output of the tagging procedure may have the form of a bracketted string with indices (with every dependent word and every coordinated construction being enclosed in its pair of parentheses, an index of this pair identifying the syntactic function of the word or the kind of coordination, and a set of indices at each word indicating its morphemic values). Only in the exceptional cases in which the condition of projectivity (adjacency, continuity of constituents) is not met it would be necessary to indicate the position of the head e.g. by its serial number (this concerns especially the long distance dependencies).
Certainly, most parsers available today or in the near future will not go that far (e.g. in what concerns the oppositions of different functions of prepositions, or the identification of the intonation center). However, tagged corpora will make it possible to analyze the relevant syntactic issues in monographs, dissertations, etc., for individual languages and their groups, and we may hope that results of such research can then be used to amend the analytic procedures.
Jan Hajic*, Eva Hajicová*, Alexandr Rosen**
*Institute of Formal and Applied Linguistics
Faculty of Mathematics and Physics
**Institute of Theoretical and Computational Linguistics
Faculty of Philosophy
Prague, Czech Republic
Building treebanks is a prerequisite for various experiments and research tasks in the area of NLP. Under a recently awarded grant,1 we are developing (i) a formal definition of a (dependency based) tree, and (ii) a mid-size treebank based on this definition. The annotated corpus is designed to have three layers: morphosyntactic (linear) tagging, syntactic dependency annotation, and the tectogrammatical annotation. The project is being carried out jointly at the authors' Institutes.
1 The Current State and Motivation
Recent decades have seen a shift towards expressing linguistic knowledge in ways which allow its verification and processing by formal means. Tools originating in mathematics, logic and computer science have been applied to human language to model its structure and functioning. Various aspects of different languages are being described within formally defined frameworks proposed by a number of interacting linguistic theories.
The proposals deal with various levels of linguistic description, starting from the level of sounds (phonetics) up to the level of meaning. Partial grammars and lexicons now exist for many languages within various formal frameworks and collections of linguistic analyses of text and speech are accumulated to be employed both in theoretical research and applications. Besides approaches relying on symbolic means and `rationalist' efforts which result in language models consisting of grammar rules and lexical entries, alterna
1 Grant of the Grant Agency of the Czech Republic No. 405/96/0198,
which has now become an integral part of a newly awarded long-term grant
of the same agency No. 405/96/K214
tive methods employ statistics computed from input text or its analysis to produce a stochastic model.2
However, a common and crucial issue cutting across all types of enterprise in this domain is the need to adopt or design an adequate formal representation of language structures in order to accommodate relevant linguistic knowledge in its relation to the actual language data. There is a number of tasks which typically require soundly defined formal representation of language structures:
1. analysis (parsing) of input text or speech into a representation, tagging of text or speech collections;
2. synthesis (generation) of output text or speech from a representation;
3. mapping of one representation onto another transfer (typically in machine translation systems).
These are the elementary tasks which are parts of many natural language processing applications, some of which are listed below:
machine translation systems;
natural language interface to knowledge bases, question answering systems;
automatic abstracting and knowledge acquisition systems;
automatic acquisition of linguistic data and its integration into a language model.
Formal representations of language structures which have been proposed by different linguistic theories and/or used in natural language processing applications reflect their context in many respects and suitable candidates for an intended more general use are difficult to find. This is due to various aspects of their design, such as (i) specific theoretical commitment, (ii) limited expressive power in partial coverage of language phenomena and restriction to certain levels of linguistic analysis, (iii) difficulties in expressing relationships between different levels of analysis, (iv) hard-wired reliance on some characteristics of a certain language or language group and the resulting difficulty in adapting the framework to a typologically different language,
2 When a linguistic description is implemented on computers, the usual goal is to parse sentences and produce representations of their analyses, thereby verifying the framework, the linguistic theory and the description itself. Another way to obtain (morphological and syntactic) analysis of sentences is by employing statistical methods on large samples of (already analyzed) texts in order to process a new text afterwards, performing some degree of linguistic analysis on the basis of the data acquired in the `learning' phase. Both these kinds of efforts converge and their increasing potential is reflected in the growing amount of text and speech data analyzed to a different degree for various purposes.
and, finally, (v) application-specificity. Thus, it is difficult to express a full-fledged syntactic analysis of a `free word-order' language by means of word-class labels and constituent brackets used for tagging (mostly English) texts.
Although it is not likely that a single framework could become a universally accepted vehicle of linguistic knowledge, we believe that a higher degree of generality and flexibility can be achieved for the benefit of both theoretical studies and application-oriented projects.
2 Characteristics of a Satisfactory Solution
From the conceptual point of view, an adequate design of formal representation should be able to express linguistic facts related to the following levels of description:
1. level of phonetics, phonology, graphemics: specification of phonemes, stress and prosodic patterns, etc.;
2. level of morphology: morphemes, morphological categories;
3. level of syntax: syntactic categories, syntactic structure (trees);
4. level of (linguistic) meaning: disambiguation of lexical meaning, specification of underlying structure and function, communicative dynamism and topic focus articulation, anaphora resolution.
There are several important features that should be reflected in the design to make it really useful:
It should be possible to describe a language structure in all its aspects simultaneously, i.e., to be able to relate facts from all levels of linguistic analysis in a straightforward fashion. At the same time, the design should permit access to specific aspects of the description without other aspects intervening. Thus, a user interested only in syntactic structure should be able to filter out any other information.
If a certain aspect of linguistic description can be structured and viewed differently depending on theoretical commitments, the design should provide an option to derive the desired way of presenting the linguistic facts from a common representation. Thus, both phrase-structure and dependency trees could be derived from the description.
The design should be capable of accommodating typologically different languages without substantial modifications, especially, it should provide space for stating the relation between word-order variations and higher levels and for the interplay between morphology and syntax in the case of complex expressions.
A related requirement concerns the possibility to express links between parallel structures and their analyses in different languages. This feature is important if parallel bi- or multilingual data are to be analyzed and studied as contrastive language structures.
The design should provide space for as little or as much linguistic facts concerning a language structure as is possible or practical to collect or express. This feature would permit to integrate text or speech samples with their analyses in a stepwise fashion, possibly starting with a bare text/speech string and leaving some levels unspecified.
It should be possible to represent at least some linguistic facts in an underspecified form. Wherever possible, an option to use a quantitative measure should accompany such cases. Disjunctions restricted to local domains, underspecified descriptions and weights could be the means to achieve this requirement.
The formal representation should be convertible to another format, as required by an application or desired by another specification covering compatible conceptual issues.
The design should be flexible in the sense that it should contain as few inherent restrictions to its extensions and modifications as possible.
3 Background, Methods and Problems
Without attempting to preview the results, the following points can be made to sketch the starting point situation, the outlines of the goal, and the path towards its achievement:
1. The project will be able to profit from theoretical results and practical experience gained in the field of formal description of natural language at our sites.
The fruitful results concerning word-order variations and their relation to meaning, as well as the richness of syntactic studies based on a dependency-oriented model, both widely acknowledged and faithful to the high standards of the Prague School linguistic tradition, provide a wealth of stimulating material.
At both sites, a number of application-oriented research projects have been at least in some respects tackling the problems of an adequate representation of language structures. The projects include machine translation, natural language interface to knowledge bases, automatic abstracting, automatic knowledge acquisition from texts and grammar checking.
2. The smallest piece of information (typically, a linguistic category) is expressed as an attribute and its value (i.e., a `feature'). A collection (conjunction) of such pairs is used to describe a linguistic object (typically an aspect of linguistic description of a word or a collocation), allowing for partial information (underspecification) and entering into more complex structures, where some attribute values are not atoms but structures. Through the recursive nature of such a representation, linguistic structures of arbitrary complexity can be described. Two or more attributes can share a single value, which is a possible way to implement relations between linguistic facts at different levels of description.
As structures of this type have become a kind of standard in modern linguistic research, the issues of compatibility with other approaches will be substantially simplified on many levels.
3. The design will be tested by its application on language data in at least two typologically different languages. A sample of bilingual parallel text data will be provided to test the parallel link option between analyses of linguistic structures.
There are a few challenging issues which call for an inventive solution:
The relation between the surface string of graphemes/phonemes, hierarchical syntactic structure and the ordering of meaning-bearing elements according to the degrees of communicative dynamism is far from straightforward. This concerns especially cases of crossing dependency (non-projective structures). If the representation is to accommodate descriptions on all levels in an integral form, a non-trivial solution has to be found.
Complex expressions like idioms, compound words and morphological categories realized by discontinuous sequences of auxiliary words present another problem of a similar kind.
The integration of all kinds of linguistic knowledge in a single formal framework capable of application to the widest range of language structures is a unique enterprise. Disregarding the undoubtedly immense practical profit for a moment, the project will probably bring the most precious theoretical fruit precisely in this domain.
4 The Treebank
The formalism developed within this project will be applied towards a mid-size treebank, mainly on the Czech material. There will be three layers in the treebank.
tation problem in such a complex and unified way. Also, the development of the past ten years will lead to novel approaches in the representation theory.
However, the idea of the `'development cycle'' involving immediate, large-scale evaluation and verification on real texts has not been exploited previously in the framework of such a theoretical issue as a formal representation of language structures undoubtedly is. There are various projects, mainly in the United States, which do use the repetitive evaluation strategy to get valuable feedback, but they are more application-oriented. We feel that an appropriate modification and proper usage of such methods would mean a qualitative leap in a search for a theoretical result in a non-technical discipline. We would like to cooperate as much as possible with the centers doing a lot of work in this direction, namely, the LDC (Linguistic Data Consortium) at the University of Pennsylvania, and use their materials, especially for the evaluation phase of the English side.
There are also projects the results of which (or at least some of them) would help this project: this would also make very effective use of funds spent on other grants and research activites both within and outside of the Czech Republic. We envisage the use of some of the results obtained in the following projects: Grammar Checking for Slavic Languages (a PECO project, funded by the EU), from which we would like to obtain some ideas about representations of ill-formed input; Czech National Corpus project (funded by GAÈR), as a resource of Czech textual material; and MATRACE (also funded by GAÈR), as a starting point for comparison (and later, unification) of structural representations developed for the purpose of machine translation between two typologically different languages.
5 A Summary of the Goals
There are two main goals to be achieved:
A specification and thorough description of a single formal representation of language structure, integrating and enhancing the previous theoretical results, and adding new contributions at the same time (especially the representation of topic/focus, coreference, discontinuous elements relations, etc.);
An experimental verification of the above, i.e. the markup of a substantial portion of diverse, real text samples using the formal specification developed under the grant. In other words, building a treebank. Two typologically different languages will be used for the experiments, Czech and English.
We consider the two goals mutually indispensible, as we believe that only a rigorous testing of any formal representation theory will put it on a solid ground, and it will make an immediate feedback possible.
 Petr Sgall, Alla Goralèíková, Ladislav Nebeský and Eva Hajièová, Functional Approach to Syntax, American Elsevier, New York, 1969
 Petr Sgall, Eva Hajièová and Jarmila Panevová, The Meaning of the Sentence in Its Semantic and Pragmatic Aspects, D. Reidel Publishing Company, Dodrecht, 1986
 Vladimír Petkevic, A New Formal Specification of Underlying Representations, Theoretical Linguistics 21, 7-61, 1995
Institute for Dutch Lexicology (INL)
Leiden, The Netherlands
INL annotates large text corpora with PoS and lemma information, using rule-based and stochastic taggers/lemmatisers. For the application of PoS-tagging, syntactic analysis can be quite useful: it may establish locality between an ambiguous PoS and its resolvent, allowing locally operating models (such as Hidden Markov Models) to resolve the ambiguity. At INL, some exploratory investigations into syntactic tagging are being carried out, at this moment primarily for the purpose of improving on PoS-tagging. The investigations address the problem of grouping the context between PoS-ambiguity and resolvent into constituents.
Two `classical' parsers have been implemented: a CYK (chart) parser, and a deterministic shift-reduce (Marcus) parser. No large grammars have been written for these parsers, yet. The parsers are being used to study the intertwining of syntactic knowledge with the PoS-disambiguation rules of INL's rule-based tagger/lemmatiser DutchTale.
An alternative approach, boundary marking, produces shallow syntactic representations without fine-grained internal structure: it generates top-level phrasal boundaries, like in:
- [The student]-[will buy]-[the cheap edition].
Boundary markers do not need large grammars. A prototypical boundary marker has been implemented, using a small set of boundary-placing rules. It is unclear yet whether PoS-tagging needs to address syntactic structure of a higher sophistication than the shallow structure produced by boundary markers.
Contrasting with these approaches, self-organising models are investigated as well. A backpropagation neural network, at the moment being used at INL for morphosyntactic disambiguation, can be trained on context-free grammar rules, and can be supplemented with tree construction routines to behave like a parser. It is possible to train the net on a relatively small core grammar, and let the net produce creative solutions for patterns outside the coverage of the training grammar (robustness).
A radical solution to the problem of writing large grammars for syntactic tagging will be the use of self-organising maps (SOM's), which can be used to construct a topology of syntactic clusters without prior formulated linguistic knowledge. These clusters can be interpreted as syntactic categories. This will be a topic of interest in the near future.
Co-ordinator: Ruta Marcinkeviciene
Since the October meeting 1996 of the participants from three projects: the ELSNET, ELSNET goes East and TELRI, the two latter projects have joined their efforts for documenting Eastern European NLP and Speech sites for the sake of a greater efficiency and cost reduction in carrying out their mutual tasks.
During the firts half of 1996 a revised joint questionnaire meeting the needs of all three projects was prepared and sent out to both Western (by ELSNET) and Central and Eastern European (by ELSNET goes EAST) countries. The questionnaires were sent out both by e-mail and surface mail from Amsterdam with the hope to have a slightly increased rate of response. 249 questionnaires were sent out to 11 countries: Baltic countries, Belorus, Bulgaria, Czech Republic, Hungary, Poland, Romania, Slovakia and Slovenia. 167 were distributed by e-mail, the remaining ones by surface mail. By the end of March about 50 of them came back answered to Amsterdam and keep coming all the time. Most answers come from the e-mail sites. The greatest percentage of answers came from the Baltic countries, Czech Republic and Poland.
TELRI WG 2 actively participated in the creation of the new joint questionnaire with the aim of careful documenting of language and speech resources according to the accepted pattern. We supplied the list of addresses of language and speech engineering organizations with 43 addresses mostly from those country which participate only in one of the projects, i.e.TELRI. Now TELRI participants are responsible for those completed questionnaires which come by the surface mail. The next task for both projects is to prepare European NLP and Speech Survey in electronic and paper versions.
Co-ordinator: Andrej Spektors
The aim of WG 10 for further period will be working out proposals of projects dealing with computational methodology and software for semi-automatic validation of the corpora and lexicons. At present there are no strict standards adhered to by all resource developers, although no one objects a standard adoption. Existing standards and guidelines developed while working on various projects are mostly used on the level of recommendations, and resource developers do not always observe them to full extent. It has to be noted that development of linguistic resources in Central and Eastern European (CEE) countries still is in its initial stage, therefore timely introduction of already developed standards during the course of resource creation would be beneficial, resulting in a considerable economy of financial resources. Of course, any standard can be introduced only by gradual acceptance by Natural Language Processing community. Therefore already accepted and validated standards have to be offered.
The lack of appropriate tools for validation of written language resources constitutes a serious impediment to wide-scale commercial exploitation of these resources. Prospects of introducing semi-automatic methods for resource validation in practice are especially good in CEE countries where creating of resources in national languages is in the initial stage. The previous experience shows that existing resources there mostly do not conform to standards or specifications and are not harmonised in content and form. Existing resources in CEE countries most often are set up for internal use of producer's institution and are not commercialized. Semi-automatic validation of formal properties would facilitate distribution of all written language resources built up in accordance with SGML and TEI formalisms.
Exploitation of linguistic resources in CEE countries at present is in an early stage, therefore timely development of methodology for validation of created resources is of utmost importance, providing grounds for minimizing financial resources necessary for error elimination and standardisation in the future. Practical usability of standards and recommendations for different languages, which are of more inflected nature than English and other Western European languages, would be tested during realisation of such further projects.
Tools for verification in accordance to standards will be designed to create the necessary means for testing a correspondence of language resources to as many existing standards and recommendations as possible. The possibility to add tools for testing resource compatibility to new standards and recommendations in the future will be supported. Development of tools will start with collection of information on all participant national languages and with a co-ordinated evaluation of this information. After information collection and evaluation the experimental software will be created and tested for all national languages. Possibilities to reduce other tagging methods to the SGML standards will be inspected.
National language engineering standard centers will be established in WG 10 participants' countries where interested persons of the country will be able to study existing standards, specifications, recommendations and evaluation methods in computer linguistics. During the work on theabove-mentioned projects the participants have to study all standards in detail. Therefore standards (guidelines, recommendations, specifications) have to be collected together, and the requirements have to be carefully studied. Specific recommendations for use of standards and specifications for the corresponding language will be developed in these standard centres.
A proposal for checking of correctness according to SGML and other recommendations will be created. First statistics about which and how many SGML tags are used and all other possible statistics about resource tagging will be collected. Possible usage of these statistics for automated resource evaluation will be investigated. Such statistics will be collected by each WG 10 participant for the respective national language and algorithms will be developed. Methods and algorithms will also be developed to test correspondence of language resources to TEI and EAGLES recommendations and specifications developed by PAROLE project.
1. General comments on the Seminar organisation
During January 29-30 1996, the Romanian Academy (Center for Advanced Research in Machine Learning, Natural Language Processing and Conceptual Modelling) organised in Bucharest, the National Seminar "Language and Technology", fully funded by the European Commission under the programme "Awareness Campaign on Language Technology". The organisation of the Seminar has been supported by the Department for European Integration of the Romanian Government and the Research Institute for Informatics.
The National Seminar was a very successful event, being attended by more than 250 participants, from research, industry and public administration. Policy making sector was represented by high level representatives. Public administration was also represented by several head of departments in key ministries. Some big private companies in Romania were represented by their directors. Big state industrial and development companies had a significant representation in the audience and in the scientific program. Different SMEs expressed their interest in the Seminar not only by taking an active part in the event but establishing contacts aiming at finding possibilities for marketing some of the systems that were demonstrated during the Seminar.
Academic community represented more than 50% of the Seminar participants and most of the Romanian representative scholars attended the Seminar. Most of them came from the field of traditional linguistics and philology, but computer scientists, mathematicians and cognitive scientists were very well represented, too.
The Seminar had a real national character, being attended by people representing all important university towns of Romania: Bucharest, Iasi, Cluj, Timisoara, Craiova, Constanta, Sibiu, Brasov. The Seminar was largely advertised in mass-media. There were press announcement, published in the nation-wide newspapers and weekly journals (Academica, Economistul). The Chairman of the Organising Committee gave three pre-seminar interviews on the national radio broadcasting programs. During the Seminar days, more than 15 persons (including EC officials) were interviewed. Two popular TV broadcasting companies included images and comments on the Seminar into their news.
2. The Seminar Program
The Seminar lectures were given in the Magna Aula of the Romanian Academy, the most prestigious conference room in Romania. The demonstrations were given in the Presidium Room, next to Magna Aula, specially equipped for this event with a heterogeneous local network (5 Pentiums, 4 IBM486, 1 SUN Sparc 4, and 2 Macintoshes). For the entire period of the Seminar, ear-phone simultaneous translation between Romanian and English were ensured by specialised translators. The work they done was gratefully acknowledged by both the organisers and the participants of the Seminar. Due to the initial intention to have the contributions to the Seminar published by the Romanian Academy Publishing House, the most prestigious publishing house in Romania, a reviewing committee was formed and all the submitted contributions (except for the invited talks) were independently reviewed. Out of the 43 submissions 29 papers were accepted for presentation. The volume, which included also the 12 invited papers, is considered to be quite representative for the state of art in Romania as far as language technology addressing Romanian is concerned. The Seminar was opened by the Vice-President of the Romanian Academy, Professor Aureliu S`ndulescu. Professor Marius Guran, presidential advisor on science and technology, presented the salute message on behalf of Ion Iliescu, President of Romania. Secretary of State Dr. Ghiorghi Pris`caru, Head of European Integration Department of the Romanian Government, presented a salute message on behalf of the Romanian Government. Mrs. Karen Fogg, Head of the European Commission Delegation in Bucharest presented a warm salute from the European Commission, highly appreciated by the audience. Secretary of State, Mircea Petrescu, President of the National Commission for Informatics, gave a keynote speech on the informatising strategy in Romania. The second keynote speech was given by Jan Roukens from the European Commission-DGXIII, on one of the hottest issues of our present-day society: "Breaking the Language Barrier: Towards a Multilingual Information Society in Europe".
After the Opening Session, there were 3 sequential communication sessions:
Language Resources, Machine Translation and Speech Technology and in parallel there were several demonstrations with language technology systems implemented in Romania and addressing Romanian language.
Five invited talks were given on the first day of the Seminar:
Wolfgang Teubert (IDS-Mannheim) _ "Language Resources for Language Technology"
Svetlana Cojocaru (Academy of Sciences of the Republic of Moldova) _ "Romanian Lexicon: Instrument, Implementation , Use"
Walther von Hahn (University of Hamburg) _ "Machine Translation "
Rajmund Piotrowski (University of Sankt Petersburg) _ "Machine Translation in New Russia"
Peter Roach (University of Reading) _ "Speech Technology"
The first day of the Seminar was concluded with a round table on "Bridging the Gap between Theoretical Linguistics and Linguistic Engineering" (moderators E. Simion and M. Guran) with panelists from both communities: Wolfgang Teubert, Rajmund Piotrowski, Marius Sala, Alexandra Cornilescu and Marian Papahagi, Peter Roach, Walther von Hahn Alfred Le]ia and Dan Tufis. For one hour and a half the panelists tried to analyze the existing gap between the researchers of the two disciplines and pleaded for a synergetic action in the benefit of language technology. The role of the education was emphasised as a key factor in bridging the gap and there were reports on some progress in this respect. The Technical University of Bucharest, Cluj and Iasi (the Computer Science Departments) included into their curricula courses on natural language processing and linguistic theories (HPSG, GB). The philological faculties (University of Bucharest, University "Babes-Bolyai" in Cluj) included in their curricula optional courses on text processing and computational linguistics. The program of the second day contained two sequential sections:"Applications: Research, Industry, Users" and "International Cooperation", followed by a round table on the topic "How could the international cooperation help the technology of Romanian language" with EC representatives and Romanian decision makers as panelists.
There were 5 invited talks in the two sections:
Gabor Proszeky (Morphologic, Budapest) _ "How to Reach the LT Market ?"
Poul Andersen (EC-DGXIII, Brussels) _ "Cooperation with Central and Eastern Europe; The European Commission's Strategy""
Steven Krauwer (OTS, Utrecht) _ "European Cooperation: The ELSNET Experience"
Eva Hajièova (Charles University, Prague) _ "Natural Language Processing in Czech Republic: National Projects and International Cooperation"
Tomaz Erjavec (Josef Stefan Institute, Ljubliana) _ "International Cooperation in Slovenia"
The technical program of the second day of the Seminar was concluded by the round table "How could the international cooperation help the technology of Romanian language" (chaired by D. Cristea). The panelists were Jan Roukens and Poul Andersen from the European Commission and Marius Guran, Mircea Petrescu, Florin Teodor Tonisescu and Eugen Simion from key governmental institutions of Romania.
After some comments by the Romanian decision makers on the necessity for further concerted actions from the local institutions towards a more focused R&D activity in the field of language technology and statements concerning governmental support, the EC officials resumed some key principles concerning the international cooperation emphasising the need for openness and distribution of tasks. Several questions were raised from the audience, which were answered by the panelists. The Seminar ended with some concluding remarks made by Marius Guran, Mircea Petrescu and Jan Roukens. All the three speakers appreciated the National Seminar "Language and Technology" as a very significant event for the Romanian scientific community which was well managed and expressed their hopes for positive and synergetic follow-ups of the event. Special thanks were addressed by the Romanian Officials to the European Commission for making possible the Awareness Seminar in Bucharest.
Besides the European Commission, different individuals must me mentioned as recipients of our gratitude.
During the preparation of the Seminar, the organisers benefited from the assistance of Mrs. Grazyna Woszcieszko and Mrs. Helene du Callatay. Their readiness, fast and precise answers to the issues raised during the organisation of the Seminar were extremely supportive. Special thanks are due to Mr. Poul Andersen who deeply involved himself in preparing the Seminar (it suffices to mention a dozen of calls, more than 100 e-mail exchanges and three face-to-face thorough discussions on different meeting occasions). Besides his extremely useful experience, his patience and understanding are warmly acknowledged. The invited speakers delivered high level presentations, carefully prepared. Their efforts are sincerely acknowledged here.
Primoz Jakopin, SLOVENIA
The first TELRI workshop took place at the University of Birmingham in the week from October 10 to October 13, 1995. The TELRI Steering Committee accepted my application for a short term visit and so I could attend the event, which can rightly be described as most useful.
The workshop took place at the Corpus linguistics department of the School of English and at the COBUILD institution. There were 8 participants: Barbora Hladka from Prague, Ruta Marcinkeviciene and Vytautas Zinkevicius from Kaunas, Madis Saluveer and Tiit Roosmaa from Tartu, Ana Maria Barbu and Maria Lidia Diaconu from Bucharest, and myself. We all have known that the Birmingham corpus of English texts, Bank of English with 210.5 million words, is the biggest existing, but to see it and its use on the spot is very different from knowing it only from literature.
We were also very pleased by the warm reception and all over hospitality of our host, Prof. Dr. John M. Sinclair (JMS) and of his team. They spared no effort to help us with our task, to see the essential working and benefits of such corpus in the span of a few days. Lectures were accompanied by rich descriptions on paper, including examples and Prof. Sinclair generously provided everyone of us with several books, including the New COBUILD Dictionary of English. As most of us arrived in Birmingham a day earlier, the University library proved its worth on Monday. It is well stocked in the field of computational linguistics and I could also find a lot of new foreign titles, most notably German, such as the ones from the QUANTITATIVE LINGUISTICS series.
The stay at the Lucas house, only a short walking distance from the Department of corpus linguistics, was also very agreeable. The institution of English breakfast was new to me and it surprised with its variety and richness; especially as I came with false preconceptions that the English food is mainly limited to fish and chips. Even good weather, for the lack of which the Island is well known, contributed to the success of the workshop. It kept throughout and from a rented bicycle I could even catch a glimpse of the Birmingham countryside, with its vast network of navigable water canals from the end of 18th century, lately furnished with sidewalks. It was interesting to see how old can be put to good use at the present time, and the suggestion of Prof. Sinclair that the quickest way to get from the University to the very centre of Birmingham is by the canal sidewalk, proved very accurate. It was 11 minutes by bike.
The workshop started on Monday with a reception in the Westmere main building. After some introductory words by Dr. Wolfgang Teubert, head of the TELRI project, and by Prof. Sinclair, there was an opportunity to discuss matters with workshop teachers and the people from Cobuild. The remark of Ramesh Krishnamurthy, corpus manager at Cobuild, that the lemmatisation of languages with rich inflection, such as Slovenian, should be easier than that of English, as there is more information for the mechanism to catch on, was highly interesting and provocative.
On Tuesday morning Prof. Sinclair gave an overview of corpus linguistics, from the first beginnings at the end of the sixties to recent achievements, such as the Bank of English, more efficient teaching, new ways of looking at the phenomena of language and better dictionaries, all coming out of it, to what can be expected in the future. Elena Tognini-Bonelli followed with an interesting, fresh approach on how corpus data, especially the collocations, can be put to good use in resolving ambiguity problems and proper use of words in translation. It was illustrated by examples in English-Italian context; as the summer school of Czech language in Prague also taught me quite some Italian, I enjoyed it very much. In the afternoon Tim Johns, who is involved in teaching English for the International students unit (2.000 students) at the University of Birmingham (12.000 students in all), described the concept of data-driven learning. The accompanying teaching material on paper and his own software, CONTEXTS, showed how to teach a language in an enjoyable, yet very efficient way. From the lecture it was very clear that the work of JMS had taken deep roots in Birmingham; corpus-driven learning is no novelty there.
Wednesday morning was devoted to the visit of Cobuild, COllins Birmingham University International Language Database (acronym invented by JMS), a joint venture between academic (Univ. of Birmingham) and industrial (Harper-Collins) partners. The University expected support in building a large-scale text corpus while the other side expected increased competitive edge through better dictionaries, with entries and explanations selected by their actual frequency and not at the liberty of dictionary authors. The project started in 1980 with a 50:50 share and pushed on with substantial funding from Collins: 1.5 mil. GBP from 1980 to 1984 and additional 1 mil. GBP from 1984 to 1987 made Cobuild the largest joint project in humanities worldwide. The data base grew from 7.3 million words in 1983 to 211 mil. now in the Bank of English and the staff to 20 full-time employees of today.
On the hardware side the work started on the University's ICL 1900 mainframe in 1980 and expanded in 1982 with the purchase of a DEC PDP 11/34 minicomputer (256 KB of RAM, 134 MB on disk), their first machine with UNIX (MULTIX) operating system. Independence from the University mainframe was achieved in 1987 through the own network of RISC workstations (IBM 6150). A network of PCs was considered but dropped due to the belief that PCs would not be up to the task, while the workstations, though much more expensive, would. It would be interesting to see what the decision would be today, as the margin in capability between high-level PCs and workstations is vanishing fast. The Cobuild network has been upgraded to Sun-Sparcstations (2 servers and 18 diskless workstations) and Tektronix terminals (16, with 17 inch screens) in 1991. The software used at the beginning was the concordance builder COCOA by Atlas, supplemented by own software (XLOOKUP) after 1983. Later in the course of the project Collins publishing house was acquired by Rupert Murdoch, the media entrepreneur, who also wanted greater control over Cobuild. His Harper-Collins now owns 75% of Cobuild; the 25% share of the University however excludes it from vital decisions. The good side of the new parent arrangement is that Cobuild can get access to all the publications from the media empire just mentioned, such as the newspapers Independent, the Times, Economist, New Scientist and Guardian for free. Bank of English (BOE) contains 75% of material in British English and 25% in American. It does not include poetry, drama or child language.
The program XLOOKUP, which is used to retrieve data from BOE, indeed performs impressively. It can also be tested, with limited access to a subset of BOE, 20 mil. words, via Internet. The relevant electronic address, ID and password are: titania.cobuild.collins.co.uk login: cobdemo password: cobdemo
Such a tool at one's disposal augments the possibilities of most research in the field of English language by an order of magnitude. The online access to collocational information on words is also of immense value for anyone writing in English.
Wednesday afternoon was devoted to concordance and collocational software (WordSmith Tools) on standard desktop computers, PCs, written and presented by Mike Scott from the University of Liverpool. The software, intended for lexical analysis on PCs and for studying the output of larger computers on smaller machines, is planned for publication by the Oxford University Press. It runs on Windows environment, is quite impressive and reflects the author's great experience in the field.
On Thursday the home-grown software tools, from the Department of corpus linguistics, were shown and demonstrated by Oliver Jacobs. Due to the large pressure put on the Cobuild staff in the new circumstances, to make as much marketable output as possible in the form of new dictionaries, the needs of the academic side (with its smaller share in the company), evidently had to be put aside in the development of XLOOKUP. The necessity for the Department to have its own software has become urgent and will be met in the next several months. It was however interesting to see how, in the world of workstations, PCs are inevitable as well. All the demonstrations were performed on PCs serving as UNIX-terminals of larger workstations; the reason seemed to be the lack or unavailability of LCD overhead projecting facility on workstations.
Triple C word, title of the famous book by JMS: Corpus, Concordance, Collocation, had, for most of the participants of the workshop, only terminological value. In Birmingham we all got an understanding of what it takes to construct a real textual corpus, to maintain it and how to exploit it fully once ready.
To compose the Bank of English was no straightforward task and many temptations had to be avoided. One of the most difficult points in such considerations, especially if the size of the data base is expected to grow from megabytes to gigabytes, is what to do with errors - typos and the like. Prof. Sinclair's answer to the question was highly illustrative: "Errors are part of text. If you correct them, you lose information." His other remark, on what words to include in the future dictionary and which not, is also worth noting here: "If a word has a frequency of more than 15 in your corpus, you must have very good reasons not to put it in; if less than 15, for including it."
There are three other points worth of further consideration:
1. The XLOOKUP program would benefit greatly, as I see it, from a graphical user interface (GUI) it now lacks. The proportional screen fonts would allow much wider word surroundings and the colours would help the collocations, especially non-adjacent ones, to stand out better.
2. In addition to displaying the current state of English, the Bank of English increasingly has an encyclopaedic value. It could prove very useful and would attract much wider academic and non-academic audience, if the Bank included all collected material and not only the current one. It would be technically feasible, even now already, to have a data base larger by an order of magnitude, at least ten times. It should be accessible to inland users and the world community via Internet and be housed in an institution with similar status and funding now characteristic for the National library.
3. I also missed very much a good statistical description of the Bank of English, on the character, word and sentence level. In the short time of the workshop it was not possible to obtain the word frequencies I would need for the plotting of rank-frequency curve which I would like to compare with data from smaller samples. It is my great hope that this would be possible in the not-so-distant future.
All in all, the knowledge gathered in Birmingham widened my horizons very much. Together with the overview which I obtained during a visit to Institut fuer deutsche Sprache in Mannheim two years ago it will help a great deal in any similar future task for the Slovenian.
Top of this issue
TELRI Main Page