NEWSLETTER

No. 6


Contents:

1. Editorial - Ruta Marcinkeviciene

2. TELRI Event - Kaunas 2nd TELRI seminar

3. TELRI Event - Birmingham workshop

4. On the lexicons (continued)

5. New prospective member of the TELRI advisory board

 


 


Kaunas Seminar | Birmingham Workshop | New Member

Editorial

 

Ruta Marcinkeviciene, Vytautas Magnus University

 

It is rather symbolic that the second TELRI seminar "Language Applications for a Multilingual Europe" took place in Lithuania, Kaunas, not far from Europe's geographic centre in the country which is between the East and West, having emerged from the Eastern block and seeking its place in the European Union. Besides, Lithuania was a suitable location for the seminar because as a small country speaking one of the most ancient Indo-European languages it is aware the importance of the issues of communication and language preservation.

Before the seminar there was a plenary meeting of TELRI, which is a three year project. This being the final year of the project allowed us to retrospect and evaluate the previously achieved results and the unaccomplished tasks in separate work groups. It was possible to look back and recallthe beginnings of TELRI which seemed so far away and to proclaim with John Sinclair "We all went a long way". From a shy beginning discussions about what kind of computers we work with to more bold collaborative endeavours culminating in proposals for new projects. These proposals - TELPROM, PAROLE East and TELRI Association - indicate that TELRI has achieved its goal to create trans-European language resources infrastructure. We will reap the fruits of our labour in the future, but the participation in the project itself is immeasurably valuable, especially for newcomers to the field. This seminar, having gathered more than 70 participants from 25 countries, was remarkable for dr. Wolfgang Teubert's well balanced program (industry versus academy, reports versus demos, theory versus practice) and the participants´ genuine interest in currently actual issues such as standardisation.

Discussions, which began with Jeremy Clear's report, continued during the break and went into the farewell party. Lithuanian folk songs started this party and turned into multilingual singing directed by Antonio Zampolli. This general merriment displayed itself by various activities such as looking for night life in Kaunas and accidental exchange of jackets. This could have had sad results if not for wearing name-tags which indicated real ownership. Although the seminar concluded there still remain discussions to finish, topics to explore and projects to be worked on. Hopefully this will be accomplished during the 3rd TELRI seminar in Italy in October. Wishing the organisers energy and success I add: "Don't forget name-tags".

 

 

 


Editorial | Birmingham Workshop | New Member

KAUNAS - 2nd TELRI seminar

  • Industry and Academia: The Turning Point
  • The English-Slovak & Slovak-English Dictionary
  • Morphological Analyzer
  • Parallel Corpora and Equivalents not Found in Bilingual Dictionaries
  • CUE - A Software System for Corpus Analysis
  • Marking, Aligning and Searching Translation Equivalents
  • Czech lexicon by two-level morphology
  •  

     

    Industry and Academia: The Turning Point

     Uri Zernik, OpenSource Inc., New York

    Somewhere in the last decade, a role reversal has taken place in our professional world. Traditionally, Academia (mostly the exclusive club of the ten leading American universities plus the major research labs), enjoyed a tremendous lead in what we call the "hi-tech" fields.

    However today, the picture is totally different. With the advent of effective communications over the Internet, and the free access to vast text resources on-line, knowledge has disseminated across the board. New text-processing products are being developed along with new experimental techniques in companies such as Netscape, Yahoo, Microsoft, Lexis, and many other smaller companies.

    We now live in an egalitarian world. We all have a chance to contribute to the game, and to be active players in the information marketplace. All one needs is some solid natural processing technique, a PC, and a hook-up to the Internet.

    In my talk I will describe these trends based on my own personal experience. I will chart some ways in which we computational linguists can play our role in the global game.

     

     

     

    A 'Hopeless' Project:

    The English-Slovak & Slovak-English Dictionary (ESSE)

     Vladimír Benko, Comenius University, Bratislava

     e-mail: jazybenk@savba.savba.sk

    The presentation describes a joint venture between a commercial publishing house (Slovak Pedagogical Publishers) and our Laboratory, in the framework of which a methodology has been developed for tagging of a dictionary text that had been (by a rather incompetent decision during the previous stage of the project) originally keyboarded as a 'plain' text without any mark-up. (At the moment of our joining it, the Project, described as 'hopeless' by the publisher, had been going on for more than ten years already.)

    After analysing the data, a simple set of software tools has been designed providing for semi-automatic assignment of the dictionary entry structure tags. Based on regular grammar rules, the system is able to recognise the headwords, morphological information, grammatical and stylistic labels, and sense numbers. Moreover, it is able (to a certain extent) to tag the English and Slovak parts of the example phrases. The implementation of the system is based on simple tools (mostly written in lex) and statistics performed in a 'bootstrap' way.

     

     

    Morphological Analyzer

     Svetlana Stoyanova-Goranova, Bulgarian Academy of Sciences, Sofia

     e-mail: jpen@bgearn.bitnet

    A morphological analyzer for the recognition of the word forms of Bulgarian has been developed in the Department for Computer Modelling of Bulgarian in the last two years. It is based on the system ”Plain” of Prof. Peter Hellwig and programmer on Turbo-C. The linguistic database represents a base lexicon that comprises a dictionary of stems and a dictionary of inflections. The former includes 3000 units from all parts of speech. The latter is a basis for the analyses of all wordforms of the inflectional paradigms of the stems. In the cases of homonymy all and only the right analyses are obtained. This year an extension of the dictionary of stems has been begun.

    A further step is lemmatizer which can be used to define the lemmas of the words, but besides lemmatization it also provides explicit morphological analyses and automatic updating of the dictionary of stems. As a rule, the lemmatization is performed automatically without the participation of an operator. Only in the cases of homonymy of word forms - when a wordform has more than one analysis and more than one lemma - the operator selects the appropriate lemma from the variants offered while the rest is erased.

    The module for automatic updating of the dictionary is activated when the text contains a wordform which cannot be recognized by the morphological analyzer. Then the operator has to insert the stem(s) of the inflectional paradigm, to determine the part of speech and to select from the offered menu the inflections which each of the new stems take. The menu does not contain the inflections for the formation of all the forms of the inflection paradigm, but only those of the forms necessary and sufficicent for the assignment of each of the new stems to a given inflectional type, which is done and included into the database automatically. This makes possible the recognition and lemmatization of any wordform of the paradigm at its next occurrence in the text.

    The product can be used for automatic lemmatization of texts of arbitrary length, including corpora, replacing the manual lemmatization which depends entirely on the qualification of the operator and which is often realized with mistakes, especially in the case of homonymy.

     

    Alexandra Jarošová,

    Slovak Academy of Sciences,

    Bratislava

     

    e-mail:sasaj@juls.savba.sk

     

    Parallel Corpora and Equivalents not Found in Bilingual Dictionaries: An Attempt at Their Generalisation

     

    An analysis of the English and Slovak translations of Plato's Republic was done with respect to the three groups of translation equivalents (TEs) proposed by W. Teubert:

    A. TEs found in bilingual dictionary (BD)

    B. TEs not found in BD and not regarded as suitable

    C. TEs not found in BD but regarded as suitable

    The results of the A list analysis were presented at the TELRI seminar in Ljubljana (1997) and published in TELRI Newsletter No 5 (April 1997). The C list contains, as a rule, context-sensitive but recurrent TEs.

    The following types of TEs are missing in the English-Slovak Dictionary:

    1. Slovak verbs as equivalents of English noun-verb collocations. The absence of these equivalents in a given

    English noun entry is the result of insufficient treatment of collocations consisting of the noun (headword)

    and verbs "denoting creation and activation" (BBI Combinatory Dictionary's expression), e.g. offer up prayers

    to sb., to catch sight of st./sb., take hold of st./sb.

    2. Slovak adverbs as equivalents of English prepositional phrases.

    The absence of these equivalents in a given English noun entry is caused by inconsistency in presenting the noun

    as a part of prepositional phrases, e.g. at present, in justice, from a distance.

    3. Synonyms of prototypical equivalents as translational

    devices applicable in specific contexts:

    a) The synonym of a prototypical equivalent compensates for the collocational restrictions of the latter.

    b) Despite no restrictions on the use of the prototypical equivalent in a given lexical environment the synonym

    of the canonical equivalent has a more specialised meaning for a given lexical partner

     

    Two related issues will be considered:

    What is the nature of the prototypical equivalent in existing bilingual dictionaries?

    The lexicosyntactical environment of the lemma and the problem of arrangement of BD entry structure.

    Oliver Mason,

    University of Birmingham,

    Birmingham

     

    e-mail: O.Mason@bham.ac.uk

     

    CUE - A Software System for Corpus Analysis

     

    Since I might not be able to give a `live' demo (depending on the accessibility of a suitable machine) I have prepared a more general presentation. This will start off by explaining the qualitative differences in handling between small and large corpora and the problems that one faces when working with large corpora. Then the solutions adopted in CUE will be explained and its main features described. If possible a `hands-on' demonstration follows.

     

    Stoyan Mihov,

    Bulgarian Academy of Sciences,

    Sofia

     

    e-mail: stoyan@lml.acad.bg

     

    MARK ALISTeR: Marking, Aligning and Searching Translation Equivalents

     

    MARK ALISTeR is a system for automatic aligning and searching of translation equivalents in large bilingual corpora. It performs sentence alignment of parallel texts using the Gale-Church algorithm, with resulting correctness of more than 95%.

    MARK ALISTeR accepts input files of different formats: .txt files (with or without line breaks), WinWord files, files with Ventura markers, SGML marked text (with or without sentence marking). The correctness of marking is checked as well. The automatic searching and the display of translation equivalents under the synchronised flow of the parallel texts is another extremely helpful function of MARK ALISTeR.

    MARK ALISTeR was developed at the Linguistic Modelling Laboratory, Bulgarian Academy of Sciences. It is an MS Windows application running on all INTEL-based Windows systems after Version 3.1 with user interface

    written in DELPHI. The system was designed and implemented in order to facilitate our tasks of aligning corpora in GLOSSER #343 COPERNICUS'94 JRP and BILEDITA #790 COPERNICUS'94 JRP.

    In its actual version MARK ALISTeR is a language independent tool. The quality of the alignment results of MARK ALISTeR can be improved by:

    (1) decreasing the noise level in the input data by integration of language specific information for correct recognition of sentence boundaries;

    (2) elaboration of the editing facilities of the system.

     

    Hana Skoumalová,

    Charles University,

    Prague

     

    e-mail: Hana.Skoumalova@ff.cuni.cz

     

    Czech lexicon by two-level morphology

     

    In my paper I show the way how I converted an existing Czech lexicon to a two-level morphology system. The existing lexicon was originally designed for simple C programs that only attach `endings' or `suffixes' to `stems'. The quotation marks in the previous sentence mean that the terms stem, ending and suffix are used technically rather than linguistically. All alternations were handled inside the endings or suffixes, which required to create more paradigms than really exist in the language.

    In the two-level approach, it is possible to work with a morphonological level and then to treat the phonological and/or orthographical changes by separate rules - two-level rules. In our lexicon I first had to redesign the set of paradigms. Those paradigms that only differed in the phonological alternations were merged and they were rewritten from the orthographical form to morphonological form.

    The next step was to create a set of two-level rules. In my work I did not try to cover all alternations that occur in the language, but only those that are frequent and productive. Other alternations are either treated as exceptions (e.g. shortening the vowel in a noun stem) or several stems are introduced for one lemma (e.g. six stems for irregular verbs - for infinitive, present indicative, imperative, past participle, present participle and transgressive). The three main types of alternations covered by the two-level rules are palatalization, assimilation and epenthesis.

    The whole lexicon that has been converted contains about 17 million word forms representing about 35 million grammatical forms, which covers about 96% of a running text.

     

     


    Editorial | Kaunas Seminar | New Member

    TELRI event - Birmingham workshop
    Kiril Ribarov, Czech Republic

    In TELRI Newsletter 3, June 1996, one could read a very enthusiastic and promising report on the first TELRI workshop held October 10-13, 1995, in Birmingham. The latest event, Birmingham revisited (May 26-30, 1997), has successfully demonstrated the continuation of the corpus oriented activities and it has fulfilled the expectations one could hold after reading the first report; it has also brought additional strength and encouragement. The workshop took place at the School of English at the University of Birmingham, situated in the red-brick heart of the University, close to the gracious University Tower (which very much resembles of a classical Italian one) and the University Library.

    There were sixteen participants present: Janusz Bien and Zygmunt Saloni from Warsaw, Kadri Vider from Tartu, Aneta Dineva and Iordan Penčev from Sofia, Andrejs Spektors from Riga, Alexandra Jarošová form Bratislava, Tamas Varadi from Budapest, Tomaž Erjavec from Ljubljana and František Čermák, Karel Kučera, Jan Hajič, Zdeňka Urešová, Jaroslava Hlaváčová, Alena Böhmová and me from Prague. Many of us were young researchers and students.

    The four day workshop enriched us in many ways: carefully selected lectures, open dialogues, new experience and all of it accompanied by frank and warm hospitality of our organiser, Ann Lawson, with local help from James Williams during the event itself. And guided by English habits, all of us enjoyed warm sunny weather. I will allow myself to say, and I hope that the participants will agree, that the organiser arranged everything we could think of, starting with smileful invitation at the place of our arrival, accommodation arrangements with the rich English breakfast which everybody got used to very fast, access to the Internet, access to the university facilities including the University Library, lot of information concerning the cultural events, the university campus, the town of Birmingham and, of course, the Cadbury chocolate factory. Writing about chocolates, it reminds me of two things: Roald Dahl and his book "Charlie and the Chocolate Factory", and the pleasant and very rich dinners all of us experienced in a French and several Indian restaurants, again thanks to our organisers. Also, our conversations continued many times after the lectures in a tempting atmosphere of the English pubs.

    In the following I will devote myself to the lectures. After the workshop opening by Ann Lawson on Monday, the workshop started with the lecture "Demonstration of Bank of English and tools for collocational analysis" by Geoff Barnbrook who talked about the historical dimension of corpus research using the up to date Cobuild tools. Cobuild, which we visited on Wednesday, was present in all workshop lectures: it prepares, preserves and analyses the evidence of the English language - corpora of the English language, the Bank of English, structured into subcorpora (includes British, American and Australian English) with currently more than 323 million nodes. The nodes (words) have their POS tags and are lemmatised. Most of the functions one needs to analyse the behaviour of the nodes within the structure of the Bank of English are gathered under the modules of XLOOKUP, as the starting point for collocational analysis, the tenet of which is, as stated by the speaker, that words do not occur accidentally or randomly, but that there are constraints which cause some of the words to be more in the vicinity of the other words. With the XLOOKUP tools it is possible to access nodes or group of nodes (by defining a simplified regular expression). This output could be declared to be an autonomous unit, a subcorpus, allowing us to perform further analysis with respect to it. An interesting feature is the so-called picture of a located node, which is presented as a table computed from collocations. The picture shows the most frequent words which happen to occur on the first, second, third etc. positions on the both sides of the node, mutually unrelated and ordered by the frequency of occurrence. This is a useful tool which summarises the information from, very often, long list of collocates. There are also tools, which help one to obtain pure statistical information, as the t-score, which offers a comparison of a random occurrence of a node with its actual occurrence in the corpus, or other scores based on the mutual information principle. The statistical measures may serve as criteria for sorting. We were shown how the corpus and the XLOOKUP tools may be used, e.g. in order to trace the history of spelling of English, the result of which is that spelling changes are registered such that irrational spelling is forwarded. It is known that the 16th century introduced more complicated spelling, which was the time when English had become more self confident. It was thought that if the spelling were more difficult (irrational) it would be more respectable. Rather provocative might be the speaker's statement which the speaker stated at the beginning, which is that he sees no differences between literature and linguistics.

    The second talk "Standards for Data Reference" was delivered by the host of the previous Birmingham Workshop, the father of the corpus research at the Birmingham University, Prof. John Sinclair. He focused on very crucial, basic but extremely important questions concerning the preserving and representation of information encoded in natural language. One could only agree, that the TEI standards are too high to be acceptable, that there is a need of meta characters over ordinary characters. He pointed out 3 modes in which the language is acceptable: spoken (being linear), written (being non linear) and electronic (linear), which he assumes as a different mode because it is already electronically encoded. In the further processing of any texts, it is very important to preserve the textual integrity of the document, which could be obtained by preserving the original while making its digitised copy. Sinclair distinguishes the digitised copy from the so called working copy (a subset of the digitised copy), which is a single transcription of the digitised copy. Any further processing is done on the working copy only, which is being aligned to the digitised copy by an aligner. This approach avoids the obstacles of the current practise: it is based on single linear string, it keeps to the tradition of manual mark-up, it has a lack of distinction subjective/objective and it tends to impose document models (that the document should look as defined by the style, so the author cannot change the document without changing the style first). The general approach, which Sinclair tries to build, is assumed to be an automatic, on-line and multiple stream data processing.

    As stated earlier, Cobuild was "everywhere"- a reception took place at the Westmere place, the former Cobuild site, yet a place full of literature, a place where Shakespeare plays are being performed in its beautiful gardens. Everybody was already well-acquainted with the Workshop environment and the reception was a very suitable place for exchanging our first impressions and getting more in contact with each other.

    We also met some of the people from Cobuild, when the Workshop continued during Wednesday morning, at their new site on the other side of the Campus with a rich and nicely organised program. After the opening talk by Jeremy Clear, we heard about "Corpus-based English grammar analysis" by Gill Francis. She explained to us the way how the grammar is described in the Cobuild dictionaries, namely the verb frames. After that Ramesh Krishnamurthy talked about "The Bank of English". In an open discussion we were acquainted with the types of materials included in the corpus and the difficulties and costs involved in collecting materials for the corpus, form 60 GBPs for electronically available materials up to 15000 GBPs for manual insertion of texts available on audio tapes (for 1 million words). In order to "weight" the corpus with a variety of types of texts, they collect a so called ephemera, a collection of newspaper and magazine headers, advertisements, which are being manually typed.

    We were given the opportunity to have a direct look at how the lexicographers compile the dictionaries and how they use the XLOOKUP tools in order to analyse a dictionary entry; Ros Combley, Jenny Watson, Laura Wedgeworth and John Williams were of a great help and showed significance patience.

    Then, our visit continued in three parallel sessions: "Phraseology" by Rosamund Moon, "Dictionary project management" by Stephen Bullon and "Computational aspects of Cobuild work" by Jeremy Clear. I owe my thanks to my colleagues, who were there, and explained to me what was happening on the other two parallel sessions and allowed me to use their materials here in this report.

    R. Moon talked about some possible approaches to corpus analysis for meaning extraction, pointing out the difficulties when one tries to do so with a wide spectrum of warnings. She said out that in a large corpus it is necessary to define accurately combinations of words, mostly based on their experience from their work on definitions of idioms and phrases on the bases of collocational patterning (1987); this experience was negative, since it was very difficult to locate the boundaries of the collocations. This is an area with strong diversities and dependence on style (formal, conversation, fiction, non-fiction). Idioms were introduced as a special kind of phrases. They tried to capture their frequency in usage, to answer the questions of their heritage, being aware of new American idioms in British English (mostly in journalism), and different metaphor shifts (health metaphors in financial text) etc. Also, other varieties are involved since patterning in spoken English is different than that of written English. The research on meaning extraction is very close to the research of locating of idioms. In this area some statistics were reconstructed: counting frequencies of groups of immediate neighbours, or frequencies of literal versus idiomatic meanings in the Bank of English. When asked why English is so idiomatic, she answered: "We develop new concepts - so we need new expressions for them, but instead of creating brand new words, we use already existing words and put them together in order to bring new senses. Evidence for this can be found in the corpora."

    The second parallel session, "Dictionary project management", was directed by Stephen Bullon, who attracted our attention to the more practical side of how to combine the dictionary development with the current market conditions. On the one hand, all of us heard what we had already somehow experienced, but on the other hand, it was a kind of a relief in confirming the actual marketing problems: the financial and marketing situation often and almost always influences the final product in many ways.

    Jeremy Clear is the author of the XLOOKUP system and he talked about the "Computational aspects of Cobuild work". In his talk he gave a nice survey through corpora, their analysis, compiling a dictionary and its final printing. He presented a more profound background of the done work at Cobuild and documented their slogan: "The bigger, the better!".

    In the afternoon, after a short walk to Westmere, we heard two lectures on "Data-driven learning", the first being "Monolingual and Multilingual Data and Software" by Philip King. The work on multilingual corpora started in 1994, based on the idea of Francine Roussel (Université Nancy II). The project has its current partners in Denmark, Finland, France, Germany, Italy, Spain and the UK, and it incorporates Danish, English, French, German, Greek, Italian (originally since 1994), with Finnish, Portuguese, Spanish, Swedish being officially added in 1997. There is also a group of unofficially added languages, which are: Afrikaans, Dutch, Hungarian, Lithuanian, Polish, Russian, Welsh and Zulu. This very ambitious project plans to incorporate also Chinese, Japanese and possibly other languages in the future. The aim is mostly pedagogical: it should serve language teachers and learners as well as translation trainers and trainees. The pedagogic focus requires: an easy mark-up, an easy input of own text pairs, a user friendly interface and student control, a possibility of test-creation and an effective feedback between programmers and users. We were given the chance to get fully acquainted with this software. The future development will include more languages, more text(s) types, more pooling of experience, greater interactivity and more local autonomy.

    After a short break Tim Jones continued this afternoon with a talk on "Monolingual and Multilingual Teaching Materials". The roots of the methods of Corpus Linguistics go back to 1960's. By the way, some of the participants recalled Tim Jones as their English teacher at their local universities! Even now, he permanently helps foreign students with English. He guides himself by the metaphor of a learner being a researcher, testing hypothesis and revising them in the light of data - or as a detective, finding and interpreting linguistic clues. Data Driven Learning changes perception not only of how to organise learning, but also of what is to be learned. We enjoyed his approach and methods, which he illustrated by rich lists of examples.

    An English breakfast and a sunny weather opened a new day in Birmingham, which everybody was looking forward to. "English Words in Use - Compiling a Dictionary of Collocations" by Ann Lawson was the first Thursday lecture. She presented a part of the work she undertook while still based in Birmingham, before moving to work at the IDS in Mannheim, bringing us a new untraditional dictionary of collocations instead of, I would say, limited explanations. As stated by A. Lawson, collocations are hard for standard dictionary to describe, since they are flexible, discontinuous, introspective and intuition inaccurate. So, from the learners point of view, they are opaque and tricky, require experience and they account for many mistakes. Also, everyday and frequent collocations are very easy to miss and thus difficult to find. One of the basic aims of this dictionary is thus to catch those kinds of collocates. The dictionary is supposed to be finished at the end of 1997. Let me express my wish for a success of this novel approach.

    Originality of approaches persisted through the next lecture by Oliver Mason, "Lexical Gravity". He stated that collocations, in the present works, have not been parametrised. He tried to do so, by defining the following collocation parameters: environment span, cut-off/threshold (throw away words with freq < n, where n is small - they are either misspellings or very rare words - it depends on the choice of n), node preprocessing - groupings (semantical, uppercase, lowercase etc.), collocate preprocessing, significance evaluation (mutual information, t-score etc.) and reference frequency. Further he specified a context (span) as something which defines a specialised sub-sample (sub-corpus) and is motivated by syntax (sentence, phrase), distance (window), adjacency and has influence on result and computational costs.

    After the definitions, the author concentrated on the formalism concerning the measuring of the influence of nodes on each other. To do so, he accepts the following assumptions/predictions: the variability of environment is influenced by the node; there are different patterns for grammatical and lexical items; there are individual patterns of influence for each word; there exists an upper limit on the span of influence; the results should be independent of the taken sample. The procedure for measuring the mutual influence calculates the TTR (true token ratio) for each relative position of a node word, after a collection of its instances has been done. The result of this procedure should be a certain threshold of significance. The graphical interpretation very much reminds of a gravitation gap, thus lexical gravity. Oliver Mason's analysis has come to the following conclusions: lexical gravity is not symmetrical, different words have different patterns, different forms have different patterns, there exists a constant over different corpora, the lexical gravity is more stable with an increasing size of the corpora and an existence of certain, so called 'negative' gravity for certain grammatical words was postulated.

    His future plans include classification of words according to their gravity patterns, separation of different meanings of forms; he plans to take into consideration multiword expressions and fixed phrases and to investigate of how other languages behave in the described formal sense.

    From my experience, I would like to add, that if similar results are run on letters instead of words, a certain isomorphism could be observed. The experiments are even more encouraging, since something similar to lexical gravity could be reconstructed from autocorellation function when run on either letters or nodes.

    A factual prove on what has been said so far, illustrated by a plenty of examples was provided by Prof. Frank Knowles in the next session, "Corpus Analysis for LSP", where the lecturer was trying to explain the vagueness of the words when compared to their forms (the words themselves).

    The last day of the Workshop was a day of an open dialogue, and a day where the participants had the opportunity to present their own work. Jan Hajič from Prague presented the Czech National Corpus and Prague Dependency Treebank (see TELRI Newsletter 4 and 5, for more details about both of these projects).

    The last presentation was about the "WordSmith Tools", by Mike Scott. WordSmith Tools is a package of programmes that help to see how words behave in texts: the Wordlist tool provides a list of all words or word-clusters in a text, set out in alphabetical or frequency order; Concord, a corcondancer, gives one a chance to see any word or phrase in context; KeyWords finds the key words in a text.

    We were a group, on one hand, big enough to raise issues and contrast ideas, and on the other hand, small enough to cooperate and to sit at one table, let's say in a nice English pub or restaurant. For some of us, it was the first time to visit Birmingham and the University Campus and we brought back wonderful memories. For the younger of us, it was encouraging to meet other young researchers from the workshop group, as well as from the host university itself.

    The weather was still sunny, even the day we had to leave.

     

     

    "It was a pleasure to welcome on behalf of TELRI a mixed and interesting group of researchers to the four-day workshop. The enthusiasm, not to say stamina, of the participants, together with a varied programme, made for a very enjoyable and productive time. I especially enjoyed the Czech presentation as an example of real bilateral participation in the workshop. An additional side-effect was that the visitors' requests prompted me into discovering that it is possible to ascend the clock tower in the centre of campus, which we promptly did. During almost ten years at the University I had never done this and, surprisingly enough, the English weather smiled on us to grant us wonderful views. In summary, a very worthwhile and thought-provoking time was had by all."

    Ann Lawson, IDS, Mannheim

     

     

     


    Editorial | Kaunas Seminar | Birmingham Workshop

    New prospective member of the TELRI advisory board

     

    Prof. Dr. Alexandr Zoubov,

    Minsk Linguistic University,

    Department of Computer Science and Applied Linguistics,

    Minsk, Belarus

     

    The Department of Computer Science and Applied Linguistics was established in February 1975. At present the Department consists of 9 lectures, 3 post-graduates, 3 engineers and 10 members of the technical staff. Ten lecturers of other Departments of the University collaborate with the unit.

     

    Current working projects of the Department:

    - computational learning theory and MULTIMEDIA programs development (English, French, German, Spanish, Russian, Belarussian);

     

    - formalization of text structure and development of programs for text generation (French: advertisement, tales, proverbs, riddles, technical descriptions; English: technical descriptions, advertisement; Russian: poetry);

     

    - automatic estimation of lexical stock of foreign language textbooks (on base of statistical coefficients);

     

    - computer understanding and development of programs for information comprimation of texts (scientific, technical, social-political texts).

     

    Our text databases:

    1. Scientific Russian text on the theme “Linguistic Computer Science” (near 200 000 units)

    2. Scientific Belarussian text on the theme “Linguistic Computer Science” (near 60 000 units)

    3. English texts of A. Clarke, G. Greene, I. Murdoch, H. Golding (about 300 000 units)

     

    Our lexical resources:

    1. English-Russian dictionary on the theme “Computers, numeric control, data processing in computer network, flexible production systems”. The dictionary contained 43 500 words, word combinations and abbreviations.

    2. German-Russian dictionary on the theme “Computers, informatics and robot technology”. The dictionary contained 40 200 words and word combinations.

    3. Russian dictionary on the theme “Computer technology and programming”. It contained 200 most frequently used words of the Russian language, 5 000 terminological word forms and 38 000 stems of Russian words.

    4. Russian dictionary of poetry. It contains 3 000 words.

    5. Frequency lists (on paper) of 6-alpha characters of combinations (190807), (357 504 entries), 5-alpha characters of combinations, 4-alpha characters of combinations (75 045) and 3-alpha characters of combinations (20 355) of Russian texts. The texts included 520 000 alpha characters of belles-lettres, 170 000 alpha characters of texts of jurisprudence and 310 000 alpha characters of scientific and technical texts.

    6. Frequency lists (on paper) of English and French words and word combinations on the theme “Specific systems of communication and computers” (text of each language included 200 000 entries).

    7. Russian-English, English-Russian, Russian-French, French-Russian dictionaries. Each pair of languages contained 1 110 - 1 500 words and 170 - 290 word combinations in 12 topic areas (on paper).

    8. Russian-English , English-Russian, Russian-French, French-Russian, Russian-German, German-Russian, Spanish-Russian, Russian-Spanish, Italian-Russian, Russian-Italian dictionaries. The dictionaries of each pair of languages contained 1 740 words from 19 topic areas (on papers).

     

    National Project: Creation of Belarussian Computer Fund

     

    International Project: TELRI