Abstracts accepted for the seventh TELRI seminarDubrovnik, 26-29 September, 2002
Definition parser and dictionary translationDuring the early 1990s software was developed at the University of Birmingham by Geoff Barnbrook and John Sinclair to parse English definition sentences of the kind used in the Cobuild dictionaries. The development and operation of this parser and the local grammar of definition associated with it are described in Defining Language (Barnbrook, forthcoming). One of the potential applications identified for the parser after its development was its use in the translation of dictionary definitions into other languages and the production of bilingual bridge dictionaries based on the Cobuild range. This paper explores the specific areas where the parser could be used in this process and invites suggestions for collaborative projects making use of the parser. It will draw on the actual and potential use of the grammar in the production of current and proposed dictionaries.
Vladimir Benko Web and/as Corpora: Linguistic Data on Internet
Since recently, there is a growing tendency of considering the web as a large multi-language corpus, out of which dynamic corpora can by selected for individual languages or language sets, both monolingual and parallel, by means of the Language Technology tools. To analyze those dynamic corpora, output of the conventional search engines can be used, either in its "pure" form, or with a minimal post-processing that does not involve downloading of the respective web pages delivered as a result of the search. To be able to use a search engine in this way, several conditions must be fulfilled: (1) The search engine must process correctly the texts in all representations (character sets) used for the respective language on the web. While this is easy to meet for English that usually do not use any accented characters, for most western "Latin-1" languages it requires coping with at least ISO-8859 and UTF-8 codings. For the "Latin-2" languages, such as Slovak, Czech or Hungarian, at lest three code sets are to be used (Win-1250, ISO-8859-2 and UTF-8); and for "Cyrillic" languages like Russian no less than 4 sets must be considered (Win-1251, ISO-8859-5, KOI8-R and UTF-8). The local MAC and DOS codings can usually be ignored without great loss of language data, though, in general, they might also be considered. The "correct interpretation" of the respective language set typically involves the ability to distinguish between alphabetic and non-alphabetic characters, provide correct upper/lower case conversions, the ability to accept all valid alphabetic characters in search expressions and to display the search results in a legible way. The easiest and most generally accepted solution is that introduced by AllTheWeb and recently adopted also by Google, which convert everything into UTF-8 before creating the index and use UTF-8 as basic output coding displaying the results. (2) A reasonably selective language filter for the respective language must exist to provide for elimination of unwanted web pages, either as part of the index building strategy (e.g. Yandex for Russian and other Cyrillic Languages), or via the user interface (Google, AllTheWeb, AltaVista). (3) A reasonable number of results must be shown (typically several hundreds), to provide for looking also for medium-frequency words and expressions in the respective language. (Google, AllTheWeb, Yandex) (4) The search results page must display at least one-line context of the expression searched for. (Google, Yandex). (5) For languages with a rich morphology, a morhological analyzer/generator should be part of the user interface to enable searching a word or expression in all its respective forms (Yandex for Russian). This can be partially supplemented by regular expression search ability (AltaVista). Anyway, the user of web as a corpus must be aware of the specific nature of this resource when compared to a "normal" corpus: the data is very noisy and unstable (the web pages appear and disappear in an unpredictable way), there are typically no word-list and lexical-statistics operations available, there is no control over the representatives and register of the language data. On the other hand, this need not to be a great problem when, e.g., a low frequency lexical phenomena are sought while compiling a dictionary of a language with no large-scale corpus available, such as Slovak. The article will bring examples of rare words "lexical evidence mining" for Slovak, as well as estimations of size of the web subcorpora, and the "recall" and "precision" values for some TELRI languages.
F. Čermak, A. Klégr Modality in Czech and English: Possibility Particles and the Conditional Mood in a Parallel Corpus
Summary: The paper examines two kinds of modality exponents and their interlingual relationships, using an aligned parallel minicorpus of two contemporary Czech originals (drama, novel) and their English translations. It focuses on four most frequent Czech adverbial particles of possibility / approximation, snad možn?, asi, nejsp?še, and the Czech conditional mood marker by in the texts and their equivalents. It contrasts the findings with the equivalents in the latest and largest Czech-English dictionary. The results confirm that in either case the lexicographic description is insufficient both in the range of equivalents offered and their respective representativeness.
Montserrat Arevalo, Montserrat Civit, and Maria Antonia Marti MICE: a module for NERC in Spanish
Named Entities recognition and classification (NERC) is a core problem to solve in IR and IE technologies. In this paper we present MICE, a system for NERC based on syntactic as well as semantic information. MICE is a module in a pipe-line process for corpus processing and annotation developed by the CLiC-TALP groups (University of Barcelona and Universitat Polit?cnica de Catalunya). This module acts/runs after the morphosyntactic tagger, once proper names (strong NEs) had been identified, and before the syntactic chunker. We have defined two types of NE: strong and weak NEs. Strong NEs include only proper names written in capital letters; weak NEs are defined in terms of syntactic and semantic characteristics. We distinguish between simple and complex weak NEs. Complex NEs include coordination of one or more constituents and some kind of subordinate complements as relative clauses. Weak NEs have specific syntactic patterns and all of them include at least one trigger word. Trigger words carry the semantic as well as morphosyntactic (POS) information of the whole NE. Semantic information is expressed in terms of a set of types, compatible with MUC classification, where each type is associated with a set of trigger words. The MICE module processes the corpus identifying the nominal phrases containing a trigger word and assigning to the whole nominal phrase its type and POS. This process is carried out by means of a chunk grammar context-sensitive. Up to now we have only dealt with strong NEs and simple weak NEs. Future work will focus on developing the module that will deal with complex weak NEs. To do so, we need to develop a treebank with syntactic as well as semantic information.
Tomaz Erjavec An Experiment in Automatic Bi-Lingual Lexicon Construction
from a Parallel Corpus
The IJS-ELAN corpus (Erjavec 2002) contains 1 million words of annotated parallel Slovene-English texts. The corpus is sentence aligned and both languages are word-tagged with context disambiguated morphosyntactic descriptions and lemmas. In the talk we discuss an experiment in automatic bi-lingual lexicon extraction from this corpus. Extracting such lexica is one the prime uses of parallel corpora, as manual construction is an extremely time consuming process, yet the resource is invaluable for lexicographers, terminologists, translators as well as machine translation systems. For the experiment we used two statistics based programs for the automatic extraction of bi-lingual lexicons from parallel corpora: the Twente software (Hiemstra, 1998), and the PWA system (Tiedemann, 1998). We compare the two programs in terms of availability, ease of use and the type and quality of the results. We experimented with several different choices of input to the programs, using varying amounts of linguistic information. We compared the extractions using the word-forms from the corpus to that where lemmas have been used: this normalises the input and abstracts away from the rich inflections of Slovene. Following the lead of Tufis and Barbu (2001) we also restricted the translation lexicon to lexical items of the same part-of-speech, i.e. we make the assumption that a noun is always translated as a noun, a verb as a verb, etc. This again reduces the search space for the algorithms and could thus lead to superior results. Finally, we experimented with taking the whole corpus as input, and opposed this to processing corpus components separately. The reasoning here is that it is likely that different components will contain distinct senses of polysemous words, which will be translated into different target words. For such words there would therefore be no benefit in amalgamating different texts, while the final precision might in fact be lower. Preliminary results show that the precision of the extracted translation lexicon is much improved by utilising lemmas with an identical part-of-speech in the source and target languages; this argues in favour of linguistic pre-processing of the corpus. However, the recall of the system tends to be lower, as it misses out on conversion translations. In the conclusion we discuss this and other findings, as well as current results on extracting translation equivalents of collocations.
REFERENCES
Erjavec, T. The IJS-ELAN Slovene-English Parallel Corpus.
International Journal of Corpus Linguistics, 7/1, In print, 2002.
http://nl.ijs.si/elan/
Kata Gabor Making correspondences between morphosyntactic and semantic patterns
The project my paper describes aims at extracting information from a 1.500.000 words corpus of Hungarian business news by means of matching sematic patterns to the input sentence to yield an XML-tagged output with a detailed description of its semantic structure. The corpus is composed of short business news that contain one or two sentences. Semantic tagging implies the identification of the so-called 'main event' of the sentence, which is most frequently represented by the predicate of the main clause, and the 'participants' of the event, which take the form of complements of the predicate. A semantic pattern consists of the event-type and a set of the corresponding participants and their role in the event. The input text is first subjected to a morphological analysis and a shallow syntactic parsing that defines sentence and clause boundaries and labels word sequences as VPs, NPs, APs etc. The task of finding the main event and labeling argument phrases as participants and circumstances is performed by an intermediate rule-based module. Rules can refer to morphosyntactic information and they output semantic tags. The main difficulties arising while transforming morphosyntactic information into semantic information are reference and coreference relations, homonimy and different syntactic behaviour of words and phrases belonging to the same semantic pattern. As a solution to these problems, an extensive lexical database of syntactic patterns describing verbal and nominal argument structures is integrated into the syntactic parsing module. The database contains 11.000 argument structures of the 3.500 most freqent Hungarian verbs and 11.500 argument structures for 9.000 nouns from general vocabulary and business terminology. Lexical entries of argument structure patterns contain the arguments' detailed morphosyntactic and semantic description. These patterns are associated with one or more meanings of the lemma. However, the problem of homonimy is reduced to the minimum since ambiguity is far less frequent between argument structure patterns than between lemmas. The syntactic analysis module looks up for and labels complements of the predicate and those of noun phrases, and transmits all information to the intermediate module. Matching event-type patterns is then completed among predicates and their arguments. Other phrases that don't match any complement in the lexical entry are considered as optional adjuncts that typically don't represent participants of events but circumstances. Argument structure patterns, besides their usefulness in disambiguation, offer the advantage of facilitating the task of making correspondences between syntactic and semantic arguments by restricting the number of phrases to be investigated. It also makes easier to write generalizing rules over syntactic arguments since they are provided a special link pointing to the governing word by the syntax analysis module.
Marko Grobelnik and Dunja Mladenic Efficient Visualization of Large Text Corpora
Visualization is one of the important ways on how to deal with large amounts of textual data. Most frequent application of text visualization techniques is particular in cases when one needs to understand or to explain the structure and nature of large quantity of typically unlabeled and poorly structured textual data in the form of documents. The usual approach when dealing with text for visualization is first to transform the text data into some form of high dimensional data and in the second step to carry out some kind of dimensionality reduction down to two or three dimensions that allows to graphically visualize the data. There are several (but not too many) approaches and techniques offering different insights into the text data like: showing similarity structure of documents in the corpora (e.g. WebSOM, ThemeScape), showing time line or topic development through time in the corpora (e.g. ThemeRiver), showing frequent words and phrases relationships between them (Pajek), etc. One of the most important issues when dealing with visualization techniques is scalability of the approach to enable processing of very large amounts of the data. In this paper, our contributions are two procedures for text visualization working in linear time and space complexity. The first procedure is a combination of the K-Means clustering procedure and a technique for nice graph drawing. The idea is first to build certain number of document clusters (with K-Means procedure), which are in the second step transformed into the graph structure where more similar clusters are connected and bound more tightly. The third step performs one sort of multidimensional scaling procedure by aesthetically drawing of the graph. Each node in the graph represents the set of similar documents represented by the most relevant and distinguishing keywords denoting the topic of the documents. The second procedure performs hierarchical K-Means clustering procedure producing a hierarchy of document clusters. In the next step the hierarchy is drawn into the two-dimensional area split accordingly to the hierarchy splits. Like in the first approach, each cluster (group of documents) in the hierarchy is represented by the set of the most relevant keywords. Both approaches will be demonstrated on the number of examples visualizing e.g. Reuters text corpora (over 800k documents) and various web-sites.
Primož Jakopin Extraction of lemmas from a web index wordlist
The paper deals with a large list of words and their frequencies, as obtained from the main Slovenian web index NAJDI.SI (http://www.najdi.si). The list (March 2002) contains 7.591.414 units with a total frequency of 578.745.747, obtained from 1,447,602 web pages where 33 language could be identified (Slovenian 920,215 pages, English 493,894, German 12,730, Croatian 4,892, Serbian 2,625, Italian 2,530, French 2,063, Russian 1,851, Spanish 1,084, Hungarian 848, Romanian 606, Polish 582, Danish 580, Finnish 547, Czech 499, Portuguese 471, Japanese 383, Latin 305, Dutch 248, Slovak 181, Swedish 161, Bosnian 147, Norwegian 82, Bulgarian 20, Albanian 18, Korean 17, Ukrainian 10, Icelandic 4, Arab 3, Macedonian 3, Chinese 1, Greek 1 and Thai 1). As expected the wordlist is very varied, Hapax legomena amount to 49,3%, but it is nevertheless probably the most complete source of neologisms in Slovenian. As it was not possible to use the context of the words (the entire index was not available) or to check the list manually (the size of the list and the fact that stemming is used during the internet search) an algorithm with associated software utility have been devised which separates Slovenian words in the list from other units (nonwords), checks for noise and assigns lemmas to wordforms. The algorithm is based on inflection rules for Slovenian nouns, verbs and adjectives from the Dictionary of Standard Slovenian (SSKJ), on word frequencies from an 80-million word text corpus Nova beseda where the wordlist has been manually inspected and corrected, and from word-tag frequencies of a 1-million word subcorpus which has been POS tagged. In the paper preliminary results, obtained by the application of the algorithm, are presented. An illustration is shown in Table 1 with lemmas on besed- (word- in English) from the NAJDI.SI wordlist. *beseda 125829 besedilnooblikovalen 2 *besediti 5 besedoljub 4 *besedar 2 besedilnooblikoven 6 besedivka 1 besedoljubiteljski 1 besedaren 1 besedilnoorganizacijski 2 *besedje 332 *besedolomen 1 besedati 2 besedilnoskladenjski 2 besedko 1 besedoslovec 1 besedca 3 besedilnost 35 *besednica 4 besedosloven 24 *beseden 8081 besedilnotipski 9 besedničar 1 *besedoslovje 106 besedenje 3 besedilnovrsten 6 *besednik 33 besedospreminjevalen 2 besedeslovje 1 *besedilo 151994 besednjačenja 3 besedotvorec 3 *besedica 1248 besedilodajalec 4 *besednjak 1872 *besedotvoren 426 *besedičenje 114 besedilopisec 7 besednjakov 32 *besedotvorje 382 *besedičiti 34 besedilopisen 2 besednjaški 1 besedotvorno 93 besedijana 6 besedilopisje 1 besednomotoričen 2 besedotvornopomenski 3 besedijski 2 besedilopiska 1 besednooblikovalski 2 besedotvornozgodovinski 2 besedika 1 besediloslovec 2 besednopomenski 8 besedovadba 4 besedilce 37 besedilosloven 94 besednoreden 20 *besedovalec 4 *besedilen 2054 *besediloslovje 207 besednoskladenjski 2 besedovalen 5 besediljenje 8 besedilotvorec 4 *besednost 4 *besedovanje 77 besedilnik 16 besedilotvoren 26 besednotvoren 2 *besedovati 23 besedilnoanalitičen 1 besedin 5 besednoumetniški 3 besedovec 4 besedilnoanalitski 1 *besedišče 1093 besednoumetnosten 10 besedovrsten 1 besedilnogradivski 2 besediščen 19 *besednovrsten 80 besedozvezen 2 besedilnolingvističen 1 besediše 13 besednozvezen 36 *besedun 2 Table 1: 88 lemmas of wordforms on besed- from the NAJDI.SI index (* = also in SSKJ) There were 501 such wordforms, od which 221 were noise; from 280 remaining wordforms the 88 lemmas from the table have been obtained. 25 lemmas, marked with an asterisk, can also be found in the Dictionary of Standard Slovenian, other 63 are new words.
Svetla Koeva The structure of hyperonym relations
Hyperonymy and hyponymy are inverse, asymmetric and transitive relations, which correspond to the notion of class-inclusion: if W1 is a kind of W2, then W2 is hyperonym of W1 and W1 is a hyponym of W2. The relation implies that the hyperonym may substitute the hyponym in a context but not the other way about. All phonetic strings expressing the same semantic meaning are defined as a word W. The semantic meaning is considered as a set of semantic components with no specification of their number and features. Multiple co-hyponyms can appear to a word. Co-hyponyms have to inherite the equal set of semantic components of their immediate hyperonym. In WordNet, multiple co-hyperonyms have occasionally been encoded. In English database approximately 0.5 percents of words receive two hyperonyms (never more), in our assessment only operation conjunction between them is considered. It could be efficient to encode more comprehensively multiple hyperonymy relations. In every hyperonymy/hyponymy relation n hyperonyms could appear, where n ? 1. If n = 1, the set of semantic components of the hyperonym is proper subset of the semantic components of the hyponym. If n = 2, there are two options - union or intersection of the hyperonyms. The union of the semantic components of two hyperonyms W1 and W2 is inherited by its immediate hyponym W0 which means that W0 inherits also the union of higher hyperonyms W11 and W22, etc. Second option - the hyperonyms W1 and W2 have intersection which is equal to their common immediate hyperonym W. The hyponym W0 inherits the semantic components from either W1 or W2 (disjunction is applied) and thus from the higher W. Consequently we accept not lexicalized, but constructed via implication, nodes in the structure. There are different combinations between hyperonyms in terms of union and intersection, if n ? 3. A single hyperonym in a node could appear only once in the structure. If a hyperonym shares the node with other hyperonyms it could appear more then once at different levels. The results of such approach should be avoiding of some artificial hierarchy between words, however the correspondence with the WordNet structure would remain.The hierarchical structure should therefore be tested against a corpus or by some task, to verify its quality.
Cvetana Krstev and Duško Vitas Multilingual concordances using INTEX
Intex (Silberztein, 1993) is a flexible environment for the development of linguistic resources and tools. The user of Intex can develop his own applications using functions incorporated in the system, available lexical resources, such as a system of electronic dictionaries for simple and compound words, and a graphical interface for the construction of finite transducers. The paper describes procedures involved in the development of aligned concordances for a text (source) and its translation (target). They are produced on basis of texts that were already processed with Intex independently from one another (Krstev, 1994; Vitas 2002). The main aim of this process is to identify lexical elements that are translated 'literally' from the source language to the target language, using concordances of both texts. The role of these elements in newspaper texts was described in depth in (Krstev, 2001). Once these elements are identified, transducers can be constructed that generate texts encoded with XML-like tags, as well as auxiliary files containing pointers to such tags (Gross, 1997). The paper describes an application that is being developed for texts tagged in the abovementioned way, which attaches to generated concordances of a text the 'corresponding context' in target language. The 'corresponding context' is defined as a segment between XML-like tags with comparable attributes. This method will be illustrated with aligned texts of Plato's Republic, Voltaire's Candide, Flaubert's Bouvard et Pécuchet, Vern's Le tour du monde en vingt-quatre jour, and a sample of texts from the monthly "Le Monde diplomatique".
Silberztein, M. D. (2001): INTEX, (http://www.bestweb.net/~intex/downloads/Manuel.pdf)
Kristīne LEVĀNE & Normunds GRŪZĪTIS Automatic Text Mark-up Facilities Building Latvian Literature Corpus
The Latvian Corpus at the Artificial Intelligence Laboratory of IMSC covers ca 30 mill. running words; ca 3.5 mill. running words are in the Latvian literature corpus, which is the part of corpus with free access on the web. This part is not copyright protected, and the corpus of the classics is interesting both for academic users and others. At the moment there are only simple navigation possibilities ensured on the web, so the main task of this project is to facilitate the use of literature corpus. The gained experience serves basis for the other software tools of Latvian Corpus, which are under the development. From February, 2002 the development of Latvian literature corpus software tools has being carried out. The conception, requirements and the desirable tasks have been settled. So far, we have no common text structure standards and the content of corpus was HTML tagged. First, structure conception and standards based on XML technologies were created. Second, software tools and methods for the present corpus automatic transformation to the new build-up tagging system were developed. Presentation will deal with solutions and issues concerning this process. DTD grammars are created for each Latvian literature genre (poetry, drama and prose). First DTD was made for poetry, because this genre is the most complicate. Different collections of poetry were examined, the aim was to try combining all the features in one grammar. In order to detect automatization problems, tagging tool was developed. For drama DTD, grammar by J.Bosak (http://www.ibiblio.org/bosak) for Shakespeare plays was used, which is a widely used example for drama structuring. The grammar for prose is relatively more simple. The current results of literature corpus transformation are available on www.ailab.lv/users/normundsg. Next stage of the project is to create the whole corpus system and to develop software tools (navigation, concordance, statistics, and search) for end-users. Web interface will be provided giving the possibility to address wider audience and providing effective further development of the literature corpus.
Michaela Mahlberg The textlinguistic dimension of corpus linguistics
One of the major achievements of corpus linguistics is that it stressed the necessity of revising the widely accepted ideas of lexis and grammar. The established separation of lexis and grammar is just an illusion that is destroyed as soon as natural language is looked at. New corpus linguistic models have led to a completely new way of describing (the English) language (e.g. Sinclair 1999a, Hunston & Francis 1999). But the potential of corpora can take us a step further. There is a dimension to corpus linguistics which has not received enough attention so far: the 'textlinguistic dimension'. If we look at a text as a communicative unit, the meanings of words in a given text can comprise more than what is normally listed in dictionaries. Functions such as giving emphasis or expressing attitudes and feelings can be part of the meaning of words in text. Corpus data suggests that there are groups of words which tend to share certain textlinguistic functions that contribute to the meanings of these words. General nouns like thing, way, man, or move form one of these groups. Among the functions that characterise these nouns we find the 'support function'. A general noun fulfils the support function if it occurs in a construction where it does not contribute much meaning by itself, but helps to represent information according to the communicative needs of the speaker/writer and hearer/reader. The support by a general noun can create various effects. For instance, a general noun can help to structure a sentence according to the information principle, as in: The man who played that part was Norman Lumsden, and [...] (BNC). Here, the clause begins with the general noun man, whose postmodifier refers back to given information which is then supplemented by new information towards the end so that the information load increases gradually. In other cases, the support by a general noun can be interpreted as an economic or effective way of packing information. The general noun way can, for example, introduce both finite and non-finite postmodifying structures into clauses (way in which/of/to ...) and thereby contribute to "the flexibility and extendibility of the syntax" (Sinclair 1999b: 169), as in: The way in which specialist health services for the elderly are provided nationally varies considerably (BNC). The concept of the support function results from the interpretation of corpus data from the BNC and the Bank of English.
References
Rūta Marcinkevičienė and Vidas Daudaravičius Detection of the boundaries of collocation
There are methods to detect and extract collocations form a text, like mutual information that helps to identify two words occuring in conjunction. Nevertheless this method does not work for longer collocations. In the case of multiword unit it is hard to detect the exact boundaries of a collocation, even if if has a clearcut boundaries and is not fuzzy in the edges. The method how to detect the boundaries of collocations is suggested while dealing with large corpora. The collocation is assumed to consist of a sequence of co-occuring words. It is detected according to high frequency word pairs that form the collocations itself or, in the case of longer collocations, part of it. Then the boundaries of a particular collocation are detected by measuring the variety of possible contextual partners to the left and to the right of the collocation. Low variety signals the continuation of the same collocation and high variety of contextual partners is the sign of the boundary of a collocation.
Soledad Garcia Martinez and Anna Fagan Academic Conflict in Research Articles: A Cross-Disciplinary Study of Chemistry and Tourism Articles
The aim of this communication is to present the results of a Research Project carried out in the Faculty of Modern Languages at the University of La Laguna (Tenerife) on the way criticism is presented on Research Articles to the scientific community. In today's competitive academic world, the pressure to publish is continually increasing, and, in order to justify publication of their research articles (RA), writers must create a research space which permits them to present their new claims to the other members of the academic community. This mainly implies the indication of a knowledge gap and/or the criticism of any weak point in the previously published work by other researchers or the academic community itself. The latter phenomenon has been termed academic conflict, a critical speech act whose rhetorical expression ranges from blunt criticism to the use of subtle hedging devices, aimed at an individual or the community in general. The study of citation practices across the disciplines carried out by the members of our Project has revealed a dichotomy between the so-called "hard" and "soft" sciences, thus there may also be significant interdisciplinay differences in the rhetorical strategies used to express AC and in the frequency of the critical speech act itself. In this study we discuss the development of the taxonomy we have created to describe the rhetorical choices writers use when making the critical speech act, and the application of this taxonomy to 50 RAs from two distinct disciplines: Tourism, representing the soft disciplines, and Chemistry, the hard disciplines. The application of this taxonomy, which categorises AC according to directness, writer mediation, and the target of the criticism, has yielded some surprising results. These findings may indicate, inter alia, that a more delicate taxonomy should be applied to the study of AC.
Goran Nenadic, Irena Spasic, and Sophia Ananiadou What Can Be Learnt and Acquired from Non-disambiguated Corpora: A Case Study in Serbian
Every NLP system needs to incorporate a certain amount of relevant linguistic knowledge acquired from theory and/or corpora. One of the main challenges is the efficient customisation of such systems to a new task or domain by automatic learning and acquisition of specific constraints [5, 9]. In this paper we discuss possible approaches to learning various lexical and grammatical features from non- disambiguated corpora in a morphologically rich language such as Serbian. Unlike reliable tagging tools for such languages [6, 8], electronic texts are widely available, and therefore, we concentrate on learning from initially tagged [7] but non- disambiguated text. We present three case studies based on the computation of minimal representation (i.e. intersection) of features from non-disambiguated corpora. Each case concentrates on learning different type of linguistic information. In the first case, we have used a genetic algorithm approach [4] to learn cases required by a specific proposition. We computed the minimal set of cases for each preposition so that every corresponding (non-disambiguated) NP from the learning corpus keeps at least one case from the set (Figure 1). The results coincide with the corresponding (theoretical) grammars, thus proving that this feature can be learnt from corpora. Further, the learning method is unsupervised, as no prior knowledge has to be provided. In the second case, we have used a general NP structure and obligatory agreements between NP constituents [2], to learn structures for specific named entities (namely names of companies, educational and governmental institutions). The initial set of entities was identified by using specific designators (e.g. 'preduzece' (Eng. company)) as anchors [3]. Then, we computed a minimal set of lexical and morpho-syntactic features that were inherent for every NP from the set, producing lexicalised local grammars [1] that describe structure of specific types of named entities. Finally, we used particles (e.g. 'kao' (Eng. like)) as anchors to learn frozen, multiword adverbial expressions (e.g. 'kao grom iz vedra neba' (Eng. surprisingly)). Simple expressions like 'kao NP'(e.g. 'kao konj' (Eng. hardly)) were not considered. The remaining expressions are "minimised" by conflating some grammatical features (e.g. pronouns in 'kao da su PRON:dative sve ladje potonule' (Eng. disappointedly)). As these studies show, some basic grammatical constraints (like cases) and specific lexical preferences (like lexicalised NP structures and multiword adverbials) can be learnt automatically even in a morphologically rich language. However, although the precision of grammar-related constraints is promising, the broader coverage of lexical learning is still a challenge.
References
Julia Pajzs The corpus based comparison of the meaning of the word loyal in English and lojális in Hungarian
Several years ago - in a special communicative situation - I suddenly realised that there is a difference in the meaning of the word loyal in English and the corresponding word lojális in Hungarian. While its core meaning in English - according to the OALD 2002 is "remaining faithful to sb/sth and supporting them or it: a loyal friend /supporter She has always remained loyal to her political principles", in Hungarian the meaning is surprisingly different with many similarities, however: "Valamely politikai rendszerhez, ill. ?llamhoz h?
The corpus based analysis of the most frequent collocates of these words will certainly help a lot in identifying the underlying similarities and differences in the usage of this words. This examination can serve as a model to similar studies on the parallel investigation of languages and cultures.
As a candidate for joining the EU, Croatia faces a challenging task: translating the Acquis Communautaire, an extensive body of legislation comprising approximately 150 000 pages. To speed up this process and to increase the consistency of translation, we developed a tool to suite the needs of translators. The input consists of the original documents, converted into the XML format, which is the standard accepted by the corpus linguistic community today. The EUROVOC glossary (the official EU legislative terminology lexicon translated to Croatian, Bratanić (2000, 2001)) is also converted to XML and stored as a separate document. The tool searches the source English document, identifies the English terms existing in EUROVOC, marks these terms in the original document and offers the established Croatian translation equivalents. The processing is based on traversing XML documents with extensive usage of XML Document Object Model, which provides a range of possibilities for different output formats. The standard output is a HTML document, being one of the most used and widespread formats today and easily readable on any platform, with terms marked and their Croatian TEs available at the user's request. The trial processing was carried out on a sample document, namely the Stabilization and Association Agreement between EU and Croatia. The authors argue that this tool provides a method for significant increase of the consistency of translations (approximate number of translators engaged by the Ministry of European Integrations of the Republic of Croatia exceeds 100) and reduction of the time human translators need to fulfill the task. This tool will also simplify the second phase of the translation process - the revision of the translated documents. Furthermore, documents with terms marked can also be used in any other type of terminological research.
The apparent paradox that I would like to explore is that
(a) language interpreted as a formal system cannot account for the
creation of meaning
Assuming that these statements can be supported, we must look for the ways
in which language creates meaning within itself, but not through its organisation as a formal system.
The clues are to be found in the evidence provided by corpora. Relatively
independent items form meaningful units by coselection, frequent collocation adds meaning through "contagion". Individuals compare meanings through averral, from which truth value is derived. Meanings are related to each other inside the language system through paraphrase, which is the non-formal process that allows language to retain aspects of a formal system without submitting to the full rigours of it.
In most cases, when information is extracted from large corpora, the units that are searched for belong to the category of nominals: proper names, common nouns and noun groups are distinguished and interpreted linguistically. In my work, the target of exploration and formal description is language constructs in Bulgarian that are identifiable as verb complexes. They form the first layer of meaningful patterns within the group of the predicate.
The modeling of the sentential structure is performed in the setting of the BulTreeBank project [Simov et al. 2002a], where relations are defined and interface is due to be established between the formal representation necessary for large-coverage computing techniques like chunk parsing and sophisticated HPSG conformant [Pollard, Sag 1994] linguistic descriptions attached to the sentences in the treebank. The segmentation of the verb complex into reliable patterns is based on the philosophy of easy-first parsing outlined by Abney in [Abney 1991] and [Abney 1996]. The parsing technique uses reliable patterns consisting of categories and regular expressions that enter finite-state automata operating in the so called cascade, that is, sequence of levels of phrase recognition. The regular grammar cascade for the verb complex consists of two subsequent levels of phrase recognition where on the basis of smaller segments on the first level, bigger segments are defined on the second level by the application of corresponding groups of pattern matching rules. The regular grammar engine used is part of the software environment provided by the CLARK system [Simov et al. 2001]
The sentence elements that immediately surround the verb and form the first layer of meaningful patterns fall in two main groupings: 1) elements that are generally considered pronominal clitics; 2) auxiliary verb forms and functional words. The interdependence between the very rich tense and mood paradigm of the Bulgarian verbs and the idiosyncrasies of the clitic behaviour leads to the existence of verb complexes with different number and type of elements which are in different combinations and generate a variety of semantic connotations.
The patterns in the verb complex are defined in such a way as to be compatible with a semantic model of the Bulgarian temporal system developed by Gerdzhikov [Gerdzhikov 1999] where four types of tenses are distinguished: 1) non-relative, non-perfect; 2) relative, non-perfect; 3) non-relative, perfect; 4) relative, perfect.
In this way the segmentation of the verb complex at the level of chunk
parsing is interfaced with the feature structure descriptions of the sentences in
the treebank of Bulgarian where syntactic and semantic information is
incorporated.
References
The paper investigates lexical and grammatical varieties of terminological translation equivalents. Parallel corpora provide a useful resource for identifying terms in the source language and for checking consistency of translations of terms in the target language texts where no TE variations are permitted. XML based search tool is applied to the sentence aligned parallel corpus consisting of texts comprising several original EU legal documents and their Croatian translations. The input consists of EUROVOC terminology (glossary of terms in English and their translations into Croatian which should be used with 100% consistency). The tool compares the consistency of translation equivalents set by the EUROVOC in advance and the actual varieties of translation equivalents found in the Croatian translated texts. The tool works in sequence of comparison steps. Firstly, input English sentences are compared with the English side of the EUROVOC glossary. After locating terms in original English sentences, the next step is a further comparison between corresponding Croatian sentences translated by human translators and a matching pair of terms from the glossary. Sentences where terms in translations do not match with already established and expected term translations from the glossary are marked and left for manual examination. Differences on the lexical and grammatical level resulted from inconsistency of terminological use of Croatian translators will be presented and the typology and frequency of those varieties will be discussed. We assume this kind of corpus evidence will be a practical guide for translators to produce terminologically consistent translations where such a requirement is an absolute necessity like in legal texts translations.
This paper intends to present the main lines of work in progress based on the exploration of large corpora as a source of quantitative information about language. The focus is on some problems relating to the morpho-syntactic annotation of corpora and on some statistical techniques, showing their effectiveness on a specific case study.
The works is mainly based on CORIS/CODIS, a corpus of contemporary written Italian, developed at CILTA - University of Bologna, is a synchronic 100-million-word corpus and is being lemmatised and annotated with part-of-speech (POS) tags. Usually the set of tags is pre-established by the linguist, who uses his/her competence to identify the different word classes. The very first experiments we made revealed that the traditional part-of-speech distinctions used in Italian are often inadequate to represent the syntactic features of words in context, especially for complex classes, such as adverbs, pronouns, prepositions and conjunctions.
In the literature there is a wide acceptance of the distinction, mainly based on the concept of open and closed set of words, between lexical words (content words) and grammatical words (or functional words). Thus, it is possible to postulate four main categories of words, three belonging to the set of lexical words (nouns, verbs, qualitative adjectives) and one large class that collects all the grammatical words (and also adverbs of manner). Using such distinction a subpart of CORIS has been automatically tagged and statistical techniques have been applied for retrieving context information for some target words, obtaining a distributional fingerprint for every word considered in this study. The approach is based on the hypothesis that two syntactically and semantically different words will usually appear in different contexts and will have different fingerprints. Some adverbs of manner have been chosen and different clustering techniques have been applied to the correspondent fingerprints. The main tools used were Hierarchical Clustering and Self-Organising-Map. The clusters derived applying such techniques suggested a clear syntactical behaviour of the considered adverbs. It emerged, as stated in various bibliographic references, that Italian adverbs of manner tends to modify sentences or to modify verbs and adjectives. These two syntactical schemas act as extreme poles of a continuum in adverbs-of-manner behaviour. Some adverbs prefer to modify mainly sentences, but sometimes also verbs or adjectives. Other adverbs prefer to modify mainly verbs or adjectives, and in rare cases also sentences. Moreover the adverbs of manner that prefer to modify sentences clusters very well with a class of word that, in a previous work, have been defined as soft connectives (Tamburini et al. 2002).
The global behaviour of each adverb of manner can thus be represented as a preference occurring in modification of other linguistic objects that can expressed by probability values. That corresponds to what is required by stochastic part-of-speech taggers. The tagging procedure will assign to each adverb of manner both the categories, disambiguating them using the derived probabilities. This appears to be a suitable way for managing such kind of linguistic phenomena.
Multilingual alignment of semantic lexicons (lexical ontologies) usually relies on some kind of language-independent conceptualization of their semantic content. In EuroWordnet and its follow-up BALKANET, such a conceptualization is called InterLingual Index (ILI). Two meanings in two different language-specific semantic lexicons which are mapped onto the same conceptual representation are taken to be semantically equivalent or put it otherwise, linguistic realizations of the same concept. The usual procedure assumes that monolingual ontologies are independently mapped, according to a commonly agreed protocol, on the interlingual index. Usually, this lexical projection is achieved by humans and its accuracy is hampered by their lexicographic experience, subjectivity, and tiredness. However, the most important element that affects the projection consistency is the difference in granularity between a given lexical ontology and the interlingual index. EuroWordnet and BALKANET adopted as the Interlingual Index a set of unstructured concepts corresponding to the meanings explicitly recorded in WordNet1.5, plus a few concepts lexicalized in other languages. In this case, as noted by several researchers, the sense-distinctions in the Interlingual Index are too fine-grained in order to expect an accurate and consistent mapping of multiple language-specific ontologies. Recently, a lot of interest rose around the idea of so-called "soft concept clustering" of the Interlingual Index. The idea is that instead of defining the crosslingual semantic equivalence based on lexical projection over the same ILI record, one should consider lexical projection over the same cluster of ILI records. This weaker definition of crosslingual semantic equivalence is more realistic and easier to meet and operationalize for computer applications.
We propose a method and its implementation for both checking consistency of the monolingual mappings over the Interlingual Index and for pinpointing the concepts in the Interlingual Index that should be "soft-clustered". The methodology builds on our most recent results in sense clustering using automatic extraction of translation equivalents, and on the recording of human failures in consistent mapping of language specific senses onto Interlingual Index records.
The parallel corpus we used in our experiments is the "1984", based on Orwell's novel, developed in the MULTEXT-EAST project, further cleaned up in the TELRI and CONCEDE projects.
The paper will show how these resources are used in checking the consistency of the mapping over the Interlingual Index of several lexical ontologies as build in the BALKANET project and how this checking could provide hints for ILI soft clustering.
The paper describes a new corpus project under development at the University of Łódź. The Department of English prides itself on the quality of its students of translation, however, there is always room for improvement. With this aim in mind the author has begun work on a corpus research project which will give the departmental translator trainers a different view both of their work and their students' work.
Within practical applications of language corpora and second language learning, corpora can be loosely divided into three groups:
a) monolingual
b) bilingual (parallel or comparable)
c) learner
For the purposes of translation training each of these corpora have their advantages and disadvantages. The project described in this paper will utilize all three kinds of corpus in an attempt to gain a different perspective on translation and the process of translation.
The PELCRA was set up in 1997 to produce extensive corpus resources at both a local and national level. The project consists of a variety of corpora:
1. A Polish monolingual corpus
2. An English learner corpus
Translator trainees are free to make use of both types of corpora found in the PELCRA whole and use them both as a guide to avoid learner errors or erroneous learner tendencies and also as a reference point by using the Polish national corpus. The students also have access to the BNC and in this way have at hand two monolingual reference corpora for both of the languages they are working in.
Students translate from the foreign language to the mother tongue, which is generally considered the norm and are encouraged to attempt translation from the mother tongue into the foreign language (i.e. Polish into English). It is with the latter that the learner corpus comes into its own becoming a useful tool and guide for the translator.
Extensive work by the PELCRA team (e.g. Leńko-Szymańska, 2000; Lewandowska-Tomaszczyk, McEnery, Leńko-Szymańska, 2000) have given our students valuable clues to dangerous areas in the production of FL texts.
Our trainees have access to a wide range of translation. However, the need for a more specialized learner corpus, one created with translation in mind, seemed apparent. The paper describes the production of a learner translation corpus which allows the analysis of errors and patterns specific to translation and student translation.
The paper is concerned with editing of corpora and
tagged corpora in particular. It introduces a specialised corpus
editor (program CED) and library for work with corpora (libkorplib.a).
The whole system CED displays the following functions and properties:
The library LIBKORPLIB.A provides an effective interface regardless of the
physical data storage, thus it is possible to access data in various formats
(text files, SQL databases etc.).
The tool can be used for editing any corpora, making quite complicated
corrections in them, modifying tagged corpora after adjustments caused by
changes in the respective tagsets. It can also closely cooperate with other
external programs like morphological analyzers or morphological databases (or
other dynamic resources) from which the appropriate options for the desired
changes can be selected. The tool has been recently (2001-2002) tested in NLP
Laboratory FI MU during the development of the grammatically tagged corpus
DESAM: we have used it for correcting both tagging errors and errors like
splitted words or misprints and also for the task involving marking sentence
boundaries and other aggregate changes.
One of the main purposes of CED system is to considerably speed up the
development of tagged corpora with the number of mistakes reduced to the
reasonable minimum -- these expectations have been fulfilled in the course of
building the corpus DESAM and now the same has been experienced with its larger version DESAM2.
The paper presents an experiment in automatic translation from Slovenian to English language based on SMT, Statistical Machine Translation. EGYPT is the result of a summer workshop at John Hopkins University, and is currently most widely used toolbox for processing bilingual parallel corpora for translation system production. The IJS-ELAN corpus contains 1 million words of annotated parallel and sentence aligned Slovene-English texts, with both languages word-tagged with context disambiguated morphosyntactic descriptions and lemmas. The corpus is encoded in XML, according to the TEI Guidelines P4.
A Slovene to English translation system was produced using the EGYPT toolbox and the IJS-ELAN corpus. We discuss the motives for source/target language selection, i.e. why we chose to train the system for Slovenian to English translation rather than vice-versa.
We performed basic evaluation on this system. The initial model was then extended using corpus annotations, in particular the context disambiguated lemmas, which abstract away from the rich inflections of Slovene. In this way the main disadvantage of our model, namely that it derived from a corpus of relatively modest size, is, at least to some extent, overcome.
The new translations were evaluated and results compared with the translations of the initial system. The translations were evaluated using two methods:
- WER, word error rate, is a variant of edit distance. Translations, acquired using our system, are compared with reference translations. Corpus is divided in train and test pairs, test pairs are used as reference translations and compared to new translations. All insertions, transpositions and deletions are counted and normalised.
- SSER, subjective sentence error rate. Translations, acquired using our system, are marked by experts and distributed into five classes ranging from "perfect translation" to "perfect nonsense".
The results are presented and discussed.
This research is based on the College Learner English Corpus which was completed at Shanghai Jiao Tong University in 1999. The major objectives are to examine 1) whether the distribution of key words is closely related to the subject matter of the essays; 2) how the words are associated with each other in the essays of the same topic; and 3) how the words are inter-related in the essays across different topics. Moreover, the relation between the words in association and their collocational links has also been investigated. A general survey of the core word associations demonstrates that there exists high overlapping frequency for the core words to be used together in terms of lexical sets. For the learners' use of the lexical words there are sets of core words that are highly productive. These core words are often less marked and super-ordinates in the learners' mental matching of the semantic fields of the two languages. The core words are often used in association with each other and it is possible that the words in association can also be used as collcoational links. Moreover, many of the learners' use of lexical words are topic dependent and there is high vocabulary concentration within the texts of the same topic with regard to the whole corpus. The core words of one subject-matter are often inter-related semantically. And the various relationships between the lexical sets for one topic may often attribute to the very theme of the text. The findings have shown that the learners tend to use words in close relation to the meaning organizations in their mental lexicon. And the choice of one word is much determined by how their schematic knowledge about the world is activated and by how the other words in association are chosen in their vocabulary network. Therefore the success of vocabulary using depends to a great extent on whether the learners can successfully perceive the complex lexical relations such as topical relations, association, and collocational links and represent them with accuracy in their language production. Such observations are of implications for the EFL teaching in that the lexical words may be better taught when the word association and topical relations are considered. And the new words may be more accessible and made easier for retention if they are delivered on the basis of the subject-matter with which the words are connected.
For the advanced learner of English there is no shortage of excellent dictionaries to choose from. It is accepted that lexicographers will cover general needs of the learner and that more specific usage is the prerogative of the terminologist. However, for the non-native writer of research papers terminology may not be the major problem, it is the words that go around the terms that prove difficult. Several problems arise here. One is that this population rarely buys dictionaries, and when it does, the users do not wish to wade through complex entries for which the examples do not conform to domain and genre-specific usage. Another problem arises from the nature of terminology. To reach agreement on ideal usage a "term" is generally defined outside of context, however, when looking at terms within a corpus we may find that research writing does not respect the standard definition as authors strive to impose their view as to the phenomenon under study. This means that corpus examples may not be to the taste of everyone in a controversial and developing field. Then, in developing a corpus-driven approach to specialised language another problem arises, that of grammatical norms. Insofar as specialised corpora are inevitably composed of both native and non-native productions some of the grammatical usage may not be that of the particular syntax of some so-called sublanguage, but simply bad English. In reference corpora minor variations are lost in the mass of data, but this is not necessarily so in the much smaller special language corpora.
The Parasitic Plant Dictionary project is an attempt to build a data-driven pedagogical dictionary in a specialised field. This means looking at both specialised and non-specialised items and displaying their usage in context. Two lexical units will be discussed here, "control", as verb and noun in general and scientific usage, and "haustorium", a domain specific term in parasitic plant biology. In looking at "control" we shall see the difference between the complex entry required in a pedagogical dictionary and that adopted here to show the usage of this word in specialised contexts. For "haustorium" we shall see the difficulties in extracting an entry from corpus data that will show usage, whilst not upsetting the terminological requirements of the leaders in the field.
The Oxford Text Archive is a large repository of electronic texts and text corpora. At present the archive works in much the same way that it has since its inception. The user consults the catalogue, selects a text or a number of texts and then completes the relevant procedure in order to download the text or texts to their computer. The main development in terms of resource delivery in the past 25 years is that many of the resources can now be
downloaded directly from the website, rather than being sent by post on magnetic media or downloaded by ftp. The user is then left to their own devices in order to find software to analyse the texts and try to extract information from them.
In order to make the archive more useful and usable for linguistics researchers, a system for the online querying texts and corpora online is being developed at the Oxford Text Archive. It is further proposed that the user will be able to construct a corpus of texts from the archive for downloading or querying online. It will be possible to select texts for the corpus on the basis of any of the resource metadata categories and by simply picking and choosing from a list of texts.
Online concordancing is not new. Many sites and corpus projects offer this facility. Furthermore, the ability to select the texts on the fly and thus construct a virtual corpus is not new. This paper reviews some existing resources and services in this area.
The specific challenge of providing a service of this type using the holdings of the Oxford Text Archive is that there are more than 2400 texts in the archive and they have been collected and documented over a period of more than twenty-five years, and as such reflect a multitude of different practices in the encoding of the texts, in the construction of collections of texts, and in the documentation of the resources. The size and diversity of the archive makes it a potentially extremely rich linguistic resource.
It is however a precondition for the type of functionality which is proposed here that the textual data and the metadata be interoperable. The OTA's response to this challenge is examined in this paper. There is also an examination of the extent to which the framework which is being developed can be generalised.
The selection of translation equivalence in MT (Machine Translation) depends on the differentiation of translation divergence between the Source Language (LS) and Target Language (TL). In this paper, the different types of translation divergence in MT are discussed. They are the translation divergence in lexical selection, in tense, in thematic relation, in head-switch, in structure, in category, and in conflation. The syntactical, semantic and contextual ambiguity that related with the translation divergence also discussed. The author suggests use the feature vector to represent the co-occurrence cluster, and the co-occurrence cluster based approach in the selection of translation equivalence is described in detail.
|