Norbert Volz (Mannheim):

CORDON - CORPUS-ORIENTED DETECTION OF NEOLOGISMS


CORDON , a multinational concerted project jointly carried out by academic and industrial partners, aims to provide a modular, language-independent client/server software solution for the automatic detection of neologisms - new words or multi-word-units denoting new concepts - in texts using monitor corpora.

New concepts reflecting changes in culture, society, industry and science quickly show their influence to language. New words or multi-word-units emerge, enabling the integration of these concepts in the communication progress. The identification and documentation of those changes therefore is of major importance for maintaining the actuality of language resources, language processing tools and terminology databases.

Monitor corpora can be used to recognise and trace the changing patterns of collocations and similar phenomena that give clues to the emergence of new terms. Basically, two types of tools are needed for this purpose:
- a tool to correlate lexical and terminological items with temporal intervals, based on frequency and distribution over text types; using statistical methods such as c_-tests to assess the significance of noticeable irregularities in the distribution of words of a corpus within a certain time
- a statistics-driven tool to establish context patterns for lexical and terminological items, reflecting their various usages, e.g. by the examination of the verbal environment of repeating instances of words, looking for repetitions and regularities within the environment.

A combination of these tools working on monitor corpora will enable the identification of "candidates" for neologisms, which then can be listed and processed for further analyses and applications.

The envisagedsoftware product will be a minimal assumption, generic modular solution that any users can adapt to their own texts and corpora regardless of language. Possible applications will mainly be within lingware products, e.g. machine translation systems, multilingual termbanks, databases etc. CORDON will also prove useful for the automatic updating and expansion of natural language lexicons and translation memories.

The project consortium consists of four academic and four industrial partners. The academical partners will provide research facilities and staff. The industrial partners will be responsible for project management, supervision, validation, evaluation and assessment of the final product in order to guarantee maximum response to user needs.

Project duration will be two years. At the end of this phase, the result of the CORDON project will be a demonstrable robust prototype that will work on existing application and corpora.

The proposal for this project will be handed in under the current TELEMATICS call within the 4th Framework Programme of the European Commission.


For further information, contact: telri-admin@nytud.hu


TELRI Home Page