Generic Bilingual Word-Alignment and Terminology Acquisition Tools

Ido Dagan
Bar Ilan University and TextRay Ltd., Israel

Bilingual word alignment has been recognized as a useful technology for various translation-related tasks. A particularly attractive architecture is obtained when deploying bilingual word alignment algorithms as the basis for semi-automatic corpus-based tools that assist human translators, mainly in translating technical terminology. We discuss the principles and components of such architecture and exemplify a possible realization of it by describing two systems:

  • A comprehensive word-alignment system that was applied to the Hebrew-English language pair. A major goal of the design of this system is to assume as little as possible about its input and about the relative nature of the two languages being aligned, while allowing the use of minimal monolingual pre-processing resources. The system receives as input a pair of raw parallel texts, and requires only a tokeniser (or lemmatiser) for each language. After tokenisation (or lemmatisation), a rough initial alignment is obtained for the texts using a version of Fung and McKeown's DK-vec algorithm (Fung and Mckeown, 1997). The initial alignment is given as input to a version of the word_align algorithm (Dagan, Church and Gale, 1993), an extension of Model 2 in the IBM statistical translation model. Word_align produces a word level alignment for the texts and a probabilistic bilingual dictionary.
  • A semi-automatic tool, Termight, that supports terminology translation. Termight consists of two components which address the two sub-tasks in bilingual glossary construction: (a) preparing a monolingual list of technical terms in a source language document, and (b) finding the translations for these terms in parallel source-target documents. As a first step (in each component) the tool extracts automatically candidate terms and candidate translations, based on term extraction and an input word alignment. It then performs several additional pre-processing steps that greatly facilitate human post-editing of the candidate lists. These steps include grouping and sorting of candidates and associating example concordance lines with each candidate. Finally, the data prepared in pre-processing is presented to the user via an interactive interface that supports quick post-editing operations.

The above architecture demonstrates how sophisticated algorithms can provide the basis for highly effective tools that facilitate human productivity, yielding a useful application of natural language technology.

