Generic Bilingual Word-Alignment and Terminology Acquisition Tools
Ido Dagan
Bar Ilan University and TextRay Ltd., Israel
e-mail: dagan@bimacs.cs.biu.ac.il
Bilingual word alignment has been recognized as a useful technology
for various translation-related tasks. A particularly attractive architecture
is obtained when deploying bilingual word alignment algorithms as the
basis for semi-automatic corpus-based tools that assist human translators, mainly
in translating technical terminology. We discuss the principles and
components of such architecture and exemplify a possible realization of it by
describing two systems:
- A comprehensive word-alignment system that was applied to the
Hebrew-English language pair. A major goal of the design of this system is to
assume as little as possible about its input and about the relative nature of
the two languages being aligned, while allowing the use of minimal
monolingual pre-processing resources. The system receives as input a pair
of raw parallel texts, and requires only a tokeniser (or lemmatiser) for
each language. After tokenisation (or lemmatisation), a rough initial
alignment is obtained for the texts using a version of Fung and McKeown's
DK-vec algorithm (Fung and Mckeown, 1997). The initial alignment is given
as input to a version of the
word_align algorithm (Dagan, Church and
Gale, 1993), an extension of Model 2 in the IBM statistical translation
model. Word_align produces a word level alignment for the texts and
a probabilistic bilingual dictionary.
-
A semi-automatic tool, Termight, that supports terminology
translation. Termight consists of two components which address the two
sub-tasks in bilingual glossary construction: (a) preparing a monolingual list
of technical terms in a source language document, and (b) finding the
translations for these terms in parallel source-target documents. As a first
step (in each component) the tool extracts automatically candidate terms
and candidate translations, based on term extraction and an input word
alignment. It then performs several additional pre-processing steps that
greatly facilitate human post-editing of the candidate lists. These steps
include grouping and sorting of candidates and associating example
concordance lines with each candidate. Finally, the data prepared in pre-processing
is presented to the user via an interactive interface that supports quick
post-editing operations.
The above architecture demonstrates how sophisticated algorithms
can provide the basis for highly effective tools that facilitate human
productivity, yielding a useful application of natural language technology.
See previous, next abstract.
Back to Newsletter no. 9.