TELRI
Trans-European Language Resources Infrastructure - II

Current Events | Write to us | TELRI Main Page | TELRI Seminar

Automatic Extraction of Terminological Translation Lexicon from Czech-English Parallel Texts

Martin Čmejrek, Jan Cuřín
Institute of Formal and Applied Linguistics
Charles University
Prague, Czech Republic
e-mail:curin@ufal.mff.cuni.cz, cmejrek@ufal.mff.cuni.cz

The primary motivation for our research was to create translation lexicon of terminology of a particular discipline. Many disciplines lack relevant dictionaries or the dictionaries are obsolete because of the quick development of the discipline. We assume that the fundamental part of the translation lexicon could be generated from the parallel corpora of up to now translated texts automatically and afterwards it could be manually edited.

We have followed the works on the field of automatic sentence alignment (Gale and Church, 1993) and automatic extraction of translation dictionary (Brown et al., 1993; Wu and Xia, 1994). These works have exploited very large corpora of parallel texts from parliaments in bilingual countries, such as Canada and Hong Kong. The first two (Gale and Church and Brown et al.) used Canadian Hansards English-French Corpus, the third one used the HKUST English-Chinese Corpus. These corpora are very large (around 2 mil and 0.4 mil pairs of sentences) and mostly contain highly equivalent, literal and tight translations. The situation in our country is different. We lack such a good source of large bilingual data.

We used a smaller corpus of texts taken from a particular discipline - computer oriented corpus. The corpus consists of operating system messages from IBM AIX and of operating system guides for IBM AS/400 and VARP 4. The translations are literal and tight. In most cases sentences are translated sentence by sentence. it means that there is one-to-one correspondence between an English sentence and a Czech sentence. On the other hand, it is a typical feature of this kind of texts that majority of operating system messages and a big part of sentences from guides do not have a verb. This corpus contains 119,886 pairs of sentences.

We also have an access to data from Reader's Digest Výběr magazine. 30-60% of articles in this magazine have been translated from English to Czech. The translations in Reader’s Digest are mostly very free. This corpus contains 58,137 pairs of sentences. The experiments were carried out on this corpus too and the results were compared to those obtained from computer-oriented corpus.

For the identification of corresponding English and Czech paragraphs and sentences we implemented an automatic statistical method based on lengths of paragraphs and sentences respectively (Gale and Church, 1993). The accuracy was 96% of correctly aligned pairs of sentences on computer-oriented corpus and 85% on fiction corpus. We compared distributions of alignment categories in Canadian Hansards, Czech-English computer-oriented and fiction corpora.

Majority of terms in the computer-oriented corpus occurs in the form of a noun phrase. The idea how to automatically extract the terminological translation lexicon is to concatenate words of potential phrases into one string, i.e. consider these constructions to be single words, and use the statistical model based on word-by-word translation probabilities. We developed a tool based on a regular grammar, which marks noun phrase boundaries. Czech phrases and words are converted into their basic forms - nominative for nouns, adjectives and pronouns, infinitive for verbs.

We implemented models of translation probability 1 and 2 (Brown et al., 1993) and estimated their parameters - probabilistic dictionaries - by EM algorithm.

Output of the training procedure is filtered to produce a smaller, more useful and reliable dictionary. We have tested several filtering criteria and looked for the optimal combination of them. Size of the resulting dictionaries varies around 6,000 entries. After the significance filtering, weighted precision is 86,4% for the computer-oriented Czech-English dictionary and 70.7% for fiction.

References:

Brown, Peter F,; Della Pietra, S. A.; Della Pietra, V. J.; Mercer, Robert L. 1993. "The Mathematics of Statistical Machine Translation: Parameter Estimation". In Computational Linguistics, 19(2): 263 - 331.

Čmejrek, Martin. 1998. "Automatická extrakce dvojjazyčného pravděpodobnostního slovníku z paralalních textů". MSc. Thesis, Institute of Formal and Applied Linguistics, Charles University. Prague. 82 pp. (in Czech)

Cuřín, Jan. 1998. "Automatická extrakce překladu odborné terminologie". MSc Thesis, Institute of Formal and Applied Linguistics, Charles University, Prague. 89 pp. (in Czech)

Gale, William A.; Church, Kenneth W. 1993. "A Program for Aligning Sentences in Bilingual Corpora". In Computational Linguistics, 19(1): 75-102.

Hajič, Jan; Hladká, Barbora. 1998. "Tagging Inflective Languages: Prediction of Morphological Categories for Rich, Structured Tagset". In Proceedings of Coling/ACL’98, Montreal, Canada.

Wu, Dekai; Xia, Xuanyin. 1994. "Learning an English-Chinese Lexicon from a Parallel Corpus". Association for Machine Translation in the Americas, Oct. 94: 206-213, Columbia, USA.


See previous, next abstract.

Back to Newsletter no. 9.

© TELRI, 19.11.1999