Trans-European Language Resources Infrastructure - II

Distance Between Languages as Measured by a Minimal-Entropy Model

Primoz Jakopin
Institute for Slovene Language "Fran Ramovs"
Slovene Academy for Sciences and Arts
Ljubljana, Slovenia

In the paper a language model, based on probabilities of n-grams from a Slovenian text sample (60 books of 41 authors, 46 original and 16 translations, 2.7 million words), is discussed. A Huffman tree is generated from all the n-grams (n=1 to 32) with frequency 2 or more and appropriate Huffman codes computed for every leaf in the tree. The model is used in such a way that any observed text is cut into n-grams (1-32) with the sum of Huffman code lengths for all the parts as close to minimum as possible. In the table 1 the model, based on the mentioned sample of Slovenian, has been applied to all the 16 translations of Plato’s Republic, as provided on TELRI CD ROM (ed. Erjavec, Lawson and Romary 1998). The table is sorted in descending order on the average number of bits per character, achieved during the coding of every text by the model. Every line in the table has 8 entries: sequence number of the language, the language name, translator, publication year of the edition, used for electronic version of the text (it is usually the same as the year of translation), the first person of the team, responsible for the transfer into electronic version, number of words in the translation, number of characters and the average number of bits per character, produced by the model. Missing values are shown by a hyphen. All the texts have been, within limits, unified and transcribed into Latin. Bulgarian translation has already been transcribed, only the hacek characters have still been left as ch, sh, zh. Russian translation was given in Cyrillic coding. The texts in Czech, Slovak and Polish language have been stripped of their diacritic characters on vowels to get more realistic behavior of the model.

As expected, according to the model the two languages most close to Slovenian are Serbian and Croatian, followed by Bulgarian, Czech, Polish, Russian and Slovak. It is interesting to notice that, at least by the model, of the two most distant languages Finnish is closer to Slovenian than Hungarian.

languagetranslation yearwordscharactersb/c
SlovenianJoz Kosar1976Primoz Jakopin92.741565.6042,37
1.SerbocroatianA.Vilhar, B.Pavlovič1983Dusko Vitas107.506613.0823,77
2.CroatianDamir Salopek1976Marko Tadic92.870532.4973,84
3.BulgariaPatrice Bonhomme--112.676678.1313,96
4.CzechRadislav Hosek1993Frantisek Cermak110.466636.2014,10
5.PolishWladyslaw Witwicki1991-107.559645.5324,32
7.SlovakJulius Spanar1990Alexandra Jarosová99.661622.4634,46
8.LatvianGustavs Lukstins1982Andrejs Spektors45.238290.5084,74
9.LithuaninJonas Dumcius1981-85.144584.3184,94
10.EnglishPaul Shorey--129.331692.0585,40
12.GermanKarl Vretska1982Joachim Hohwieler104.876641.3335,69
13.RomanianAndrei Cornea1986Dan Tufis131.064658.8045,76
14.FinnishMarja Itkonen-Kaila-Anna Mauranen75.800582.5226,11
15.HungarianSzabo; Miklos1984Tamas Varadi105.538728.5016,47

Table 1: Minimal-entropy model, applied to Plato's Republic in 16 languages.

