TELRI-II

Trans-European Language Resources Infrastructure - II

Current Events | Write to us | TELRI Main Page | TELRI Seminar

Distance Between Languages as Measured by a Minimal-Entropy Model

Primoz Jakopin
Institute for Slovene Language "Fran Ramovs"
Slovene Academy for Sciences and Arts
Ljubljana, Slovenia
e-mail: primoz.jakopin@uni-lj.si

In the paper a language model, based on probabilities of n-grams from a Slovenian text sample (60 books of 41 authors, 46 original and 16 translations, 2.7 million words), is discussed. A Huffman tree is generated from all the n-grams (n=1 to 32) with frequency 2 or more and appropriate Huffman codes computed for every leaf in the tree. The model is used in such a way that any observed text is cut into n-grams (1-32) with the sum of Huffman code lengths for all the parts as close to minimum as possible. In the table 1 the model, based on the mentioned sample of Slovenian, has been applied to all the 16 translations of Plato’s Republic, as provided on TELRI CD ROM (ed. Erjavec, Lawson and Romary 1998). The table is sorted in descending order on the average number of bits per character, achieved during the coding of every text by the model. Every line in the table has 8 entries: sequence number of the language, the language name, translator, publication year of the edition, used for electronic version of the text (it is usually the same as the year of translation), the first person of the team, responsible for the transfer into electronic version, number of words in the translation, number of characters and the average number of bits per character, produced by the model. Missing values are shown by a hyphen. All the texts have been, within limits, unified and transcribed into Latin. Bulgarian translation has already been transcribed, only the hacek characters have still been left as ch, sh, zh. Russian translation was given in Cyrillic coding. The texts in Czech, Slovak and Polish language have been stripped of their diacritic characters on vowels to get more realistic behavior of the model.

As expected, according to the model the two languages most close to Slovenian are Serbian and Croatian, followed by Bulgarian, Czech, Polish, Russian and Slovak. It is interesting to notice that, at least by the model, of the two most distant languages Finnish is closer to Slovenian than Hungarian.

language translation year words characters b/c

Slovenian Joz Kosar 1976 Primoz Jakopin 92.741 565.604 2,37

1. Serbocroatian A.Vilhar, B.Pavlovič 1983 Dusko Vitas 107.506 613.082 3,77
2. Croatian Damir Salopek 1976 Marko Tadic 92.870 532.497 3,84
3. Bulgaria Patrice Bonhomme - - 112.676 678.131 3,96
4. Czech Radislav Hosek 1993 Frantisek Cermak 110.466 636.201 4,10
5. Polish Wladyslaw Witwicki 1991 - 107.559 645.532 4,32
6. Russian - - - 99.503 649.102 4,46
7. Slovak Julius Spanar 1990 Alexandra Jarosová 99.661 622.463 4,46
8. Latvian Gustavs Lukstins 1982 Andrejs Spektors 45.238 290.508 4,74
9. Lithuanin Jonas Dumcius 1981 - 85.144 584.318 4,94
10. English Paul Shorey - - 129.331 692.058 5,40
11. French - 1993 - 142.624 817.658 5,67
12. German Karl Vretska 1982 Joachim Hohwieler 104.876 641.333 5,69
13. Romanian Andrei Cornea 1986 Dan Tufis 131.064 658.804 5,76
14. Finnish Marja Itkonen-Kaila - Anna Mauranen 75.800 582.522 6,11
15. Hungarian Szabo; Miklos 1984 Tamas Varadi 105.538 728.501 6,47

Table 1: Minimal-entropy model, applied to Plato's Republic in 16 languages.

See previous, next abstract.

Back to Newsletter no. 9.