Distance Between Languages as Measured by a Minimal-Entropy Model
In the paper a language model, based on probabilities of n-grams from a Slovenian text sample (60 books of 41 authors, 46 original and 16 translations, 2.7 million words), is discussed. A Huffman tree is generated from all the n-grams (n=1 to 32) with frequency 2 or more and appropriate Huffman codes computed for every leaf in the tree. The model is used in such a way that any observed text is cut into n-grams (1-32) with the sum of Huffman code lengths for all the parts as close to minimum as possible. In the table 1 the model, based on the mentioned sample of Slovenian, has been applied to all the 16 translations of Plato’s Republic, as provided on TELRI CD ROM (ed. Erjavec, Lawson and Romary 1998). The table is sorted in descending order on the average number of bits per character, achieved during the coding of every text by the model. Every line in the table has 8 entries: sequence number of the language, the language name, translator, publication year of the edition, used for electronic version of the text (it is usually the same as the year of translation), the first person of the team, responsible for the transfer into electronic version, number of words in the translation, number of characters and the average number of bits per character, produced by the model. Missing values are shown by a hyphen. All the texts have been, within limits, unified and transcribed into Latin. Bulgarian translation has already been transcribed, only the hacek characters have still been left as ch, sh, zh. Russian translation was given in Cyrillic coding. The texts in Czech, Slovak and Polish language have been stripped of their diacritic characters on vowels to get more realistic behavior of the model.
As expected, according to the model the two languages most close to Slovenian are Serbian and Croatian, followed by Bulgarian, Czech, Polish, Russian and Slovak. It is interesting to notice that, at least by the model, of the two most distant languages Finnish is closer to Slovenian than Hungarian.
Table 1: Minimal-entropy model, applied to Plato's Republic in 16 languages.
Back to Newsletter no. 9.
|© TELRI, 19.11.1999|