TELRI-II

Trans-European Language Resources Infrastructure - II

Current Events | Write to us | TELRI Main Page | TELRI Seminar

Encoding and Presenting an English-Slovene Dictionary and Corpus

Tomaz Erjavec
Department for Intelligent Systems E-8
Jozef Stefan Institute
Ljubljana, Slovenia
e-mail: tomaz.erjavec@ijs.si

The paper presents the markup conversion of a bi-lingual dictionary that we are working on in the scope of the EU Concede project, and a Web implementation of a sample of the dictionary with additional examples retrieved from a bi-lingual corpus.

The dictionary we use is an English-Slovene dictionary, currently being produced by the Slovene publishing house DZS, and based on the Oxford-Hachette English-French dictionary. The corpus is the IJS-ELAN English-Slovene corpus; it contains one million words, and is composed of 15 sentence aligned and tokenised bi-texts. One text, produced in the scope of the MULTEXT-East project, the novel "1984" by G. Orwell is also tagged for part-of-speech.

The paper first discusses standardised digital encoding, by outlning the Text Encoding Initiative Guiedelines (TEI) and explaining the TEI document type for dictionaries and that for aligned corpora. We then detail the process of conversion from original dictionary and corpus data into the standardised TEI encoding. In the case of the dictionary, we simply chose the TEI.dictionary base module, while the corpus is encoded as parametrisation of TEI, in a manner similar to that for Translation Memories.

The TEI encodings are used as the basis for a pilot exploitation of the resources. We give the conversion of the dictionary sample to HTML. The manner of format down-conversion is discussed; we explain the software environment (Unix, Omnimark, Atril, MULTEXT tools) and the on-screen rendering of TEI elements, esp. the dictionary entries.

Next we delve into the enhancement of the WWW presentation of the dictionary with examples retrieved from corpus data. The integration of such examples with dictionary information is a complex process, potentially involving all levels of lingware tools and linguistic analysis, from tokenisation, lemmatisation, part-of-speech tagging, syntactic chunking, and sense discrimination. We discuss the steps we performed in our experiment, which involve exploiting part-of-speech tagging to improve the recall and precision of locating headwords in the corpus. We evaluate the results of automatic querying and suggest the next steps in bettering the performance; two issues are sense specific querying and multiword entries, e.g. compounds and idioms. We also touch on the issue of selecting the translation equivalents of the query terms from the corpus data and matching these to the content of the dictionary entries.

See previous, next abstract.

Back to Newsletter no. 9.