Gabor Proszeky (Morphologic, Budapest):

HUMOR, a Morphological System for Corpus Analysis


Humor, a reversible, string-based, unification approach for lemmatizing and disambiguation has been introduced for both corpus analysis in the Research Institute for Linguistics, and creating a variety of other lingware applications, like spell-checking, hyphenation, etc. for the wide public. The system is language independent, that is, it allows multilingual applications: besides agglutinative languages (e.g. Hungarian, Turkish) and highly inflectional languages (e.g. Polish, Rumanian) it has been applied to languages of major economic and demographic significance (e.g. English, German, French).

The basic strategy of Humor is inherently suited to parallel execution. Search in the main dictionary, secondary dictionaries and affix dictionaries can happen simultaneously. What is more, in the near future it is going to be extended by a disambiguator based on the same strategy. This is a new parallel processing method of various levels (higher than morphology) called HumorESK (Humor Enhanced with Syntactic Knowledge). Both Humor and HumorESK have a very simple and clear strategy based on surface-only analyses, no transformations are used; all the complexity of the systems are hidden in the graphs describing morpho-syntactic behavior.

Humor is rigorously tested by "real" end-users. The Hungarian version has been used in every-day work since 1991 both by lexicographers and other researchers of the Research Institute of Linguistics of the Hungarian Academy of Sciences, and users of word-processing tools (Humor-based linguistic modules have been licensed by Microsoft, Lotus, Inso and other software developers). The lemmatizer shares some of the extra features of Helyes-e?, the speller derived from Humor, because lexicographers need a fault-tolerant lemmatizer that is able to overcome simple orthographic errors and frequent mis-typings. It is useful in analyzing Hungarian texts from the 19th century when the Hungarian orthography was not standardized.

Humor's Hungarian version -- the largest and most precise implementation -- contains nearly 100.000 stems which cover all (approx. 70.000) lexemes of the Concise Explanatory Dictionary of the Hungarian Language. Suffix dictionaries contain all the inflectional suffixes and the productive derivational morphemes of present-day Hungarian. With the help of these dictionaries Humor is able to analyze and/or generate several billions(!) of different well-formed Hungarian word-forms. The whole software package is written in standard C using C++ like objects. It runs on any platform where C compiler can be found.

TELRI Home Page