Tagging with Combined Language Models and Large Tagsets Dan Tufis The paper discusses experiments and results related with tagging a highly inflectional language, based on multiple register diversified language models (LMs). The texts are accurately disambiguated (average 98.5%) in terms of a large tagset (615 tags) in two processing steps (tiered processing). The underlying tagger simultaneously uses multiple register LMs and chooses the final annotation by means of a combined classifiers decision-making procedure. With a small price in tagging accuracy (as compared to a reduced tagset approach), and practically no price in computational resources, it is possible to tag a text with a large tagset by using LMs built for reduced tagsets and consequently small training corpora. We call this way of tagging tiered tagging. In general terms, tiered tagging uses a hidden tagset (we call it C-tagset) of a smaller size (in our case 92 tags out of which 10 are used for punctuation), based on which a LM is built. This LM serves for a first level of tagging. Then a post-processor deterministically replaces the tags in the small tagset by one or more tags in the large tagset (we call it MSD-tagset). The words that this replacement makes ambiguous (in terms of the MSD-tagset annotation) are more often than not the difficult cases in statistical disambiguation. Very simple contextual rules differentiate the interpretations of the few still ambiguous words (in our experiment, less than 10%). These rules investigate (depending on the ambiguity class) left, right or both contexts within a limited distance (in our experiment never exceeding 4 words in one direction) for a disambiguating tag or word-form. The success rate of this second phase was slightly beyond 98%. Given the rare cases when the application of contextual rules is required, the response time penalty is insignificant. Depending on the accuracy degree of the contextual rules, the error rate for the final tagged text could practically be the same as for the hidden tagging phase (the additional error factor is less than 0.5%). Obviously, the reduced tagset and the extended one have to be in a specific relation (the small tagset should subsume the large one. In our experiment, we used QTAG and TREE-Tagger (but any other tagger may be used) trained on different kinds of text registers, constructing different language models (LM1, LM2.). A new text (unseen, from an unknown register) is independently tagged with the same tagger but using different LMs. The results provided by the different classifiers (tagger + LMi) are interpolated by a classifier we called CREDIBILITY. The combined LM classifier (CLAM) is shown to be constantly more accurate than any individual classifier used in the combination. As our experiments have shown, when a new text belonged to a specific language register, the classifier based on that language register model more often than not provided the second best accuracy (after the combined classifier). Therefore, it is reasonable to assume that when tagging a new text by a multiple LM-classifier if the final result is closer to the individual classifier based on the language model LMi, then probably the new text belongs to the LMi register, or is closer to this register. Having a general hint at what type of a text the one currently processed is, then stronger identification criteria could be used to validate the hypothesis. The paper will present a thorough evaluation of the tiered-tagging with combined language models (TT&CLAM) methodology, details about its implementation and availability. Back to Newsletter no. 9. |
||
© TELRI, 19.11.1999 |