TELRI-II

Trans-European Language Resources Infrastructure - II

Current Events | Write to us | TELRI Main Page | TELRI Seminar

Hybrid Approaches for Automatic Segmentation and Annotation of Chinese Text Corpus

Feng Zhiwei
Institute of Applied Linguistics
The State Language Commission of China
Beijing, China
e-mail: zwfengde@public.bta.net.cn

In Chinese text corpus, the sentence is a continuum sequence of Chinese characters, there are no obvious delimiting markers (such as spaces in European languages) between Chinese words except for some punctuation marks. Because of this, the word segmentation is essential in the automatic processing of Chinese text corpus. Main approaches are as follows:

Rule-based matching approach: Maximum Matching method (MM method), Reverse Maximum Matching method (RMM method), Bi-direction Matching method (BM method), Optimum Matching method (OM method), Association Backtracking method (AB method).
Rule-based approach for dealing with the Ambiguous Segmentation Strings (ASS): overlapping strings and combinative strings.
Hybrid approach (rule + statistics) for dealing with the Unregistered Words (URW): personal names, place names, institution names, new words.

The automatic annotation for Chinese text corpus is tagging the Chinese text with POS (part of speech). Main approaches are as follows:

Tagging with linguistic rules: POS ambiguity mainly concentrate on the commonly-used words: verb, noun adjective, etc. The disambiguation bases on the linguistic rules (morphology, grammar, semantics, context, etc). The tagging precision rate: 77%.
Tagging with HMM (Hidden Markov Model): ~{"Y~} manually creating of the training set, ~{"Z~} constructing of N-gram statistic model: lexical probability (emission probability), contextual probability (transition probability), ~{"[~} tagging of corpus with CLAWS (Constituent-Likelihood Automatic Word-tagging System) algorithm. The tagging precision rate: 95.16%.
Tagging with TBED (Transform-Based Error-Driven) based on Brill method:~{"Y~} definition of the rule template and generation of the rule space, ~{"Z~} automatic learning the rules according to rule template ~{"[~} dynamically tracing the learning process with the error matrix ~{"\~} optimization of rules. The tagging precision rate: 96.87%.
Hybrid (HMM + TBED) approach: ~{"Y~} initialization of algorithm with HMM ~{"Z~} learning the rules with TBED ~{"[~} Tagging the corpus with learned rules. The tagging precision rate: 97.86%.

The segmentation and annotation of text corpus are two key tasks in Chinese text corpus Processing. It is difficult to resolve them only with a single method. The hybrid approach combining rule-based and statistic-based methods can improve the precision of segmentation and tagging.

Keywords: segmentation, tagging, hybrid approach, rule-based approach, HMM (Hidden Markov Model), CLAWS (Constituent-Likelihood Automatic Word-tagging System) algorithm, TBED (Transform Based Error Driven), Brill method.

See previous, next abstract.

Back to Newsletter no. 9.