Hybrid Approaches for Automatic Segmentation
and Annotation of Chinese Text Corpus
Feng Zhiwei
Institute of Applied Linguistics
The State Language Commission of China
Beijing, China
e-mail: zwfengde@public.bta.net.cn
In Chinese text corpus, the sentence is a continuum sequence of
Chinese characters, there are no obvious delimiting markers (such as spaces in
European languages) between Chinese words except for some punctuation
marks. Because of this, the word segmentation is essential in the automatic
processing of Chinese text corpus. Main approaches are as follows:
- Rule-based matching approach: Maximum Matching method (MM
method), Reverse Maximum Matching method (RMM method), Bi-direction
Matching method (BM method), Optimum Matching method (OM
method), Association Backtracking method (AB method).
- Rule-based approach for dealing with the Ambiguous Segmentation
Strings (ASS): overlapping strings and combinative strings.
- Hybrid approach (rule + statistics) for dealing with the Unregistered
Words (URW): personal names, place names, institution names, new words.
The automatic annotation for Chinese text corpus is tagging the
Chinese text with POS (part of speech). Main approaches are as follows:
- Tagging with linguistic rules: POS ambiguity mainly concentrate on
the commonly-used words: verb, noun adjective, etc. The disambiguation
bases on the linguistic rules (morphology, grammar, semantics, context,
etc). The tagging precision rate: 77%.
- Tagging with HMM (Hidden Markov Model): ~{"Y~} manually
creating of the training set, ~{"Z~} constructing of N-gram statistic model:
lexical probability (emission probability), contextual probability
(transition probability), ~{"[~} tagging of corpus with CLAWS
(Constituent-Likelihood Automatic Word-tagging System) algorithm. The tagging
precision rate: 95.16%.
- Tagging with TBED (Transform-Based Error-Driven) based on Brill
method:~{"Y~} definition of the rule template and generation of the rule
space, ~{"Z~} automatic learning the rules according to rule template
~{"[~} dynamically tracing the learning process with the error matrix
~{"\~} optimization of rules. The tagging precision rate: 96.87%.
- Hybrid (HMM + TBED) approach: ~{"Y~} initialization of algorithm
with HMM ~{"Z~} learning the rules with TBED ~{"[~} Tagging the
corpus with learned rules. The tagging precision rate: 97.86%.
The segmentation and annotation of text corpus are two key tasks in
Chinese text corpus Processing. It is difficult to resolve them only with
a single method. The hybrid approach combining rule-based and
statistic-based methods can improve the precision of segmentation and tagging.
Keywords: segmentation, tagging, hybrid approach, rule-based
approach, HMM (Hidden Markov Model), CLAWS (Constituent-Likelihood
Automatic Word-tagging System) algorithm, TBED (Transform Based Error
Driven), Brill method.
See previous, next abstract.
Back to Newsletter no. 9.