Statistical Word Segmentation
统计式分词法
Tung-Hui Chiang 江东辉; Jing-Shin Chang 张景新; Ming-Yu Lin 林铭裕; Keh-Yih Su 苏克毅

Abstract 摘要
A Chinese sentence has no word delimiters, like white space, between “words”. Therefore, it is important to identify word boundaries before any processing can proceed. The same is true for other languages, like Japanese. When forming words, traditional heuristic approaches tend to use dictionary look up, morphological rules and heuristics, such as matching the longest matchable dictionary entry. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation formula; the various probabilistic models for word segmentation are then derived based on the generalized word segmentation model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, the parameters of the models are further adjusted with an adaptive and robust learning algorithm. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. By incorporating word length information into a simple context-independent word model and applying a robust adaptive learning algorithm to the segmentation problem, it is possible to achieve accuracy in word recognition at a rate of 99.39% and sentence recognition at a rate of 97.65% in the test corpus. Furthermore, the assumption that all lexical entries can be found in the system dictionary is usually not true in real applications. Thus, such an “unknown word problem” is examined for each word segmentation model used here. Some prospective guidelines to the unknown word problem will be suggested.

中文词与词之间并无类似空白符号之类的分隔符，故在进行中文讯息处理之前，需先界定词的界限。传统的分词方法主要是利用辞典讯息，辅以一些经验法则，如长词优先法，来找出中文的分词点。由于中文构词及句法相当复杂，这样的作法，对于大型系统而言，未必能适用。##本文重点主要在于利用中文句中所有可资运用的特征，发展一套一般化的中文分词公式，从而推导出各种的统计分词模式。在估计统计参数的估计值时。一般是以最大似然度作为估计标准。但这种估计标准并未能反应出各种可能的分词样型间相对的排名顺序。因此，我们采用具有强健性的调适性学习法，来调整参数的估计值，以提升系统的效能。实验结果显示，我们所提议的分词模式在各种情况下均能经济而有效地达到分词的效果。在使用词长度讯息及应用强健性的调适性学习法于一简单的统计模式之下，对测试预料而言，以词为单位的分词辨认率达99.39%，以句为单位的辨认率则达97.65%。此外，在一般情况下，并非所有词汇都可以在系统的词典内找到。这类的[新词] 或 [未知词] 往往严重影响分词的辨认率。因此，我们也针对此一[未知词问题]提出一些可行的解决方法。