Unknown Chinese composite words tagging using selective back-off smoothing
漢語未登錄複合詞的詞性標注
Samuel W.K.Chan 陳偉光; Mickey W.C. Chong 莊華祥; Tom B.Y.Lai 黎邦洋
Abstract 摘要
"The aim of this research is to tag unknown Chinese words with their part-of-speech (POS).1 Even narrow coverage of unknown words produces explosive ambiguity in natural language processing. At the same time, a completely unsupervised and refined POS tagging is impossible without any help from lexicographers. In this research, we propose to implement a means of un-locking POS tags based on two important features: word structure and word sequence in raw text. A similarity-based technique will be employed to classify an unknown word using its orthographic form and its contextual neighbors without becoming trapped in a subjective linguistic quagmire. The technique produces a good estimate of POS tags of Chinese compound words before they are fed into a tagger. A recursive inferential mechanism is also devised to alleviate the ripple effect from changes made at its neighbors during tagging. The approach is justified with a compound words database with more than 53,500 words. Experimental results with 500,000 words show the approach outperforms its counterparts.
Keywords 關鍵詞
Part-of-speech tagging 詞性標注 Chinese word structures 漢語複合詞內部結構 Morphemes 語素 Machine learning 機器學習