From corpus to grammar: Automatic extraction of grammatical relations from annotated corpus
語料與語法:標記語料中語法關係的自動抽取
Chu-Ren Huang 黃居仁; Jia-Fei Hong 洪嘉馡; Wei-Yun Ma 馬偉雲; Petr Šimon 石穆
Abstract 摘要
Automatic extraction of grammatical knowledge from corpora has been one of the ultimate goals and challenges of corpus linguistics. We present in this paper 1 one of the approaches to this challenge in Chinese corpus linguistics by introducing our recent work using the Sketch Engine (SkE, also known as Word Sketch Engine)2 platform to automatically extract grammatical relations from PoS-annotated Chinese corpora. The SkE approach requires both giga-word size corpora and comprehensive lexico-grammatical information of the language in question. On the one hand, corpus size is crucial as the automatic extraction of grammatical relations requires enough instances of the relation pairs, which in turn require an exponential jump from the million-word size corpus for observation of single lexical items. On the other hand, lexico-grammatical information is crucial to the identification of potential relational pairs based on local context. The quality of such extraction is dependent on the quality of available lexico-grammatical knowledge. We show that a comprehensive lexical grammar, based on Information-based Case Grammar (Chen & Huang 1990) and covering over 40 thousand verbs greatly help the accuracy and recall of grammatical relation detection. The paper concludes by underlining the importance of integrating existing grammatical information to meet the challenge of automatic extraction of grammatical knowledge from large corpora.
Keywords 關鍵詞
Mandarin Chinese 漢語 Grammatical knowledge 語法知識 Automatic extraction 自動抽取 Lexical grammar 詞彙語法 Sketch engine 速描引擎