Evaluating Chinese web-as-corpus: Some methodological considerations
漢語網路語料庫評測方法初探
Shu-Kai Hsieh 謝舒凱
Abstract 摘要
Corpus development in the context of Web has become one of the most important issues due to its tremendous size, geographic and social range, up-to-datedness, multimodality and wide availability at minimal cost, etc. Many Web-as-Corpus (WaC) construction tools are made freely available as well. However, due to its intricate orthography, in this paper, I will argue that a sound methodology for evaluating newly emerging Chinese WaC resources is needed urgently. There has been a wide range of possible usages of the Web for corpus construction, as well as the measures for the comparison of traditional corpus and web corpus. Basically, main approaches include acquiring web content and processing it into a static corpus (WaC, Web-as-Corpus), and accessing it directly as a dynamic corpus (WfC, Web-for-Corpus). I will introduce our works in constructing twWaC (Taiwan Web as Corpus1) at National Taiwan University, with the explanation of problems encountered. Two statistic measures from the distributional point of view will be proposed to illustrate the difference of scaled twWaC and ASBC (Academia Sinica Balanced Corpus).
Keywords 關鍵詞
Web corpus 網路語料庫 [Modern] Chinese corpus 現代漢語語料庫 Segmentation [in Chinese] 中文分詞 Corpora comparison 語料庫比較