The Hong Kong Cantonese corpus: Design and uses
香港粵語語料庫的設計和用途
Kang Kwong Luke 陸鏡光; May L.Y. Wong 王麗賢
Abstract 摘要
The Hong Kong Cantonese Corpus (HKCC) was built with the specific aim of making available to researchers and language learners a body of naturally occurring talk gleaned from everyday conversations between speakers of Cantonese in Hong Kong.1 In this paper, we describe the origin, rationale, design principles and uses of HKCC. In particular, we focus on the following aspects of the corpus: (1) data collection procedures; (2) transcription and orthographic conventions; (3) encoding schemes; (4) segmentation and POS tagging; and (5) potential uses of the corpus and future directions.
Keywords 關鍵詞
Speech corpus 口語語料庫 Conversation 日常會話 Cantonese 粵語 Naturally occurring talk 自然語言材料 Corpus design 語料庫設計