A descriptive text data format for Chinese language corpora
汉语语料库的文本描述
Aiping Fu 傅爱平; Hong Zhang 张弘
Abstract 摘要
The paper proposes a general-purpose text data format for documents in Chinese language corpora. The format describes the archival structure and other attributes of the documents by a set of markup elements built using XML Schema. So it is called the XML Schema for Corpora, XSC for short. The XSC is intended 1) to carry the basic textual structural information of the documents in both raw and annotated corpora, 2) to describe the linguistic features in annotated corpora based on the different annotations, 3) to be open-ended in the sense that document-specific element types can be used, by user’s customization within the hierarchical and nestable framework of the XSC, 4) to allow the documents to be converted into an XML data file and processed using automatic tools such as XML database management system, indexing software, and other transformations. In this paper the framework and the applications of the XSC are presented, with some instances taken from the XSC-based Chinese language corpus built by the authors.
Keywords 关键词
Chinese language corpora 汉语语料库 Description of the corpus documents 语料文档的描述 XML-based text data structure 基于XML的文本数据结构 Corpus annotation 语料库标注 XML Schema