Cultivating large-scale parallel corpora from comparable patents: From bilingual to trilingual, and beyond
從專利文本建立大型平行語料庫:由雙語到多語
Bin Lu 路斌; Benjamin Tsou 鄒嘉彥; Ka Po Chow 周嘉寶
Abstract 摘要
Parallel corpora are critical resources for many NLP applications, ranging from machine translation (MT) to cross-lingual information retrieval. In this chapter, we explore a new but important area involving patents by investigating the potential of comparable multilingual patents for building large-scale parallel corpora. Two major issues are investigated on multilingual patents: (1) How to build large-scale corpora of comparable patents involving many languages? (2) How to mine high-quality parallel sentences from these comparable patents? Three bilingual parallel corpora and one trilingual parallel corpus are presented as examples, and some preliminary SMT experiments are reported. Moreover, we investigate and show the considerable potential of getting large-scale parallel corpora from multilingual patents for a wide variety of languages, such as English, Chinese, Japanese, Korean, and German, which would to some extent reduce the parallel data acquisition bottleneck in multilingual information processing.
Keywords 關鍵詞
Multilingual patents 多語專利 PCT patents PCT專利 Parallel corpora 平行語料庫 Machine translation 機器翻譯 Sentence alignment 句對