Bayesian semi-supervised Chinese word segmentation for statistical machine translation

Jia Xu, Jianfeng Gao, Kristina Toutanova, Hermann Ney

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

53 Scopus citations

Abstract

Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT. Experiments show that our method improves a state-of-the-art MT system in a small and a large data environment.

Original languageEnglish
Title of host publicationColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Pages1017-1024
Number of pages8
DOIs
StatePublished - 2008
Event22nd International Conference on Computational Linguistics, Coling 2008 - Manchester, United Kingdom
Duration: 18 Aug 200822 Aug 2008

Publication series

NameColing 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
Volume1

Conference

Conference22nd International Conference on Computational Linguistics, Coling 2008
Country/TerritoryUnited Kingdom
CityManchester
Period18/08/0822/08/08

Fingerprint

Dive into the research topics of 'Bayesian semi-supervised Chinese word segmentation for statistical machine translation'. Together they form a unique fingerprint.

Cite this