TY - GEN
T1 - Bayesian semi-supervised Chinese word segmentation for statistical machine translation
AU - Xu, Jia
AU - Gao, Jianfeng
AU - Toutanova, Kristina
AU - Ney, Hermann
PY - 2008
Y1 - 2008
N2 - Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT. Experiments show that our method improves a state-of-the-art MT system in a small and a large data environment.
AB - Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation model which uses both monolingual and bilingual information to derive a segmentation suitable for MT. Experiments show that our method improves a state-of-the-art MT system in a small and a large data environment.
UR - http://www.scopus.com/inward/record.url?scp=80053424888&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053424888&partnerID=8YFLogxK
U2 - 10.3115/1599081.1599209
DO - 10.3115/1599081.1599209
M3 - Conference contribution
AN - SCOPUS:80053424888
SN - 9781905593446
T3 - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
SP - 1017
EP - 1024
BT - Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference
T2 - 22nd International Conference on Computational Linguistics, Coling 2008
Y2 - 18 August 2008 through 22 August 2008
ER -