Integrated Chinese Word Segmentation in Statistical Machine Translation

Jia Xu, Evgeny Matusov, Richard Zens, Hermann Ney

Research output: Contribution to conferencePaperpeer-review

25 Scopus citations

Abstract

A Chinese sentence is represented as a sequence of characters, and words are not separated from each other. In statistical machine translation, the conventional approach is to segment the Chinese character sequence into words during the pre-processing. The training and translation are performed afterwards. However, this method is not optimal for two reasons: 1. The segmentations may be erroneous. 2. For a given character sequence, the best segmentation depends on its context and translation. In order to minimize the translation errors, we take different segmentation alternatives instead of a single segmentation into account and integrate the segmentation process with the search for the best translation. The segmentation decision is only taken during the generation of the translation. With this method we are able to translate Chinese text at the character level. The experiments on the IWSLT 2005 task showed improvements in the translation performance using two translation systems: a phrase-based system and a finite state transducer based system. For the phrase-based system, the improvement of the BLEU score is 1.5% absolute.

Original languageEnglish
Pages131-137
Number of pages7
StatePublished - 2005
Event2nd International Workshop on Spoken Language Translation, IWSLT 2005 - Pittsburgh, United States
Duration: 24 Oct 200525 Oct 2005

Conference

Conference2nd International Workshop on Spoken Language Translation, IWSLT 2005
Country/TerritoryUnited States
CityPittsburgh
Period24/10/0525/10/05

Fingerprint

Dive into the research topics of 'Integrated Chinese Word Segmentation in Statistical Machine Translation'. Together they form a unique fingerprint.

Cite this