Do we need Chinese word segmentation for statistical machine translation?

Jia Xu, Richard Zens, Hermann Ney

Research output: Contribution to conferencePaperpeer-review

37 Scopus citations

Abstract

In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the minimization of the manual effort for building a translation system. The commonly used method for getting the word boundaries is based on a word segmentation tool and a predefined monolingual dictionary. To avoid the dependence of the translation system on an external dictionary, we have developed a system that learns a domain-specific dictionary from the parallel training corpus. This method produces results that are comparable with the predefined dictionary. Further more, our translation system is able to work without word segmentation with only a minor loss in translation quality.

Original languageEnglish
Pages122-128
Number of pages7
StatePublished - 2004
Event3rd SIGHAN Workshop on Chinese Language Processing, SIGHAN@ACL 2004 - Barcelona, Spain
Duration: 25 Jul 2004 → …

Conference

Conference3rd SIGHAN Workshop on Chinese Language Processing, SIGHAN@ACL 2004
Country/TerritorySpain
CityBarcelona
Period25/07/04 → …

Fingerprint

Dive into the research topics of 'Do we need Chinese word segmentation for statistical machine translation?'. Together they form a unique fingerprint.

Cite this