Abstract
In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the minimization of the manual effort for building a translation system. The commonly used method for getting the word boundaries is based on a word segmentation tool and a predefined monolingual dictionary. To avoid the dependence of the translation system on an external dictionary, we have developed a system that learns a domain-specific dictionary from the parallel training corpus. This method produces results that are comparable with the predefined dictionary. Further more, our translation system is able to work without word segmentation with only a minor loss in translation quality.
Original language | English |
---|---|
Pages | 122-128 |
Number of pages | 7 |
State | Published - 2004 |
Event | 3rd SIGHAN Workshop on Chinese Language Processing, SIGHAN@ACL 2004 - Barcelona, Spain Duration: 25 Jul 2004 → … |
Conference
Conference | 3rd SIGHAN Workshop on Chinese Language Processing, SIGHAN@ACL 2004 |
---|---|
Country/Territory | Spain |
City | Barcelona |
Period | 25/07/04 → … |