Abstract
In statistical machine translation, word alignment models are trained on bilingual corpora. Long sentences pose severe problems: 1. the high computational requirements; 2. the poor quality of the resulting word alignment. We present a sentence-segmentation method that solves these problems by splitting long sentence pairs. Our approach uses the lexicon information to locate the optimal split point. This method is evaluated on two Chinese-English translation tasks in the news domain. We show that the segmentation of long sentences before training significantly improves the final translation quality of a state-of-the-art machine translation system. In one of the tasks, we achieve an improvement of the BLEU score of more than 20% relative.
| Original language | English |
|---|---|
| Pages | 280-287 |
| Number of pages | 8 |
| State | Published - 2005 |
| Event | 10th Annual Conference on European Association for Machine Translation, EAMT 2005 - Budapest, Hungary Duration: 30 May 2005 → 31 May 2005 |
Conference
| Conference | 10th Annual Conference on European Association for Machine Translation, EAMT 2005 |
|---|---|
| Country/Territory | Hungary |
| City | Budapest |
| Period | 30/05/05 → 31/05/05 |