Partitioning parallel documents using binary segmentation

Jia Xu, Richard Zens, Hermann Ney

Research output: Contribution to conferencePaperpeer-review

6 Scopus citations

Abstract

In statistical machine translation, large numbers of parallel sentences are required to train the model parameters. However, plenty of the bilingual language resources available on web are aligned only at the document level. To exploit this data, we have to extract the bilingual sentences from these documents. The common method is to break the documents into segments using predefined anchor words, then these segments are aligned. This approach is not error free, incorrect alignments may decrease the translation quality. We present an alternative approach to extract the parallel sentences by partitioning a bilingual document into two pairs. This process is performed recursively until all the sub-pairs are short enough. In experiments on the Chinese-English FBIS data, our method was capable of producing translation results comparable to those of a state-of-the-art sentence aligner. Using a combination of the two approaches leads to better translation performance.

Original languageEnglish
Pages78-85
Number of pages8
StatePublished - 2006
Event2006 Workshop on Statistical Machine Translation, WMT 2006, collocated with the HLT-NAACL 2006 - New York City, United States
Duration: 8 Jun 20069 Jun 2006

Conference

Conference2006 Workshop on Statistical Machine Translation, WMT 2006, collocated with the HLT-NAACL 2006
Country/TerritoryUnited States
CityNew York City
Period8/06/069/06/06

Fingerprint

Dive into the research topics of 'Partitioning parallel documents using binary segmentation'. Together they form a unique fingerprint.

Cite this