Enhancing Chinese word segmentation using unlabeled data

Weiwei Sun, Jia Xu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

88 Scopus citations

Abstract

This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.

Original languageEnglish
Title of host publicationEMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages970-979
Number of pages10
StatePublished - 2011
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2011 - Edinburgh, United Kingdom
Duration: 27 Jul 201131 Jul 2011

Publication series

NameEMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

ConferenceConference on Empirical Methods in Natural Language Processing, EMNLP 2011
Country/TerritoryUnited Kingdom
CityEdinburgh
Period27/07/1131/07/11

Fingerprint

Dive into the research topics of 'Enhancing Chinese word segmentation using unlabeled data'. Together they form a unique fingerprint.

Cite this