TY - GEN
T1 - Enhancing Chinese word segmentation using unlabeled data
AU - Sun, Weiwei
AU - Xu, Jia
PY - 2011
Y1 - 2011
N2 - This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
AB - This paper investigates improving supervised word segmentation accuracy with unlabeled data. Both large-scale in-domain data and small-scale document text are considered. We present a unified solution to include features derived from unlabeled data to a discriminative learning model. For the large-scale data, we derive string statistics from Gigaword to assist a character-based segmenter. In addition, we introduce the idea about transductive, document-level segmentation, which is designed to improve the system recall for out-of-vocabulary (OOV) words which appear more than once inside a document. Novel features1 result in relative error reductions of 13.8% and 15.4% in terms of F-score and the recall of OOV words respectively.
UR - http://www.scopus.com/inward/record.url?scp=80053221622&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053221622&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:80053221622
SN - 1937284115
SN - 9781937284114
T3 - EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 970
EP - 979
BT - EMNLP 2011 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
T2 - Conference on Empirical Methods in Natural Language Processing, EMNLP 2011
Y2 - 27 July 2011 through 31 July 2011
ER -