TY - GEN
T1 - Robust unsupervised segmentation of degraded document images with topic models
AU - Burns, Timothy J.
AU - Corso, Jason J.
PY - 2009
Y1 - 2009
N2 - Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difficult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difficult challenge. Furthermore, when presented with significant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis-by-synthesis approach to examine the model, and provide quantitative segmentation results on a manuallylabeled document image data set. We illustrate our model's robustness by providing results on a highly degraded version of our test set.
AB - Segmentation of document images remains a challenging vision problem. Although document images have a structured layout, capturing enough of it for segmentation can be difficult. Most current methods combine text extraction and heuristics for segmentation, but text extraction is prone to failure and measuring accuracy remains a difficult challenge. Furthermore, when presented with significant degradation many common heuristic methods fall apart. In this paper, we propose a Bayesian generative model for document images which seeks to overcome some of these drawbacks. Our model automatically discovers different regions present in a document image in a completely unsupervised fashion. We attempt no text extraction, but rather use discrete patch-based codebook learning to make our probabilistic representation feasible. Each latent region topic is a distribution over these patch indices. We capture rough document layout with an MRF Potts model. We take an analysis-by-synthesis approach to examine the model, and provide quantitative segmentation results on a manuallylabeled document image data set. We illustrate our model's robustness by providing results on a highly degraded version of our test set.
UR - http://www.scopus.com/inward/record.url?scp=70450191271&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70450191271&partnerID=8YFLogxK
U2 - 10.1109/CVPRW.2009.5206606
DO - 10.1109/CVPRW.2009.5206606
M3 - Conference contribution
AN - SCOPUS:70450191271
SN - 9781424439935
T3 - 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009
SP - 1287
EP - 1294
BT - 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009
T2 - 2009 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009
Y2 - 20 June 2009 through 25 June 2009
ER -