TY - JOUR
T1 - On wavelet compression and cardinality estimation of enterprise data
AU - Choudur, Lakshminarayan
AU - Dayal, Umeshwar
AU - Gupta, Chetan
AU - Swaminathan, Ram
PY - 2010
Y1 - 2010
N2 - Storing and analyzing large volume of structured or unstructured data at the scale of petabytes in applications such as business intelligence of an enterprise, is a daunting task. It is therefore desirable to store data in a compressed form. Compression often is performed using transform methods that permit capture of details in the data while at the same time representing it efficiently; wavelet transform is one such compression technique. Viewing the data as a signal, wavelet transform involves computing coefficients and selecting a small subset judiciously to approximate the signal. Because we pick a subset of the coefficients and not all of them to synthesize the data, wavelet based compression is inherently lossy. Compression in the wavelet domain is traditionally done using hard and soft thresholding techniques and variants thereof. Based on the notion of "total energy" of a signal, in this paper, we introduce two new thresholding methods called, level-independent energy and level-dependent energy thresholding. In particular, our energybased thresholding techniques exploit the relationship between total energy of the signal and wavelet coefficients to obtain compression to meet pre-specified error tolerances. In addition, level-dependent energy thresholding aims at determining wavelet coefficients that carry most "information" at various resolutions of the wavelet decomposition of the data distribution. Detailed studies and experiments over noise contaminated synthetic and TPCH benchmark data sets indicate that energy based thresholding methods yield approximately 100:1 compression with high reconstruction accuracy. As an application of our compression techniques, we show that energy-based thresholding methods improve accuracy of cardinality estimation in databases substantially over the popular methods based on equi-height and max-diff histograms as well as over hard, soft and probabilistic thresholding techniques from the statistics and database literature.
AB - Storing and analyzing large volume of structured or unstructured data at the scale of petabytes in applications such as business intelligence of an enterprise, is a daunting task. It is therefore desirable to store data in a compressed form. Compression often is performed using transform methods that permit capture of details in the data while at the same time representing it efficiently; wavelet transform is one such compression technique. Viewing the data as a signal, wavelet transform involves computing coefficients and selecting a small subset judiciously to approximate the signal. Because we pick a subset of the coefficients and not all of them to synthesize the data, wavelet based compression is inherently lossy. Compression in the wavelet domain is traditionally done using hard and soft thresholding techniques and variants thereof. Based on the notion of "total energy" of a signal, in this paper, we introduce two new thresholding methods called, level-independent energy and level-dependent energy thresholding. In particular, our energybased thresholding techniques exploit the relationship between total energy of the signal and wavelet coefficients to obtain compression to meet pre-specified error tolerances. In addition, level-dependent energy thresholding aims at determining wavelet coefficients that carry most "information" at various resolutions of the wavelet decomposition of the data distribution. Detailed studies and experiments over noise contaminated synthetic and TPCH benchmark data sets indicate that energy based thresholding methods yield approximately 100:1 compression with high reconstruction accuracy. As an application of our compression techniques, we show that energy-based thresholding methods improve accuracy of cardinality estimation in databases substantially over the popular methods based on equi-height and max-diff histograms as well as over hard, soft and probabilistic thresholding techniques from the statistics and database literature.
KW - Compression
KW - Energy
KW - Thresholding
KW - Wavelets
UR - http://www.scopus.com/inward/record.url?scp=78650690133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=78650690133&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:78650690133
SP - 1
EP - 12
JO - HP Laboratories Technical Report
JF - HP Laboratories Technical Report
IS - 132
ER -