Time-Interval Sampling for Improved Estimations in Data Warehouses



Data warehouses are of crucial strategic importance to decision-making in many competitive organizations. The fact that they store enormous quantities is a challenge in what concerns performance and scalability, as Gigabytes or Terabytes of data are stuffed in those systems, while users request instant answers to their queries. It is possible to return very fast approximate answers to user queries in exploration phases using pre-computed summaries. For instance, a query that would take two hours can be completed in one minute or less using a summary. Sampling has been proposed as a strategy to produce general-purpose summaries well-fit for all types of exploration analysis. However, their usage is constrained by the fact that there must be a representative number of samples in grouping intervals to yield acceptable accuracy. In this paper we propose and evaluate a technique that deals with the representation issue by using time interval-biased stratified samples (TISS). The technique is able to deliver fast accurate analysis to the user by taking advantage of the importance of the time dimension in most user analysis. It is designed as a transparent middle layer, which analyzes and rewrites the query to use a summary instead of the base data warehouse. The technique is presented and evaluated experimentally in a typical TPC-H setup. The estimations and error bounds returned using the technique are compared to those of traditional sampling summaries, to show that it achieves significant improvement in accuracy.


Approximate Summaries, OLAP


Data Warehousing


4th International Conference on Data Warehousing and Knowledge Discovery - DaWaK 2002, September 2002

Cited by

No citations found