I'm interested in the application of sampling theory to database queries. The success criteria for a useful method includes the following:
- Run time must be significantly faster than simply pulling the full result directly using all rows available.
- The result must produce not only a point estimate, but a confidence interval on that estimate. Ideally, both 1- and 2-sided intervals are supported.
The first criteria significantly reduces the applicability of pseudo-random sampling at the row level, since pulling individual random rows from a database is a relatively expensive operation. Thus, clustered sampling techniques show more promise. However, adjacent database rows tend to be highly correlated, so the use of clustered techniques widens the resulting confidence interval for a given sample set and introduces complexity into the confidence calculation itself.
Kerry and Bland (1998) indicate that "The main difficulty in calculating sample size for cluster randomized studies is obtaining an estimate of the between cluster variation or intracluster correlation." This problem has led to multi-stage sampling techniques where the inter- and intra- cluster variations are estimated at each stage and used to predict a sample size needed at the next layer of refinement. Several epidemiological studies where attaining a true random sample is prohibitive due to geographical constraints have demonstrated this technique (e.g., Galway et al, 2012)
Such methods have their detractors (e.g., Luman et al., 2007), generally claiming that clustered methods consistently overstate results when the variability between clusters is significantly higher than the variability within clusters. In response to this, the World Health Organization has layered various heuristics on top of the cluster selection method to improve performance of estimates of vaccination coverage (Burton et al. 2009). They concede, however, that reliable power estimates are problematic, if not impossible, under their methodology.
The goal of this research is to develop methods for extracting unbiased clustered samples from large databases while retaining the ability to perform power calculations. Some areas to explore are:
- Using background processes to continually analyze the data to develop better correlation estimates which can be used by subsequent query processing to optimally select clusters.
- Similarly, background processing can be used to extract a row-wise pseudo-random sample database that can be used for exploratory queries.
- Using the convergence of the point estimate as an indicator for a stopping rule. That is, refusing to accept that a confidence interval has been met if the point estimate is still showing greater variability with each query iteration.
- Using the convergence of other correlated characteristics of the data as in indicator for a stopping rule (e.g., if one was estimating the cash flows from of a group of insurance policies, one could consider the more volatile component of projected claims since convergence on that item would imply convergence on the far more stable components of premiums and expenses.
Presumably, a more thorough literature review will turn up lots of other avenues to explore as well; these are just a few things that I've thought of so far.
No comments:
Post a Comment