So, here's a first cut at the storyboard. The graphs are generated quickly out of excel; in the final they will be prettier versions created by the ggplot2 package in R. In particular, the axis will be labeled properly: in all the performance graphs, the horizontal axis is a logarithmic scale of samples taken and the vertical scale is how many times the 95% confidence interval missed the real value. Ideally, 50 of 1000 (5%) of the confidence intervals miss.
Sequential loading of data in batches causes similar data to be stored in the same physical block. The boundaries of batches do not typically align with the physical block boundaries. Here, physical block j contains some of batch m, all of batch m+1, and some of m+2. Block j+1 is entirely comprised of observations from batch m+2. Block j+2 contains the remainder of batch m+2 and the beginning of m+3. Such co-location of related rows can significantly increase the variance of an estimator when sampling is done by block.
When rows are not correlated by physical block, the sample sum and sample variance are independent. However, when the rows are correlated by block, the sum and variance also become correlated. This simultaneous understating of both the sum and variance leads to a high number of misses, even though two statistics are marginally unbiased. The cluster of points in the lower left of this graph result in confidence intervals that miss (I have a graph in R that does a much better job of illustrating this).
When estimating using the sample block variance, the miss rate is correct in the uncorrelated case. As a result of the correlation, this method produces too many misses when applied to small samples of correlated data. This is problematic when the width of the confidence interval is used as a stopping rule for sampling. Arbitrarily setting a lower bound on the number of samples is also unreliable, since the convergence to a proper 5% miss rate depends on the nature of the query and the strength of the correlation. (Uncorr is the data where the rows are randomly scattered in batches. rhoCorr is where the hit rate is correlated by batch. batchCorr is where both the hit rate and the distribution of the measure values is correlated by batch.)
Using the sample distribution of the hit rate to estimate the block variance yeilds good results in the case where that is the primary source of variance. However, it makes no adjustment for correlation of the measure distribution. When the data is not correlated, the method significantly overstates the variance (resulting in no misses) due to random fluctuations in hit rate being counted as real variation in the parameter.
Subtracting the expected variance from random fluctuations from the variance of the kernel distribution corrects for the overstatement in the uncorrelated case. This method is very fast and is a good choice when only the hit rate (and not the distribution of the measures) is correlated by batch.
The simplest method which makes no assumptions on the homogeneity of the measures is the partition bootstrap. However, this method suffers from similar problems of the simple sample variance: the samples which understate the sum also tend to understate the block variance. As a result, the confidence intervals, especially in the early sampling miss too often.
When we figure out what we're going to do with BMH, that will go here. I actually expect BMH to work really well since it basically does the same thing as the full kernel; just using a Metropolis-Hastings distribution rather than a kernel distribution as a source. The argument for using the kernel will be one of efficiency since BHM is pretty slow. Finally, the winner is...
The full kernel produces confidence intervals that include the true value roughly 95% of the time at all sample sizes. There is some understatement of variance at very low sample sizes, so it is recommended that some minimum number of samples be set (or, alternatively, some minimum number of samples which produce hits) prior to applying a stopping rule.
No comments:
Post a Comment