To that end, I put together a little table of each method being compared and why you would even consider sampling that way.
n: number of raw rows
m: block size
N=n/m: number of blocks
r: blocks read
s2: estimated variance of block sums
B: bound on variance of population sum estimator
Full Sampling
Block selection: sequential
Variance estimator: n/a
Estimator of population sum: actual sum
Stopping rule: all blocks read
Rational: Certainty. This is what most people do now with finite data sets.
Drawback: Slow if data set is really large.
Random Sampling
Block selection: random
Variance estimator: sample variance of block sums read so far
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * s2 < B
Rational: Leverage law of large numbers to get a good approximation with less reading.
Drawback: Variance will be understated for heavy-tailed distributions.
Bag of Little Bootstraps
Block selection: Random, each block selected becomes a "little bootstrap" subsample
Variance estimator: bootstrap estimate of blocksum variance
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Better estimate of variance in heavy-tail case
Drawback: More computation per block, variance may still be understated due to correlation
Bootstrap Metropolis-Hastings
Block selection: random
Variance estimator: Metropolis-Hastings
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Another stab at getting a better variance estimate
Drawback: Same as above
CISS
Block selection: Random by strata to minimize estimate of total variance
Variance estimator: Bayesian posterior applied to stratum variance, S
Estimator of population sum: sum of strata estimates
Stopping rule: sum(S) < B
Rational: Leverage nature of desired statistic to actually lower the blocksum variance
Drawback: Data must be stratified in advance
This contest is totally rigged to have CISS come out on top. All the other methods just do increasingly good jobs of telling you how hard the problem is. That's because they are general purpose methods designed to work on any statistic. CISS is using the fact that we know exactly what kind of statistic we are after (1st order U-statistic) and therefore can rearrange things to minimize the variance of such a statistic.
No comments:
Post a Comment