Friday, October 27, 2017

Why would you even do that?

I didn't have much to show for this week's efforts. Reading a paper and generating a bunch of random data sets doesn't make for a good demo. I decided to go ahead and keep my regularly scheduled meeting with my adviser figuring I could at least get him to validate the approach I was taking.

To that end, I put together a little table of each method being compared and why you would even consider sampling that way.

n: number of raw rows
m: block size
N=n/m: number of blocks
r: blocks read
s2: estimated variance of block sums
B: bound on variance of population sum estimator

Full Sampling

Block selection: sequential
Variance estimator: n/a
Estimator of population sum: actual sum
Stopping rule: all blocks read
Rational: Certainty. This is what most people do now with finite data sets.
Drawback: Slow if data set is really large.

Random Sampling

Block selection: random
Variance estimator: sample variance of block sums read so far
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * s2 < B
Rational: Leverage law of large numbers to get a good approximation with less reading.
Drawback: Variance will be understated for heavy-tailed distributions.

Bag of Little Bootstraps

Block selection: Random, each block selected becomes a "little bootstrap" subsample
Variance estimator: bootstrap estimate of blocksum variance
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Better estimate of variance in heavy-tail case
Drawback: More computation per block, variance may still be understated due to correlation

Bootstrap Metropolis-Hastings

Block selection: random
Variance estimator: Metropolis-Hastings
Estimator of population sum: N * (sample mean block sum)
Stopping rule: N * (sum s2) / r < B
Rational: Another stab at getting a better variance estimate
Drawback: Same as above

CISS

Block selection: Random by strata to minimize estimate of total variance
Variance estimator: Bayesian posterior applied to stratum variance, S
Estimator of population sum: sum of strata estimates
Stopping rule: sum(S) < B
Rational: Leverage nature of desired statistic to actually lower the blocksum variance
Drawback: Data must be stratified in advance


This contest is totally rigged to have CISS come out on top. All the other methods just do increasingly good jobs of telling you how hard the problem is. That's because they are general purpose methods designed to work on any statistic. CISS is using the fact that we know exactly what kind of statistic we are after (1st order U-statistic) and therefore can rearrange things to minimize the variance of such a statistic.

No comments:

Post a Comment