never2old4school: Why would you even do that?

Friday, October 27, 2017

Why would you even do that?

I didn't have much to show for this week's efforts. Reading a paper and generating a bunch of random data sets doesn't make for a good demo. I decided to go ahead and keep my regularly scheduled meeting with my adviser figuring I could at least get him to validate the approach I was taking.

To that end, I put together a little table of each method being compared and why you would even consider sampling that way.

n: number of raw rows

m: block size

N=n/m: number of blocks

r: blocks read

s2: estimated variance of block sums

B: bound on variance of population sum estimator

Full Sampling

Block selection: sequential

Variance estimator: n/a

Estimator of population sum: actual sum

Stopping rule: all blocks read

Rational: Certainty. This is what most people do now with finite data sets.

Drawback: Slow if data set is really large.

Random Sampling

Block selection: random

Variance estimator: sample variance of block sums read so far

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * s2 < B

Rational: Leverage law of large numbers to get a good approximation with less reading.

Drawback: Variance will be understated for heavy-tailed distributions.

Bag of Little Bootstraps

Block selection: Random, each block selected becomes a "little bootstrap" subsample

Variance estimator: bootstrap estimate of blocksum variance

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * (sum s2) / r < B

Rational: Better estimate of variance in heavy-tail case

Drawback: More computation per block, variance may still be understated due to correlation

Bootstrap Metropolis-Hastings

Block selection: random

Variance estimator: Metropolis-Hastings

Estimator of population sum: N * (sample mean block sum)

Stopping rule: N * (sum s2) / r < B

Rational: Another stab at getting a better variance estimate

Drawback: Same as above

CISS

Block selection: Random by strata to minimize estimate of total variance

Variance estimator: Bayesian posterior applied to stratum variance, S

Estimator of population sum: sum of strata estimates

Stopping rule: sum(S) < B

Rational: Leverage nature of desired statistic to actually lower the blocksum variance

Drawback: Data must be stratified in advance

This contest is totally rigged to have CISS come out on top. All the other methods just do increasingly good jobs of telling you how hard the problem is. That's because they are general purpose methods designed to work on any statistic. CISS is using the fact that we know exactly what kind of statistic we are after (1st order U-statistic) and therefore can rearrange things to minimize the variance of such a statistic.

never2old4school

Friday, October 27, 2017

Why would you even do that?

Full Sampling

Random Sampling

Bag of Little Bootstraps

Bootstrap Metropolis-Hastings

CISS

No comments:

Post a Comment