never2old4school: Optimal stratification

Tuesday, October 3, 2017

Optimal stratification

Here's the motivation behind CISS in a nutshell. I argued last week that the difference between our estimate of the sample sum based on the subsample and the true sample sum was

Var(Y-Yr|Xr) = σ²(n-r) where σ² = Var(Xi).

Now, take just the case where we trim off the tails. Clearly, the variance of the remaining blocks is reduced without all those tail observations (just how much it's reduced depends on the distribution). Assuming we sample the tails in their entirety, the remaining variance is

Var(Y-Yr|Xr) = σ_T²(n-r) where σ_T² is the variance of the non-tail elements.

This is clearly less than the original value since the only difference is σ_T² being swapped in for the greater quantity σ². So, given that this is better, what's the best possible answer? The optimization problem we're trying to solve is:

minimize r such that

σ_T²(n-r) > B (where B is our upper bound on the estimate variance)
r > number of blocks in the tail strata (we need to sample at least one from the middle to have an estimate of the mean

You can crunch that through a linear programming algorithm if you want, but it's pretty obvious by inspection that the optimal solution is one that puts r-1 blocks in the tail. The problem, of course is that we have no idea what σ_T² is. We could compute it for the entire population easy enough (we're going to have to do a complete pass of the data to stratify it anyway), but it's going to change with every query. Some queries won't have any results in the tail. Others will have lots. Some will have high correlation in the non-tail blocks (increasing the variance), others will be much more random.

So, just tearing off the tails won't really help much. We need more stata. How many more? Ah, now we're getting into rough waters. The short answer is: it depends. Mostly it depends on how volatile the stratum variances are from one query to the next. I don't know that this has a tractable answer mathematically. I could just simulate a bunch of results. Before heading down that road, I think I'll at least try to come up with a theorem that relates the distribution to the optimal partitioning for any one query.

never2old4school

Tuesday, October 3, 2017

Optimal stratification

No comments:

Post a Comment