Tuesday, October 3, 2017

Optimal stratification

Here's the motivation behind CISS in a nutshell. I argued last week that the difference between our estimate of the sample sum based on the subsample and the true sample sum was

Var(Y-Yr|Xr) = σ2(n-r) where σ2 = Var(Xi).

Now, take just the case where we trim off the tails. Clearly, the variance of the remaining blocks is reduced without all those tail observations (just how much it's reduced depends on the distribution). Assuming we sample the tails in their entirety, the remaining variance is

Var(Y-Yr|Xr) = σT2(n-r) where σT2 is the variance of the non-tail elements.

This is clearly less than the original value since the only difference is σT2 being swapped in for the greater quantity σ2. So, given that this is better, what's the best possible answer? The optimization problem we're trying to solve is:

minimize r such that

  1. σT2(n-r) > B (where B is our upper bound on the estimate variance)
  2. r > number of blocks in the tail strata (we need to sample at least one from the middle to have an estimate of the mean

You can crunch that through a linear programming algorithm if you want, but it's pretty obvious by inspection that the optimal solution is one that puts r-1 blocks in the tail. The problem, of course is that we have no idea what σT2 is. We could compute it for the entire population easy enough (we're going to have to do a complete pass of the data to stratify it anyway), but it's going to change with every query. Some queries won't have any results in the tail. Others will have lots. Some will have high correlation in the non-tail blocks (increasing the variance), others will be much more random.

So, just tearing off the tails won't really help much. We need more stata. How many more? Ah, now we're getting into rough waters. The short answer is: it depends. Mostly it depends on how volatile the stratum variances are from one query to the next. I don't know that this has a tractable answer mathematically. I could just simulate a bunch of results. Before heading down that road, I think I'll at least try to come up with a theorem that relates the distribution to the optimal partitioning for any one query.

No comments:

Post a Comment