Tuesday, October 10, 2017

Optimal partitioning

I may have to punt on this one. There are just too many variables to state with certainty that a partitioning strategy for CISS is optimal.

That said, some guidance can be derived. So, I did that. Here's what I've got.

We're trying to optimize the following system:

minimize sum rj
where B ≥ sum (nj-rjj2
and rj ≥ 1

Well, duh, that's easy. Find the critical variance, σ* where you want to sample everything greater and nothing less. Just set rj=nj for all the σj2 > σ* and rj=1 when σj2 < σ*. The only statum that gets a partial sample is the critical one where σj2 = σ*.

That's all well and good but, if we knew the distribution of the block sums in advance, we wouldn't be sampling in the first place. Since we have to estimate σj2, things get a bit more complicated.

First off, note that the bias on the estimate is O(1/h), where h is the number of hits (sampled blocks with a non-zero sum) we've had on the statum. If we assume that the number of hits is proportional to the number of samples (that's a bogus assumption, by the way), we get

E(rj) = q σj2 / σ*

Where q is some constant that will vary by the query. Roughly. And that's just for the strata where σj2 < σ*. If σj2 > σ*, it's reasonable to assume the entire stratum will be sampled since the bias is high.

So, while we have no closed form expressions, there are some principals that are clear. First off, more strata means that the σj2 go down. That's offset by the fact that you have to sample blocks even from strata that have very low variances to get rid of the bias. So, you really want the gaps between σj2 to be large enough that the estimates for the lower ones can slide underneath σ* as quickly as possible.

Second, this only pays off if you can pack lots of blocks into a few strata with very low variance. With heavy-tailed distributions, that means really going after the tails. Recall from the chart from last week that the inner quartiles of Cauchy were every bit as tight as Normal; it was the tails that needed to be chopped up.

So, the guidance I'm going to suggest is go with the σj2 = 1/nj heuristic and tune from there. Not the big-time math result my adviser was hoping for. Sigh...

No comments:

Post a Comment