never2old4school: Optimal partitioning

Tuesday, October 10, 2017

Optimal partitioning

I may have to punt on this one. There are just too many variables to state with certainty that a partitioning strategy for CISS is optimal.

That said, some guidance can be derived. So, I did that. Here's what I've got.

We're trying to optimize the following system:

minimize sum r_j
where B ≥ sum (n_j-r_j)σ_j²
and r_j ≥ 1

Well, duh, that's easy. Find the critical variance, σ* where you want to sample everything greater and nothing less. Just set r_j=n_j for all the σ_j² > σ* and r_j=1 when σ_j² < σ*. The only statum that gets a partial sample is the critical one where σ_j² = σ*.

That's all well and good but, if we knew the distribution of the block sums in advance, we wouldn't be sampling in the first place. Since we have to estimate σ_j², things get a bit more complicated.

First off, note that the bias on the estimate is O(1/h), where h is the number of hits (sampled blocks with a non-zero sum) we've had on the statum. If we assume that the number of hits is proportional to the number of samples (that's a bogus assumption, by the way), we get

E(r_j) = q σ_j² / σ*

Where q is some constant that will vary by the query. Roughly. And that's just for the strata where σ_j² < σ*. If σ_j² > σ*, it's reasonable to assume the entire stratum will be sampled since the bias is high.

So, while we have no closed form expressions, there are some principals that are clear. First off, more strata means that the σ_j² go down. That's offset by the fact that you have to sample blocks even from strata that have very low variances to get rid of the bias. So, you really want the gaps between σ_j² to be large enough that the estimates for the lower ones can slide underneath σ* as quickly as possible.

Second, this only pays off if you can pack lots of blocks into a few strata with very low variance. With heavy-tailed distributions, that means really going after the tails. Recall from the chart from last week that the inner quartiles of Cauchy were every bit as tight as Normal; it was the tails that needed to be chopped up.

So, the guidance I'm going to suggest is go with the σ_j² = 1/n_j heuristic and tune from there. Not the big-time math result my adviser was hoping for. Sigh...

never2old4school

Tuesday, October 10, 2017

Optimal partitioning

No comments:

Post a Comment