Saturday, September 23, 2017

Reframing the problem (again)

I was starting to write up some more explanation of the underpinnings of CISS when it occurred to me that I might be framing this whole thing wrong. It's not that my specific problem isn't valid; I really do have an incredibly unstable distribution. It's just that, even if that wasn't true, stratifying the sample will still get better results when the population is finite. That last bit is really, really important and I haven't been giving it enough ink. So, here's another go at what we're trying to accomplish (again, the notation is a bit rough; I'll typeset it later).

Let {Xi} be a sequence of observations drawn without replacement from a finite set X where ||X||=n. Let Xr be the first r elements of {Xi}. Suppose we wish to estimate Y = sum(Xi) using just Xr. The Minimum Variance Unbiased Estimator (MVUE) is simply the sum so far divided by the percentage of the population sampled: Yr = (n/r)sum(Xi, i=1...r).

Because the population is finite and we are sampling without replacement, the error in this approximation can be estimated either from the data sampled so far, OR (this is the key to the whole line of reasoning) by looking at what's still out there:

Y - Yr | Xr = (1-n/r)sum(Xi, i=1...r) + sum(Xi, i=r+1...n)

When conditioned on Xr, the first term is a constant so the variance of our error is just the sum of the variances of the unsampled data:

Var(Y-Yr|Xr) = σ2(n-r) where σ2 = Var(Xi).

Note that this variance is independent of the observations taken so far. Therefore, there's no upside to doing anything other than random sampling.

Now suppose the set X is partitioned into k>1 stata. Let σj2 be the variance of observations in stratum j and nj indicate how many items of X are in stratum j. Also, let rj indicate how many items of Xr are in each stratum. Assuming that rj>0 for all j, the MVUE is now the sum of the estimates for each strata:

Yr = sum((nj/rj)sum(Xi Ik(Xi), i=1...r), j=1...k) where Ik() is an indicator of whether observation Xi is in stratum k.

The variance now becomes the sum of the variances for each stratum, which is dependent on which strata have been sampled the most:

Var(Y-Yr|Xr) =  sum(σj2(nj-rj), j=1...k).

The rub is that we don't actually know what the variance for each strata is. Since we have to estimate it, a random sampling approach is appropriate where we generally sample from the higher-variance stata, but also go after some of the other stata, because our estimate of their variance might change.

The reason I like this new framing is that there's no mention anywhere here of heavy-tailed distributions, missing higher moments, etc.. Those are still true with respect to my data set, but the algorithm actually has much broader usage. Any time you can partition your data such that the bulk of the data goes into strata that don't add a lot of uncertainty to the outcome, you'll get much faster convergence using this method.

No comments:

Post a Comment