Friday, September 22, 2017

CISS variance

The few who have been reading this for over a year will recall that getting the variance of the CISS estimate right was no small thing. I was reviewing that with my adviser today when it occurred to me that I haven't really made a very good case for why I'm approaching it the way I am.

Normally, the variance of a statistic is based on the data collected. This makes sense when viewing the sample as coming from an infinite population with a finite second moment. However, in the case of CISS, the first assertion is never true and the second is dubious at best. So, instead we consider the fact that, if we kept sampling until the entire population was exhausted, the CISS estimator would be exactly equal to the value we are estimating. That is, the CISS estimator Xn (where n is how many blocks we've read) converges almost surely to the value we're after.

The question, of course, is: how far off is it? As noted, the estimator isn't even unbiased in the early going (though empirical evidence suggests it becomes so quite quickly). To estimate how far off it is, we compute the variance of Y = |Xn - X|. Y converges almost surely to zero, so an estimate of the variance of the error is just E(Y2).

Computing that directly is problematic, but computing it for each stratum is less so. Then, it's a matter of choosing a good prior and summing the variance for each strata as I did here.

Anyway, that's all been done, but I probably should put a paragraph or two in the introduction at least explaining why I'm computing the variance of what I haven't sampled rather than the variance of what I have.

No comments:

Post a Comment