Monday, May 28, 2018

Actual contestants in the comparo

As I've noted before, BMH and BLB are train wrecks when the data is not independent. While I certainly want to cite those authors, I don't think it's fair to say that their algorithms are really what I'm competing against. I've modified them too heavily to work in the blocked sampling case.

So, who's actually in this contest? And by the way, what is the contest? The second question has an easy answer: find the best estimator of the block sum variance. The actual estimator of the sum is the same for all of these. But, we want to know how good it is. The "best" estimator isn't necessarily the one that comes the closest to the real value (though, that's a good start). The correlation with the sum matters, too. If the estimator and the block sum variance are positively correlated (they usually are), then you get too many rejections when both are low. So, it's quite possible to have an estimator that's quite accurate, but still produces bad confidence intervals too often (the sample block variance fails on this count).

  1. Sample Block Variance: You want to know the variance of a population and you have a sample from that population? Well, just use the sample variance. It's easy to compute, unbiased, and has minimum variance. Too bad it sucks so badly in this case. The problem, as just mentioned is that it is very strongly correlated with the sample sum, especially at small sample sizes. Any stopping rule based on this estimator will stop too early too often.
  2. Rho Variance: Rather than use the block sum sample variance, we derive the block sum variance as a function of the hit rate variance. We then use the sample variance of the hit rates and plug it into this function. Works great when the variance of the hit rate is really what's driving the block sum variance. Not so much otherwise.
  3. Adjusted Rho Variance: Part of the problem with the above method is that there will be variations in the hit rate from one block to the next even if the actual parameter is constant. So, we compute the variance in the hit rate that we would expect to see if it really was constant and subtract that out (variances are nice that way, you can decompose them into pieces that can be added or subtracted as needed). This adjustment makes the method work well in both the above case and when things really distributed uniformly. Still doesn't help when the underlying distribution is changing, which is to be expected as the method doesn't even look at that.
  4. Partition Bootstrap: To try to account for partitions that vary both in hit rate and distribution of the measures, we perform a bootstrap method where we resample the existing partitions to create blocks and then look at the variance of the empirical distribution of the blocks. I'm not entirely sure why this one doesn't work better other than the usual problem with correlation between the sample sum and variance estimate. At any rate, it's actually pretty bad across the board.
  5. Metropolis-Hastings: Here, we try to construct the distribution of the entire population by estimating the parameters with an MCMC chain. We then sample these points and feed them into the exact formula for the block sum variance. I haven't finished validating my work on this one, but it appears to work fine. The only problem is that it's rather slow because you can't parallelize the MCMC chain (well, you can, but the results aren't as good). This algorithm borrows the idea from BMH that it's OK to run the chain on just a sample rather than the full data set. There's a variant that I'm also testing that the BMH authors suggested where you actually use a different subset of the data for each iteration of the chain.
  6. Full Kernel estimator: Rather than generate an empirical distribution with an MCMC chain, we generate one using Kernel techniques. The resulting distribution is very similar to the MCMC distribution, as are the generally good results. It's still not as fast as the Adjusted Rho Variance, but it handles pretty much any case and the Kernel can be easily biased in the early going to avoid the problems of correlation with the sum estimator at low sample sizes. (I didn't find this to be necessary, but it's a nice feature to have).
So, 3 and 6 are the winners, though I'm pretty sure 4 would work fine if I put more effort into it. As 3 and 6 are the most original methods, I'm not really worried that 4 isn't doing great. If somebody else wants to publish an improvement, they are free to do so.

No comments:

Post a Comment