Saturday, March 31, 2018

Searching for the lost

The lost piece of my variance formula, that is. (Those who know me personally know that I take this whole Jesus died and rose again thing fairly seriously. That doesn't mean, however, that one can never make a lighthearted comment about it. If you disagree with that statement, you might want to skip both today's and tomorrow's posts).

Anyway, we're told that between dying on Friday and rising sometime early Sunday, Christ hung out in hell. Exactly what he did there is a point that some folks like to debate. Some claim he went there looking for lost souls. Whatever he was up to, it probably sucked worse than my last two days, which have been spent looking for what part of my variance was lost when we approximate the vector of hit rates.

But, maybe not by much. Searching for computation errors is pretty much my version of hell.

The formula worked fine when calculated directly and I wasn't seeing anything wrong with the proof.

Substituting in the moments for a random variable is a pretty standard trick, so I was really miffed that the formula was not only off, it was outright wrong. Breaking the blocks into smaller partitions was increasing the variance. The opposite should be true.

So I spent yesterday afternoon and evening going through the derivations really carefully to see what was missed. I did find a reversed sign (as I've noted before, I do that a lot), but nothing that explained the real problem.

So today I did what I only do as a last resort because I'm really bad at computation. I sat down and started working through the whole thing by hand. Well, just in time for Easter, I've found the problem. And, it was pretty insidious.

The formula is actually right, but it contains the fairly innocuous looking term:



The nasty bit is that that term is almost always zero when you are computing each partition individually. That's because most batches don't span more than two blocks, so the variance of any particular block partition given a batch m will be zero because there's at most one entry. When you blend that, you wind up with



and that guy is almost never zero because now you're looking at the variance of partition sizes across all blocks. So, the more you chop up the data, the bigger the approximation gets even though the true value is going down.

The fix is easy enough - just drop the term. Like I said, it's almost always zero and even when it's not, it's not very big compared to the other terms in the equation. Having done that, my sampler quickly rose from the dead and began producing some decent results. Sorry, no graphs yet, but at least I've got stuff to work with again. And, I got the paper updated so my adviser can continue to review it.

I can now go to Easter Vigil and think about what really matters. I'll pick up my research again tomorrow evening.

No comments:

Post a Comment