Wednesday, January 17, 2018

When in doubt... (v2)

... simulate the hell out of it.

Machine apocalypse watchers got a big milestone in 2017 when the world Go champion was bested by a computer (and it wasn't a particularly close match). That, in and of itself, isn't super interesting. It's just more evidence that the difference between the computational power of machines and humans has become sufficiently small that a purpose built machine can beat a general purpose human at a high-cognitive task.

What is interesting is how it did it. Rather than the usual "min-max" method of game theory where you build these huge trees of "best" moves on both sides, the algorithm plotted out the next few steps and then just played several thousand completely random games from each of the resulting positions. This turns out to be a really good way of appraising a complicated situation. With enough simulations, you get a pretty good distribution of what to expect. This is, not coincidentally, how MCMC samplers work.

So, faced with the fact that I haven't been able to round up a paper that simply tells me what the variance of a block sum is when the observations are correlated, I have set to deriving it. Whenever you're deriving something, it's helpful to know the answer ahead of time. That's why textbook questions generally state derivation problems as "prove the following" rather than "what's this equal to?"

So, to get that, I wrote a quick simulator that generates blocks of correlated data. Since it's my general conviction that the underlying distribution doesn't matter much aside from the moments, I just went with the easy one and generated normal random variables with a mean of 2 and variance of 1. If we generate a block of 1000 independent, identically distributed (iid) observations, theory tells us we should expect the mean and variance of the sums to be 1000 times the mean and variance of the individual observations. I generated 1000 such blocks and got almost exactly that. The sample mean was 2001 and the sample variance was 999. Also in line with theory, the histogram looks an awful lot like the familiar normal bell curve.


If we add that only a percentage of the rows from the block are selected by the query, we see the variance rise since values are now being forced to zero, which is in the tail of the N(2,1) distribution. With 40% of the rows selected (on average), the variance rises to 1302 and the mean drops to 801.


Of course, if we go far enough down this path, the zero observations become the center of the distribution and the actual selected values are sitting off by themselves. This causes the variance to drop again. At 10%, the variance is less than half the full-inclusion case.


OK, I didn't need to run a simulation to know all that. Sums and variances of iid sequences are generally pretty easy to derive. But, having convinced ourselves that the simulation matches theory, lets look at what happens when we correlate the variables. There are lots of parameters we could mess with, and I'll get to them all as I derive my results, but for demonstration purposes, let's keep it simple and say that we have two partitions, each equally likely. One of the partitions returns no rows for the query, the other returns 80%. Over the entire data set, this will look just like the iid40% case. However, we now generate blocks where if we are in one partition, we tend to stay there. On average, we go half a block length before switching. The mean of the block sums should not be affected by this. However, the variance will get cranked way up because we'll have a lot of blocks with no rows selected and about the same number of blocks where 80% of the rows are selected. Let's take a peak:


Well, the variance sure did go up! The sample mean is 784, pretty close to the true mean of 800, but the spread is completely different from the iid case. The bell curve is pretty much non-existent and there's a big spike over on the left of blocks with a sum of zero. Another bump on the right is due to blocks where nearly all the rows came from the 80% partition.

And, we're just getting warmed up. What happens if the partition length is (on average) the same or twice as long as the block length?


Pretty much what you might expect. The sums are all clustered around the endpoints. You're either getting blocks with 80% of the rows included or blocks with none included. The middle is all but gone.

This partitioning example is particularly inflationary for two reasons. First,the hit rates are very different and there aren't any in-between partitions to smooth things out. Second, zero is nowhere near the center of the distribution of the measures; it's two standard deviations away from the mean. In reality, the variance won't go up that much due to correlation.

The point, however, is that I now have a nice easy way to check my calculations as I try to put some closed formulas around my variance estimates.

No comments:

Post a Comment