Thursday, January 18, 2018

The new normal

The Central Limit Theorem (CLT) is exceedingly well known. Even people who don't know the details are often familiar with the general idea: if you average enough things together, you get a "bell" curve. However, the devil is in the details and before invoking it in a formal setting, we need to make sure we're not breaking some rules.

The most important rule is that the variance of the observations is finite. People often wave their hands on this one because, outside of financial data and some weird stuff that particle physics folks deal with, it's almost always true. We're using financial data and it's definitely heavy tailed, but the more I've been looking at it the more I'm convinced that we do have a second moment.

Slightly less important is that the observations be identically distributed. As long as they have the same mean and variance, the actual distribution can vary without messing things up too badly. In our case, the individual observations are really coming from a mixture distribution, so the moments aren't equal. But, because we're averaging block sums rather than the individual observations, we get quite a bit of smoothing. Given our block size, it's reasonable to claim that the block sums are pretty similar. (Ultimately, we might want to look at distributions by partition, but we're holding off on that for now).

Averaging blocksums also helps us out on the fact that we are in flagrant violation of the independence assumption. As we saw yesterday, the correlation makes the distribution of the blocksums much different from what it would be if the observations were truly pulled at random. However, from one block to the next, the correlation is very weak, especially if we are sampling blocks randomly. So, we're on reasonably stable footing with the assumptions.

That leaves us with the question of how long the convergence takes. When our blocksum and average partition length are roughly equal, we have the opposite of the normal distribution - nearly all the values are pushed to the endpoints. So, how long do we have to run this thing before we can start applying the CLT to produce a confidence interval?

Happily, not very long. Simply summing two random blocks together gets us to a more or less unimodal distribution (though there's still a disconcerting spike at zero). The distribution predicted by the Central Limit Theorem is overlaid with the solid line.


By five blocks, things are looking much better and by twenty, the fit is just about perfect.


So, that's one less thing to stress over. We'll always be sampling way more than 20 blocks. If I can get the variance right, the rest of the sampler should work just fine.


No comments:

Post a Comment