Tuesday, March 22, 2016

Impact of correlation

I glossed over the whole independence problem yesterday, but it will need to be dealt with. Fortunately, independence (which I am sure is NOT true) is not really the issue. What we care about is correlation. We'll start by claiming that the variance of the sum is, in fact, finite. This seems pretty safe since, even with the heavy tail, financial metrics are bounded by the total amount of money in the world and the variance of any bounded random variable is finite. In that case, the variance of the sum is:

Var(ΣXi) = ΣΣCov(Xi, Xj) = ΣVar(Xi) + ΣΣCov(Xi, Xj)    where i <> j in the last term

So, if they're uncorrelated, this is all easy; the variance of the sum is just the sum of the variances (any introductory stats book will tell you this). Problem is, they're not.

The value of θk (the proportion of rows included in a query from stratum k) varies from one block to the next, and not just randomly. There is real correlation within a block even after stratification. That's not too terrible to work around since that's really just affecting the posterior variance of θk, not the expectation. We can still use the point estimate of θk as input to our posterior distribution for the stratum sum and get an unbiased estimate of the sum variance from that. Looking at the whole stratum, the values are not correlated (or, at least, I have no reason to believe they are). However, across strata, they are most definitely correlated. Remember that whole reversing entry thing? A bunch of detail rows get offset by one big reversing entry. This means that entries in one stratum tend to correlate to a smaller number of larger entries in a higher stratum with the sign reversed.

So, if we assume that we sample enough blocks that we have no correlation issues within a stratum, the variance for the stratum sum is:

Var(ΣXi) = ΣVar(Xi) = nkVar(Xk)     over all i in stratum k

and the formula above for the total sum becomes:

Var(ΣXi) = ΣnkVar(Xk) + ΣΣnknjCov(Xk, Xj)    where again k <> j in the last term

I hope it's clear that the right hand side is now summing over strata, not individual observations. Putting the summation indices into HTML is more work than it's worth for a blog post.

So, the big question becomes: what is Cov(Xk, Xj)? For the moment, I'm going to say it's close enough to zero that I can leave it out. However, I'm actually a little pleased that's not the right answer. Everything else in this project has been merely implementation of a somewhat clever algorithm. That's all well and good, but if I'm going to submit this for publication, it will help to have some actual math in there, too.

No comments:

Post a Comment