Tuesday, March 29, 2016

E Pluribus Unum

Don't panic, I'm not switching this over to be a political blog. It did occur to me though, that the nature of my data correlation allows for a nice trick which may simplify matters considerably. To review:

  • We want a weak prior on the proportion of values returned.
  • We want a somewhat stronger prior on the magnitude of those values.
  • Unless we adjust for correlation, the posterior on the proportion will be too tight because we'll give too much credence to the first data points.
  • Correlation is strong within a block.
  • Correlation is weak (nonexistent?) between blocks.
  • All we really care about is the sum of values out of a block (actually, we only care about the sum for the whole stratum, but we can only read one block at a time so we have to put that sum somewhere).


So...

Why not simply ditch the individual rows altogether and just work with block sums? Those are very nicely distributed random variables:

P(X = x) = {1-θ for x = 0;  θU(0, nbk) for x ≠ 0}
where U(0, nbk) is uniform between zero and the stratum upper bound times the block size. (In the case of the negative strata, it will be U(nbk, 0), which flips the sign on the expected value but doesn't change the variance.) Thus,

E(X|θ) = μ(θ) = θnbk/2
E(X) = μ = ∫μ(θ)P(θ)dθ    where P(θ) is the posterior on θ

and

Var(X|θ) = σ(θ)2 = E(|X - μ|2) = (1-θ)μ2 + θ[(nbk)2/3 - nbk μ + μ2]
Var(X) = σ2 = ∫σ(θ)2P(θ)dθ

If the second term in Var(X|θ) looks weird, that's because we're looking at the variability around the posterior mean, E(X), rather than the mean of the uniform random variable. If you substitute nbk/2 for μ, you get the familiar uniform variance of (nbk)2/12 multiplied by θ for that term, but that puts us right back where we were last Friday.

Even with something as computationally convenient as the Beta distribution for θ, those integrals are going to be messy. In the early going, I can get by with a rough approximation. As the confidence interval gets closer to the goal, I'll need to be more accurate. Fortunately, the distribution of θ will tighten up as well, so I may not have to integrate over the entire domain.

Also note that in the first integral, it's just a constant times the variable times the density. The second isn't much worse: μ is fixed, so we're just placing the density on top of the affine space of two constants; it's really not that bad.

In exchange for the additional computation, this allows me to intentionally increase the correlation within a block, which I certainly do intend to do when we get to dynamic partitioning. Being able to know that a block can be skipped because the partition contains no data in the query filter will more than pay back grinding out a numerical integral. As long as I'm just concerned with the sum, I can still compute a confidence interval without a bunch of correlation adjustments.

Next step is to get these ideas into CISS and test them out. Code modifications should be pretty straightforward. Really hoping to have a stable version this week so I can get to writing.

No comments:

Post a Comment