Sunday, December 17, 2017

Do we really want to go there?

Anybody who's been with the blog from the get go (a fairly small group based on my page stats) knows that I originally went from looking at individual entries to block sums because adjusting for the correlation was over-complicating things. Having developed a working algorithm using just block sums, I'm a little loathe to revisit that decision. Still, my adviser has made the legitimate point that, if the block sums are not sufficient statistics, then we are giving away information.

We have argued that while the individual observations are not independent, the blocks are. That's not really true, but it's pretty close to true. The last records in one block are correlated with the first records in the next adjacent block. However, since we are sampling blocks randomly, that's really not a problem; there's no systematic correlation that will throw off our results. And, if the blocks sums are independent, then it's a no-brainer that they are sufficient statistics for the total sum.

A question that can't be answered from just the block sums, however, is how correlated the data is. This impacts our estimate of the variance of the total sum. Recall that we're assuming a mixture distribution on the sum where it is zero with some probability (1-λ) >= 0 and distributed according to an arbitrary non-constant distribution otherwise. If the correlation was zero, then λ would always be so close to zero, we'd have no need for a mixture distribution. It's only if all the observations are highly correlated that there's much chance that all the rows in a block get excluded. As the variance is very dependent on the value of λ, it would be good to get this value right as soon as possible in the sampling.

I've been using prior on λ ~ Beta(h, m) where h (hits: number of non-zero block sums) and m (misses: number of zero block sums) start at 2 and then get increased for every sampled hit and miss, respectively. This obviously converges to the right answer, but it takes a while when λ really is very near (or equal to) zero or one. This is why CISS actually generates better confidence intervals for modestly selective queries than ones where nearly every block is producing either zero or non-zero sums.

One recent suggestion from my adviser is to look at the number of rows selected in each block and make some deductions from that. If all the blocks are returning about the same ratio, the correlation is probably pretty low. If it's all over the place, or if many blocks return zero rows, then correlation may be a big deal. In the latter case, you'd want to sample further to insure against missing a block that had a lot of rows to return.

The question is, do we need to answer this now? Frankly, I'd rather get the first result out there and cite it as an area for future study. I don't think it will fundamentally change the algorithm, but it could complicate the analysis and I've found that this is not an easy problem to explain to people even in it's simplest form (it seems lots of people who use databases and analytic tools all the time haven't given much thought to how they actually work).


No comments:

Post a Comment