Sunday, November 5, 2017

BLB and correlation

The bootstrap algorithm, in most of it's many forms, assumes iid (independent, identically distributed) observations. There are some adjustments that have been proposed when the data elements are correlated with respect to your attributes (in particular, a lot has been done with time-series data). But, very little has been done to adjust for correlation of the attribute values themselves.

In both the BLB (Bag of Little Bootstraps) and BMH (Bootstrap Metropolis-Hastings) algorithms, attribute correlation is something of a killer. It's not hard to see why.

Suppose I have a query where only 1 in 100 rows is included. Since my block size is 100,000 rows, I'm still expecting to see 1000 rows included from every block. The chance of getting no rows in a block is essentially zero.

Unless, of course, the attributes are highly correlated, which they usually are in my data. Let's suppose we're going after data for one particular office and that office did all their loading on a single week. Let's further assume that the period over which all offices loaded data was four weeks and the rate of loading was fairly uniform. Since data is placed into blocks in load sequence, all the data my query cares about is going to be concentrated in the 25% of the blocks that were created during the week that office was loading.

The chance of getting no relevant rows in a block is now 75% which, you may notice, is a lot higher than 0%. We could sample eight blocks and still have a ten percent chance of not finding any relevant rows. Obviously, any estimate of the sum from such a sample would be zero. Worse, the bootstrap evaluation of the quality of that estimate will be that it's perfect. No deviations will be found, no matter how many bootstrap runs you perform. Any reasonable stopping rule would kick in and you'd get a very confidently stated wrong answer.

There are a couple ways to attach this. One is to try to randomize the data a bit so there's less attribute correlation in blocks. That would work and it's not particularly hard to do, especially on a platform like hadoop where background data re-orgs are what the system is good at.

But, what if we went the other way? Suppose we amped up the correlation so that me were maximizing the chance of finding zero rows. D-BESt, my blocking algorithm from last spring, is really good at that. Recall that for real-life production queries on real-life production data, it yielded a partitioning scheme where upwards of 90% of all blocks could be safely excluded from any search.

Well, if you know up front that they can be excluded, it's not cheating to simply exclude them and limit your search to the blocks that may actually have some information you're looking for. Such a remaining set of blocks would have a hit rate much higher than the population at large. While I developed D-BESt with the idea that it would be used to optimize full scans, I'm increasingly thinking it can make the sampling work a lot better as well.

No comments:

Post a Comment