Wednesday, January 10, 2018

When in doubt...

... restate the problem! Yes, really, I'm going to reframe this thing yet again.

We have n observations where n is finite (important!) These observations are not iid. Instead, they come in batches. Within a batch, they can be thought of as iid. (Actually, even within a batch they aren't really independent, but it's true enough for sampling.) If we were to sample this whole population randomly with replacement, we could say that we are dealing with iid observations where the distribution is simply the entire population with each observation having a probability of 1/n. We then use all the standard techniques for parameter estimation. That's what BLB does.

Or, we could take a sub-sample, with or without replacement, and run a MCMC simulation to get an estimated distribution for whatever parameter we were after. That's what BMH does.

Both of those approaches work. Both of those approaches are hideously inefficient. Why? Because random sampling from large databases is hideously inefficient. Reading an entire physical block of data takes about the same amount of time as reading a single row within that block. Unless you're doing something really nutty, processing the entire block takes several orders of magnitude less time than reading. In other words, once you've read and processed the first row, the rest of the block is basically free.

The rub is correlation. The batches are loaded sequentially into blocks. Maybe there are some transforms that scramble things a bit, but if you look at the physical rows in just about any data store, you will see heavy correlation between adjacent rows because they very likely came from the same batch.

The impact of this is to increase the variance of any estimator we might be calculating. If we pick a block of 100,000 rows, and 90% of those rows came from the same batch, we are really computing an estimator for just that batch, not the entire population. Other batches might be very different. Therefore, if you treat a block like a true random sample, the estimator will seem much more stable than it really is.

OK. I knew that before I even started working on this problem. It is, in fact, the original idea I was trying to work on. I just got sidetracked on this heavy-tailed stuff because it kind of blind-sided me. So, let's take a deep breath and look at the original problem and this time we won't beg out of it by switching to block sums (although, I still think that's a great practical trick).

We start with the definition of a query I used the last time I restated things: Q(X,Y) = Σ Xi IQ(Yi).

Now, let bi=Yi,1 indicate the batch to which observation (Xi, Yi) belongs. The distribution of Xi is a function of Yi. Although Yi isn't usually thought of as a random variable but rather independent inputs, in this context it can be very much modeled as a random vector with a distribution determined by bi. Thus, Xi IQ(Yi) is a random variable with a distribution determined by bi. The query is just the sum of these guys. If we've sampled k < n observations, all we need to know to make a statement about the full population is how much variability remains in the remaining n-k observations.

That's hard to do because the remaining observations are all in unsampled blocks and are therefore correlated. However, their distributions, while unknown, are functions of a single attribute, b. And {bi} just happens to be a Markov Chain! Yes, this is the big revelation of the night. The variables themselves are, at best, a really difficult to model Markov Chain. But, the distributions of those variables is a really easy to model Markov Chain. Just take the number of batches, divide by n and that's the probability that the batch changes at any point on the chain. By estimating the first two moments of Xi IQ(Yi) by batch and the variance of those moments between batches, we then can estimate the first two moments of the sum of any arbitrarily long chain. That gives us both our point estimate and confidence interval (or, HDI, if you prefer Bayesian terms).

Yes, there's some serious math to do. But, that's a good thing. And, Stochastic Processes was actually one of the few classes I took at Cornell where I was one of the students giving rather than receiving assistance from classmates. This stuff is very much hitting my strong points. (Or, at least it was and should be again once I dust off my 32-year-old copy of Karlin & Taylor).

No comments:

Post a Comment