Well, things may not be too bad. First off, my data isn't quite as 1-sided as I had feared. Here's a histogram of the order of magnitude (log2):
I'm not sure what that distribution is, but it sure isn't exponential. If it was, the partitions would get exponentially larger as I approached the lower order values and that would make it very hard to sample without missing a few really important values way out on the right. It starts out that way, but then it levels off. I just needed to dig further into the center of the distribution to see that.
I think we're looking at the union of two distributions. A normal distribution centered around 2^10 ($1000) and a Chi-squared with a mode at just under 2^20 ($1,000,000). I have some theories on what's driving those two distributions, but I'll talk to the actuaries before sharing them as I may be embarrassingly off base.
The next question is how to go through the partitions (keep in mind that this is a tiny slice of the full data set; the real histogram would have five more zeroes tacked on to the counts). A thought on that struck me as we were going through one of the more basic MCMC (Markov Chain Monte Carlo) methods in class today.
While the example is intentionally trivial so as to demonstrate the process, a key aspect of it struck me. The problem under consideration is that of a politician who wants to spend time in each district proportional to the number of people living there. Problem is, this bozo doesn't know how many people are in each district or even how many districts there are. By using a basic MCMC strategy know as the Metropolis algorithm, the politician can localize the decision to one of simply staying where he is or going to an adjacent district, knowing only the populations of his current and potential district. Just that information at each step, a random walk can be executed that converges to spending days in proportion to population.
The salient feature here is that you don't need to know very much about the range, or even the domain, of the random variable coming in. You can figure it out locally as you go. This is a very desirable feature when trying to sample from a 60 Terabyte database.
While somewhat problematic, it would not be terribly difficult to use an MCMC technique directly to estimate a posterior distribution for the result of a query. The problem, as seen when I first tried some sampling against this data is that the heavy observations in the tail are both very significant and very infrequent. By the time the sample is large enough for the distribution to converge, you've read darn near the entire database; you might as well just run a regular query and get the real answer.
For any given query, we don't know how far the tail extends, but we can form a reasonable prior from either the general shape of the data (in this case, it would be good to know if the metric we were after was from the Normal or Chi Squared component) or, we could inform the prior from actuals, which is a much smaller set of data indicating past performance and then sample from the much larger set of projected results to form a posterior. This latter approach is used by the actuaries to validate that the projections are making sense and they do form a pretty reliable starting point.
So far, nothing new. Trouble is, as Financial Services companies are forced to declare in their ads, past performance does not necessarily indicate future results. It is not uncommon for a large block of business to be sold off or some other significant development to change the projections of the cash flows from policies. Even with a good idea of where the tail is, we can't be sure of that without looking at least a few orders of magnitude (log2) beyond. So, we're still looking at fairly slow convergence. To speed that along, I'm proposing that we violate the tenets of a Markov Chain and actually build some memory into the process.
Rather than continue to apply the same prior at each step in the walk, we compute a new posterior based on the data sampled so far. That becomes the prior for the next step. We then compute the next block to sample, using a branching distribution that targets the tail. Once we have a handle on the largest observations, filling in the rest of the distribution is relatively easy and can be done either by continuing the pseudo-Markovian sampling or by switching to more traditional random sampling (I'm favoring the former, but leaving my options open at this point).
The rub, of course, is computing the new posterior. That's usually the output of the entire MCMC process, not the result of one step. It's not that computing it at each step is intractable, just time consuming, and the whole point of sampling is to get to an answer more quickly. That's another good reason to be sampling blocks rather than rows. Each step will at least generate a bunch of observations which will presumably move the posterior along a bit quicker (that's still conjecture given the early divergent results). It may also be that the whole distribution doesn't need to be estimated. Just having a distribution on the extreme value is exceedingly useful.
Lots of other questions to answer, too, but I do believe that there's something here and I do need to get some sleep. Next step is to actually try this and see how well it converges. Project for the weekend, I guess.
No comments:
Post a Comment