Tuesday, February 23, 2016

Moving on

I'm pretty much ready to shelve the Metropolis algorithm. I'll concede that when very little is known about the distribution of the observations, it does have some merit, but that's not really the problem I'm trying to solve. I'm looking for fast convergence and this is clearly not it. Rather than demonstrate on my data which is too irregular to be a good illustration, I'll show what I mean on a the uniform distribution.


The blue bars show the count of actual observations from a discrete uniform (1,100) distribution. The red bars show the Metropolis count where the current observation, xi, is kept if it is larger than the next, xi+1, with probability xi+1/xi. If xi+1 is larger, Metropolis always moves to the next observation.

As one would expect, Metropolis correctly shifts the sampling roughly in proportion to the mass of the observations (I'm using the term mass to indicate density times the value). However, the convergence is terrible. This is a sample set of 10000 observations on the smoothest possible distribution. One can reasonably ask for pretty nice convergence here. Yet, look at the deviation from the expected value for the counts for each possible value of X:



The actual observations have a few spikes, but are in line with expectations (they'd better be, or I need a new random number generator). Metropolis does great in the area of the distribution where there is very little mass, but look at those giant spikes on the right. That's in the heavy part of the distribution where I can't afford to be off by much. This isn't a fluky set of observations, either. I ran it quite a few times with different seeds and the graphs looked pretty much the same each time. If Uniform is off by this much after 10000 observations, my data is going to need millions of hits to converge. Given the difficulties of random sampling a large database, there's no way that's going to outperform a full scan until the data size is way into the billions. My full data set is that large, but individual queries will often return "only" several million rows. This just isn't looking like a good answer.

So, here's my revised algorithm that simultaneously monitors convergence of the posterior distributions of the extreme value (to avoid missing big observations) and the sum (which is the value we really want):

As before, when loading the data, stratify it by log2 and group into blocks large enough to get some decent mixing of attributes. Might have to encourage such mixing by employing a hash to the block assignment, but I'm not worried about that right now.

Then, on query, start with a prior (ideally from actuals) P(|Xs|) which is the density of the sum of the absolute values of the relevant measures from a block from stratum s (note that, despite the somewhat misleading notation, this is the sum of absolute values, not the absolute value of the sum). Set P(Xs) to be some convenient distribution centered at zero, which I think will work fine as these are generally offsetting accounting entries, though I may need to generate a better prior for the actual sum as well. Then:
  • Pick a stratum s in proportion to E(|Xs|).
  • Sample a block from there to get both the sum of the absolute values |D| and the actual sum D.
  • Update the posteriors P(|Xs| |D) and P(Xs | D). May also need to update the nearby posteriors, but, for now, I'm treating them as independent even though they aren't.
  • Keep going until the (1-α) HDI on all the posteriors P(Xs | D) is within the specified tolerance.
  • Multiply E(Xs) by the number of blocks in each stratum and sum to get your answer.
This is essentially stratified Gibbs sampling with an update on the distribution at each step. That strikes me as obvious enough that I did a cursory review to see if there's already a paper out there somewhere for this and didn't find one (though I did turn up some related papers from Importance Sampling that I should probably read).

For a first cut, I'm not going to stress too much about computing posteriors; I'll just assume something mathematically convenient with a tail to the right. The goal is to get something working in the next few days so I can then refine it while I'm in Texas.

No comments:

Post a Comment