Thursday, February 18, 2016

First cut at the algorithm

Preparation:
  • Stratify data upon loading, creating blocks of rows with roughly equal order of magnitude (in absolute value, so big offsetting entries will tend to wind up in the same block).
  • Store references to blocks as an array of lists, so it's easy to retrieve all blocks with the same order of magnitude. Minimally, this will be an access pointer and row count for each block, plus a count for each stratum. It may make sense to also store attribute information so one can quickly determine if a block has ANY rows of interest without actually scanning the rows; that's an implementation detail to be dealt with later.
Query (single cell result):
  • Compute initial prior for the result and the extreme value (largest value included in the result). I'll use a generic prior for a first go, but using actuals is probably where I'll wind up.
  • Create an (empty) list of blocks already queried, so we can enforce sampling without replacement. (Might make more sense to do the inverse: create a copy of the full list and remove references as we go).
  • Set the initial stratum to the mode of the extreme value prior.
  • While the width, e, of the (1-γ) Highest Density Interval (HDI) of the extreme value prior is greater than E (γ and E are parameters):
    • Choose an unsampled block from the current strata.
    • Add that block to the list of sampled blocks.
    • Sample that block completely.
    • Calculate posterior distributions on the result and the extreme value (lots of edge cases here; particularly, what to do when no rows are found in the block and when the block is the last one for this stratum).
    • Generate a random variable from the extreme value posterior and set the current stratum to match the magnitude of that value.
    • Set the current priors to the posteriors.
At this point, we have a very good distribution on the extreme value and should also have a reasonable handle on the result distribution as well (although, since we haven't actually used it yet, it might be best to hold off on the result distribution until we get here). What we may or may not have is a good handle on the distribution of the individual data elements. At any rate, compute one now from the initial prior and all the data sampled so far and set that as the prior for this step.
  • While the width, r, of the (1-α) HDI of the result prior is greater than R (α and R are parameters):
    • Generate a random variable from the prior distribution of the individual data elements and set the current stratum to match the magnitude of that value.
    • Choose an unsampled block from the current stratum.
    • Add that block to the list of sampled blocks.
    • Sample that block completely.
    • Calculate posterior distributions on the result and the individual data elements.
    • Set the current priors to the posteriors.
And, when that converges, you're done. Yes, it looks like a lot of work, but would you like to see what a keyset search for all matching records by a traditional database system looks like? Actual computation is so fast now that it's almost free. The stick in the mud is getting stuff in and out of the computation engine. More calcs on less data will likely carry the day. At least, that's what I'm betting on.

Also note that this algorithm parallelizes very easily. Just delegate each step to a separate process. There isn't much downside to doing them out of order; just a little coordination required to maintain the list of sampled blocks.

No comments:

Post a Comment