Sunday, March 20, 2016

Convergence plot

Granted, for purposes of demonstration, I got a bit lucky in that my random sampling algorithm missed a big 1-sided entry until late in the game. But, that's just the point, random means things like that will happen. CISS, on the other hand, cranked out the big uncertainty items much earlier on. Here's a graph I put together showing the convergence of CISS versus a single random run. Obviously, it would be more rigorous to perform a few thousand random runs and look at the population convergence, but that's not really a big concern right now.

What is a big concern right now is the uncertainty bounds which are clearly way too conservative. I'm calling it "uncertainty" rather than a confidence interval because right now it's just a heuristic that will nearly always contain the true value; not something based on an actual probability. That will be the main point of discussion with my adviser tomorrow. I'm pretty sure I can specify the variability as a real Bayesian prior and then work out the posterior, even without knowing the underlying distribution. The reason for this is that, since the data is stratified, I already know the distribution of the data in each stratum: it's basically uniform with a range from 2i to 2i+1, where i is the stratum number. What I'm estimating is how many of those rows are included in this query result. That's just estimating a Bernoulli success rate using repeated samplings. If I have a confidence interval on the success rate, I also have a confidence interval on the variability of the number of hits. Multiply that by the average magnitude for values in that stratum and you've got a pretty good handle on the variability of the estimated sum.

Side note on this data: if you're wondering why the random sampler appears to show so much less volatility in the path, that goes back to the original motivator of this research: correlation within a block. These are accounting entries. If you process them in blocks where the blocks represent sequential entries, you tend to get the reversing entry along with the detail. So, blocks tend to sum to zero, which means the estimate doesn't move much until you get to something where the entire entry didn't all go into one block or the reversing part is excluded from the scope of the query. (This latter case is how you wind up with profit in a year even when all entries have to balance; the reversal is in the next fiscal year). When the data is stratified, the reversing entries (which negate hundreds of detail entries) get put into different strata, so stratified sampling will show just half the entry until the rest of the details are found. That leads to very high volatility at first, but better convergence in the long run.

No comments:

Post a Comment