Tuesday, November 28, 2017

More thoughts on BMH

In any Metropolis-Hastings approach, you need some kind of likelihood function that you can use to decide if you stay where you are in the chain or move to the new proposed value. I've been using simply computing the likelihood of getting the observed block average given the proposed mean and observed variance.

Here's the kicker, though: these are heavy tailed distributions. The variance may not to exist. Of course it will always exist for a finite sample, but if the underlying data truly is heavy tailed, that's a rather bogus measure.

So, now I'm thinking about how you compute a likelihood using some other measure of spread that can be reasonably estimated. Interquartile range is one obvious one, though it suffers from the fact that it's not particularly helpful in assessing the mean, particularly if the distribution is multi-modal. There might be some real math to be done here.

Monday, November 27, 2017

Size matters

Block size, specifically. And, none of the methods I've been looking at seem to care.

For example, in the BMH paper, the example uses k samples of m observations per iteration. Values of k are 25 and 50, m is 200, 500, and 1000. I'm sorry, I thought the title of your paper included the phrase "Analysis of Big Data". I don't know of a Big Data platform that uses 1000-row blocks. 1-million-row blocks would be more common.

That matters because if I have, say, 10 billion rows, that's only 10,000 blocks. If I use up 25 of them on each iteration, I'm not going to have many iterations before I've gone through the bulk of the data set. It's not that I can't see that a larger number of smaller blocks converges with less data, it's just that such an arrangement completely defeats the purpose of these algorithms; that is, to get the answer as quickly as possible. And, simply counting the number of floating point operations is a very poor predictor of how long an algorithm takes when the size of the data is larger than the size of memory.

This, of course, is one of the big differences between academic work and industry. Academics my rightly point out that fewer computations are needed with smaller blocks. What they are glossing over is the fact that computation time doesn't really matter much. It's getting stuff off the disk that takes all the time. Once the block is in memory, running the computation on the full block is basically free (unless you're doing something really nutty).

So, I'm implementing these algorithms using much larger blocks. The authors could rightly accuse me of not giving their algorithms a fair shake. I'm not. I'm actually giving them a huge advantage: I'm letting them use all the data in memory instead of just a tiny amount. My metric continues to be how many blocks are read, even though these algorithms have more computation than mine.

At the end of the day, I'll have to justify all this with some true time studies. To do that, I'll need to re-write all this stuff (again) to run on the cluster. It's actually not that much code and I was going to get around to it anyway. But, I'm hoping I can put that off until after we get this paper out the door.

Sunday, November 26, 2017

Rebirth

Between trying to get things tied up at work before leaving for Ithaca and the trip itself, I haven't had much to report the past week. Thus, there have been no reports.

However, one of the advantages of taking a break is that you tend to think of the bigger picture a bit. I'm starting to think this whole bootstrap business has some promise. It's an active area, so it will be viewed as relevant and nobody seems to be terribly concerned that all the existing methods fail (or, at least have not been proven correct) when higher moments are absent. So, one might conclude that getting some results in the absence of higher moments would be a good thing.

Here are a few of the questions that might be worth answering (I'll have to check to see if any of them have been):

  1. How can you mix bootstrapping with stratification?
  2. How to account for mixture distributions where the mixture has temporal correlation?
  3. More detailed results using BMH with mixture/hierarchical models. In particular, does a block sampling strategy have the same asymptotic properties, even in the presence of block correlation (I'm sure it does, but the proof could be messy)?
  4. What happens to all these things when the variance is unbounded?
  5. What happens when we mix the correlation-maximizing strategy of D-BESt with methods designed to account for correlation?
  6. What if we replaced the pseudo-random number generator with a chaotic function in these methods?
That last one is a bit off-topic, but it occurred to me that random number generators are really just chaos functions by another name. Maybe it would be simpler/more efficient/more effective to intentionally pick a chaos function that covers the the sample space non-uniformly. Sometimes what we want isn't so much randomness, but non-periodic coverage. Non-linear differential equations are good at that, particularly when you go to higher dimensions.

Also, I still have my eye on the Blind Multi-Stage Bootstrap that I wrote about a couple weeks ago. A couple additional thoughts on that one:

  1. Does it make sense to update the BMH distribution with each sample based on what we've observed so far? If so, what sort of prior do you use to control that?
  2. What if it's the entire model that you're not sure about? That is, rather than a few competing models, is there a way to be entirely agnostic about the model? We actually had a colloquium on this last year which looked into Dirichlet models. I think I'd at least like to read up on that. 

Sunday, November 19, 2017

High tide

Once again, work and school have crested simultaneously and, for obvious reasons, work is winning.

The annoying thing is that the last few production issues at work have had nothing to do with our new platform. It's been the legacy stuff that's been acting up. I'll be happy when we can retire all of that, but I'm afraid that's at least a year away. Legacy systems don't die without a fight.

I do expect to be completely free from work concerns during my trip back to Ithaca next week. I may get a chance to do some research at Cornell's math library. I'm not sure how much I can get done in a single day there, but it would certainly make sense to try. Of course, the real point of the trip is to see my mom and sister. I don't think hiding in a library all week is a particularly good way to demonstrate that I care that my dad just died.

None of this has me particularly stressed. I expect that some weeks will be productive and others won't. That said, I really would like to get past my admission to candidacy sooner rather than later. It's a fairly arbitrary hurdle, but it has the effect of clarifying expectations for the rest of the dissertation. Right now, I still feel a bit unsettled as to how much work remains.

Friday, November 17, 2017

Right answer, wrong question

I've found a fair bit of research on using Metropolis-Hastings on time series data. In everything I've read so far the focus is on identifying the correlation. I don't really need to do that. I already know the correlation boundaries. What I need to be able to do is build those into my model so I can adjust for them.

In the mean time, I think just by pulling multiple blocks I can smooth out most queries pretty well.

Wednesday, November 15, 2017

The hard work of math

Most people think of math as "hard". By that, they really mean "incomprehensible". There's nothing hard about noting, or even proving, that the cardinality of the integers equals that of the rationals but is less than the cardinality of the reals, even if most people would prefer to simply call them all "infinite". Even most math-literate folks, like engineers and physicists are generally content to simply make the distinction between countably and uncountably infinite rather than trying to wrap their heads around the various degrees of uncountable sets. These things aren't "hard" if you trust your axioms and your methods but, intuitively, they are pretty far out there.

On the other hand, rigorous proof and its software analog, formal validation, are objectively hard tasks. They require discipline, patience, and, above all, attention to detail. Any attempt to do them quickly defeats the purpose. They necessarily require effort.

So, having completed the interesting part of comparing the sampling algorithms, I'm now left with the hard task of making sure that everything I did was correct. It's a long night of fairly unrewarding work. But, it is the hard work of math. It needs to be done.

Tuesday, November 14, 2017

BMH preliminary results

Here's a first look at the Bootstrap Metropolis-Hastings results. Vertical scales are the same as before.


Cauchy is nutty as usual but the other two are actually pretty good. Then we try a filtered query...


Oh, wowsers, that's not nearly so good. The confidence interval misses the true value most of the time, even in the normal case. The problem, of course, is the correlation of the attributes. Here's the kicker though. Metropolis-Hastings can handle arbitrarily complex models. Currently, I've got it assuming iid inputs. That's not necessary. A model that assumes correlation in the data set might do much better. I'll work on that next.

Monday, November 13, 2017

Bootstrap Metropolis Hastings

I'm coding up the second algorithm for the comparo. This one is a bit less intuitive. First, a quick overview of the Metropolis Hastings algorithm:

The idea is to approximate the posterior distribution of some parameter given a set of data. Presumably, this distribution is difficult (or impossible) to derive, so we settle for generating an empirical distribution. It goes like this:

1) Pick a starting value for your parameter. Let's call it Y since greek letters are a pain to produce in a normal text editor. This first value is Y0.

2) Pick another value for Y by applying some distribution to Y0. The choice of distribution is pretty arbitrary, but things converge faster if it's somewhat similar to the distribution of Y. Since we're looking at sums, a normal distribution should be close (if not spot on). So, we let Y' = Y0 + N(0,s) where the standard deviation, s, is a reasonable guess at the variance of our sum.

3) Compute the relative likelihood of the two values L = P(X|Y')/P(x|Y0) where x is our observed data set.

4) If L > 1, then Y1 = Y', that is, we "accept" Y' as our next value in the empirical distribution.

5) If L < 1, then we accept Y1 = Y' with probability L (by comparing it to a pseudo-random uniform(0,1) variable. Otherwise, we keep the original value, Y0 = Y1.

We keep doing this for a long time until the empirical distribution of {Yi} converges. That limiting distribution is our estimate of the true distribution of Y.

The Bootstrap Metropolis Hastings algorithm works pretty much the same way except that rather computing the likelihood of the entire data set, we first select a subset and run it on that. We repeat that for multiple subsets and then average the results.

If this sounds like a lot of work for something as simple as a sample sum, you're right. We're not expecting the result to be any better than the BLB algorithm, which requires significantly less computation. I'll hopefully have confirmation of that in a couple days.

However, because the BMH algorithm returns an arbitrarily complex distribution and not just a point estimate and variance, it opens the possibility of leveraging the hierarchical model that we're using in CISS or even expanding that to something more complex. Once I'm done implementing the naive method, I'm going to try it with the hierarchical model and see if that keeps the variance from collapsing the way it did in the BLB tests on filtered queries.

Sunday, November 12, 2017

Relentless reogranization

INTJ. That's my Meyers-Briggs personality type. And, as much as I'm wary about over-simplifying human behavior, I have to concede that the description of that type fits me rather well. One of the attributes is "relentless reorganizer".

And, yes, I'm going to re-frame part of my problem once again. This time, however, it's not of my own volition. My adviser felt that some more clarification could be made about what exactly we mean by a "query". That's obviously a fair thing to ask since the entire method hinges on it. I thought about giving a bunch of examples. I may still do that, but I think what would fly a lot better would be a rigorous mathematical definition. So, here's a first cut at that:

Let X = {Xi} be a set of financial measures and Y = {Yi} be a set of attribute vectors describing each corresponding measure in X. Let IQY → {0, 1} be an indicator of whether a vector in Y should be included in a query. A query, Q: ℝn ⨉ Y → ℝ, is a function such that Q(X,Y) = Σ Xi IQ(Yi).

IQ is called the query condition. Typically, IQ defines a rectangular subspace of the full attribute space. That is, each attribute is limited to a range of values independent of other attributes. However, our results do not rely on that assumption. Any subset of the attribute space is an acceptable query condition.

If using a multi-currency model, then one could further indicate how exchange rates factor in:

Let C be the set of currencies and T be the range of time values. An exchange rate is a function F: C2 ⨉ T such that F(d, c, t) is the conversion factor from input currency c to display currency d at time t. Let Yci be the input currency for Xi (that is, the financial units of measure Xi) and Yti be the time attribute for Xi. Then, a query, Q: ℝn ⨉ Y ⨉ C→ ℝ, is a function such that Q(X,Y,d) = Σ Xi IQ(YiF(d, YciYti) where the result is expressed in units d. As we don't specifically deal with multi-currency in this paper, I'll leave that bit out. Might need it later, though.

Saturday, November 11, 2017

Stopping rule revisited

So, my stopping rule got a little contorted through all the various machinations of my variance. Fortunately, I had it right in the code all along, I just transcribed it badly when going from the confidence interval to the stopping rule (basically, I stop when the confidence interval is sufficiently narrow).

For starters, we have:



Stated as an alpha-risk, that's



Since the denominator is always positive, this implies



so, collecting all the stuff that changes each sample on the left, we stop when

Friday, November 10, 2017

More results

I scaled up my data size and also tried some runs where the filter created a mixture distribution (a mixture of observations that are constant zero because they don't meet the criteria and rows from the original distribution that do). Results are what I was expecting but I'm much happier than someone who just had things work out according to expectations. That's because the charts so vividly demonstrate the issue.

First the Bag of Little Bootstraps method with no filtering of rows, that is, basically iid distributions:

As with previous examples, all three distributions have the same inter-quartile range. The difference is the weight on the tails. The horizontal axis is number of blocks read.

Not surprisingly, BLB does great with normal. Then again, just about any algorithm works with iid normal data; that's why so many people assume it even when they shouldn't. The heavier the tail, the worse things get, which is also what you'd expect. That said, Even in the case of Cauchy data, the algorithm is holding it's own. It appears the iid assumption is enough.

CISS does a bit better with the heavy-tailed distributions, but not much. To make comparisons a bit easier, I've used the same vertical scale from one method to the next. I did have to use a larger scale for Cauchy because it's so much more variable, but it's the same on both graphs.

The error bounds are jacked up on the first two; I haven't had time to look at why. If that was all we had, I'd be sad. CISS is better, but not enough better to give up the flexibility that the bootstrap can be used for any estimator, not just means and sums. However, that's not all we got. When we switch to the mixture distribution, things change radically.

That's a mess. In the first two, the variance collapses because there are too many blocks where all the records are excluded (I generated this data set with attribute correlation, which cranks up the probability of missing all records). Oddly, the Cauchy approximation is be the best of the three, though I think that gets chalked up to dumb luck.

On the other hand...

Now we're talking. Good convergence and tight confidence bounds to match.

Thursday, November 9, 2017

Base results

Here are some simple charts showing the Bag of Little Bootstrap performance on three distributions with the same inter-quartile mean. As one would expect, it works great on the normally-distributed data. OK, but not so great on the heavy-tailed Laplace. A train wreck on Cauchy.


Wednesday, November 8, 2017

... and less good news

Got interrupted right when I was starting to have a productive evening working on my sampling of generated data. Production failed at work. Not a terribly hard fix but it took a big bite out of the evening and now I'm up against a real deadline if I'm going to have anything at all to show to my adviser this week. Tomorrow's looking like a very long day.

Tuesday, November 7, 2017

In other news...

I do have a day job. And, while it's consuming way more of my time than I'd like, at least I can claim some success. Here are some stats from our new reporting platform compared to last year:

Data size:

  • 2016: 3 Billion rows, split among three cubes.
  • 2017: 3.8 Billion rows, all consolidated in one cube.
Median batch load time (for week before recalc; which is the busiest week):
  • 2016: 47 minutes (review only)
  • 2017: 22 minutes (all)
Last year, the only way we could keep the "official" cube stable was to throttle the builds to once every three hours. So, you might have to wait a long time to see results there. We had a "review" cube that allowed you to see what your particular batch was going to look like that refreshed more often.

Benchmark query time:
  • 2016: 14 seconds with high variability (it was common for the query to take over a minute)
  • 2017: 5 seconds, reasonably consistent
So, we've hit our marks. But, it's come at a cost, both for RGA and for me. The project wasn't cheap and all the overtime I've put in could have gone to working on my research.

Still, it's an easier pill to swallow when the results are so obvious.

Monday, November 6, 2017

Blind multi-stage bootstrap

The multi-stage bootstrap is a bootstrap method where one first creates a hierarchy of attributes and then selects the bootstrap sample based on that hierarchy. For example, suppose we had cash flows from several offices and we have good reason to believe that cash flows within an office are correlated whereas cashflows between offices are not. Further, we know that the cash flows for life policies are fundamentally different from those for health. If the goal was to get some sense of the variability of the present value of all these cash flows, it would make some sense to try to estimate these things separately and then combine the estimates, rather than pretend that they are all just one big pool.

So, we perform the bootstrap, but rather than selecting randomly from all observations, we first select an office (proportional to the observations for each office), then pick a product type (proportional to the life/health mix for that office) and then select our bootstrap sample from the observations meeting those criteria. As with the usual procedure, we repeat this via a Monte Carlo process until we have convergence on our estimator quality.

All good, but what if we don't know the hierarchy? Specifically, what if we know that a query may or may not exclude all rows from a block, but have no way of knowing which it is (without sampling the block beforehand, which defeats the purpose of ad hoc sampling). We are now in a situation where we are performing a multi-stage bootstrap but, rather than consciously picking the sampling branch, we are merely observing it as we sample.

I'm not sure what (if anything) this does to the estimator. In frequentist statistics, foreknowledge of an experimental design changes the outcome. Bayesian stats are a bit more resilient to foreknowledge because that's baked into the prior. At any rate, I don't need to address it right away because we are just doing a comparison with BLB and BMH, not really trying to expand on those methods.On the other hand, a cursory literature review didn't turn up anything on this topic, so it might be worth investigating sooner rather than later simply to stake out some theoretical ground.

Sunday, November 5, 2017

BLB and correlation

The bootstrap algorithm, in most of it's many forms, assumes iid (independent, identically distributed) observations. There are some adjustments that have been proposed when the data elements are correlated with respect to your attributes (in particular, a lot has been done with time-series data). But, very little has been done to adjust for correlation of the attribute values themselves.

In both the BLB (Bag of Little Bootstraps) and BMH (Bootstrap Metropolis-Hastings) algorithms, attribute correlation is something of a killer. It's not hard to see why.

Suppose I have a query where only 1 in 100 rows is included. Since my block size is 100,000 rows, I'm still expecting to see 1000 rows included from every block. The chance of getting no rows in a block is essentially zero.

Unless, of course, the attributes are highly correlated, which they usually are in my data. Let's suppose we're going after data for one particular office and that office did all their loading on a single week. Let's further assume that the period over which all offices loaded data was four weeks and the rate of loading was fairly uniform. Since data is placed into blocks in load sequence, all the data my query cares about is going to be concentrated in the 25% of the blocks that were created during the week that office was loading.

The chance of getting no relevant rows in a block is now 75% which, you may notice, is a lot higher than 0%. We could sample eight blocks and still have a ten percent chance of not finding any relevant rows. Obviously, any estimate of the sum from such a sample would be zero. Worse, the bootstrap evaluation of the quality of that estimate will be that it's perfect. No deviations will be found, no matter how many bootstrap runs you perform. Any reasonable stopping rule would kick in and you'd get a very confidently stated wrong answer.

There are a couple ways to attach this. One is to try to randomize the data a bit so there's less attribute correlation in blocks. That would work and it's not particularly hard to do, especially on a platform like hadoop where background data re-orgs are what the system is good at.

But, what if we went the other way? Suppose we amped up the correlation so that me were maximizing the chance of finding zero rows. D-BESt, my blocking algorithm from last spring, is really good at that. Recall that for real-life production queries on real-life production data, it yielded a partitioning scheme where upwards of 90% of all blocks could be safely excluded from any search.

Well, if you know up front that they can be excluded, it's not cheating to simply exclude them and limit your search to the blocks that may actually have some information you're looking for. Such a remaining set of blocks would have a hit rate much higher than the population at large. While I developed D-BESt with the idea that it would be used to optimize full scans, I'm increasingly thinking it can make the sampling work a lot better as well.

Friday, November 3, 2017

Big Data gets it wrong

Some may interpret my meager posts over the last week as lack of progress. Well, they pretty much got that right. We are in the final two weeks of the Plan cycle at work. That's where all the offices submit their projections for the next year. It's not as easy as it might sound and my group is right in the middle of it since they use our stuff to report on the projections. Thus, it's been a pretty busy week and, while I've had time to read quite a bit of stuff, I haven't had a solid block of hours to code the comparo.

So, I'm writing completely off topic tonight.

Yaya turns 16 in 21 months. I'm not going to give her my car right away but, as she's a pretty responsible kid, I doubt I'll hold it back for too long. I'll probably change the license plate since there's pretty much no chance she'll ever run 100 miles.

So, like any self-respecting 54-year-old dude, I've been checking out the current crop of sports cars. Note that I said "sports cars" not "grand touring cars". I have nothing against fast luxury vehicles, but they don't seem like a good use of such a large sum of money.

Anyway, as I was browsing web sites looking at pictures of Miatas, 124s, Minis, and such, it occured to me that maybe I could afford a used Lotus Elise, which is pretty much my idea of the perfect car: super light, exceedingly nimble, and sufficiently underpowered that you have to drive it right to appreciate just how great it is. So, I poke around for that.

Now, every site I go to has a banner add for some quarter-million-dollar car. Folks, the Elise only ran $50K new. How does checking out the resale value on that suddenly put you in the market for the likes of Ferrari and Maserati?

Somebody at Google needs to tweak their eigenvectors.