Friday, March 9, 2018

Who cares (part 2)

I think I just figured out who cares: I do. More importantly, my boss does. And several layers of management above that.

As some may recall, I've been a bit concerned that my research, while improving in academic merit, has risked becoming irrelevant. Today, I realized that's not the case (or, at least, doesn't have to be the case).

Up until now, I've been framing this as a sampling problem (uh, oh, there he goes reframing things again). The sampling problem is not without merit and before my adviser, who reads these posts, gets a heart attack, I'm not straying from that for the immediate paper. However, the real power is in what it allows from a data organization standpoint.

A brief anecdote may help here. Back in 2012, the actuaries asked us to build an Analytic Cube based on policy attributes. That sounds like an easy thing and, from a data architecture standpoint, it is. From a data engineering standpoint, not so much. The problem wasn't that we'd have 60 billion fact rows. We had built cubes that size before. The problem was that the policy attribute dimension would have around 4 million rows. That was in addition to all the existing dimensions, several of which have a cardinality close to 1 million.

We spent no small amount of time trying to figure out how to partition the data so that we weren't doing ridiculously large "joins" (multi-dimensional data doesn't technically get joined, but the operation of attaching dimensions to facts is similar in terms of computation). Wouldn't it be nice if we could just indicate the structure and let the database tune itself?

That's sort of what D-BESt does, but that's really focused on just maximizing partition pruning. What if we took a step beyond that? To answer that, we'd have to have some idea of what that step would be. I sort of stumbled onto that while coding my sensitivity analysis stuff this week (yes, I have actually got some work done; I've just been too busy to post about it). Sensitivities on the hit rate are pretty easy, but what about the distribution of the measures given a hit? That's a lot more open ended.

Recall that our basic framework for the block distribution is driven off three parameters: the hit rate, the mean of the included observations, and the variance of the included observations. D-BESt does a great job of handling the hit rate. It tends to be zero or fairly high, and we generally don't need to read the block to know that. But, what about the other two? Having them relatively homogenous within a block might also be of great value. If we knew the distribution for the whole data set, and we knew how those related to hit rate, we could do a much better job of not only sampling and performing sensitivities but also at recognizing what data should be pre-aggregated and where the anomalies lay (that last bit is really what the Actuaries are after).

For example, if we we already have data somewhat stratified by mean and variance, we can bring sampling algorithms like CISS into play.

We can also create good sub-samples of the data by taking a slice from each block.

Aggregates can be built based on how often a block gets "split", that is, how often a query produces a high hit rate for some of the block and a low hit rate for the rest.

We can ask why the data gets sorted the way it does. Are there patterns in the organization that are useful to machine learning? If one particular attribute doesn't conform to the pattern, is that a problem?

Most importantly, this leads to pre-computation. If we can anticipate a query, it doesn't so much matter that it takes a while to answer it because we can start doing so before the query is even issued. Sorting the data into pre-computed blocks of like distribution is the first step in that.

Doing this within the context of rigorous statistics (rather than the more Data-Sciency approach of just trying stuff and seeing what generally works) is not a small task. But, that's why I'm thinking this line of inquiry may still have merit, even if it has strayed a bit from the original question. It's hard, feasible, and valuable. That's a pretty stout combination.

No comments:

Post a Comment