Thursday, December 28, 2017

Markov chains

I'm back. Didn't have much chance to publish over break, but I did get a fair bit of work done. One lead I'm looking into is the suggestion from my adviser that we consider the query identity function to be a Markov Chain. Markov Chains are sequences of variables that reflect a "state". How you get to the state doesn't matter. Your next state depends entirely on what state you are in. If you want to model correlation that way, you build your history into the state.

So, for example, if you had a sequence where the variables are correlated 3 deep (that is, the next value will be correlated with the previous 3), then your states would actually be 3-tuples of the last three observations. Those alone determine the distribution of the next value. Since the identity function can only take on zero or one (a row is either in the query or it ain't), the state can just be the sum (or, possibly a time-weighted average) of the previous n observations where n is your correlation horizon.

I generated some quick chains using correlation horizons of 0, 1, 2, and 10 observations. I used the same pseudo-random number stream for each so that they were easier to compare. The only difference is the transition matrix.

 0:0100001001011101101010000000001100001101001001101010010001010101010011000110010001011111010110100011
 1:1110000000001111100011000000000110000111000001101110000000010101111000000110000000011111111110100011
 2:0100000000001111101010000000000100000101000001111110000000010101010000000111010000011111111110100011
2a:0100000000000101111010000000000100000101000001111111110000000101010000000111010000000001010110100011
10:0100001000000000000000000000000100000100000001101010010001010101010011000110010000011111010110100011

Each of these chains has an overall hit probability of 0.4. That is, 40% of the values are 1. More precisely, if you ran the sequence long enough, you would converge to that percentage.  However, the correlation is much different.

On the uncorrelated row, the value switches from zero to one quite frequently. When the correlation horizon goes to 1, it switches less frequently. At 2, even less so. 2a shows that even with the same horizon, you can greatly increase the correlation by changing the transition probabilities. The last line represents a fairly modest correlation with a horizon of 10. It switches more often than 2a, but also tends to drift further from the underlying probability for any given subsequence.

The point is, you can model pretty much any correlation behavior you want with this sort of thing.

Except, the behavior I'm really after.

The reason the variables appear to be temporally correlated is because batches of related data get loaded all at once. So, if one is running a query which will include a lot from one batch and not from another, that will look like temporal correlation except at the transitions between one batch and the next. At the actual batch boundaries, the correlation is zero.

What's really happening isn't correlation of the identity function as much as correlation of a confounding variable. Batch ID is generally not in the query. But the values of other attributes that are in the query are very dependent on the batch ID.

I think we can get away with modeling it this way for now. The fact that we're getting pretty decent results without accounting for correlation at all indicates that this isn't a huge problem from a practical standpoint.

That said, the ultimate goal is to lay this algorithm on top of my dynamic partitioning which will send attribute correlation through the roof. So, we are going to have to deal with it at some point.

No comments:

Post a Comment