Thursday, December 28, 2017

Markov chains

I'm back. Didn't have much chance to publish over break, but I did get a fair bit of work done. One lead I'm looking into is the suggestion from my adviser that we consider the query identity function to be a Markov Chain. Markov Chains are sequences of variables that reflect a "state". How you get to the state doesn't matter. Your next state depends entirely on what state you are in. If you want to model correlation that way, you build your history into the state.

So, for example, if you had a sequence where the variables are correlated 3 deep (that is, the next value will be correlated with the previous 3), then your states would actually be 3-tuples of the last three observations. Those alone determine the distribution of the next value. Since the identity function can only take on zero or one (a row is either in the query or it ain't), the state can just be the sum (or, possibly a time-weighted average) of the previous n observations where n is your correlation horizon.

I generated some quick chains using correlation horizons of 0, 1, 2, and 10 observations. I used the same pseudo-random number stream for each so that they were easier to compare. The only difference is the transition matrix.

 0:0100001001011101101010000000001100001101001001101010010001010101010011000110010001011111010110100011
 1:1110000000001111100011000000000110000111000001101110000000010101111000000110000000011111111110100011
 2:0100000000001111101010000000000100000101000001111110000000010101010000000111010000011111111110100011
2a:0100000000000101111010000000000100000101000001111111110000000101010000000111010000000001010110100011
10:0100001000000000000000000000000100000100000001101010010001010101010011000110010000011111010110100011

Each of these chains has an overall hit probability of 0.4. That is, 40% of the values are 1. More precisely, if you ran the sequence long enough, you would converge to that percentage.  However, the correlation is much different.

On the uncorrelated row, the value switches from zero to one quite frequently. When the correlation horizon goes to 1, it switches less frequently. At 2, even less so. 2a shows that even with the same horizon, you can greatly increase the correlation by changing the transition probabilities. The last line represents a fairly modest correlation with a horizon of 10. It switches more often than 2a, but also tends to drift further from the underlying probability for any given subsequence.

The point is, you can model pretty much any correlation behavior you want with this sort of thing.

Except, the behavior I'm really after.

The reason the variables appear to be temporally correlated is because batches of related data get loaded all at once. So, if one is running a query which will include a lot from one batch and not from another, that will look like temporal correlation except at the transitions between one batch and the next. At the actual batch boundaries, the correlation is zero.

What's really happening isn't correlation of the identity function as much as correlation of a confounding variable. Batch ID is generally not in the query. But the values of other attributes that are in the query are very dependent on the batch ID.

I think we can get away with modeling it this way for now. The fact that we're getting pretty decent results without accounting for correlation at all indicates that this isn't a huge problem from a practical standpoint.

That said, the ultimate goal is to lay this algorithm on top of my dynamic partitioning which will send attribute correlation through the roof. So, we are going to have to deal with it at some point.

Sunday, December 17, 2017

Do we really want to go there?

Anybody who's been with the blog from the get go (a fairly small group based on my page stats) knows that I originally went from looking at individual entries to block sums because adjusting for the correlation was over-complicating things. Having developed a working algorithm using just block sums, I'm a little loathe to revisit that decision. Still, my adviser has made the legitimate point that, if the block sums are not sufficient statistics, then we are giving away information.

We have argued that while the individual observations are not independent, the blocks are. That's not really true, but it's pretty close to true. The last records in one block are correlated with the first records in the next adjacent block. However, since we are sampling blocks randomly, that's really not a problem; there's no systematic correlation that will throw off our results. And, if the blocks sums are independent, then it's a no-brainer that they are sufficient statistics for the total sum.

A question that can't be answered from just the block sums, however, is how correlated the data is. This impacts our estimate of the variance of the total sum. Recall that we're assuming a mixture distribution on the sum where it is zero with some probability (1-λ) >= 0 and distributed according to an arbitrary non-constant distribution otherwise. If the correlation was zero, then λ would always be so close to zero, we'd have no need for a mixture distribution. It's only if all the observations are highly correlated that there's much chance that all the rows in a block get excluded. As the variance is very dependent on the value of λ, it would be good to get this value right as soon as possible in the sampling.

I've been using prior on λ ~ Beta(h, m) where h (hits: number of non-zero block sums) and m (misses: number of zero block sums) start at 2 and then get increased for every sampled hit and miss, respectively. This obviously converges to the right answer, but it takes a while when λ really is very near (or equal to) zero or one. This is why CISS actually generates better confidence intervals for modestly selective queries than ones where nearly every block is producing either zero or non-zero sums.

One recent suggestion from my adviser is to look at the number of rows selected in each block and make some deductions from that. If all the blocks are returning about the same ratio, the correlation is probably pretty low. If it's all over the place, or if many blocks return zero rows, then correlation may be a big deal. In the latter case, you'd want to sample further to insure against missing a block that had a lot of rows to return.

The question is, do we need to answer this now? Frankly, I'd rather get the first result out there and cite it as an area for future study. I don't think it will fundamentally change the algorithm, but it could complicate the analysis and I've found that this is not an easy problem to explain to people even in it's simplest form (it seems lots of people who use databases and analytic tools all the time haven't given much thought to how they actually work).


Saturday, December 16, 2017

Insufficient

One of the things I discussed with my adviser yesterday was whether we were dealing with sufficient statistics (better yet, minimal sufficient statistics). Because the individual rows are correlated, it is more convenient to work with block sums and use the (mostly true) assumption that block sums are independent.

If individual rows are distributed by pdf f(x) with mean μ and they are included in the query with probability λ, then we have two parameters to estimate (λ, μ). Knowing both of these gives us the distribution of the sum. The problem is, the block sum is not a sufficient statistic for these. If we get a block sum of 50 on a block of 100 rows, there's no way to know if that was (λ = 1, μ = 0.5) or (λ = 0.5, μ = 1). Examining the individual rows would give us more insight into that, so the blocksum is not sufficient.

Of course, the blocksum is sufficient for the value we really want, the product of λ and μ, which is the expected value of an individual row. That can then be multiplied by the total number of rows to estimate the population sum.

However, if we want a confidence interval on that sum, we also need to know the variance, and that very much depends on the individual parameters, λ and μ.

All hope is not lost. Again, we don't really care about λ and μ, we want the variance (which happens to be a function of those two). The variance is also a function of the vector of block sums. In the infinite population case, the blocksums so far are a sufficient statistic for the variance of the mean. So, in the finite case, we should have a statistic which is assymptotically sufficient. I think if I can establish that, plus some reasonable comments about the rate of convergence, that should be good enough.

Friday, December 15, 2017

St. Vincent.

I had a good meeting with my adviser today, but I've got quite a bit of work to do before I'll have anything coherent to post from it. So, instead, I'm going to talk about the run I went on after our meeting.

There's a park south of campus called St. Vincent park. I don't know what this Vincent character did to get canonized, but I'll give him the benefit of the doubt and say it was probably genuine virtue rather than simply winding up on the wrong end of some Turk's simitar. Since many of my runs from campus go through this park, I've started naming them with virtues. St. Vincent the Humble is the shortest; it's just a loop around the park. St. Vincent the Goode combines the park loop with most of the Wayne Goode trail that runs through campus. The longest loop is St. Vincent the Patient.

Today, I decided to run a new route. I went through the park and then continued south to Forest Park. I then took the metrolink back to campus (UMSL students get free metro passes). Running through Wellston, four High Schoolers did a comically bad job of impersonating dangerous thugs. Granted, this is a very rough neighborhood and race relations in North County are challenged, to say the least. But, this was also in broad daylight on one of the busiest streets and they were having a hard time not laughing as they ran towards me. They ran along with me for a bit and heckled me while I just chuckled. Seeing that their "attack" had failed to produce the desired response, they went back to whatever they were doing before (which appeared to be nothing).

I run through the hood all the time and little incidents like these don't phase me. Nobody is seriously going to mug a runner; we don't carry anything worth stealing. Still, I don't know that I'll run this particular route often enough to warrant naming it. However, I can't pass up such an obvious pun. From here on, this route will be St. Vincent the Chased.

Tuesday, December 12, 2017

Upset

I spent the night at work (yes, Dr. Wu, paid employment; I'll get back to this research thing real soon). Well, a good chunk of the night, anyway. I'm packing it in at 11PM, which is actually earlier than I had expected.

Since it was off hours, I didn't feel bad about having FiveThiryEight's running commentary on the Alabama election going at the same time. The fact that Alabama just elected a democrat to the United States Senate is more than a little mind bending. It was the perfect storm for the GOP. Just goes to show that heavy-tailed distributions do exist.

While I'm certainly no fan of Moore, I actually find this whole episode rather sad. Of course Moore should have lost. Even without the whole child molestation thing, he's shown a complete disregard for the limits of power throughout his career as a judge. If anybody should be aware of those limits, it would be a judge. But, this is really unfair to the people of Alabama.

Jones doesn't speak for them. He speaks for a small and vocal minority. He will vote opposite of what the vast majority of the state wants. Whether or not you agree with what Alabama wants, this is not how democracy is supposed to work. Then again, very little of this past year has had much to do with how democracy is supposed to work.

It has occurred to me that, this whole data science doctoral thing could make me somewhat valuable to political campaigns. I don't know if I want to go that route or not. On the one hand, I really do believe in democracy and wouldn't mind helping a candidate with whom I agreed. On the other hand, the whole process is just so broken right now, I might get super depressed.

I suppose I should get back to my research before stressing about it.

Sunday, December 10, 2017

Delegation

Another week of very little progress on the paper as we get year end things finished up at work.

One could argue that, since I'm in charge of the group, I could just delegate a few more things. Technically, writing code isn't even in an architect's job description, though I do a fair bit of it. This last bit was a module that I really should have handed off because it was quite a bit of work. However, it was done rather badly the first time and I just didn't want to risk it being done badly again because it's rather central to the whole system.

As with my research, I need a big block of hours to be productive coding and that simply doesn't happen during normal business hours. So, since knocking out a big chunk of code during the day is pretty much fantasy land, I end up doing it at night. I like coding at night, but that doesn't leave any room for research.

Fortunately, it's done and we don't have any big pushes for a few months. I need to start using those evening blocks for research.

Thursday, December 7, 2017

Pareto results

Just so it doesn't look like I've made no progress when I meet with my adviser tomorrow, I ran the algorithms with a Pareto distribution. The significance of using Pareto rather than Laplace is that Pareto has infinite variance (but finite mean). Laplace just has a really big variance. That appears to matter.


CISS is dealing with it fine (stripping off the tails brings us back to finite, but large, variance). The Bag of Little Bootstraps goes completely haywire. Bootstrap-Metropolis-Hastings is somewhere in the middle. Using a query that introduces correlated filtering doesn't change things much one way or the other.



So, I think this is a better illustration if we're only going to use three distributions. I suppose there's no downside to showing all four, other than all these graphs are making the paper pretty long without adding much real meat.

Tuesday, December 5, 2017

Effort vs Talent? Try Ability

There's an ongoing debate in the area of human performance as to which counts for more, effort or talent. Arthur Lydiard, the famed track coach, was once asked what the most important thing a person can do to make it to the olympics. He answered, "choose good parents." On the other hand, the much misquoted "10,000 hour rule" implies that it's the effort that makes the difference.

My own experience, both in professional athletics and at one of the world's top universities, is that it very much depends on what you count as success. I was good enough to ride for a pro team and good enough to get an engineering degree from Cornell, but both were very hard. The hardest things I've ever done. No additional effort would have made me a really good pro or put me at the top of Cornell's class; I simply didn't have that kind of talent. But, I did have enough to do what I did. Some people aren't even that lucky.

That said, I think a person can find modest success in just about anything unless they are boxed in from birth by truly bad genetics or circumstance. Conversely, no matter how much the stars have aligned in your favor, there comes a point where you have to put out some effort to actually accomplish something.

So, you need both. I call that "ability". It's independent of how you got there; it's just a measure of where you are. And from that point, you can choose to apply more effort to increase it (or, if you're a 54-year-old runner like me, limit it's decline) or you can choose to put effort into something else where talent gives you more room to grow.

Or, you can do what so many people do and not even think about it. Simply cruising through life is an option, and I'm not even going to say it's a particularly bad one, especially if you are blessed with a lot of advantages from the outset. It's just not a path I'm interested in.

I started thinking about all this because tonight was Yaya's concert at school. The middle school had two bands (7th and 8th grade) and the high school had two (there's more mixing of years in the two ensembles there, but it can basically be viewed as varsity and JV). The 8th grade band sounded the best, at least to my ear. Sure the Symphonic Band (Varsity) played harder stuff, but they didn't play it particularly well. In their defense, they are all coming right off marching band season and threw this thing together in just a handful of rehearsals. But, that's my point. It was the extra effort of 15 weeks of prep that made the 8th graders better. At the end of the day, people are (generally) rewarded for what they actually do. Ability is the key to doing things, regardless of how you come by it.

Sunday, December 3, 2017

Pareto

I haven't done much (hardly any) work with Pareto random variables. That's actually a little odd since they show up so often in financial analysis. The Pareto distribution is the mathematical basis of the 80/20 rule. That is, 80% of the wealth (or income, or effort, or whatever you're measuring) is held by 20% of the population. Furthermore, as a power distribution, you can apply this rule as often as you like. So, if you just look at that top 20%, you find the same 80/20 rule applying. This pattern is definitely present in my data, as I noted way back when I started down this road 2 years ago.

It doesn't have to be 80/20; that's just the classic statement of the "Pareto Principle". Pareto found that 80/20 was the most common ratio with respect to wealth distribution both in his home country of Italy and pretty much every other country for which he could get data. That seems to have held up pretty well over the subsequent 125 years not only for wealth but a lot of other things as well. The actual distribution can be tuned to have an inflection point somewhere else (for example 90/10), but 80/20 works for a surprisingly large number of cases.

Believers in math religion (that is, that the world is somehow governed by math) will quickly note that the scale parameter, α, that yields the 80/20 rule is a very pretty (log 5)/(log 4). Yes, the Ancient Greeks would have loved it (except for the bit about it being an irrational number; they weren't down with that idea). However, I find that view dangerously naive. Math is very useful for describing reality, but it certainly doesn't dictate it. 80/20 is just a nice approximation, nothing more.

But, we're drifting into Philosophy and right at the moment I need to be focused on Statistics. The reason Pareto is useful to me is that it's a common distribution with an infinite variance. As such, it's a better example than the Laplace distribution I have been using. So, I think I'm going to swap it into the analysis.

There is a small problem when dealing with positive and negative numbers. Laplace is a simple symmetrical version of the exponential distribution. You can do that with Pareto as well, but you get a hole in the domain. Pareto random variables are bounded away from zero. So, if your minimum value is xm, then simply folding the distribution over on itself means that you can have observations anywhere on the real line except the interval (-xm, xm). This not a particularly hard problem to get around, just shift the distribution closer to zero by xm.

The downside of that is that it screws up the 80/20 rule. Another option is just to not worry about values with magnitude less than xm. My model already assumes a mixture of a real-valued random variable and one that is constant zero, so I don't really need to do anything to adjust for that. The draw back here is that I'm really giving up a lot of the center of my distribution. To get the interquartile range of the standard normal with a symmetric 80/20 Pareto, the minimum value is approximately 0.37. So, the "hole" in the distribution is over a quarter the size of the inter-quartile range. By contrast, the standard normal puts nearly 30% of it's distribution in that same interval and Cauchy jams in a bit more.

However, it's the tails I'm really interested in, so I'm not going to stress much over the interior. I don't really want to devote a whole paragraph in the paper to how I transformed the pseudo-random data just to do a comparison. I'll just state it as symmetrical Pareto and let the results speak for themselves (unless, of course, my adviser vetoes that approach).

Just for the record (so I don't have to derive it again) the Pareto parameters that give 80/20 distribution with an interquartile range equal to standard normal are xm=0.3713 (I'm using 4 significant digits in my parameters) and α=1.161.


Friday, December 1, 2017

New hours

I'm once again feeling like progress is stuck. A few weeks ago, I felt quite differently. I don't think it's any coincidence that I felt like things were going well after I had pulled a couple all-nighters to get some stuff knocked out.

As a pentagenarian, I can't do that on a regular basis. So, I talked to my VP about shifting my work hours to open up some larger blocks of time. I can slide reading a paper in around things, but to actually get my own work done, I need to be able to focus for at least a day.

The plan is that I'll stay a bit later during the early part of the week and then leave before lunch on Friday. That will allow me to meet with my adviser and then stay late at the library. Should be good for 8-10 hours of concentrated effort. I'll also carve out one other night a week (probably Tuesday or Wednesday) where I stay late at work. Oddly enough, my "workspace" at "work" is actually well suited to efficiently getting "work" done.

The remainder of my 20 hours of schoolwork will be the aforementioned sliding in reading a paper at lunch or in the evening at home.

At least, that's the plan. I tried it out today and it didn't go very well because my adviser wasn't available and I had to get back home to feed Yaya. But, it's a good plan.