Monday, May 28, 2018

Actual contestants in the comparo

As I've noted before, BMH and BLB are train wrecks when the data is not independent. While I certainly want to cite those authors, I don't think it's fair to say that their algorithms are really what I'm competing against. I've modified them too heavily to work in the blocked sampling case.

So, who's actually in this contest? And by the way, what is the contest? The second question has an easy answer: find the best estimator of the block sum variance. The actual estimator of the sum is the same for all of these. But, we want to know how good it is. The "best" estimator isn't necessarily the one that comes the closest to the real value (though, that's a good start). The correlation with the sum matters, too. If the estimator and the block sum variance are positively correlated (they usually are), then you get too many rejections when both are low. So, it's quite possible to have an estimator that's quite accurate, but still produces bad confidence intervals too often (the sample block variance fails on this count).

  1. Sample Block Variance: You want to know the variance of a population and you have a sample from that population? Well, just use the sample variance. It's easy to compute, unbiased, and has minimum variance. Too bad it sucks so badly in this case. The problem, as just mentioned is that it is very strongly correlated with the sample sum, especially at small sample sizes. Any stopping rule based on this estimator will stop too early too often.
  2. Rho Variance: Rather than use the block sum sample variance, we derive the block sum variance as a function of the hit rate variance. We then use the sample variance of the hit rates and plug it into this function. Works great when the variance of the hit rate is really what's driving the block sum variance. Not so much otherwise.
  3. Adjusted Rho Variance: Part of the problem with the above method is that there will be variations in the hit rate from one block to the next even if the actual parameter is constant. So, we compute the variance in the hit rate that we would expect to see if it really was constant and subtract that out (variances are nice that way, you can decompose them into pieces that can be added or subtracted as needed). This adjustment makes the method work well in both the above case and when things really distributed uniformly. Still doesn't help when the underlying distribution is changing, which is to be expected as the method doesn't even look at that.
  4. Partition Bootstrap: To try to account for partitions that vary both in hit rate and distribution of the measures, we perform a bootstrap method where we resample the existing partitions to create blocks and then look at the variance of the empirical distribution of the blocks. I'm not entirely sure why this one doesn't work better other than the usual problem with correlation between the sample sum and variance estimate. At any rate, it's actually pretty bad across the board.
  5. Metropolis-Hastings: Here, we try to construct the distribution of the entire population by estimating the parameters with an MCMC chain. We then sample these points and feed them into the exact formula for the block sum variance. I haven't finished validating my work on this one, but it appears to work fine. The only problem is that it's rather slow because you can't parallelize the MCMC chain (well, you can, but the results aren't as good). This algorithm borrows the idea from BMH that it's OK to run the chain on just a sample rather than the full data set. There's a variant that I'm also testing that the BMH authors suggested where you actually use a different subset of the data for each iteration of the chain.
  6. Full Kernel estimator: Rather than generate an empirical distribution with an MCMC chain, we generate one using Kernel techniques. The resulting distribution is very similar to the MCMC distribution, as are the generally good results. It's still not as fast as the Adjusted Rho Variance, but it handles pretty much any case and the Kernel can be easily biased in the early going to avoid the problems of correlation with the sum estimator at low sample sizes. (I didn't find this to be necessary, but it's a nice feature to have).
So, 3 and 6 are the winners, though I'm pretty sure 4 would work fine if I put more effort into it. As 3 and 6 are the most original methods, I'm not really worried that 4 isn't doing great. If somebody else wants to publish an improvement, they are free to do so.

Sunday, May 20, 2018

Lamest defense ever

I'm not saying the bible gives Peter a bad rap. He is, after all, called out as the first Pope which would presumably imply pretty high respect. But some of the anecdotes we read during Easter season are less than complimentary and (at least to me) needlessly so.

First we have Matthew quoting Jesus as calling Peter "Satan" when Peter suggests that maybe Christ dying isn't such a great plan. All four of the gospel writers note his betrayal during the trial. I've already written about John's little barb about him not being able to keep up running to the tomb. Today, we get as good an explanation as any as to why Paul became the primary defender of the gospel: Peter is just not a very good lawyer.

We tend to read the passage of the Pentecost through the lens of Renaissance paintings rather than constructing the scene in our heads directly from the text. The accusation makes a lot more sense if you take the story at face value. It starts by noting that the apostles, that is, twelve guys in their 20's, were all living in one house. Yeah, what could go wrong there? One morning they are all out front yelling out a bunch of stuff in a dozen different languages. The obvious conclusion is the one the bystanders came to: these dudes are drunk out of their minds.

Maybe Peter had never been to a frat party, but his defense is absurd: "We're not drunk, it's only 9AM." At what point after the discovery of alcohol did 9AM mean that young men are not drunk? Granted, he goes on to say some really important stuff after that and at least some folks were sold as the story ends with three thousand people getting baptized. But, seriously, that was your opening argument? That's the lamest defense ever.

Friday, May 18, 2018

A paper in pictures

So, here's a first cut at the storyboard. The graphs are generated quickly out of excel; in the final they will be prettier versions created by the ggplot2 package in R. In particular, the axis will be labeled properly: in all the performance graphs, the horizontal axis is a logarithmic scale of samples taken and the vertical scale is how many times the 95% confidence interval missed the real value. Ideally, 50 of 1000 (5%) of the confidence intervals miss.


Sequential loading of data in batches causes similar data to be stored in the same physical block. The boundaries of batches do not typically align with the physical block boundaries. Here, physical block j contains some of batch m, all of batch m+1, and some of m+2. Block j+1 is entirely comprised of observations from batch m+2. Block j+2 contains the remainder of batch m+2 and the beginning of m+3. Such co-location of related rows can significantly increase the variance of an estimator when sampling is done by block.


When rows are not correlated by physical block, the sample sum and sample variance are independent. However, when the rows are correlated by block, the sum and variance also become correlated. This simultaneous understating of both the sum and variance leads to a high number of misses, even though two statistics are marginally unbiased. The cluster of points in the lower left of this graph result in confidence intervals that miss (I have a graph in R that does a much better job of illustrating this).


When estimating using the sample block variance, the miss rate is correct in the uncorrelated case. As a result of the correlation, this method produces too many misses when applied to small samples of correlated data. This is problematic when the width of the confidence interval is used as a stopping rule for sampling. Arbitrarily setting a lower bound on the number of samples is also unreliable, since the convergence to a proper 5% miss rate depends on the nature of the query and the strength of the correlation. (Uncorr is the data where the rows are randomly scattered in batches. rhoCorr is where the hit rate is correlated by batch. batchCorr is where both the hit rate and the distribution of the measure values is correlated by batch.)


Using the sample distribution of the hit rate to estimate the block variance yeilds good results in the case where that is the primary source of variance. However, it makes no adjustment for correlation of the measure distribution. When the data is not correlated, the method significantly overstates the variance (resulting in no misses) due to random fluctuations in hit rate being counted as real variation in the parameter.


Subtracting the expected variance from random fluctuations from the variance of the kernel distribution corrects for the overstatement in the uncorrelated case. This method is very fast and is a good choice when only the hit rate (and not the distribution of the measures) is correlated by batch.


The simplest method which makes no assumptions on the homogeneity of the measures is the partition bootstrap. However, this method suffers from similar problems of the simple sample variance: the samples which understate the sum also tend to understate the block variance. As a result, the confidence intervals, especially in the early sampling miss too often.

When we figure out what we're going to do with BMH, that will go here. I actually expect BMH to work really well since it basically does the same thing as the full kernel; just using a Metropolis-Hastings distribution rather than a kernel distribution as a source. The argument for using the kernel will be one of efficiency since BHM is pretty slow. Finally, the winner is...


The full kernel produces confidence intervals that include the true value roughly 95% of the time at all sample sizes. There is some understatement of variance at very low sample sizes, so it is recommended that some minimum number of samples be set (or, alternatively, some minimum number of samples which produce hits) prior to applying a stopping rule.

Wednesday, May 16, 2018

Storyboard

My adviser wants me to put more text around the figures in the paper. Fair enough, if it's worth a graph, it's worth a concise caption. But, then it occurred to me that I might be able to do better than that.

Many of the people I've discussed this problem with are really struggling with what, exactly the problem is. These aren't poorly informed folks. It's just that multidimensional queries are typically done on fairly small (as in, less than a Terabyte) data sets, so very little effort has been put into looking at their asymptotic performance other than to note that it's not uncommon for these things to literally run for days.

So, embracing the idea that a picture is worth a thousand words, I've decided to group all my figures together with sufficient captions that someone with a basic knowledge of data organization could follow the main points of the paper just by looking through the figures. Of course, that would be a very superficial review, but sometimes that's what a person wants. It will basically serve as an illustrated abstract. For those who want the thousands of words, they are, of course, provided.

I've been working on this for the last couple of days and I like where it's going. I should have something tomorrow.

Tuesday, May 15, 2018

Intentionally working past midnight

I don't know why that strikes me as odd. I did it all the time when I was in grad school 30 years ago. Anyway, I got a lot done, but I do need to get some sleep, so the post will have to wait until tomorrow.

Monday, May 14, 2018

Intentional

So, if you've been following the little graphs in the lower right corner of this blog, you can see that the last few weeks have been a bit of a train wreck for getting research done but not so bad as far as getting in physical training. The easy explanation for this would be that I like the physical training more, but that's not really true. As we're really getting close to wrapping this paper, I'm pretty excited about putting effort into it. I'd call my motivation pretty high.

Instead, the problem is one of planning. I've been getting in physical training because, frankly, it's planned. I have to decide at the beginning of the week which days I'll run or ride to work because I need to bring the right clothing in. Once that plan is set, I don't have too much choice but to execute it since I obviously have to get to work somehow.

With the research, it's been more about just fitting it in. And, even if something is important, it can easily get bumped by something that's urgent unless specific plans have been made to protect that time.

So, rather than simply logging my time on research, I'm going to start doing a better job of planning it. After meeting with my adviser, I generally have a pretty clear view of what needs to be done in the next week and how long it will take to do it. I'll follow each meeting with laying out the specific blocks of time when that will get done. Kate and Yaya are pretty good about respecting my schedule if I make them aware of it in advance. It's that lack of intentional planning that has been causing me to put down the research to take care of other stuff.

Friday, May 11, 2018

I meant for that to happen

Just a quick post tonight because it's almost midnight and I'm still at work.

I have, however, spent the last six hours working on my research. Yay for that. Even with the modifications the bootstrap algorithm is still off by quite a bit. But, that's a good thing. If it worked perfectly, there wouldn't be much point in deriving the exact distribution of the block variance. As you can see, in the uncorrelated case, it works great (to be expected - bootstraps like truly random data). In the other two cases, the variance is getting badly understated and there are too many misses.


So, that's pretty much best possible result there. It works, just not very well. Makes for a good baseline.

Thursday, May 10, 2018

He who pays the piper...

... calls the tune.

I hope my adviser will accept that explanation for why I have pretty much nothing to show for this week's work. My day job has been just a bit chaotic this week. Not really a surprise this time of year, but something I need to keep in check.

Getting over this hump is pretty crucial. If I can get the paper to a submittable state this month, using the summer for the literature review makes a lot of sense. Carving out 1 or 2 hours a day to read other people's work is no problem. But, I won't likely have the big blocks of time to work on my own stuff until after we go live with our big update in September.

Wednesday, May 9, 2018

The other shoe drops

It's not like I didn't know it was coming.

I had really hoped that we'd get the paper done before work heated up for the annual push to our big release in August. No such luck. It's come fast and hard this week and I've made very little progress on the paper. I think I'll get a good chunk of time tomorrow so I'll have something for my adviser on Friday, but it's looking like I'm just going to have to do something nutty like work a combined 80-hour week to get this wrapped up.

Friday, May 4, 2018

What's stopping you?

I had a good meeting with my adviser today. We discussed the current work only briefly, then looked at longer term (as in, post-PhD) opportunities. While he concedes that the market for full-time faculty isn't what it once was, he had some other options that I hadn't considered. The details aren't particularly important at this point. The main question to ask is whether this all makes sense.

Specifically, am I OK with a big pay cut.

I suddenly find myself on the opposite side of an argument I've had with a lot of aspiring athletes. It's also a conversation I'll need to have with Yaya if music is really the avocation to be pursued. Basically, my question has always been, "What's stopping you?" Followed by, "Really, that's actually stopping you?"

Most people don't appreciate the difference between a criterion and a preference. A criterion is absolute. If it isn't met, the discussion is over. There are very few legit criteria in the world.

Mostly, what "stops" us from doing things is preferences. Sure, I'd like to be a pro bike racer, but I also want to finish grad school. Sure, I'd like to be a pro bike racer, but I don't have a contract and to race as an independent would mean living out of my car while working a side job. Sure, I'd like to be a pro bike racer, but I'm married to somewhat who doesn't really want that existence.

Those were all preferences that I faced in 1988. They were strong preferences. There were consequences to not honoring them. But, they were preferences. I decided to see what would happen if I set them aside.

Well, I can't point to any huge success story but I can also say that it was a life without excuse. It basically cost me everything. I didn't get my PhD. I had a whopping $900 to my name when I retired. My wife left me just when I was actually getting good. But, I did do it. Those things weren't stopping me. They were just really difficult preferences.

Would I make the same choices now? Of course not. But, that's not the point. The point is that every decision is just that, a decision. If the option isn't feasible, there's no choice. It's only a decision because you are weighing the cost. And, the cost can be rather high.

Which brings us to life after grad school (let's assume for the moment that I do get my PhD this go round). There are three reasonably plausible paths. I could keep working as a consultant. That would certainly be the path of least resistance. I don't know that a PhD would improve my marketability much I've been doing this consultant thing for a while now and generally get my jobs by referral, so credentials don't count for much. It certainly wouldn't hurt.

I could find a full-time teaching position. As I've noted before, that's becoming a heavy lift due to the increasing reliance on adjuncts. However, a friend of mine who's about my age just took a full-time job at a community college and is loving it. So, there are opportunities; just maybe not the ones I had in mind.

Another option, which was the discussion of our meeting today is a sort of hybrid position which is partially funded by an independent research group and partially funded by a university. I have long known about such positions; it's basically the deal my dad got with Cornell when they "bought" the independent research institute where he was a Principal Investigator. But, I hadn't really considered it as an option for me.

The upsides are obvious. It's full-time work and it constitutes real involvement in the faculty.

The downsides are also obvious. The pay is less than great (though no worse than full-time teaching) and your job is only as secure as your last grant.

I'm debt free, have a marginally OK retirement account and, aside from my predilection for good wine, don't have much in the way of big expenses. I do have a kid going to college in four years, but that's more Yaya's problem than mine. I'll help as best I can, of course, but at the end of the day I don't have a big problem telling Yaya the same thing my parents told me when noting that private school tuition is a pretty big chunk (like half) of an academic's salary: if you want to go to a private engineering school, you're going to have to shoulder the costs. I did and I did. I think I took college a lot more seriously as a result.

So, nothing's really stopping me. Does that mean the preferences are not important? Of course not. But, it's a decision, nothing more.


Thursday, May 3, 2018

The real problem with correlated variances

I've suspected this all along and have had it in the paper for some time, but only recently bothered to prove it. The funky shape of the sample variance distribution isn't the real problem causing too many confidence intervals missing the true value. It's the correlation between the sample sum and the sample variance.

As you can see in the plots below, the uncorrelated sum and variance are basically independent. When the data is correlated, there's a bigger chance of getting "zero" blocks which understates both the sum and the variance. This combination is, simply put, bad.

While there's a lot more variation in the correlated case, the average values for both the sample sum and sample variance are fine. But, since the joint distribution pushes both away from the mean together, it results in small confidence intervals when the error is actually the greatest.

And, the problem never really goes away. Even at 250 of the 1000 blocks sampled, the correlation is obvious from the plots:
It's tight enough that the rejection rate is about right, but there's no denying that these statistics are correlated.

Wednesday, May 2, 2018

Well, it looks nice

I'm not suggesting that ggplot2 is a crappy piece of software. Quite the opposite. From my working with it over the last couple days, it seems that it is rightfully the "standard" (as much as there can ever be one in the open source world) for creating graphs in R. What I am suggesting is that, damn, what a giant PITA just to produce some graphs.

Purists will be all over me saying that there's plenty of documentation out there and that everything works just as it should. Furthermore, the adherence to The Grammar of Graphics gives it some academic street cred. They are correct on both of these.

However...

I can create pretty much the same graphs in Excel in about 1/100 the time. If I spent the next few years doing intensive graphics work in ggplot2, I might get to where I could produce graphs in only twice the time it takes to do it in Excel.

Nor am I suggesting that Excel is the world's greatest graphing tool (though, it's pretty darn good). What I am saying is that this is definitely a case where adhering to academic purity is making things take waaaaaaaaaay longer than they need to. Be that as it may, the graphs have been ported to ggplot2 on R.

Yay for that.

Tuesday, May 1, 2018

Targeted solution

So, if you've got data where the query relevance is highly correlated by partition, but it's otherwise uncorrelated, I've got the sampling algorithm for you. The "rho"-kernel algorithm works great in that case.

And not anytime else.


This is the number of times out of 1000 tries that the sample confidence interval didn't cover the real value. Alpha risk was set at 5%, so it should be around 50 misses per 1000.

I expected the full correlated set to generate a lot more misses. The rho-kernel algorithm makes no attempt to adjust for that type of correlation, so it would be alarming if  the results were good for that case. However, the fact that the uncorrelated data was never missing was a bit of a shock. Why does a correlation coefficient of zero break the algorithm? I had to think on that one for a while.

What's happening is that the kernel will never assign zero variance to rho. But, if the variance really is zero, the only variability that will be observed is just random noise. That's already baked into the variance computation. But, it will also be the variance of rho. So, it basically gets double counted.

Unfortunately, there's no good way to know from the data sampled whether you're in a correlated or uncorrelated situation. You could put some heuristics around it, like, if the differences between blocks were above some threshold, but it would have no mathematical basis. I'm trying very hard to avoid heuristics in this paper.

So, I just have to conclude that the rho-sampler sucks. That's not really a bad result. Eliminating inferior solutions is part of finding the good ones. I do have the same problem with the full kernel approach. In that case though, the effect is smaller and it's a bit easier to adjust for it. More on that to come.