Sunday, April 29, 2018

Heart stopper #2.

Not quite as bad as a couple weeks ago when I thought my entire thesis had been scooped, but still enough to cause some anxiety.

I've been reworking my simulated data so that I can apply all the methods to the same data set (yes, it would have made more sense to set that up from the get-go, but that's a different conversation). Anyway, as I ran the most basic sampler (just use the sample block variance) on the new set, I noted with dismay that it was returning very good results even on the correlated data. I began to worry that all this work has been for nothing and that I had just been working with an anomalous data set.

Turns out, it was the new set that was anomalous. Not terribly so, just that the correlation between hit rate and the mean observation given a hit ended up yielding roughly the same expected value for all the partitions. Obviously, that's not the situation we're trying to deal with here. We state up front that if the partitions all return roughly the same mean, then any sampling method will do.

It was an easy fix to generate data where that wasn't the case. We carry on.

Friday, April 27, 2018

Running on credit

It's been a long time since I've made the tragic mistake of booking a vacation on a credit card. Still, today was looking like the very best day of the year, so I decided to bolt out of work and take my "sabbath" from 2PM today until 2PM tomorrow. At that point, the bill will come due and I'll need to work my ass off the rest of the weekend but I can't say I have any regrets. It was really nice out.

Thursday, April 26, 2018

Doing the comparos justice

With my new outline, I'm having to change my treatment of BLB and BMH a bit. I had introduced them only in the context of the uncorrelated case. They do great there. But, then again, so does simple sampling, so why go to the trouble? Given that they fail miserably when the data is correlated, one could question why I even bring them up.

Well, they don't fail quite as miserably if you put just a little work into it.

First, there's the fact that BMH specifically says you should resample blocks with each iteration rather than using the same block over and over again. If you actually sample random blocks from the entire population, this will work great, but that's because you will have sampled the entire population by the time you're done. That rather defeats the purpose of sampling.

But, suppose we just sampled from the blocks that we had already selected. So, say we've sampled ten blocks and we want to check three blocks at each step of the chain. No problem, just pick any three from the ten you've got. Since we've already computed summary statistics, this doesn't involve any extra computation, it's just a matter of picking the right value out of an array at each step. This will spread the distribution to at least cover all of the blocks sampled.

Now, let's look at BLB. This one is a bit trickier because the method explicitly relies on each "bag" being a truly independent random sample, which a block is definitely not. But, the partitions within our sampled blocks are, more or less, independent random samples. So, if instead of looking at individual observations, we select a group of partitions and then construct the bootstrap from them, we should get similar variation in the total sum without having to randomly sample at the row level.

I'll need to test out both these ideas, of course. Fortunately, the sampler harness is pretty generic so it's probably only 1-2 days work, plus another day writing up the results (unless they come out whacky, then it could be several days trying to figure that out). Either way, I think it will be a presentation where I'm less likely to be accused of portraying competing methods in their worst possible light.

Wednesday, April 25, 2018

Tuesday, April 24, 2018

Big finish

In most things, it's pretty easy to put out a big effort right at the end. In running, many people sprint to the finish line once they can see it. However, turning up the intensity with a mile and a half to go will pull back a lot more time and places.

I feel like that's where I am with this paper right now. There's still 1-2 weeks of solid work and probably another week of clean up and submission. It's too far out for doing something crazy like pulling 2 or 3 all nighters. I'd just be too tired to edit carefully and make a bunch of mistakes. But, it is time to really push. I think if I get 20-25 hours in each of the next two weeks, I'll be in really good shape on this one.

My reward for that is that that's exactly when I can return to running as my broken foot should be healed on May 6.

Monday, April 23, 2018

Still stuff to do

Definitely getting warm on this last revision (I'll find out if my adviser agrees tomorrow). The flow of the paper makes much more sense now, as does the inclusion of the BLB and BMH algorithms. There are still a few things left to do:

  • Display consistent results for each of the methods. Right now, the graphs are a little scattered. They all tell good stories, but I need to either come up with a consistent graph that allows comparisons across methods, or come up with some other type of table that puts them on equal footing.
  • There are a few sections of text that my adviser thinks are less than clear and I haven't been able to come up with significant improvements. We may just have to work them together word by word. Fortunately, it's a short list.
  • The conclusion is, well, missing. I need to write up a conclusion of some sort.
  • Run all the samplers on the empirical data set. This is the biggest lift, but at least I have that data set on the HDFS cluster at work now. It's 17 billion fact rows and it's no problem to construct queries against it that take several minutes to complete (they used to take hours before we rehosted to the cluster this year). So, it's a big enough data set to make the case that sampling makes sense. The problem is that I haven't properly parallelized the samplers, so they will probably take just as long as the full query engine (which we have spent the last year making very fast). I guess I don't need to worry about that since we're just demonstrating that things work and not making a bunch of performance claims.
That's probably a solid week's worth of work and I'm sure there's another round of edits, but this is looking like something that really will happen quite soon.

Sunday, April 22, 2018

All that Jazz

This weekend was the UMSL Jazz Festival. As with last year, Yaya's school performed during the day. Then, I took her to both evening performances. Really good stuff all around.

Except, of course, for the fact that the time doesn't come from nowhere. I was so tired from staying up late the last two nights that I didn't get much done today.

But, back to the jazz. Yaya is very much into it. So much so that I was happy to hear Gordon Goodwin speak plainly to the crowd (which, this being a scholastic event, had a lot of middle and high school jazz players in attendance) about the realities of professional jazz. He didn't come across as remotely jaded or bitter; he just acknowledged that if jazz is what you want to do, you might need to find some other way to pay the bills. As he's pretty much top of the heap in the Big Band world, his commentary carries a lot of weight.

Yaya needs to hear stuff like that. Frankly, so do I.

I entered this PhD program not even considering that getting a faculty job might be a problem. After all, I had multiple offers coming out of Cornell, and that was with just a masters. But, things have changed. Schools aren't hiring full-time faculty anymore. At least, not as a general practice. They are relying on adjuncts to carry the load. I'm not really interested in being an adjunct. I've done it and I didn't find I got the same connection with the students that I did when I was full-time faculty.

So, will I ever teach again? I can't honestly say. I still think it's possible, but I don't know that I'd call it likely. I don't regret going back to school and a PhD in Computational Mathematics is a useful thing to have as a consultant. But, it's certainly not the path I thought I was on.

Friday, April 20, 2018

Reorganizing again

The outline of the paper is changing again, but not the content. We're just going to present the topics in a slightly more cohesive manner given that the true subject of the paper has become the block variance, not the sum itself.

  1. Introduction
  2. Sampling framework and notation
  3. Exact distribution of sample sum
    1. Distribution as a function of the block variance
    2. Block variance in the uniform case (my adivser would like to use a different word than "uniform" since it sort of implies a Uniform distribution - I'll think of something).
    3. Block variance in the correlated relevance case
    4. Block variance in the general correlated case
  4. Methods for estimating the block sum
    1. Simple sample variance
    2. Bag of Little Bootstraps
    3. Bootstrap Metropolis Hastings
    4. Kernel estimation of relevance
    5. General kernel estimation
    6. Results and discussion
  5. Implementation details
  6. Future research

Wednesday, April 18, 2018

Implementation details

I doubt anyone will care, as this paper doesn't really espouse a working algorithm as much as a general approach to multidimensional sampling. That said, it is a working algorithm, so I added an implementation details section where I give the computational complexity and offer a framework for parallelization. It's only one more page and it at least tells the reader that we're not completely lost in theory land here.

Meeting with my adviser tomorrow to get some more detailed feedback on Monday's draft. Finishing this weekend is probably optimistic, but we are definitely getting close.

Monday, April 16, 2018

Sometimes it snows in April

Well, Prince being from Minneapolis, one can see how he might think that. But, this is St. Louis. It's snowing. In mid-April. What the heck?

On a related note, boy am I glad I ran Boston last year when it was just kind of warm rather than the sleet and wind they got this year. Slowest winning time in at least 40 years and the folks who I know that are around my speed are having trouble breaking 3:30. Sounds miserable.

Sunday, April 15, 2018

Backwards

As I've admitted before, I'm kind of doing this research thing backwards. I've been developing my own ideas without paying a whole lot of attention to what everybody else has done. My operating premise has been that, if somebody had already cracked this nut, we'd be doing it at work. We have a pretty crack team and so do many of our competitors. There are literally trillions of dollars at stake. And, it's not something that would be kept a secret. Unlike regular insurance companies, which might view such knowledge as a trade secret, reinsurance companies attract clients (regular insurance companies) by showing off how smart we are. People are encouraged to publish.

So, the fact that neither we or our major competitors have figured out a good way to sample our data was enough for me to consider it an open problem.

Imagine my dismay when I came across a paper today that appeared to have addressed it back in 2008. It's one of those moments when you first aren't sure you read that right and read it again. Then you read it slowly while feeling that urge to be sick build in your gut. The fact that it came from of team at University of Illinois was less than comforting; very little that comes out of that engineering school is bogus. Then there's the frantic search to find an online copy of the paper (I'm sure UMSL's library has the journal, but I didn't want to spend an hour going over there this evening). I finally found a copy and was relieved to see that their premise is fundamentally different from mine.

Basically, they set about showing how to construct confidence intervals when your OLAP cube is populated with a sample of the full population. That's a valid thing to think about and you could use it to solve our problem if you were OK with truly random sampling. But, we're not. Random sampling is hugely inefficient. Block sampling is the underlying approach of my paper and that wasn't even mentioned in the U of I work. So, we're all good for now, but it sure did have me in cold sweats for a bit.

I really need to take my literature review seriously this summer.

Saturday, April 14, 2018

1

Just a quick proud papa post today. Yaya got a "1" (which is the high score) at the St. Louis Music Educators Exhibition. I'm not a particularly good judge of what makes a trumpet piece hard since I've never been any good at valved instruments, but it seemed like the solo Yaya picked was pretty stout for an 8th grader. The judges seemed to agree.

Thursday, April 12, 2018

On the margins

Quick, spot the miss in the following derivation:




It's not super obvious. And, truthfully, it doesn't actually make any difference worth noting. However, a PhD dissertation is your chance to get things really right, so being picky is in order.

The problem with this derivation is that we take as the probability that a block partition comes from partition m as the ratio of the rows in partition m to the total rows. That's not a bad approximation but, the rest of the formula is really specific, so approximating this term isn't appropriate.

It should be the marginal distribution, P(m|k), rather than just P(m). For a given value of m, that's quite different. In total it washes out a bit, but not completely.

This is the sort of semantic stuff that I'm not super good at. It's not that I don't think it's important; it's just that I tend to miss these things. I hope my adviser is good at catching them.

Anyway, I've updated the sampler to reflect the correct term.

Monday, April 9, 2018

A lotta work for nothing

Alert readers will recall that a week ago I spent the better part of two days sorting through derivations to find a term that almost always evaluated to zero unless you blended the estimators - then it was not anything like zero.

Well, it's back, but in reverse. Now, since I'm not blending estimators in my most general case, I do want the term. It still comes out to pretty much nothing, but I don't want reviewers pinging me for dropping terms just because I don't think they matter. I still do drop terms, but I'm being very careful about making sure they really are insignificant. I've wasted too much time chasing down bad assumptions.

FWIW, here's the term:




I do drop a couple terms in the third step because the hit rate and mean observation aren't supposed to change from one partition to the next so their variance should be very close to zero, especially with large row counts. That's an assumption and I could get called on it, but since I call it out myself in the paper, I'm OK with it.

Saturday, April 7, 2018

My point is...

This is not the progress my adviser was likely hoping for today as I said I'd have actual results on Monday. But, while the morning has not produced any such results (or even done much to forward that effort), it has provided a significant revelation, that should help in wrapping all this up.

People have been asking me what my paper is about. I've been having a lot of trouble answering that because the topic has shifted so many times. Frankly, when I read the first part on the uniform sampling and then look at the second part with the generalizations, it looks a bit scattered. This was brought to my mind when my adviser suggested we just use the empirical distribution of the partition parameters to construct the sum directly and then get the variance of that rather than using the block sums. For reasons I couldn't put my finger on, that seemed horribly inconsistent with the rest of the paper. Not inconsistent in the sense that it contradicted anything; just that it didn't fit.

This morning, while making a few false starts on implementing the general block variance computation, I thought more about why I wasn't keen on using the total sum. There are some technical ramifications that I'd have to work through, but they aren't a big deal. I then realized why it doesn't fit.

This paper really isn't about the sum. It's about the block variance. If we know the block variance, we can always get the variance for an estimator for the population. More importantly, in the finite sampling case, this is true for any estimator. The sum is just a convenient choice to demonstrate the technique.

Viewing the paper through this lens, the title becomes a bit more obvious (something like: Estimating Query Values from Finite Multi-dimensional Populations using Blocked Sampling). The comparison with the BMH and BLB algorithms to estimate the block variance is suddenly a sane thing to do. The introduction explaining why one would even use blocks rather than true iid sampling fits. And, though I haven't written it yet, the conclusion can talk about future research where we don't just consider sums, but generic population parameters estimated by repeatedly sampling blocks.

Most importantly, this points the way for proceeding to the real problem I want to solve: self-tuning multi-dimensional data stores. The objective is the restructure the data so the block variances are minimized. A secondary objective would be to make them easy to compute. I've already got quite a bit of work done on both of those.


Friday, April 6, 2018

Back to earth

It's been a crazy few days at work (actually, normal by the crazy standards of my group this time of year, but crazy by any normal standard). Anyway, it's prevented me from acting on the previous post.

And, that's probably a good thing.

I had another meeting with my adviser today and we both agreed that the whole rx4-dimensional space was just a bit nutty. So, we're going to frame this in less grandiose terms. We'll still create a 4-dimensional kernel distribution, but we won't blow it out for every partition. Instead, we'll just sample from that and estimate the block variance. My adviser would like to go even beyond that and just directly sample the total sum variance, but I'm still lobbying for the block variance. I'm pretty sure I can show that it's equivalent, both theoretically and in terms of computational effort. Sticking with block variance gives us a consistent measure of spread throughout the paper.

I've got until Monday to make my case. It's time to wrap this thing up.

Tuesday, April 3, 2018

New method

Oh, brother, here I go again reframing things just when the finish line is in sight.

At my weekly meeting with my adviser, he was understandably skeptical of my plan to us an MCMC chain to generate a distribution across all the vectors for the partition parameters. Given 1000 partitions (which actually isn't very many), it is a 3000-dimensional space. I'll concede that's a heavy lift, though my thought was that even a very sparse approximation would still converge. I, of course, have nothing but gut feel to back that up.

Then, as I was staring at the blackboard, it hit me. There's no reason all these things need to be estimated at once. We've already said that in the general case we are assuming no direct correlation between partitions. They may exist, but we don't assume as much.

So, that means that we could partition the partition space and estimate the groups of partitions separately. Then, we just need a way to glue all those back together into one empirical distribution of the block variance. My adviser liked that idea and I left the meeting thinking that should be easy enough to do.

Then the revelation hit me: it's already been done! That's essentially what the Bag of Little Bootstraps algorithm does. All the theory has already been worked out. Better yet, I already have a section in my paper adapting the theory to my use case. It's not a slam dunk because I was showing how it operated in the uniform case and explicitly called out how it fails miserably in the correlated case. But, that's because we were measuring the wrong thing. Applied to the partition statistics rather than the individual rows, this should work.

It does mean I have some more proving to do. Simply showing that it works on simulated data sets won't cut it. But, the proof of correctness in the original paper is fairly straightforward. I should be able to follow the same path to show it works in this case.

The downside, of course, is that I just set myself back another week. This is the first time I've felt like it will be worth it.

Monday, April 2, 2018

Theory and practice

Agree! Finished up the empirical results for the correlated hit-rate sampler last night. (My observation of a day off each week is the traditional sundown-sundown, so it's OK to work Sunday night, even Easter Sunday.)



As you can see, even for small sample sizes, the rejection rate is very close to the desired 5%. On to the full general case.

Sunday, April 1, 2018

John smokes Peter in the inaugural Ressurection 5K

I may get some angry emails for making light of the resurrection, but I figure everybody who cares about such things has already had time today for serious reflection on the paschal mysteries. Today's lectionary from John would seem to indicate that running is, indeed, a thing:

So Peter and the other disciple [that's John, who has a strange aversion to first-person narrative] went out and came to the tomb. They both ran, but the other disciple ran faster than Peter and arrived at the tomb first; he bend down and saw the burial cloths there, but did not go in. When Simon Peter arrived after him, he went into the tomb and saw the burial cloths there, and the cloth that had covered his head, not with the burial cloths rolled up in a separate place. Then the other disciple also went in, the one who had arrived first, and he saw and believed.

Those who study scripture know that the tradition is that when a point is really important, it gets stated three different ways. There's a good reason for that. If you tell something once and it's misheard (remember, we're talking oral tradition here, none of these folks had a bible at home), you've given them bad information. If you tell it twice and they mishear once, you've given them conflicting information. If you tell it three times, you've at least given them a preponderance of statements to sort it out. It's basically the ancient equivalent of a checksum bit.

So, it appears that in writing about the most important event in Christianity, John really wanted us all to know that he dusted Peter running to the tomb. John wrote his gospel sometime between AD 85 and 95, so this is basically a 90-year-old giving a race report from his 20's. As the saying goes, the older I get, the better I was.