never2old4school: 2018

Monday, October 22, 2018

Chicago Marathon

Sorry for the upfront spoiler, but I'm assuming that most people reading this already know that I had a good run at Chicago. The problem with good runs is that they make for rather dull race reports. So, since my first marathon was also Chicago 25 years ago, I thought I'd write a report on that train wreck with a few lessons learned applied to this year's effort. So, step into the way back machine, we're going to party like its 1993.

Even though my identity in my 20's was completely built around being a cyclist, I never took much joy in the usual cyclist practice of bashing runners. The truth is, I envied them. I had just turned 9 years old when Frank Shorter won the Olympic Marathon and I was instantly enamored with it. I tried so hard to be a runner. I was doing 60-mile weeks at age 12. But, genetics had other plans. By the time I was in High School, it was obvious that I simply was not put together like the really good runners. I could be good, but I'd never be really good and greatness was completely out of the question. I looked for sports more suited to my body type and found cycling. I didn't get great at cycling, but I did get really good. While it wasn't enough to live on, I had a license and a contract that said I was a professional athlete. I thought that was cool. Still, I wished I had been a runner. So, when I retired from cycling, I decided to run a marathon.

I think I can be excused some hubris. Cycling is an endurance sport that uses predominantly legs, heart, and lungs. In the winter, I'd run to stay in shape and 10-15 miles runs were common. Anytime I'd enter a local race, I'd do pretty well. My 5K PR is a solid 16:20. I've heard about "hitting the wall", but I've also ridden dozens of cycling races that were five hours or longer and know how to manage effort. It seems like it won't take more than a few long runs to put a marathon effort together. My longest run prior to this is 15 miles. I extend that with a training runs of 18, 20, and 22 at around 8:30 per mile. I'm tired at the end, but not crushed. I decide that three and a half hours (8-minute miles) is a reasonable target. In retrospect, it probably was, but the devil is in the details.

In the intervening 25 years, I've observed a lot of details. I've also got pretty good at estimating my finish time. I have three workouts I use for that.

The first is the classic "Yasso 800s" named for Bart Yasso who originally published the relationship in Runner's World. It couldn't be simpler. Run 10x800m as fast as you can given that they all have to be roughly the same time (within a second or two). Your average time in minutes:seconds is the hours:minutes for your marathon. Like many others, I've found this predicts just a bit fast, so I always add five minutes. In August, I knock out ten at 2:54. That points to a 2:59 for this year's effort at Chicago.

The second is a baseline race; some race that I ran before another marathon where both the race and the marathon were good efforts. I then scale the time difference. I run the Big River 10 in September and finish third in 66:20. That's 20 seconds slower than three years ago when I won it outright and then ran a 2:58 at Milwaukee. Scaling that delta to a three hour effort again points to 2:59.

The final is "The Duck". It's a 3-mile tempo loop around Mallard Lake that I run pretty regularly so I have a long history of Duck times converted to performances in races. It takes a bit more discipline to get this one right because tempo is not "all out". I run this in late September and get a prediction of 3:01. Close enough; it appears I should be trying for a sub-3, but not much more than that.

These are all part of the larger 18-week training cycle leading up to the marathon which averages 90 miles a week, 35-40 of those going towards two or three quality workouts and the rest base (which, for me these days, is simply running to and from work).

My fiance, Kate, drives with me to Chicago. She's supportive, but makes no bones about the fact that she's really just happy to spend a night in a downtown Chicago hotel and go to a fancy restaurant. Check and check, though it taxes every bit of restraint in our waiter to not roll his eyes when we order a bottle of White Zinfandel. We walk back to the hotel in the darkness along the riverfront. There are moments so perfect you replay in your mind the rest of your life. This is one of them.

My wife, Kate, drives with me to Chicago. She's supportive, but makes no bones about the fact that she's really just happy to spend a night in a downtown Chicago hotel and go to a fancy restaurant. As I've found these things go better if we save the latter for after the race, I've booked two nights and made reservations Sunday at a place that doesn't have a White Zin on the wine list. We get an early dinner at Gino's East and call it a night.

I grew up just outside New York City and have spent almost all my adult life within a day trip of Chicago. It's interesting to observe how Chicago has both embraced and resented the "Second City" label it picked up when it passed Philadelphia in the 1890 census. The good folks of Cook County don't want to be New Yorkers, but they sure don't mind beating them. In 1993, Chicago is locked in a battle with the New York Road Runners Club to host the world's largest marathon and logistics are lagging the hype. This is before multiple waves, start corrals, or chip timing were a thing. You basically show up in the vicinity of the start line and wait for the gun to go off. As such, lining up next to the sign that has 3:30 on it means starting with a bunch of runners who have little hope of finishing under five hours. It takes eight minutes to get to the start line and another twelve to reach mile one. At 20 minutes per mile, this is going to take a really long time.

To get into the "A" corral at Chicago in 2018 as a guy means running a 2:50 (the standard is a bit more lax for women, presumably so the sub-elites have some chance of seeing each other). My 2:59:40 qualifier puts me in the B corral. It turns out that I could have petitioned to be moved up to the A corral at the number pickup the day before, but I'm not sure how anyone was supposed to know that. Anyway, with fewer than a thousand runners in the A corral, I'm still close enough to the front that I plan on gauging my effort by gun time rather than chip. Unfortunately, start procedures are still one of the few things Chicago doesn't do as well as Boston or New York. When the gun fires, we're held up by tape while the elites and then the A-corral head off. I guess this is to relieve congestion on the course, but it also means that there will be a significant difference between my gun and chip time. After about a minute they start releasing us, but the tape gets all wrapped around everybody. Nobody goes down but, in the confusion, I forget to check the time on the clock as I cross the line. I don't wear a watch, so I'll just have to assume that it was around a minute. There's not really much I can do with that information anyway. I get to the first mile at 8:05 and figure I'm probably just on the high end of the 6:50-7:00 I was targeting. Better slow than fast for mile one. I relax into the pace.

Miles two and three are closer to 10 minutes each. The road is jammed from curb to curb, but I've spent the last 15 years learning how to move up inside a pack of cyclists and find the same techniques work pretty well in running. Shortly after mile three, I find I have enough room to run my pace, but there's still the question of what pace that should be. A 3:30 isn't looking very likely, but I'd still like to be "comfortably" under 4 hours. I settle into a 7:00 pace and hold it for the next 10 miles, hitting the half in around 1:50. That's close enough to my original goal that I decide to get off the gas as my legs are starting to feel the effort. This doesn't alarm me in the least as big pushes early in the race are pretty common in cycling. Usually, it just takes a few minutes of soft pedaling to right matters. I'm about to find out that running doesn't work that way.

As I head towards the northern end of the course, two things concern me. One is that I keep forgetting the seconds from my previous mile split (was that 38 seconds from last mile or two miles ago?). I don't look at the minutes because I'm never off by that much. It seems like I'm holding things in the 6:45-6:50 range, but I worry I might be inadvertently crediting myself for an extra 10-15 seconds due to a missed split. Maybe there is some merit to wearing a watch in a marathon after all. The larger concern is that the pace feels just a bit firm. Not enough that I want to give up on it, but enough that it might get really ugly later in the race. At eight miles, the course turns south and suddenly everything feels easy. Apparently, we've been running into the wind.

I hit the half at 89:35 on the clock. ON THE CLOCK! I'm actually ahead of 3-hour gun pace! I look up the road and, sure enough, there's the 3-hour pace group for the A Corral. If I can stay near them the rest of the way, I'm golden.

Aside from spinning easy for a few miles, one thing I'd be doing to help bounce back from a hard start in a bike race would be grabbing something to eat. Unfortunately, I don't have anything on me and the aid stations only offer drinks. My easier pace brings up mile 15 in just under 2:10, but the legs aren't coming back. I'm going to have to dial it back even more.

I've tucked a couple gels in my waistband and pull one out and eat it. I've already thrown in two 6:40 miles just before the half. I had figured I could insert three such surges over the course of the race without destroying my legs for the final 10K. I put in one more to try to catch the back of the 3-hour group. I get to within 10 seconds at mile 15. I could try one more hard mile to close the gap, but I feel like I'm right on the edge and decide to go back to high 6:40's.

Things are not going well. My legs aren't coming back at all. In fact, they are completely going away. My stride is reduced to a shuffle. By 18, I'm barely holding 11 minutes/mile. At 20, I can see the finish line, but there's still a three mile out and back along the lake to go. The entirety of my clothing is a singlet, shorts, socks, and shoes. That was chilly, but fine at the 35-degree start but the temperature has dropped rather than warmed. It's now well below freezing and there's a howling wind off the lake. At least it's not sleeting. Oh, now it is.

The race organization describes the conditions for the 2018 race as "moderate". I guess I'd agree with that assessment. The temperature has been steady in the low 60's all run and there's been complete cloud cover. It was actually foggy prior to the start. That much humidity would normally be a problem, but we got a nice light rain for the second hour which mitigated fluid loss. I don't feel like I'm any lower than usual going into the final hour.

Just before 18, I make my biggest mistake of the race when I take an energy bar (I thought it was a gel) from an aid station. There's no getting solid food down at this point in the race so I should have just tossed it off. Instead, I try to jam it in my waistband. It turns out that getting things into the waistband pocket while running is a bit awkward. By the time I'm done, I notice that the people I was running with are about five seconds ahead of me. That doesn't sound like much, but it means that I'm now 15 seconds behind the three hour pace group. Even a fourth 6:40 surge won't catch them now. That means I'll not have the protection of a group when we turn back into the wind at mile 23.

I usually pace myself so that I expect to lose about 10 seconds per mile over the last 10K; I've found that results in a faster overall time than even splits. I hit 20 miles at 2:16:30 gun time. If I'm right about having a full minute in hand, that makes the sub-3 pretty much a done deal short of an injury or a complete meltdown (both are possible, of course). Sevens from here will make hitting the finish before the gun clock rolls over a squeaker. I decide every mile I can hold 6:50 is one more in the bank if I fade more than usual.

One of my favorite running authors, Alan Lawrence, writes that "There are few experiences worse than a marathon gone bad. The symptoms reported are those of terminal illness. Many real deaths are probably easier." As my pace continues to degrade, I find myself in complete agreement. I am determined not to walk. My stride is barely the length of my foot, but I manage a microscopic hop from one step to the next.

Coming back from the turnaround, Lake Shore Drive rises to become an elevated highway. It's not much of a climb, but it results in a pathetic 17-minute mile. It would be faster to walk, but I want to be able to say I ran the whole way, even if it is a rather generous definition of running.

I don't hold on to 6:50 for long. I'm just under seven through 23, then give up a few more seconds as the course turns into the wind for the grind north to the finish. Most of the people around me have been dropped from the 3-hour group ahead and are struggling so the drafting opportunities are limited. The sense of urgency is palpable. Breaking three is pretty much the biggest prize out there for non-elite runners and we are right on the edge. In one of the things that makes running wonderfully unique among competitive endeavors, we spur each other on as best we can. The clock is the common enemy and we have no weapon to disrupt its advance, but we are united in our defiance.

By mile 24, The sleet has turned to snow and I can't even see the far end of Navy Pier. I tell myself I simply have to find a way to get the last 2.2 miles done in under half an hour. I can't take much more of this.

Mile 24 comes with 2:44:45 on the clock. I tell myself I simply have to find a way to get the last 2.2 miles done in under a quarter hour. A sub-3 on the gun clock is too good to pass up.

Since I'm on the elevated section of the highway, the finish is literally in sight, even though it's going to take a while to get there. It's enough to return a tiny amount of bounce in my step and I do finish out the race in about half an hour. I cross the line four hours and thirty four minutes after the gun. I didn't notice the seconds and I never bothered to look them up after results were posted.

I push for a few strides which results in a searing pain as one of my abdominals tears. I've never done that in a run before, but figure it can't be good. Surging to the line is not a happening thing today. I back off to 7:10 pace and the pain subsides. The one hill on the course comes right at 26 miles and that adds a few more seconds to my run in.

I hit the line with 3:00:35 on the clock, which I'm pretty sure gives me a sub-3, but I spend the next few minutes trying to get my phone out of my waistband so I can confirm that from the website. The site is obviously getting hammered right now and it takes a while to tell me that I've finished in 2:59:20, good for sixth place in my age group. It's my first (and likely to be only) top-10 in a World Major.

I shuffle through the finish area in a daze. Maybe someone put a medal around my neck; if they did, it's no longer in my possession. At the far end, I meet Kate who spent the morning perusing shops on Michigan Avenue until the weather got nasty. She says her feet are sore. I have no response that wouldn't result in cancelling the wedding, so I let it go.

The finish area is a happy place filled with runners that broke three, many for the first time. I collect my medal along with a rather generous (by big city marathon standards, anyway) amount of free food. I also get a text from Kate, who has been abandoned several miles away by an Uber driver who couldn't figure out how to get around the road closings. She says her feet are sore. This time, I have to laugh.

Tuesday, August 28, 2018

The variable in question

The blog is back for another school year. I did do a fair bit of research this summer, as well as a bunch of other interesting stuff at work (and took a nice vacation, too). However, I'm going to skip all that for the moment and get right to the question of the day because I'm discussing it with my adviser this evening.

We've been proceeding through all this analysis somewhat blindly assuming that it's a given that we should use the normed sample sum as our estimator. I give a brief (and, as it turns out, less than compelling) proof that this is the Uniform Minimum Variance Unbiased Estimator and then move on to the variance of the estimator, which is the real subject of the paper.

Not so fast on that first point. It's true that the normed sample average would be the UMVUE for the mean of an infinite population, but that doesn't translate as naturally as you might think to the population sum when things are finite. In fact, given what we know about the composition of the blocks, the estimator can actually be improved upon.

First, though, let's pretend we don't know anything about the composition of the blocks other than that they may contain partitions of correlated data. In that case, the sampled pairs of observations, that is, both the dependent and independent variables, are a sufficient statistic. We'll call this D and the realization of a sample will be the vector d=(x,y), the individual rows from all the sampled blocks. One important item to note here is that d is considered unordered. That is, we don't care which blocks got sampled first or where a row is placed within a block.

Now, we pick an arbitrary unbiased estimator. The first row in the sample times the population size will do, we call that S=Ny₁. We now use the Rao-Blackwellization process to make a better estimator by conditioning this on the observed sample, that is S' = E(S|D=d). I'll spare the algebra, because it's simple and I don't feel like typesetting it here, but it's pretty clear that, if ordering isn't important, the expected value of the first item in a finite sample, given the sample, is the sample mean. So, S' is the sample mean times the total number of rows in the population, which gets us right back to where we started. In fact, it's not hard particularly hard to prove that any unbiased estimator conditioned on the sample will give this result. So, our estimator is fine in this case.

HOWEVER...

That's not really our case. We have a lot more information available to us. When loading the data, we can pre-compute all sorts of things. We know the true sum for each partition (that is the sum when all rows from that partition meet the query criteria). We know how many rows are in each partition. We know which blocks those rows wound up in. We probably know even more than that if this isn't the first query we've ever received. We can track historical hit rates. Hit rates correlated to specific attributes in the query. Which partitions are particularly susceptible to variation based on certain query attributes. Heck, we might even know the actual answer; query processors frequently cache the last few queries they've returned because people have a tendency to hit "refresh" and there's no reason to go through the trouble of recomputing a result you just produced.

Why is this a problem? Well, it's not if you're happy with the naive estimator, but the point of this exercise is to return the tightest confidence intervals in the least time, so if we can produce a better estimator using it, we should.

Here's one quick example: Suppose we know that the data has only two partitions and one of those partitions is in a single block. We could randomly sample blocks, but we might wind up with a sample that excludes the single-block partition. That introduces extra variance to our estimator. On the other hand, we could always sample the single partition block, and then proceed with the rest of the sampling as in the uniformly-distributed case (which yields much tighter bounds). The resulting estimator will have less variance.

Yes, that's contrived, but the point is that the extra information can help. A lot. I'd like to leave that out of this paper because the point of this one was simply to demonstrate that we can create reliable confidence intervals for highly-correlated data. I think we've proven that both mathematically and empirically.

However, this plays really well into the research path I had set out for the rest of the dissertation. I now have a template for showing that we can not only assess the quality of our estimator, we can actually improve on it by leveraging the structure of the data. More importantly, we can change the structure of the data to make that lever even longer. As I argued when creating DBESt (which, by the way, is still a thing and I intend to use it to restructure the data), the optimal structure is an np-complete problem, so there's no point in trying to solve it. But, it can be approximated using either the Evolutionary Strategy of DBESt, or layering some sort of Dirichlet analysis on top of that to try to get rows from like distributions into the same blocks.

Saturday, June 16, 2018

Bogus detection

This post is mainly for my adviser as i told him I'd have the implementation details section done tonight and may not get to prettying up this graph. Under the heading of "How can we tell if this is just plain wrong?" my thought was that a simple heuristic might be to not use a minimum sample size but rather a minimum non-zero sample size. That is, where we get in trouble is when we don't have enough non-zero blocks. So, I modified the simple block variance sampler to keep going until it had the specified number of non-zero blocks instead of at a fixed block count. The results are below:

As you can see, at really small sample sizes, this technique helps a lot (the horizontal scale is logarithmic, the points are 5, 10, 20, 50, 100, 250). That is, waiting for 5 non-zero blocks yields much better confidence intervals than simply stopping at 5 no matter what. However, by 10, it's that advantage has gone away. Somewhere between 5 and 10 non-zero blocks is enough that the method works.

So, this is a really simple heuristic: run a few queries using whichever method you want and plot the performance when the stopping rule is some number of total blocks versus the same number of non-zero blocks. When the two converge, that's your minimum non-zero count. Don't even test the width of the confidence interval if you don't have that many; it can't be trusted.

Wednesday, June 13, 2018

Something to show for it

Going dark for a week was intentional. It's the only way I could make any progress in light of current work demands. The good news is that I think I have made progress. Not on the Metropolis-Hastings stuff; I think that line of inquiry is dead for the moment. But, I found a better way to use the bootstrap that does almost as well (but not quite as well) as the full kernel.

That's a big win on several fronts in my eyes. First, it provides a nice segue from the bootstrap to the kernel. Second, the kernel, which is the most Mathy solution is still the winner. Finally, the full bootstrap, which is the more Data Sciency solution is the winner once you consider performance (it's about 50% faster than the full kernel). That sets up the next phase of research really well since D-BESt isn't very Mathy at all, but will be a great add to the bootstrap algorithm.

FWIW, here's the rejection graph (as always, 50 is the target).

Saturday, June 2, 2018

One percent

Trying to finish this paper as we get into the busiest part of the year at work has pushed blogging a ways down the list of things I get done in a day. Most of what I've been doing on the paper has been the usual editing stuff that isn't that interesting to write about.

In less mundane news: I hit a few snags with the Metropolis-Hastings stuff and may just jettison that whole line of work if I can't get it fixed. It's really just a transitional step to the full kernel sampler, anyway. But, I haven't given up quite yet.

Looking ahead, I have formed a goal in my head as to what would constitute a useful result (from an applied standpoint, as opposed to just something that may or may not be mathematically interesting). I already know the D-Best is pretty good at cutting 90% of the data out of a query. I also know that CISS did a pretty good job of returning results on just 10% of the data read. So, now having the theoretical framework to justify those results, it seems reasonable to expect that we could produce good results reading just 1% of the data. Two orders of magnitude is no small thing in the real world. It would make our current $1.2-million cluster operate like one that costed $120 million. That's a number that shows up on the bottom line even for a Fortune-500 company like mine.

Granted, we already have other tuning methods that give us one order of magnitude, so it's really more like a $10 million difference. Still, I don't know any VP's that wouldn't take that if you offered it to them. (Though, my VP still grabs his chest and hyperventilates every time I suggest we actually write a production version of this stuff - I guess he's seen big ideas go down in flames before).

Monday, May 28, 2018

Actual contestants in the comparo

As I've noted before, BMH and BLB are train wrecks when the data is not independent. While I certainly want to cite those authors, I don't think it's fair to say that their algorithms are really what I'm competing against. I've modified them too heavily to work in the blocked sampling case.

So, who's actually in this contest? And by the way, what is the contest? The second question has an easy answer: find the best estimator of the block sum variance. The actual estimator of the sum is the same for all of these. But, we want to know how good it is. The "best" estimator isn't necessarily the one that comes the closest to the real value (though, that's a good start). The correlation with the sum matters, too. If the estimator and the block sum variance are positively correlated (they usually are), then you get too many rejections when both are low. So, it's quite possible to have an estimator that's quite accurate, but still produces bad confidence intervals too often (the sample block variance fails on this count).

Sample Block Variance: You want to know the variance of a population and you have a sample from that population? Well, just use the sample variance. It's easy to compute, unbiased, and has minimum variance. Too bad it sucks so badly in this case. The problem, as just mentioned is that it is very strongly correlated with the sample sum, especially at small sample sizes. Any stopping rule based on this estimator will stop too early too often.
Rho Variance: Rather than use the block sum sample variance, we derive the block sum variance as a function of the hit rate variance. We then use the sample variance of the hit rates and plug it into this function. Works great when the variance of the hit rate is really what's driving the block sum variance. Not so much otherwise.
Adjusted Rho Variance: Part of the problem with the above method is that there will be variations in the hit rate from one block to the next even if the actual parameter is constant. So, we compute the variance in the hit rate that we would expect to see if it really was constant and subtract that out (variances are nice that way, you can decompose them into pieces that can be added or subtracted as needed). This adjustment makes the method work well in both the above case and when things really distributed uniformly. Still doesn't help when the underlying distribution is changing, which is to be expected as the method doesn't even look at that.
Partition Bootstrap: To try to account for partitions that vary both in hit rate and distribution of the measures, we perform a bootstrap method where we resample the existing partitions to create blocks and then look at the variance of the empirical distribution of the blocks. I'm not entirely sure why this one doesn't work better other than the usual problem with correlation between the sample sum and variance estimate. At any rate, it's actually pretty bad across the board.
Metropolis-Hastings: Here, we try to construct the distribution of the entire population by estimating the parameters with an MCMC chain. We then sample these points and feed them into the exact formula for the block sum variance. I haven't finished validating my work on this one, but it appears to work fine. The only problem is that it's rather slow because you can't parallelize the MCMC chain (well, you can, but the results aren't as good). This algorithm borrows the idea from BMH that it's OK to run the chain on just a sample rather than the full data set. There's a variant that I'm also testing that the BMH authors suggested where you actually use a different subset of the data for each iteration of the chain.
Full Kernel estimator: Rather than generate an empirical distribution with an MCMC chain, we generate one using Kernel techniques. The resulting distribution is very similar to the MCMC distribution, as are the generally good results. It's still not as fast as the Adjusted Rho Variance, but it handles pretty much any case and the Kernel can be easily biased in the early going to avoid the problems of correlation with the sum estimator at low sample sizes. (I didn't find this to be necessary, but it's a nice feature to have).

So, 3 and 6 are the winners, though I'm pretty sure 4 would work fine if I put more effort into it. As 3 and 6 are the most original methods, I'm not really worried that 4 isn't doing great. If somebody else wants to publish an improvement, they are free to do so.

Sunday, May 20, 2018

Lamest defense ever

I'm not saying the bible gives Peter a bad rap. He is, after all, called out as the first Pope which would presumably imply pretty high respect. But some of the anecdotes we read during Easter season are less than complimentary and (at least to me) needlessly so.

First we have Matthew quoting Jesus as calling Peter "Satan" when Peter suggests that maybe Christ dying isn't such a great plan. All four of the gospel writers note his betrayal during the trial. I've already written about John's little barb about him not being able to keep up running to the tomb. Today, we get as good an explanation as any as to why Paul became the primary defender of the gospel: Peter is just not a very good lawyer.

We tend to read the passage of the Pentecost through the lens of Renaissance paintings rather than constructing the scene in our heads directly from the text. The accusation makes a lot more sense if you take the story at face value. It starts by noting that the apostles, that is, twelve guys in their 20's, were all living in one house. Yeah, what could go wrong there? One morning they are all out front yelling out a bunch of stuff in a dozen different languages. The obvious conclusion is the one the bystanders came to: these dudes are drunk out of their minds.

Maybe Peter had never been to a frat party, but his defense is absurd: "We're not drunk, it's only 9AM." At what point after the discovery of alcohol did 9AM mean that young men are not drunk? Granted, he goes on to say some really important stuff after that and at least some folks were sold as the story ends with three thousand people getting baptized. But, seriously, that was your opening argument? That's the lamest defense ever.

Friday, May 18, 2018

A paper in pictures

So, here's a first cut at the storyboard. The graphs are generated quickly out of excel; in the final they will be prettier versions created by the ggplot2 package in R. In particular, the axis will be labeled properly: in all the performance graphs, the horizontal axis is a logarithmic scale of samples taken and the vertical scale is how many times the 95% confidence interval missed the real value. Ideally, 50 of 1000 (5%) of the confidence intervals miss.

Sequential loading of data in batches causes similar data to be stored in the same physical block. The boundaries of batches do not typically align with the physical block boundaries. Here, physical block j contains some of batch m, all of batch m+1, and some of m+2. Block j+1 is entirely comprised of observations from batch m+2. Block j+2 contains the remainder of batch m+2 and the beginning of m+3. Such co-location of related rows can significantly increase the variance of an estimator when sampling is done by block.

When rows are not correlated by physical block, the sample sum and sample variance are independent. However, when the rows are correlated by block, the sum and variance also become correlated. This simultaneous understating of both the sum and variance leads to a high number of misses, even though two statistics are marginally unbiased. The cluster of points in the lower left of this graph result in confidence intervals that miss (I have a graph in R that does a much better job of illustrating this).

When estimating using the sample block variance, the miss rate is correct in the uncorrelated case. As a result of the correlation, this method produces too many misses when applied to small samples of correlated data. This is problematic when the width of the confidence interval is used as a stopping rule for sampling. Arbitrarily setting a lower bound on the number of samples is also unreliable, since the convergence to a proper 5% miss rate depends on the nature of the query and the strength of the correlation. (Uncorr is the data where the rows are randomly scattered in batches. rhoCorr is where the hit rate is correlated by batch. batchCorr is where both the hit rate and the distribution of the measure values is correlated by batch.)

Using the sample distribution of the hit rate to estimate the block variance yeilds good results in the case where that is the primary source of variance. However, it makes no adjustment for correlation of the measure distribution. When the data is not correlated, the method significantly overstates the variance (resulting in no misses) due to random fluctuations in hit rate being counted as real variation in the parameter.

Subtracting the expected variance from random fluctuations from the variance of the kernel distribution corrects for the overstatement in the uncorrelated case. This method is very fast and is a good choice when only the hit rate (and not the distribution of the measures) is correlated by batch.

The simplest method which makes no assumptions on the homogeneity of the measures is the partition bootstrap. However, this method suffers from similar problems of the simple sample variance: the samples which understate the sum also tend to understate the block variance. As a result, the confidence intervals, especially in the early sampling miss too often.

When we figure out what we're going to do with BMH, that will go here. I actually expect BMH to work really well since it basically does the same thing as the full kernel; just using a Metropolis-Hastings distribution rather than a kernel distribution as a source. The argument for using the kernel will be one of efficiency since BHM is pretty slow. Finally, the winner is...

The full kernel produces confidence intervals that include the true value roughly 95% of the time at all sample sizes. There is some understatement of variance at very low sample sizes, so it is recommended that some minimum number of samples be set (or, alternatively, some minimum number of samples which produce hits) prior to applying a stopping rule.

Wednesday, May 16, 2018

Storyboard

My adviser wants me to put more text around the figures in the paper. Fair enough, if it's worth a graph, it's worth a concise caption. But, then it occurred to me that I might be able to do better than that.

Many of the people I've discussed this problem with are really struggling with what, exactly the problem is. These aren't poorly informed folks. It's just that multidimensional queries are typically done on fairly small (as in, less than a Terabyte) data sets, so very little effort has been put into looking at their asymptotic performance other than to note that it's not uncommon for these things to literally run for days.

So, embracing the idea that a picture is worth a thousand words, I've decided to group all my figures together with sufficient captions that someone with a basic knowledge of data organization could follow the main points of the paper just by looking through the figures. Of course, that would be a very superficial review, but sometimes that's what a person wants. It will basically serve as an illustrated abstract. For those who want the thousands of words, they are, of course, provided.

I've been working on this for the last couple of days and I like where it's going. I should have something tomorrow.

Tuesday, May 15, 2018

Intentionally working past midnight

I don't know why that strikes me as odd. I did it all the time when I was in grad school 30 years ago. Anyway, I got a lot done, but I do need to get some sleep, so the post will have to wait until tomorrow.

Monday, May 14, 2018

Intentional

So, if you've been following the little graphs in the lower right corner of this blog, you can see that the last few weeks have been a bit of a train wreck for getting research done but not so bad as far as getting in physical training. The easy explanation for this would be that I like the physical training more, but that's not really true. As we're really getting close to wrapping this paper, I'm pretty excited about putting effort into it. I'd call my motivation pretty high.

Instead, the problem is one of planning. I've been getting in physical training because, frankly, it's planned. I have to decide at the beginning of the week which days I'll run or ride to work because I need to bring the right clothing in. Once that plan is set, I don't have too much choice but to execute it since I obviously have to get to work somehow.

With the research, it's been more about just fitting it in. And, even if something is important, it can easily get bumped by something that's urgent unless specific plans have been made to protect that time.

So, rather than simply logging my time on research, I'm going to start doing a better job of planning it. After meeting with my adviser, I generally have a pretty clear view of what needs to be done in the next week and how long it will take to do it. I'll follow each meeting with laying out the specific blocks of time when that will get done. Kate and Yaya are pretty good about respecting my schedule if I make them aware of it in advance. It's that lack of intentional planning that has been causing me to put down the research to take care of other stuff.

Friday, May 11, 2018

I meant for that to happen

Just a quick post tonight because it's almost midnight and I'm still at work.

I have, however, spent the last six hours working on my research. Yay for that. Even with the modifications the bootstrap algorithm is still off by quite a bit. But, that's a good thing. If it worked perfectly, there wouldn't be much point in deriving the exact distribution of the block variance. As you can see, in the uncorrelated case, it works great (to be expected - bootstraps like truly random data). In the other two cases, the variance is getting badly understated and there are too many misses.

So, that's pretty much best possible result there. It works, just not very well. Makes for a good baseline.

Thursday, May 10, 2018

He who pays the piper...

... calls the tune.

I hope my adviser will accept that explanation for why I have pretty much nothing to show for this week's work. My day job has been just a bit chaotic this week. Not really a surprise this time of year, but something I need to keep in check.

Getting over this hump is pretty crucial. If I can get the paper to a submittable state this month, using the summer for the literature review makes a lot of sense. Carving out 1 or 2 hours a day to read other people's work is no problem. But, I won't likely have the big blocks of time to work on my own stuff until after we go live with our big update in September.

Wednesday, May 9, 2018

The other shoe drops

It's not like I didn't know it was coming.

I had really hoped that we'd get the paper done before work heated up for the annual push to our big release in August. No such luck. It's come fast and hard this week and I've made very little progress on the paper. I think I'll get a good chunk of time tomorrow so I'll have something for my adviser on Friday, but it's looking like I'm just going to have to do something nutty like work a combined 80-hour week to get this wrapped up.

Friday, May 4, 2018

What's stopping you?

I had a good meeting with my adviser today. We discussed the current work only briefly, then looked at longer term (as in, post-PhD) opportunities. While he concedes that the market for full-time faculty isn't what it once was, he had some other options that I hadn't considered. The details aren't particularly important at this point. The main question to ask is whether this all makes sense.

Specifically, am I OK with a big pay cut.

I suddenly find myself on the opposite side of an argument I've had with a lot of aspiring athletes. It's also a conversation I'll need to have with Yaya if music is really the avocation to be pursued. Basically, my question has always been, "What's stopping you?" Followed by, "Really, that's actually stopping you?"

Most people don't appreciate the difference between a criterion and a preference. A criterion is absolute. If it isn't met, the discussion is over. There are very few legit criteria in the world.

Mostly, what "stops" us from doing things is preferences. Sure, I'd like to be a pro bike racer, but I also want to finish grad school. Sure, I'd like to be a pro bike racer, but I don't have a contract and to race as an independent would mean living out of my car while working a side job. Sure, I'd like to be a pro bike racer, but I'm married to somewhat who doesn't really want that existence.

Those were all preferences that I faced in 1988. They were strong preferences. There were consequences to not honoring them. But, they were preferences. I decided to see what would happen if I set them aside.

Well, I can't point to any huge success story but I can also say that it was a life without excuse. It basically cost me everything. I didn't get my PhD. I had a whopping $900 to my name when I retired. My wife left me just when I was actually getting good. But, I did do it. Those things weren't stopping me. They were just really difficult preferences.

Would I make the same choices now? Of course not. But, that's not the point. The point is that every decision is just that, a decision. If the option isn't feasible, there's no choice. It's only a decision because you are weighing the cost. And, the cost can be rather high.

Which brings us to life after grad school (let's assume for the moment that I do get my PhD this go round). There are three reasonably plausible paths. I could keep working as a consultant. That would certainly be the path of least resistance. I don't know that a PhD would improve my marketability much I've been doing this consultant thing for a while now and generally get my jobs by referral, so credentials don't count for much. It certainly wouldn't hurt.

I could find a full-time teaching position. As I've noted before, that's becoming a heavy lift due to the increasing reliance on adjuncts. However, a friend of mine who's about my age just took a full-time job at a community college and is loving it. So, there are opportunities; just maybe not the ones I had in mind.

Another option, which was the discussion of our meeting today is a sort of hybrid position which is partially funded by an independent research group and partially funded by a university. I have long known about such positions; it's basically the deal my dad got with Cornell when they "bought" the independent research institute where he was a Principal Investigator. But, I hadn't really considered it as an option for me.

The upsides are obvious. It's full-time work and it constitutes real involvement in the faculty.

The downsides are also obvious. The pay is less than great (though no worse than full-time teaching) and your job is only as secure as your last grant.

I'm debt free, have a marginally OK retirement account and, aside from my predilection for good wine, don't have much in the way of big expenses. I do have a kid going to college in four years, but that's more Yaya's problem than mine. I'll help as best I can, of course, but at the end of the day I don't have a big problem telling Yaya the same thing my parents told me when noting that private school tuition is a pretty big chunk (like half) of an academic's salary: if you want to go to a private engineering school, you're going to have to shoulder the costs. I did and I did. I think I took college a lot more seriously as a result.

So, nothing's really stopping me. Does that mean the preferences are not important? Of course not. But, it's a decision, nothing more.

Thursday, May 3, 2018

The real problem with correlated variances

I've suspected this all along and have had it in the paper for some time, but only recently bothered to prove it. The funky shape of the sample variance distribution isn't the real problem causing too many confidence intervals missing the true value. It's the correlation between the sample sum and the sample variance.

As you can see in the plots below, the uncorrelated sum and variance are basically independent. When the data is correlated, there's a bigger chance of getting "zero" blocks which understates both the sum and the variance. This combination is, simply put, bad.

While there's a lot more variation in the correlated case, the average values for both the sample sum and sample variance are fine. But, since the joint distribution pushes both away from the mean together, it results in small confidence intervals when the error is actually the greatest.

And, the problem never really goes away. Even at 250 of the 1000 blocks sampled, the correlation is obvious from the plots:

It's tight enough that the rejection rate is about right, but there's no denying that these statistics are correlated.

Wednesday, May 2, 2018

Well, it looks nice

I'm not suggesting that ggplot2 is a crappy piece of software. Quite the opposite. From my working with it over the last couple days, it seems that it is rightfully the "standard" (as much as there can ever be one in the open source world) for creating graphs in R. What I am suggesting is that, damn, what a giant PITA just to produce some graphs.

Purists will be all over me saying that there's plenty of documentation out there and that everything works just as it should. Furthermore, the adherence to The Grammar of Graphics gives it some academic street cred. They are correct on both of these.

However...

I can create pretty much the same graphs in Excel in about 1/100 the time. If I spent the next few years doing intensive graphics work in ggplot2, I might get to where I could produce graphs in only twice the time it takes to do it in Excel.

Nor am I suggesting that Excel is the world's greatest graphing tool (though, it's pretty darn good). What I am saying is that this is definitely a case where adhering to academic purity is making things take waaaaaaaaaay longer than they need to. Be that as it may, the graphs have been ported to ggplot2 on R.

Yay for that.

Tuesday, May 1, 2018

Targeted solution

So, if you've got data where the query relevance is highly correlated by partition, but it's otherwise uncorrelated, I've got the sampling algorithm for you. The "rho"-kernel algorithm works great in that case.

And not anytime else.

This is the number of times out of 1000 tries that the sample confidence interval didn't cover the real value. Alpha risk was set at 5%, so it should be around 50 misses per 1000.

I expected the full correlated set to generate a lot more misses. The rho-kernel algorithm makes no attempt to adjust for that type of correlation, so it would be alarming if the results were good for that case. However, the fact that the uncorrelated data was never missing was a bit of a shock. Why does a correlation coefficient of zero break the algorithm? I had to think on that one for a while.

What's happening is that the kernel will never assign zero variance to rho. But, if the variance really is zero, the only variability that will be observed is just random noise. That's already baked into the variance computation. But, it will also be the variance of rho. So, it basically gets double counted.

Unfortunately, there's no good way to know from the data sampled whether you're in a correlated or uncorrelated situation. You could put some heuristics around it, like, if the differences between blocks were above some threshold, but it would have no mathematical basis. I'm trying very hard to avoid heuristics in this paper.

So, I just have to conclude that the rho-sampler sucks. That's not really a bad result. Eliminating inferior solutions is part of finding the good ones. I do have the same problem with the full kernel approach. In that case though, the effect is smaller and it's a bit easier to adjust for it. More on that to come.

Sunday, April 29, 2018

Heart stopper #2.

Not quite as bad as a couple weeks ago when I thought my entire thesis had been scooped, but still enough to cause some anxiety.

I've been reworking my simulated data so that I can apply all the methods to the same data set (yes, it would have made more sense to set that up from the get-go, but that's a different conversation). Anyway, as I ran the most basic sampler (just use the sample block variance) on the new set, I noted with dismay that it was returning very good results even on the correlated data. I began to worry that all this work has been for nothing and that I had just been working with an anomalous data set.

Turns out, it was the new set that was anomalous. Not terribly so, just that the correlation between hit rate and the mean observation given a hit ended up yielding roughly the same expected value for all the partitions. Obviously, that's not the situation we're trying to deal with here. We state up front that if the partitions all return roughly the same mean, then any sampling method will do.

It was an easy fix to generate data where that wasn't the case. We carry on.

Friday, April 27, 2018

Running on credit

It's been a long time since I've made the tragic mistake of booking a vacation on a credit card. Still, today was looking like the very best day of the year, so I decided to bolt out of work and take my "sabbath" from 2PM today until 2PM tomorrow. At that point, the bill will come due and I'll need to work my ass off the rest of the weekend but I can't say I have any regrets. It was really nice out.

Thursday, April 26, 2018

Doing the comparos justice

With my new outline, I'm having to change my treatment of BLB and BMH a bit. I had introduced them only in the context of the uncorrelated case. They do great there. But, then again, so does simple sampling, so why go to the trouble? Given that they fail miserably when the data is correlated, one could question why I even bring them up.

Well, they don't fail quite as miserably if you put just a little work into it.

First, there's the fact that BMH specifically says you should resample blocks with each iteration rather than using the same block over and over again. If you actually sample random blocks from the entire population, this will work great, but that's because you will have sampled the entire population by the time you're done. That rather defeats the purpose of sampling.

But, suppose we just sampled from the blocks that we had already selected. So, say we've sampled ten blocks and we want to check three blocks at each step of the chain. No problem, just pick any three from the ten you've got. Since we've already computed summary statistics, this doesn't involve any extra computation, it's just a matter of picking the right value out of an array at each step. This will spread the distribution to at least cover all of the blocks sampled.

Now, let's look at BLB. This one is a bit trickier because the method explicitly relies on each "bag" being a truly independent random sample, which a block is definitely not. But, the partitions within our sampled blocks are, more or less, independent random samples. So, if instead of looking at individual observations, we select a group of partitions and then construct the bootstrap from them, we should get similar variation in the total sum without having to randomly sample at the row level.

I'll need to test out both these ideas, of course. Fortunately, the sampler harness is pretty generic so it's probably only 1-2 days work, plus another day writing up the results (unless they come out whacky, then it could be several days trying to figure that out). Either way, I think it will be a presentation where I'm less likely to be accused of portraying competing methods in their worst possible light.

Wednesday, April 25, 2018

The right week to be indoors

Not.

I might not get quite as much done by next week as I had hoped.

Tuesday, April 24, 2018

Big finish

In most things, it's pretty easy to put out a big effort right at the end. In running, many people sprint to the finish line once they can see it. However, turning up the intensity with a mile and a half to go will pull back a lot more time and places.

I feel like that's where I am with this paper right now. There's still 1-2 weeks of solid work and probably another week of clean up and submission. It's too far out for doing something crazy like pulling 2 or 3 all nighters. I'd just be too tired to edit carefully and make a bunch of mistakes. But, it is time to really push. I think if I get 20-25 hours in each of the next two weeks, I'll be in really good shape on this one.

My reward for that is that that's exactly when I can return to running as my broken foot should be healed on May 6.

Monday, April 23, 2018

Still stuff to do

Definitely getting warm on this last revision (I'll find out if my adviser agrees tomorrow). The flow of the paper makes much more sense now, as does the inclusion of the BLB and BMH algorithms. There are still a few things left to do:

Display consistent results for each of the methods. Right now, the graphs are a little scattered. They all tell good stories, but I need to either come up with a consistent graph that allows comparisons across methods, or come up with some other type of table that puts them on equal footing.
There are a few sections of text that my adviser thinks are less than clear and I haven't been able to come up with significant improvements. We may just have to work them together word by word. Fortunately, it's a short list.
The conclusion is, well, missing. I need to write up a conclusion of some sort.
Run all the samplers on the empirical data set. This is the biggest lift, but at least I have that data set on the HDFS cluster at work now. It's 17 billion fact rows and it's no problem to construct queries against it that take several minutes to complete (they used to take hours before we rehosted to the cluster this year). So, it's a big enough data set to make the case that sampling makes sense. The problem is that I haven't properly parallelized the samplers, so they will probably take just as long as the full query engine (which we have spent the last year making very fast). I guess I don't need to worry about that since we're just demonstrating that things work and not making a bunch of performance claims.

That's probably a solid week's worth of work and I'm sure there's another round of edits, but this is looking like something that really will happen quite soon.

Sunday, April 22, 2018

All that Jazz

This weekend was the UMSL Jazz Festival. As with last year, Yaya's school performed during the day. Then, I took her to both evening performances. Really good stuff all around.

Except, of course, for the fact that the time doesn't come from nowhere. I was so tired from staying up late the last two nights that I didn't get much done today.

But, back to the jazz. Yaya is very much into it. So much so that I was happy to hear Gordon Goodwin speak plainly to the crowd (which, this being a scholastic event, had a lot of middle and high school jazz players in attendance) about the realities of professional jazz. He didn't come across as remotely jaded or bitter; he just acknowledged that if jazz is what you want to do, you might need to find some other way to pay the bills. As he's pretty much top of the heap in the Big Band world, his commentary carries a lot of weight.

Yaya needs to hear stuff like that. Frankly, so do I.

I entered this PhD program not even considering that getting a faculty job might be a problem. After all, I had multiple offers coming out of Cornell, and that was with just a masters. But, things have changed. Schools aren't hiring full-time faculty anymore. At least, not as a general practice. They are relying on adjuncts to carry the load. I'm not really interested in being an adjunct. I've done it and I didn't find I got the same connection with the students that I did when I was full-time faculty.

So, will I ever teach again? I can't honestly say. I still think it's possible, but I don't know that I'd call it likely. I don't regret going back to school and a PhD in Computational Mathematics is a useful thing to have as a consultant. But, it's certainly not the path I thought I was on.

Friday, April 20, 2018

Reorganizing again

The outline of the paper is changing again, but not the content. We're just going to present the topics in a slightly more cohesive manner given that the true subject of the paper has become the block variance, not the sum itself.

Introduction
Sampling framework and notation
Exact distribution of sample sum

Distribution as a function of the block variance
Block variance in the uniform case (my adivser would like to use a different word than "uniform" since it sort of implies a Uniform distribution - I'll think of something).
Block variance in the correlated relevance case
Block variance in the general correlated case

Methods for estimating the block sum

Simple sample variance
Bag of Little Bootstraps
Bootstrap Metropolis Hastings
Kernel estimation of relevance
General kernel estimation
Results and discussion

Implementation details
Future research

Wednesday, April 18, 2018

Implementation details

I doubt anyone will care, as this paper doesn't really espouse a working algorithm as much as a general approach to multidimensional sampling. That said, it is a working algorithm, so I added an implementation details section where I give the computational complexity and offer a framework for parallelization. It's only one more page and it at least tells the reader that we're not completely lost in theory land here.

Meeting with my adviser tomorrow to get some more detailed feedback on Monday's draft. Finishing this weekend is probably optimistic, but we are definitely getting close.

Monday, April 16, 2018

Sometimes it snows in April

Well, Prince being from Minneapolis, one can see how he might think that. But, this is St. Louis. It's snowing. In mid-April. What the heck?

On a related note, boy am I glad I ran Boston last year when it was just kind of warm rather than the sleet and wind they got this year. Slowest winning time in at least 40 years and the folks who I know that are around my speed are having trouble breaking 3:30. Sounds miserable.

Sunday, April 15, 2018

Backwards

As I've admitted before, I'm kind of doing this research thing backwards. I've been developing my own ideas without paying a whole lot of attention to what everybody else has done. My operating premise has been that, if somebody had already cracked this nut, we'd be doing it at work. We have a pretty crack team and so do many of our competitors. There are literally trillions of dollars at stake. And, it's not something that would be kept a secret. Unlike regular insurance companies, which might view such knowledge as a trade secret, reinsurance companies attract clients (regular insurance companies) by showing off how smart we are. People are encouraged to publish.

So, the fact that neither we or our major competitors have figured out a good way to sample our data was enough for me to consider it an open problem.

Imagine my dismay when I came across a paper today that appeared to have addressed it back in 2008. It's one of those moments when you first aren't sure you read that right and read it again. Then you read it slowly while feeling that urge to be sick build in your gut. The fact that it came from of team at University of Illinois was less than comforting; very little that comes out of that engineering school is bogus. Then there's the frantic search to find an online copy of the paper (I'm sure UMSL's library has the journal, but I didn't want to spend an hour going over there this evening). I finally found a copy and was relieved to see that their premise is fundamentally different from mine.

Basically, they set about showing how to construct confidence intervals when your OLAP cube is populated with a sample of the full population. That's a valid thing to think about and you could use it to solve our problem if you were OK with truly random sampling. But, we're not. Random sampling is hugely inefficient. Block sampling is the underlying approach of my paper and that wasn't even mentioned in the U of I work. So, we're all good for now, but it sure did have me in cold sweats for a bit.

I really need to take my literature review seriously this summer.

Saturday, April 14, 2018

1

Just a quick proud papa post today. Yaya got a "1" (which is the high score) at the St. Louis Music Educators Exhibition. I'm not a particularly good judge of what makes a trumpet piece hard since I've never been any good at valved instruments, but it seemed like the solo Yaya picked was pretty stout for an 8th grader. The judges seemed to agree.

Thursday, April 12, 2018

On the margins

Quick, spot the miss in the following derivation:

$\begin{align*} \textrm{Var}_m&[\textrm{E}(S_{jk})|m] = \textrm{Var}_m(b_k\rho_m g_{1m})\nonumber\\ &=\textrm{E}(b_k^2\rho_m^2 g_{1m}^2) - \textrm{E}(b_k\rho_m g_{1m})^2\nonumber\\ &=\sum_m\sum_{b_k=1}^b b_k^2\rho_m^2 g_{1m}^2\textrm{P}(m)\textrm{P}(b_k|m) -\left[\sum_m\sum_{b_k=1}^b b_k\rho_m g_{1m}\textrm{P}(m)\textrm{P}(b_k|m)\right]^2\nonumber\\ &=\sum_m\rho_m^2 g_{1m}^2\textrm{P}(m)\sum_{b_k=1}^b b_k^2\textrm{P}(b_k|m) -\left[\sum_m\rho_m g_{1m}\textrm{P}(m)\sum_{b_k=1}^b b_k\textrm{P}(b_k|m)\right]^2\nonumber\\ &=\sum_m\rho_m^2 g_{1m}^2\textrm{P}(m)\textrm{E}(b_k^2|m) -\left [\sum_m\rho_m g_{1m}\textrm{P}(m)\textrm{E}(b_k|m)\right ]^2 \end{align*}$

It's not super obvious. And, truthfully, it doesn't actually make any difference worth noting. However, a PhD dissertation is your chance to get things really right, so being picky is in order.

The problem with this derivation is that we take as the probability that a block partition comes from partition m as the ratio of the rows in partition m to the total rows. That's not a bad approximation but, the rest of the formula is really specific, so approximating this term isn't appropriate.

It should be the marginal distribution, P(m|k), rather than just P(m). For a given value of m, that's quite different. In total it washes out a bit, but not completely.

This is the sort of semantic stuff that I'm not super good at. It's not that I don't think it's important; it's just that I tend to miss these things. I hope my adviser is good at catching them.

Anyway, I've updated the sampler to reflect the correct term.

Monday, April 9, 2018

A lotta work for nothing

Alert readers will recall that a week ago I spent the better part of two days sorting through derivations to find a term that almost always evaluated to zero unless you blended the estimators - then it was not anything like zero.

Well, it's back, but in reverse. Now, since I'm not blending estimators in my most general case, I do want the term. It still comes out to pretty much nothing, but I don't want reviewers pinging me for dropping terms just because I don't think they matter. I still do drop terms, but I'm being very careful about making sure they really are insignificant. I've wasted too much time chasing down bad assumptions.

FWIW, here's the term:

$\begin{align} \textrm{E}_m&[\textrm{Var}(S_{jk}|m)] = \textrm{E}_m[\textrm{Var}(\rho_m g_{1m} b_k|m)]\nonumber\\ &=\textrm{E}_m[\textrm{Var}(\rho_m g_{1m})\textrm{E}(b_k|m)^2 + (\rho_m g_{1m})^2\textrm{Var}(b_k|m) + \textrm{Var}(\rho_m g_{1m})\textrm{Var}(b_k|m)]\nonumber\\ &=\sum_m \rho_m^2g_{1m}^2 \textrm{Var}(b_k|m)\textrm{P}(m)\nonumber\\ &=\sum_m \rho_m^2g_{1m}^2 \textrm{P}(m)[\textrm{E}(b_k^2|m) - \textrm{E}(b_k|m)^2] \end{align}$

I do drop a couple terms in the third step because the hit rate and mean observation aren't supposed to change from one partition to the next so their variance should be very close to zero, especially with large row counts. That's an assumption and I could get called on it, but since I call it out myself in the paper, I'm OK with it.

Saturday, April 7, 2018

My point is...

This is not the progress my adviser was likely hoping for today as I said I'd have actual results on Monday. But, while the morning has not produced any such results (or even done much to forward that effort), it has provided a significant revelation, that should help in wrapping all this up.

People have been asking me what my paper is about. I've been having a lot of trouble answering that because the topic has shifted so many times. Frankly, when I read the first part on the uniform sampling and then look at the second part with the generalizations, it looks a bit scattered. This was brought to my mind when my adviser suggested we just use the empirical distribution of the partition parameters to construct the sum directly and then get the variance of that rather than using the block sums. For reasons I couldn't put my finger on, that seemed horribly inconsistent with the rest of the paper. Not inconsistent in the sense that it contradicted anything; just that it didn't fit.

This morning, while making a few false starts on implementing the general block variance computation, I thought more about why I wasn't keen on using the total sum. There are some technical ramifications that I'd have to work through, but they aren't a big deal. I then realized why it doesn't fit.

This paper really isn't about the sum. It's about the block variance. If we know the block variance, we can always get the variance for an estimator for the population. More importantly, in the finite sampling case, this is true for any estimator. The sum is just a convenient choice to demonstrate the technique.

Viewing the paper through this lens, the title becomes a bit more obvious (something like: Estimating Query Values from Finite Multi-dimensional Populations using Blocked Sampling). The comparison with the BMH and BLB algorithms to estimate the block variance is suddenly a sane thing to do. The introduction explaining why one would even use blocks rather than true iid sampling fits. And, though I haven't written it yet, the conclusion can talk about future research where we don't just consider sums, but generic population parameters estimated by repeatedly sampling blocks.

Most importantly, this points the way for proceeding to the real problem I want to solve: self-tuning multi-dimensional data stores. The objective is the restructure the data so the block variances are minimized. A secondary objective would be to make them easy to compute. I've already got quite a bit of work done on both of those.

Friday, April 6, 2018

Back to earth

It's been a crazy few days at work (actually, normal by the crazy standards of my group this time of year, but crazy by any normal standard). Anyway, it's prevented me from acting on the previous post.

And, that's probably a good thing.

I had another meeting with my adviser today and we both agreed that the whole rx4-dimensional space was just a bit nutty. So, we're going to frame this in less grandiose terms. We'll still create a 4-dimensional kernel distribution, but we won't blow it out for every partition. Instead, we'll just sample from that and estimate the block variance. My adviser would like to go even beyond that and just directly sample the total sum variance, but I'm still lobbying for the block variance. I'm pretty sure I can show that it's equivalent, both theoretically and in terms of computational effort. Sticking with block variance gives us a consistent measure of spread throughout the paper.

I've got until Monday to make my case. It's time to wrap this thing up.

Tuesday, April 3, 2018

New method

Oh, brother, here I go again reframing things just when the finish line is in sight.

At my weekly meeting with my adviser, he was understandably skeptical of my plan to us an MCMC chain to generate a distribution across all the vectors for the partition parameters. Given 1000 partitions (which actually isn't very many), it is a 3000-dimensional space. I'll concede that's a heavy lift, though my thought was that even a very sparse approximation would still converge. I, of course, have nothing but gut feel to back that up.

Then, as I was staring at the blackboard, it hit me. There's no reason all these things need to be estimated at once. We've already said that in the general case we are assuming no direct correlation between partitions. They may exist, but we don't assume as much.

So, that means that we could partition the partition space and estimate the groups of partitions separately. Then, we just need a way to glue all those back together into one empirical distribution of the block variance. My adviser liked that idea and I left the meeting thinking that should be easy enough to do.

Then the revelation hit me: it's already been done! That's essentially what the Bag of Little Bootstraps algorithm does. All the theory has already been worked out. Better yet, I already have a section in my paper adapting the theory to my use case. It's not a slam dunk because I was showing how it operated in the uniform case and explicitly called out how it fails miserably in the correlated case. But, that's because we were measuring the wrong thing. Applied to the partition statistics rather than the individual rows, this should work.

It does mean I have some more proving to do. Simply showing that it works on simulated data sets won't cut it. But, the proof of correctness in the original paper is fairly straightforward. I should be able to follow the same path to show it works in this case.

The downside, of course, is that I just set myself back another week. This is the first time I've felt like it will be worth it.

Monday, April 2, 2018

Theory and practice

Agree! Finished up the empirical results for the correlated hit-rate sampler last night. (My observation of a day off each week is the traditional sundown-sundown, so it's OK to work Sunday night, even Easter Sunday.)

As you can see, even for small sample sizes, the rejection rate is very close to the desired 5%. On to the full general case.

Sunday, April 1, 2018

John smokes Peter in the inaugural Ressurection 5K

I may get some angry emails for making light of the resurrection, but I figure everybody who cares about such things has already had time today for serious reflection on the paschal mysteries. Today's lectionary from John would seem to indicate that running is, indeed, a thing:

So Peter and the other disciple [that's John, who has a strange aversion to first-person narrative] went out and came to the tomb. They both ran, but the other disciple ran faster than Peter and arrived at the tomb first; he bend down and saw the burial cloths there, but did not go in. When Simon Peter arrived after him, he went into the tomb and saw the burial cloths there, and the cloth that had covered his head, not with the burial cloths rolled up in a separate place. Then the other disciple also went in, the one who had arrived first, and he saw and believed.

Those who study scripture know that the tradition is that when a point is really important, it gets stated three different ways. There's a good reason for that. If you tell something once and it's misheard (remember, we're talking oral tradition here, none of these folks had a bible at home), you've given them bad information. If you tell it twice and they mishear once, you've given them conflicting information. If you tell it three times, you've at least given them a preponderance of statements to sort it out. It's basically the ancient equivalent of a checksum bit.

So, it appears that in writing about the most important event in Christianity, John really wanted us all to know that he dusted Peter running to the tomb. John wrote his gospel sometime between AD 85 and 95, so this is basically a 90-year-old giving a race report from his 20's. As the saying goes, the older I get, the better I was.

Saturday, March 31, 2018

Searching for the lost

The lost piece of my variance formula, that is. (Those who know me personally know that I take this whole Jesus died and rose again thing fairly seriously. That doesn't mean, however, that one can never make a lighthearted comment about it. If you disagree with that statement, you might want to skip both today's and tomorrow's posts).

Anyway, we're told that between dying on Friday and rising sometime early Sunday, Christ hung out in hell. Exactly what he did there is a point that some folks like to debate. Some claim he went there looking for lost souls. Whatever he was up to, it probably sucked worse than my last two days, which have been spent looking for what part of my variance was lost when we approximate the vector of hit rates.

But, maybe not by much. Searching for computation errors is pretty much my version of hell.

The formula worked fine when calculated directly and I wasn't seeing anything wrong with the proof.

Substituting in the moments for a random variable is a pretty standard trick, so I was really miffed that the formula was not only off, it was outright wrong. Breaking the blocks into smaller partitions was increasing the variance. The opposite should be true.

So I spent yesterday afternoon and evening going through the derivations really carefully to see what was missed. I did find a reversed sign (as I've noted before, I do that a lot), but nothing that explained the real problem.

So today I did what I only do as a last resort because I'm really bad at computation. I sat down and started working through the whole thing by hand. Well, just in time for Easter, I've found the problem. And, it was pretty insidious.

The formula is actually right, but it contains the fairly innocuous looking term:

$\sum_m \rho_m^2\textrm{P}(m)[\textrm{E}(b_k^2|m)-\textrm{E}(b_k|m)^2]$

The nasty bit is that that term is almost always zero when you are computing each partition individually. That's because most batches don't span more than two blocks, so the variance of any particular block partition given a batch m will be zero because there's at most one entry. When you blend that, you wind up with

$\hat\rho^2[\textrm{E}(b_k^2)-\textrm{E}(b_k)^2]$

and that guy is almost never zero because now you're looking at the variance of partition sizes across all blocks. So, the more you chop up the data, the bigger the approximation gets even though the true value is going down.

The fix is easy enough - just drop the term. Like I said, it's almost always zero and even when it's not, it's not very big compared to the other terms in the equation. Having done that, my sampler quickly rose from the dead and began producing some decent results. Sorry, no graphs yet, but at least I've got stuff to work with again. And, I got the paper updated so my adviser can continue to review it.

I can now go to Easter Vigil and think about what really matters. I'll pick up my research again tomorrow evening.

Thursday, March 29, 2018

Enough theory

Yes, that's a direct quote from my adviser! Let's get some stinking results and put this thing to bed. As I have a 3-day weekend ahead of me (yes, I will take Sunday off and I'll also go to church tomorrow), I should be able to get empirical results we are after. Well, at least the first set.

The full-general 3-stage MCMC results that I've proposed for the grand finale, might not come together quite so quickly. Mainly because I'm not entirely sure how to do it. Still, the target is in sight and, since I can't spend the weekend running, there's no reason not to set expectations high.

Tuesday, March 27, 2018

Oh, snap.

And not in a good way.

It looks like I've got a stress fracture. I went for my normal morning run Sunday. As is my custom, except when the ground is frozen, I ran barefoot. Since we had cold rain all night, my feet were pretty cold, but otherwise there was no discomfort.

When I got done I wiped the mud off my feet and noticed the telltale bruise of a stress fracture above the third metatarsal. If it is a stress fracture (it's pretty hard to know for sure for a few weeks), this will be the third time I've had one. It's been a different metatarsal each time.

The first time, I thought it was just a bruise and kept running until it became an overt fracture. There wasn't much mistaking it when that happened. I was in a race when it snapped and had to hobble the remaining mile or so to the finish. The second time, I recognized the pattern and let it heal before things got really bad.

Hopefully, that will be the case again this go round but, at the very least, it's a few weeks off at a time I was hoping to start ramping up my mileage. Bummer. The amount of stress in my life right now will not be easy to handle without running as a coping mechanism.

Monday, March 26, 2018

About that todo list...

16 hours at work today. Not a whole lot of research going on.

I knew this was coming; we always ramp up in April and May. I was hoping to have the paper done before it hit. The next month will be telling. I'll either get the paper done, despite the growing demands at work, or I won't.

What I'll do in the latter case is an open question that I'm not going to seriously consider unless the predicate becomes a reality.

Sunday, March 25, 2018

Edged out

I think I have a reasonable progression of logic that gets r/(r-1) in front of my sample variance. It's not formal proof by any means, but I don't think that's necessary since any competent statistician knows you have to do that to remove the bias of the sample moments.

Meanwhile, I shored up the arguments quite a bit and streamlined the discussion. I now have a section that is somewhat readable.

There are still some holes:

I've pointed out that the edge case where the sampled blocks return no rows is pathological when using the most straightforward estimators. I haven't shown that my enhanced estimators do any better.
I don't have any empirical results (or even a data set to get such results) for the most general case. That's my priority for tomorrow.
There are still a few gaps in the logic that may or may not need closing - I'll let my adviser make that call.
My list of references is still way shorter than I'd like.

So, there's work to do but, dang, this is starting to look like a real paper.

Saturday, March 24, 2018

Edge cases

I haven't been posting because, frankly, working out these distributions (and typesetting them) is consuming pretty much every free moment. I did derive a somewhat interesting result tonight that deserves some mention.

Because I don't want to make any assumptions about the underlying distributions, I'm doing all my variance calculations using method of moments (compute the first two moments from the sample and plug them back into the variance equation). I'm doing this on just about everything: the observation amounts, the hit rate, the number of blocks in a partition... For the most part, it is then just a matter of cranking out the formulas. Of course, it doesn't hurt to check your work by plugging in some numbers you know and seeing that you get what you expect.

So, I did that. I set the number of partitions per block exactly equal to 1 and the variance of the block hit rate came out to exactly the variance of the partition hit rate. Except that it didn't. It came out to the population variance. But, if this is a sample, it really should come out to the sample variance. This isn't too terribly surprising. After all, I'm just matching up the moments, so there's no distinction made between a sample or population moment. Still, it's unsettling.

Then, I set the number of partitions to be exactly the row count for the block; basically my uniform case. And, again, it came out just right except that it was matching as a population rather than a sample.

This doesn't much matter when the sample gets large, but at small values it can throw things off quite a bit. I can adjust for it easy enough, but coming up with a mathematically sound way of doing that (as opposed to just messing with priors until it works) might be a bit of a trick. This seems like a good question for my adviser. It's entirely possible there's a standard trick that I just don't know about.

Thursday, March 22, 2018

What are the chances?

Somewhat off-topic post tonight.

I pulled up my usual weather app this morning and noted an interesting anomaly.

Why would the probability of rain be exactly 65% for such an extended period. (I'm guessing it's really 66.7% or 2/3, but that the app rounds to the nearest 5%).

The "chance of rain" is actually complete fiction. Pin any meteorologist down on this if you don't believe me. Is it the likelihood that it will be raining at that very instant? Is it the probability that some precipitation will be measured within some fixed time interval? Is it the odds you would take or give if betting? Is it the historical percentage of observed rain when similar pre-conditions are observed?

No, it's none of those. Why? Because most people don't have a clue how probability works. Look at the last presidential election. Nearly all the polls (and certainly the average of the polls) indicated that there was a 20-30% chance that Trump would win. Yet, people still marvel about how "wrong" the pollsters were. They weren't wrong. A 20-30% chance is the same chance a major league baseball player has of getting a hit when they come to the plate. Do people think that getting a hit in baseball is some sort of bizarre rare event? Of course not. It's just less likely than getting out.

But, if you tell someone that the chance of rain is 75% and they cancel their picnic plans, they'll be mad if it turns out to be a nice day. So, the weather folks play this silly game of trying to come up with a "probability" that conveys what they really mean, even though it has nothing to do with actual probability.

I can't know for sure, but my guess is that the algorithm that generates probabilities for the weather app I use has some sort of harness that prevents the chance of rain going beyond 2/3 in certain conditions because people will interpret that as "It's definitely going to rain."

Interestingly, a couple hours later, they had changed it to look a lot more like a real super-accurate prediction (any honest meteorologist will also admit that the difference between 50% rain and 60% is meaningless more than a few hours out). Clearly, the same algorithm is being used as the shape below the bound is identical, but the bound has been removed.