Monday, February 26, 2018

Not what I signed up for

I wasn't quite so arrogant as to think it would be easy but, I have to say, I'm a bit taken aback by how much difficulty I'm having getting some decent research done. Yes, I'm old. Yes, I'm holding down a fairly stressful day job. Yes, I have a family to tend to. Excuses are not the makers of greatness.

Actually, I'm not even after greatness. If four years ago someone had told me they were looking at my topic in pursuit of a PhD, I would have thought it was credible research, but certainly wouldn't have been overly impressed. Most open problems are open not because they are super hard but because nobody has gotten around to them yet. Mine certainly falls into the latter category.

But, here I am struggling to get anything of merit out of my research. It's a little depressing. I have to say I feel like the dumbest student on campus right now. (Well, OK, the dumbest PhD student. Some of the undergrads are clueless in ways I can't even describe). That said, it's also somewhat refreshing. Sure, making our dates at work is hard. But, there's never any doubt that we can actually do it. Ultrarunning carries a modest risk of failure, but it's not like the stakes are particularly high. I don't like it when I don't run well, but nobody else cares.

Having an activity that comes at great cost and where failure is a very real possibility is not a common thing for me.

Sunday, February 25, 2018

New information

So, I've spent the last few days working on the correlated bootstrap. Yes, I cursed this effort by saying it would be "easy" once I had the kernel distribution worked out. I actually decided to plow ahead without the kernel distribution hoping that might give me some insight on what the kernel should look like. It didn't. But it did uncover something that I'm not having much luck explaining.

First, the method. Sample r blocks and get all the partitions. Then run a bootstrap where you keep re-sampling from the partitions with replacement until you are back to your sample size. (I also tried the simpler version where rather than partitions, you just use blocks - it didn't make much difference either way, though the partition samples were a little smoother). That gives you a distribution on your sample sum. Now, just scale that up to your total size (multiply by N/r) and that gives you the distribution of your total sum estimator. Find the 2.5 and 97.5 percentiles and that's your 95% confidence interval.

Except it isn't.

If we plot the number of samples where the confidence interval does not contain the true sum, we get this:

Yeah, so the "confidence" interval doesn't seem to have much to do with our confidence. Note that this is an empirical distribution, so it's not like we've got a parameter wrong somewhere. The empirical distribution itself is way to narrow when the sample is small and too wide when it gets larger. To see how much, I scaled each confidence interval by a factor so that 5% of the intervals would miss the mark. I then tried to find a relationship between that factor and the sample size. From the shape of the graph, a reciprocal relationship seemed a good bet and, indeed, it fits fairly well.


 I could just move on, but I'd really like to understand what's going on here because accurately simulating distributions is a pretty important part of sampling. And, I can't just write in the paper, "Then divide by r for no apparent reason and it works."

So, why would the empirical distribution be squeezed by the reciprocal of the sample size? And, why does the relationship continue to hold, even as the the confidence interval goes from being too wide to too narrow (one would expect it to be asymptotic)? Finally, 1/r doesn't do anything different when the number of blocks, N, goes to infinity. I need a formula that converges to 1 when N >> r.

That last bit got me thinking this has something to do with the fact that the relationship between the block variance and the true sum is different in the finite and infinite cases. Recall that the relationship between the block variance and the estimator error in the finite case is (N/r)(N-r). In the infinite case, where the you're not really estimating the sum, but rather the population mean scaled up by a factor of N (we're treating block sums as atomic observations for the moment), the relationship between the block variance and the estimator error is N2. When you divide those two, you get (1/r)(N-r)/N. When N is fixed, that gives the same reciprocal arrangement as above. When r is fixed and N goes to infinity, this converges to 1. So, it looks good.

I'm not quite ready to give out high fives though. First and foremost, I shouldn't be expanding the confidence interval by the size of the variance, I should be expanding it by the standard deviation. The relationship to the square root of the reciprocal mitigates the annoying kink on the left, but the right hand side is looking like it might not fit. At any rate, the graph is hardly proof of anything. I need to derive the actual bootstrap distribution.


The casualty in all this is that I haven't done any of the work on the Dirichlet stuff, other than reading about it. That's not really the message I wanted to take into tomorrow's meeting with my adviser, but at least I have an algorithm that works (albeit, inexplicably).

Friday, February 23, 2018

Well, it's wider

...and at least it's the right order of magnitude. Still working on the bootstrap sampler. It's definitely NOT coming out the way I thought it would. Lots more to do this weekend.

Wednesday, February 21, 2018

Fixing the methods

So, how do we actually go about "fixing" the three approaches to evaluating our estimator in the correlated case? There are many ways of course, but here's what I'm going to try.

In the case of simple block sampling, the problem is that the distribution of the sample variance doesn't match the χ2 distribution. That means the estimator doesn't match the t-distribution. So, we need a way to estimate the actual distribution of the variance so we can then estimate the distribution of the estimator. (Lots of estimation!) To estimate the variance distribution, I am going to use kernel density estimation (KDE). This is a technique where the data points are used as centers of some kernel density. You then average all these together to get an estimate of the real density. It's like a histogram, except it can be made much smoother. (Actually, a histogram is a special case of KDE where the kernel maps the observation to the center of the histogram bucket containing the data point with probability 1). The smoothness is useful if you want to evaluate the probability of an arbitrary interval. If you don't need that, a histogram works pretty well.

Since I already have a derivation of the variance as a function of the hit rate, I'm going to use some variant of the Beta distribution as my kernel to estimate not the variance, but the hit rate. That can be run through the variance formula to get a distribution on the block variance, which in turn gives an estimation of the distribution of Sr, the estimator of the total sum. There's actually not a lot of math involved in all that. It's simpler to just resort to numeric methods for the integrals.

Next up is bootstrap. This one is fairly easy given the above work. Rather than bother computing the kernel density, just resample using the raw hit rates. This is way more efficient than traditional oversampling because we're only saving the sufficient statistics (hit rate, and the first two moments of the observation) from each partition and how many rows were in each partition. We can then generate a multinomial vector with dimension equal to the number of partitions sampled and generate a random some for each element of the vector rather than generating millions of individual observations and adding them up.

Finally, we have the MCMC idea, which is basically where the Dirichlet stuff comes in. We gather the same sufficient statistics as above and run them through a Dirichlet process to get a distribution on the distribution for each partition. Yes, a distribution of distributions. One could derive the distribution of the estimator from that, but here's the wrinkle I want to throw in to tie all this stuff together: use that distribution to run the bootstrap from above. So, you have a Dirichlet Process to boil the data down to a non-parametric distribution and then you use a bootstrap process to generate a distribution of the estimator. Meanwhile, the estimator itself is still the leveraged sample sum.

Tomorrow, I'm going to modify the simulator to produce all this stuff. If it works, I'm going to have A LOT of writing to do this weekend.

Tuesday, February 20, 2018

More thoughts on part 4

So, I got a little myopic in my treatment of what we're calling the "uniform" case of the sampling (this is the case where the relevant rows of the query are scattered about the data with no discernible correlation). In those cases, simple block sampling, BLB, and BMH are all comparable in terms of results, so you'd probably just go with the easy one. It's also fairly easy to show that BLB and BMH are substantially worse than simple block sampling once correlation is introduced. That's not really a surprise as both methods state up front that they are designed for iid observations.

It then occurred to me that I wasn't really doing either of those methods in the uniform case. I had to modify them to work with a finite population. I have to modify them a whole lot more to work with a correlated finite population. So, maybe I'm not really using those methods at all, but rather using them as inspiration for methods that work in the finite case.

Ah, patterns! Now we're getting somewhere. There are lots of ways to build estimators, but not nearly as many ways to evaluate the quality of those estimators. That's what we're really after. The simple block estimator is always the best estimator. But, we need some way to know how good it is so we can create a stopping rule for our sampling. So, how do we do that? There are three options: derivation, bootstrapping, and simulation.

Derivation is fairly easy in the uniform case. Not so much in the correlated case. There's just too much stuff we don't know.

Bootstrapping works in both cases if you can make some assumptions about the kernel distribution (which may or may not be valid).

Simulation, particularly Monte Carlo Markov Chain is the most flexible, as in, you can let the data drive the model rather than the other way around, but it also incurs the most cost.

So, simple block sampling using the sample variance is a good example of the derivation method. If the data is iid, the exact distribution of the estimator can be stated. And, it's easy to show that it breaks down in the correlated case. It's also easy to show that if you can make some fairly restrictive assumptions about the data, you can save it. But, it's really a stretch.

I had been using Bag of Little Bootstraps (BLB) as the starting point for the bootstrap method, and that is a good entry. However, by the time we've accounted for correlation, we have something that's quite different from BLB. It's still very much bootstapping, though, so the flow of the paper works just fine if we cite BLB as an inspiration rather than trying to claim that it's what we're actually doing. FWIW, the modified bootstrap works OK with somewhat less restrictive assumptions.

Bootstrap Metropolis-Hastings is also better viewed as an entry point rather than the actual method we are using. In the correlated case, we have a lot more flexibility and that is helpful.

All this makes for a fairly cohesive part 4 (which I'm trying to get written this week, but it's increasingly looking like it will spill into the weekend).

What I'm really excited about is that it also sets up a really obvious use of the Dirichlet process. The upshot of this is an estimator that is derived to be the MVUE and assessed with a bootstrap process driven off results from a MCMC process. It ties it all together in part 5. I think this will make for a really cohesive paper.

Monday, February 19, 2018

"Final" outline

Well, my adviser says he's OK with it, and I'm certainly not looking to add anything.

1. Introduction
2. Sampling Framework and Notation
2.1 Data organization (brief description of storage schemas)
2.2 Formal problem statement
3 Estimating the Sum in the Uniform Case
3.1 Simple block sums
3.2 BLB
3.3 BMH
4 Estimating the Sum in the Correlated Case
4.1 Impact of correlation on block sum variance
4.2 Difficulties with BLB and BMH
5 Constructing an estimator distribution using a Dirichlet Process
(might have some sub-sections here; this part isn't close to done)
6 Results and discussion


Saturday, February 17, 2018

Rocky Raccoon 100

When I bookmarked Rocky Raccoon on my training plan last year, I added the comment, "Get back on the horse." I don't like to let failures linger. Leadville did not go well and I didn't want to lose confidence that I could run 100's. Of course, Southern Texas doesn't have anything on the order of Hope Pass, but 100 miles is still a long way and the real obstacle to finishing a 100 is not the physical task, but what goes on in your head. At some point in the race, you will question whether you can finish. If you feel that way at the start, the answer is almost certainly, no. So, throwing a confidence builder in here and there is a sound strategy.

Of course, the soundness of a strategy greatly depends on the circumstances. In December, I fell on a trail and badly damaged my shin. Three weeks later, it felt OK, so I ran Little Woods. My shin didn't bother me at all during that event but the following week it became obvious that this was not a superficial injury. Even walking was a problem. I decided to stop running altogether during the four weeks remaining before Rocky and hope for the best when I got to Texas.

St. Louis being what it is, I don't have to take the time completely off. As the injury is only aggravated by impact rather than effort, I am able to take advantage of some mild days and get in some bike riding. I also do a few workouts on the elliptical trainer. But, basically, it's a major layoff. I've never tried running an ultra coming directly off an injury but, as Kate and I board the plane for Texas, I feel like I'm sufficiently heeled to make this a somewhat less than insane endeavor. Still, I don't kid myself that this will be a confidence builder; it will be tough.


We stay at the home of Kelly and Adam Brown, friends of ours that moved from St. Louis to Texas a few years back. Their house is less than an hour from the race, so I don't camp on site. I drive up on Friday afternoon to pick up my number (one less thing to do in the wee hours before the start) and also get a look at the trails. They are gently rolling, picturesque, and really, really, hazardous. An army of volunteers has spent thousands of hours trying to rehab the trail network after last year's hurricane and they've done their work well. But, there's just no getting around the fact that the combination of fine soil, shallow roots, and 60" of rain have turned the paths into minefields.

Some of the trails are still completely impassible, so the organizers have re-routed the course. Rather than five 20-mile loops, we'll be doing four 25-mile "loops". It's not really a loop, more of a badly mangled "T" with the start/finish at the base and then out-and-backs to each end of the T and finally back to the base. I generally don't like out-and-backs on trail because passing can be an issue. However, the trails here are a bit wider than true singletrack and passing shouldn't be a problem.

 Map of course

I arrive race morning at 4:45AM and stress about where to leave my main drop "bag" (it's actually the Brown's cooler, filled with food, drink, lights, clothing, and spare shoes). The start/finish is the easy answer, but I wonder if the Nature Center aid station, four miles in and before the top of the "T", might be a better spot. It would give me access 8 times during the race rather than 4. The downside is that if I drop my stuff here, I can't make any changes. I finally decide that the safe bet is the start/finish. That way I can change my mind about gear right before the start if I want.

It's still completely dark as we take the start at 6AM. Within the first few hundred meters it's obvious that going with the little headlamp instead of the high-power handheld light was a mistake. I can see where I'm going, but making out the individual roots is difficult. I'm not the only one having trouble with it; a runner just in front of me goes down not once, but twice in the first two miles. He's in for a really long day at that rate. Since the trail is slightly wider than singletrack, I adopt a strategy of running just off the shoulder of whatever nearby runner has the best light. I also switch to my "technical" stride which I use when blasting through short sections of really rough trail or when I'm running straight through the woods in orienteering meets. This stride has me up on my toes and it's pretty effective at avoiding trips. The drawback is that, as a devout heel striker, it puts the strain on muscles that don't get as much endurance training; I can only do it for an hour or so.

Indulge me a brief diversion which I promise has a point. I often get a lot of comments on these race reports about how detailed they are. Some folks wonder how I can remember all that. Well, I'm sure I can't and I get details wrong all the time. But, as for what I do remember, I keep it in my head by writing the race report during the race. This only works in ultras where the pace doesn't demand concentration right from the start and I often change it quite a bit once I know how the story ends, but composing a first draft is a nice way to pass the early, easy miles when my brain is still looking for something to do.

The point of this is that I "wrote" those words about switching to my technical stride, including the bit about only being able to do it for an hour at the time I was doing it. You might ask why I thought I could get away with a 1-hour effort in a race that lasts twenty times that long. Yeah, you might ask that.

After about an hour on the trails, we hit the dirt road for the short part of the "T". This is just a little over a mile out and back to the Gate aid station. Even though the road is easy running, I take a walk break to give my legs a rest from prancing over the roots. The sky brightens shortly before reaching the aid station. Given that making it to 57 miles before dark was a pretty sure thing, this would have been a great place to drop the big light had I started with it. Oh, well, if that's the biggest mistake I make all day, it's been a good race. (Again, I wrote those words at the time knowing full well I had just trashed my legs).

Leaving the station, a light rain begins to fall. This, I'm actually prepared for. It's not particularly cold (high 40's), but I don't want to be wearing a soaked jersey all day. I pull out my rain shell, an act I will repeat a dozen more times as spotty showers continue for the rest of the race.

Just after passing the junction onto the long part of the "T", we again get back on trail. The roots aren't quite as bad here (or, maybe, they are just a lot easier to navigate in the daylight). At any rate, this 3-mile section between the Gate and Damnation aid stations is pretty easy. That changes after Damnation. The out and back to the Far Side turnaround is the most technical part of the course. I don't make any attempt to run it fast, dropping into an easy rhythm just fast enough to keep my stride from becoming sloppy.

I strike up a conversation with a fellow name Jack. He's very erudite and engaging. He's eager to both share his own thoughts on running and to hear mine. He properly uses terms like meta-cognitive and experiential disassociation. I'm very much a hard sciences guy, but I always enjoy wandering over to the social sciences building to strike up conversations just because the folks there are so good at framing vague but important ideas. We don't talk backgrounds, just running, but I conclude he's got at least one degree in Psychology. Maybe several.

Alas, Jack is less inclined to walk breaks than I and, given my pre-dawn indiscretion, I can't be skipping those. So, after we pass back through Damnation, I have to let him go. I strike up another conversation with a guy name Brian who is also quite interesting, though for a much different reason. Whereas Jack was all about why we like to run, Brian talks about what it's like to run. He's from Houston, so he's particularly well informed about local ultra scene. We stay together all the way back to the start/finish while he describes in detail the unique perils of running ultras in Southern Texas. Basically, it comes down to the footing sucks and it can be really hot. I'm not even to marathon distance in my first Lone Star State ultra and I've already figured that out, but it's fun to hear the anecdotes he offers to back up the claim.

I finish the lap at 10:24AM. That's about the lap time I was expecting; in line with a finish of around 20 hours, but it's come at a cost. The feeling in my quads and hamstrings is similar to the tightness I felt at a same point in Potawatomi. While the second half of that race was three hours longer than the first, I generally regard it as one of my best races because it was such a good save of a race that threatened to go very badly. So, I try the same strategy that worked there: rather than push through the sections that are trashing my legs, I'll use them as my walk breaks, even though they are objectively fast sections of trail. This means I'll probably run a lot more of the uphills, but the biggest hill here is only around 100 feet vertical and none of the terrain is steep.

Implementing that strategy is not as straightforward as one might think. Between each aid station, there is one significant section of contiguous runnable trail (or dirt road). That leaves about 75% of the distance littered with roots, some presenting bigger hazards than others. Determining when to walk and when to run is not always obvious. I err on the conservative side, resulting in a lap that's a full hour slower than the first. On the upside, my quads don't feel any worse than they did at the start of the lap, I haven't fallen even once, and my injured leg is doing fine.

I stop just long enough to grab my big light and a bunch more food. I'll definitely be needing both. Even though I haven't been running hard, the cold, wet conditions require the body to burn more just to stay warm. I press just a bit for the first six miles, hoping to make it to the dirt road to Gate before needing my light. The rain stops for a while and the clouds get a bit thinner resulting in just a bit more twilight. I end up getting through Gate and halfway to Damnation before switching on the lamp. I then take an extended walk break the rest of the way to to the aid station.

One of the fun things about out and back courses is that you get to see what the leaders are up to. On lap courses, you might get to see them come by in the same direction as well. I've managed to avoid that in previous 100's by staying on the lead lap. Not today. Ronnie Delzer comes by still looking pretty spry for a guy with 85 miles in his legs. This isn't really a surprise, I knew the winning time would be under 15 hours and I was pretty sure that 20 was a lower bound on my own finish. I had hoped to at least be on my way back from the Far Side turn when it happened, though.

Shortly before getting to the turn, I'm passed again. This time it's Sabrina Little and, if anything, she seems to be in better form than Ronnie (though she's too far behind to catch him). This is also not much of a surprise. The advantages of testosterone are increased muscle mass and quicker recovery from hard efforts. Neither of those count for much in a 100, so the top women are often in the mix for the overall win. I manage to finish the lap before the third runner catches me.

Things are not going well, however. My legs are still OK, but I'm really starting to feel bad inside. I'm still eating, but I can feel my blood sugar crashing. This usually happens around the 100K mark. Tonight, probably because I backed off so much, it's hitting closer to 70 miles, but it's still hitting pretty hard. I walk the entire four miles from the Nature Center aid station back to the finish. I try a caffeine pill. A gel. Nothing seems to help. My mind has checked out and my body just wants to stop.

Objectively, I've DNF'd three running races in my life. All of them 100's and each time, it happened in the third quarter (which is the toughest part of any race in my opinion). One was due to a legitimate injury. Another was missing the time cut at Leadville last summer. I don't really count those two as DNF's because continuing wasn't really an option. The one that gets me is the completely voluntary withdrawal at Mark Twain while leading the race. Ok, technically, I had dropped to third by the time I actually made it to the next aid station and quit, but I was winning when I decided that I simply couldn't go on. Granted, I had just emptied my insides on the trail, was too dizzy to walk straight, and the remaining miles would have been a nightmare given that I had clearly overcooked it to stay on the lead, but, still, dude: you quit while leading?  Who does that?

My decision making gets pretty fuzzy around the 10th hour of activity. From then on, I really operate off of habit. So, after that decision to quit, I made a conscious effort to squash that thought any and every time it occurs to me. I can't say that I've been entirely successful in that but, for the most part, the internal discussion of whether to continue gets cut very short. Fighting it back today, when what was supposed to be an "easy hundred" is turning into an epic requires more than habit. I need some hook to hang this thought on and I find a familiar one: vanity.

Do I really want to go back to the Brown's and meet a bunch of people I don't know at their Super Bowl party and introduce myself as a guy who just failed at an activity they already think is crazy? No, that would be much worse than grinding this thing out. I carry on.

At the start/finish, the med staff is concerned. Apparently I look even worse than I feel. The lap time was a whopping seven hours, so even a sub-24 is pretty unlikely. Therefore, I don't make a fuss when they suggest I step into the medical tent for a bit and try to get myself composed.

I'm not sure what the actual credentials of the med staff are, but they certainly seem to know what they are doing. Across from me, another runner is getting some very good care for some awkwardly placed foot blisters. Another is getting a massage to relieve some cramps. All I need is rest, food, and drink, so I'm a low-priority case, but they eagerly bring me what I need. I'm very cognizant of the perils of sitting down for too long, especially at the start/finish where I could simply walk the 100 meters to my car and be done with the whole thing. Still, I'm really sleepy, so I ask them to wake me in 20 minutes and nod off.

I awake with a jolt. Something inside me registers that I'm not supposed to be sleeping right now. Maybe it's the memory of the time I literally fell asleep on trail at Kettle Moraine (fortunately while walking, so the ensuing fall was more humorous than injurious). The aid stations workers get a laugh from my startled expression and tell me I still had two minutes left. Alert again, I take my time getting back on my feet and moving. The EMT who's been paying the most attention to me is worried that I might break down on the trail and be susceptible to hypothermia. I note that the first part of the course is never more than half a mile from a road, so I have a few miles to sort things out. I promise that if I'm not moving OK by the Nature Center, I'll at least take another break. I head out shortly after midnight. Walking. Slowly.

By the Nature Center aid station, I'm feeling like I could try easy running again, but I might trip. So, I continue walking until I hit the dirt road to Gate. I run that both ways and then walk the rest of the way to Damnation. There, I take another short break. Partly because I could use it, but also because I'm not entirely sure my light can make it all the way to sunrise. I was expecting to be done by now so I figured I could get by on the 3-cell and 6-cell batteries I own. When new, they easily combined for around 15 hours of light, but they're getting old and may have faded. I have the backup light I used this morning, but I already know how useless that is on this terrain.

I walk most of the Far Side section, taking another short break at the turnaround. The sky brightens shortly before I get back to Damnation. A competitive finish vanished long ago and, with only seven miles to go, I'm in no danger of missing the time cutoffs. So, there's no particular rush. Several runners come scrambling through the other direction knowing they have to be on their way to Far Side before 7:30AM or they will be pulled. I hang at the aid station for a while offering encouragement while enjoying some pancakes. Shortly after they close the trail to Far Side, I'm on my way back to the finish.

I'm through my troubles now and, with the last shower passing and sky clearing, I feel like running again. I run the rest of the way in and finish a few minutes after 9AM. It's my slowest 100 by over four hours (though I was obviously on a slower pace when I missed the cut at Leadville). I don't really feel bad about that. The point of this one was not to break any records, but simply to get in a finish after getting yanked at Leadville. The med staff seems particularly happy to see me return. Obviously, their priority is treating the people who are too jacked up to continue, but I'm sure they get more joy in seeing their care result in a finish that otherwise might not have happened.

Then, I take my shoes off. I knew the wet conditions and extended walking were giving me blisters, but I am pretty surprised by just how bad they are. A shoe change (or two) was probably in order between laps. Ah, well, I wasn't planning on running for the next week or so anyway. Those who are squeamish might not want to look at the final photo below.

It's always dangerous to make future plans in the weeks immediately following an event like this. On the one hand, one seeks redemption. This really isn't a particularly difficult course, even with the cold rain and hurricane damage. I think I could run well here if I was a bit smarter about the early miles and brought some better fitness. Then again, there are so many good races out there. I don't want to do a bunch of repeats. Finally, there's the reality that 100 miles is just not a good distance for me. I'm much better at 50's.

So, I don't know if I'll ever come back, but my gut feel is probably not. I came up short, but I didn't quit. The question has been answered in the affirmative. Getting back on the horse is no guarantee that you won't get thrown again. It's just what you do.




Friday, February 16, 2018

Camel

Yesterday, I claimed that the shape of the sample variance distribution was "wrong". A reasonable question to ask then is, "how so?" Thanks to my handy-dandy simulator, getting an answer to that is relatively easy.

First, we'll state the theoretical "right" shape. If the blocksums were independent, identically distributed according to a normal distribution (none of those things are true for any sample, but let's move on), the normed sample standard deviation is chi-squared. (by normed, I mean that we're looking at the sample variance as a proportion of the population variance and adjusting for the sample size).

A shocking amount of public policy gets built around research where those assumptions go unquestioned. True iid samples are about as common as hen's teeth. But, fortunately for the world's sloppy researchers, the violations of those assumptions generally don't get you into to much trouble, especially when you are dealing with averages and sums, both of which converge to normal if you make you sample large enough.

But, that's the rub, generally doesn't mean always. And, in the case of correlated query sums, it just ain't so. I ran the simulator for r=5 (that is, I pulled 1000 samples of 5 blocks each) from both the uniform data and the correlated batch data. In the case of the uniform data, things line up really nice, even for such a small sample size. The correlation between blocks isn't enough to mess up the iid assumption and the sums themselves are, in fact, normally distributed.


The "theory" line shows the Chi-square density with 4 (r-1) degrees of freedom. When we look at the correlated data, well, it's something quite different.

Yikes! It's not even uni-modal, much less properly skewed. Granted, this is for a very small sample size, but the effect doesn't really go away until many blocks have been sampled. Note that the mean is 4 (well, actually 3.85, but that's just sampling error) just like the uniform case. The right side mode doesn't change things too much as in this case the variance estimate is very high, so the confidence interval will be wide and we'll keep on sampling. It's the left side mode that's the killer. This yields an artificially narrow confidence interval. Any stopping rule based on that will be prone to shutting down too soon.

Of course, that can happen in the uniform case as well, but the t-statistic accounts for that and keeps the overall chance of shutting down early at 5% (or, whatever Type-I risk you want to accept). In the correlated case, even with the adjustment for sample size, you're still going to think you're good when you're not far too often.

Another reasonable question to ask is: "Since it's clearly not chi-squared, what is it?" That's a tougher question to answer. I'm sure I could grind out the exact distribution, but there really isn't much point. We have to come up with another way to prevent early termination of the sampling, and that's what the more sophisticated methods will be used for.

Thursday, February 15, 2018

Out of shape

Not me. Well, maybe I am a bit, but I'm talking about the shape of the distribution of the sample variance of block sums when observations are correlated. I ran a simulation today to illustrate how simple block sampling doesn't work when the results are correlated. You'd think it would. Sure, the variance is higher, but your still sampling sums from a finite population of blocks. As long as you do that, the total sum will converge to normal and the sample variance will be an unbiased estimator of the true variance. Yes, it's a sample variance, not a true variance, so we have to use a t-distribution instead of Normal for our confidence interval (not that there's much difference after we've taken a decent-sized sample). Even so, why do get this?


The horizontal axis is number of blocks sampled out of a population of 1000. Note the uniform sample correctly creates a 95% confidence interval that includes the true sum about 95% of the time. Not so when they are correlated. We have to sample many more blocks before our confidence interval correctly reflects the variability of our estimate.

It's not that the variance estimate is biased - the mean of the simulated estimates was pretty close at all sample sizes. The problem is the distribution of the sample variance. When a sample gets a block with no (or hardly any) rows relevant to the query, the variance estimate is understated. In the more common case where you don't get a zero block, the variance is slightly overstated, so the mean comes out fine. But those understated cases give you a false sense of security that you have a reasonably tight estimate of the sum when, in fact, you don't. Any stopping rule based on that would be very prone to stopping early.

The distinction is really important. As I posted a few weeks ago, the estimator of the total sum based on the block sums converges to normal very quickly, even when the underlying distribution is crazy skewed. It's the distribution of the sample variance that's messing us up. Why is that important? Because the thing random block sampling doesn't do is look at the underlying distribution. This helps motivate our study of methods that dig deeper into the shape of the data than just recording the first two moments.

Tuesday, February 13, 2018

Putting the re back in research

As in, search the searching that other people have done.

I may have got a bit ahead of myself. When I started working on CISS two years ago, the path forward seemed very clear. On that particular item, it still does. It will make a great white paper published on some trade site. It probably wouldn't make it through peer review, and I doubt I'll bother adding a bunch of unnecessary theory to fix that. I can insert it as a chapter in the dissertation or just let it stand on it's own in the professional literature.

The problem is that I was spending all my directed readings classes working on that instead of doing what I should have been doing: directed readings. Now that I'm knee deep in theory, I'm coming up short. So, some actual research is in order. I think I have enough (with a bit of cramming) to get this paper out the door. After that, I'm going to need to put my own ideas down for a few months and really dig into the academic literature.

Monday, February 12, 2018

Scope limits

Another good meeting with my adviser today. Including lots of new avenues for exploration. Fortunately, none of those need to be included in the current iteration of "the paper" (I don't know what else to call it since the original title no longer applies). We basically put some  limits around what we're going to put in. I need to work fast so I can hit those before those boundaries get moved.

Some of the new topics are interesting. They include:


  • Defining a multinomial distribution to describe the block/batch correlation. This might just be a re-parameterization of an existing distribution that I discovered in the literature last week.
  • Using the Dirichlet process to drive/assess D-BESt.
  • Finding more efficient ways of running the Dirichlet process (or, more promising, finding ways to pre-compute it).
All fun stuff, but stuff I'm not going to think about until we get this paper out the door.

Sunday, February 11, 2018

Finite comprehension

Last week, my adviser pointed out an error in my variance calculation when using the Bag of Little Bootstraps (BLB) method. That didn't bother me much. As I noted yesterday, I get details wrong all the time. Fix 'em and get on with life.

What bothered me was that, if he was right, my straightforward estimate from random block sampling would have a higher variance than BLB. I was sure that the scaled block average was a minimum variance unbiased estimator (MVUE). Granted, bootstrap estimators are a bit whackadoodle, being non-deterministic and all, but they are still estimators based on the data and if this one was coming in lower, then the scaled block sum couldn't possibly be the MVUE.

So, back to a very careful pass through the derivation followed by actually simulating the results to make sure they were matching theory. The problem, it turns out, is the whole confusion you get with bootstrap estimators over what it is you're actually estimating.

Normally, the sample is assumed to be a finite sample from an infinite population. You fold the sample over on itself to generate a bunch of samples and that lets you get some idea of what the variance should be on any sample from that finite population. BLB obfuscates that even more because now we have all these little sub-samples which themselves get bootstrapped. Under that model, my variance calc was wrong.

But, that's not what we're doing. The "sample" is actually the population. The population sum is not a random variable, it's a population parameter. Even in the Bayesian world, one has to concede that there is no actual randomness in the sum. It is what it is. So, what we're really looking at is bootstrapping the block sums to create an estimator and then looking at the variance of that estimator around it's mean plus the variance of the mean of that estimator. The first part is the block variance scaled up to the size of the total population.

I simulated 1000 blocks and then ran the BLB on each block. The variance of the block sums was 457 and the average variance of the estimators was 464934, or 1017 times as much. So, far, things are checking out just fine. But, we're not done. That's just how much variance we expect in the empirical distribution of the bootstrap samples. Since the bootstrap samples are all taken from the same block, the expected variance is (n/b2 = Nσ2, where n is the total number of rows, b is the block size, and N is the total number of block. So, if I'm wanting to estimate the block variance, I just divide my empirical variance by the number of blocks. Great, that was the whole point: use BLB to get the block variance.

Now, let's look at the variance of our estimator, Uj. Uj is the sum of n observations sampled with replacement from block j. Pretty clearly, E(Uj) = NSj where Sj is the block sum. We also know that NSj is an unbiased estimator of the total sum S (the MVUE from Sj at that). Let's take the variance of the difference:

Var(S - NSj) = Var(S) + N2Var(Sj) = 0 + N2σ2

Wait a minute, that's literally a thousand times greater (since N=1000 in my simulation)! Yup, and the sim came back with just that number. The variance across the BLB estimators from all blocks was 456,998,006, almost exactly a million times the block variance. So, while I haven't made as much progress on the new stuff my adviser wanted me to look at, at least I can sleep a bit better knowing that my base case is using the MVUE and all this other stuff is just to help figure out what that variance is.

Incidentally, if you grind out the covariance terms (which I'm starting to get good at), it turns out that the estimator created by simply averaging the BLB estimates for however many blocks you've sampled is as good as using the simple block estimator. Really! The variances come out to be exactly the same: N2σ2(N-r)/[r(N-1)]. That's fine. There's no rule that MVUE's are unique. I just wanted to be sure I had one.

Saturday, February 10, 2018

Accuracy

Among the many competitive activities that I pursued just enough to get sort of good before deciding it wasn't for me was Chess. I like playing Chess. I like studying Chess. In 1995, I was literally the median Tournament Player in the US (along with a bunch of other folks who had the same rating as I). But, I knew I'd never be an expert, much less a master.

It wasn't that I didn't think I could put in the effort or that the concepts were too hard for me. Those were both open questions, but I was willing to give myself the benefit of the doubt. What I knew I couldn't do was play accurately.

In Chess, accurate play basically means not missing the obvious stuff. Leaving a piece unprotected, making moves in the wrong order, failing to see appropriate counter-moves. These errors will kill the best laid plans. The reason I knew I'd never get beyond such mistakes is because such mistakes characterize my efforts in everything I do.

One of the reasons I'm fairly good at computer programming is that the computer is fairly good at telling me when I've messed something up. Modern development environments are pretty good at flagging not only syntax errors, but also uninitialized variables, name clashes, parameter mismatches, and other semantic errors. I can pretty much charge ahead and let the editor tell me when I've just done something stupid and fix it.

It wasn't always like that. I started programming in 1977 punching FORTRAN statements onto Hollerith cards. Every miskey meant leafing through the card deck to find the offending card, retyping it, inserting the new card into the deck and resubmitting the program. It took me a long time to get stuff done. The next year, my dad asked me to write some code for his labs and I got to use a terminal directly attached to a PDP-8. Correcting errors became a process of seconds rather than minutes. Suddenly, I could get things to work quickly (at least by the standards of the day).

I could blame it on being dyslexic or left handed, but the truth is that I just don't particularly like being careful. Sure, I mess up the details a lot. It doesn't seem to matter. I just fix 'em and everybody's happy.

However, there are some things, like deriving the covariance of block sums from a finite sample, that require accurate computation. The fact that I have to do it about 10 times before getting it right is really slowing this paper down. As with Chess, I doubt I'll be able to fix that. I just need to expect every one of these formulas to take a lot of time. And, of course, I need to do a LOT of re-checking my work.

Friday, February 9, 2018

The bridge

Having slept on it for a night, I'm still not wild about adding the additional scope to this paper, but I did come to see some use for the line of research in my dissertation, which is really the goal; the paper is just a step along that path.

The idea is to use a Dirichlet Process to pull the partition distributions out of the data. That's an interesting exercise, but there's no way that's faster than just reading all the data and getting the real answer. However, it does offer an interesting bridge to D-BESt, which is still a piece of research I very much like.

The train of through goes like this:

  • When the data is correlated, "random" sampling doesn't work very well.
  • Instead of trying to un-correlate the data, we use D-BESt to crank the correlation up even higher and use that correlation to limit the number of blocks we have to look at.
  • That works, but with the results from the Dirichlet distribution, we can do even better on two fronts. First, the distribution tells us how well D-BESt is working. Second, and this is the great part in my mind, it allows us to identify groups of blocks that have similar distributions and consider a sample from one of them to be representative of the entire group and leverage that in our estimate. It basically takes care of the problem noted before of irrelevant data. Somebody can add a trillion new rows we don't care about and we still won't care. The "map" of distributions to blocks will allow us to steer away from that and only consider the mass of data that supports non-zero sums.
Of course, that's a whole bunch of work and won't make it into this paper. But, setting the ground work for it might have some merit (though I'd still rather just get the basic results out first).

Thursday, February 8, 2018

Scope creep

Bah! Today's meeting didn't go how I would have hoped. It's not that we had a bad discussion. On the contrary, it was probably the best conversation we've had to date. However, it wasn't to the end of getting this paper wrapped up. Instead, it opened up a whole new line of research.

Oh well, as a Literature candidate described PhD work to me when I was at Cornell, "You like Hershey bars? Here! Eat 5,000 of them!"

Wednesday, February 7, 2018

Reining it in

You may recall that the original thought (well, original for this iteration, which actually makes it about the five hundredth thought) was to have the theory section build starting with iid, then going to correlated on hit rate, then go to correlated on general distribution of the observations. Given what we've got now, I think I'm going to suggest that we focus just on the hit rate problem. It still gives us three basic steps.

The first is, again, iid. There's nothing remotely interesting about sampling iid observations, but it does allow us to establish the framework and methods without all the complexities that follow.

The next is sequential correlation, typically caused by the fact that things that are loaded together tend to be related and also tend to get stuck next to each other in the storage system. This was also part of the plan.

The third step ends up flowing pretty simply from that. Suppose things aren't strictly sequential, but there's still a lot of correlation. This happens on massively parallel systems where writes get sent to multiple nodes according to some hash algorithm. It has the effect of breaking up batches, but the little chunks are still sequential and highly correlated. This case is actually a little cleaner mathematically because you don't have the annoying boundary conditions for the first or last batch in a block (which typically spill into adjacent blocks).

The actual methods work the same in all three cases (though you'd be a fool to use the complicated methods if you knew the data was iid). The results vary quite a bit, though. Plus, in the context of a full dissertation, the third case sets me up nicely to follow with D-BESt.

My meeting with my adviser got pushed back to tomorrow, so I should know by tomorrow's post if that's the way we're going.

Tuesday, February 6, 2018

I'm back

... from Texas. The race was a Texas-sized train wreck which I will write about this weekend or next. There was a little time for visiting and celebrating our anniversary. Mostly, I ground out a bunch of equations which I'm not sure have a lot of merit other than demonstrating that the exact distribution of our statistic is basically intractable. You can do it, but there isn't much point. What I was able to establish was an upper and lower bound on the variance that is fairly easy to compute. That will be useful for sure. I did all this on paper and I'm not sure I'll get it all typeset in time for my meeting with my adviser tomorrow. It's already close to midnight and I've got quite a ways to go.

Thursday, February 1, 2018

Found some more math

I think I found where all the good math went. I worked this out on the very bumpy flight to Houston today and haven't really dug into it, so I could be wrong. Actually, that last part is pretty much true all the times I'm not on a bumpy flight, too, but I digress.

Anyway, when I introduced the correlation, I went to the simple case where each block comes from a single partition. This is perfectly adequate for demonstrating the issue, but it also oversimplifies the problem. I tried injecting the more realistic constraint that a block may contain an any number of partitions. There are three good reasons to do this. 1: The variance computation is more interesting from a theoretical point of view, 2: It's a better model of the data (OK, this should be #1, but we're trying to get something published here), and 3) It allows us to show that the 1-partition case is easier to work with computationally.

That last one may sound like a bad thing; but it's not. It actually sets up a mtathematically sound motivation for intentionally increasing the correlation within a block. That is, it leads very naturally to a theoretical, rather than completely pragmatic, argument for DBESt, which is the next step in this research. Up until now, I was worried the discussion to that would be an awkward break in the thesis. Not it looks like it will flow quite naturally.

Of course, I still have to do the derivations and I may still be wrong.