Wednesday, January 31, 2018

Looking for more math

So the theory section of the latest iteration of my paper was shaping up really nicely but has suddenly devolved into hand waving. The correlation case should present all sorts of opportunities for interesting results, but I'm not finding them. It's not that I can't prove them, I'm just not finding good things to prove.

Saturday's race will give me upwards of 20 hours to think about it. Maybe I'll come up with something.

Tuesday, January 30, 2018

Who cares?

All through this sampling project, there's been an underlying concern: who cares? Sure, sampling large databases by block is a great thing. So do it. Why all the fuss? Just sample by block, use the block sum sample variance to estimate the spread and you're done. What's more to say?

We've been trying to show that there is more to say and, while I knew that to be true, I was not finding a good way to articulate it. Showing a bunch of convergence graphs is helpful, but not a substitute for a concise argument.

My mind wandered back to the statement that put me onto this line of research before I even went back to school. It was made by one of our senior actuaries in response to my suggestion of using sampling: "We tried that and we kept missing important stuff."

Why? The actuaries all got A's in statistics. They know how to construct a confidence interval from a sample. Why was it not working? My original thought (and still my belief) was that the sampling wasn't adequately accounting for correlation in the data. But, the more I worked with BLB and BMH, the more they started looking like really expensive ways to arrive at the sample block variance and we were going to wind up right back where we started. Why did I think any of this was going to be better? Why should anybody care?

Then, while grinding out the exact formula for the block sum variance this morning, I realized why it wasn't working. And, it's so simple, it's a little embarrassing. The sample block sums are not sufficient statistics for the total sum (they are in the iid case, but not in general). When observations are correlated, rolling things up to blocks throws away information that turns out to be really important. We need some other statistics that summarize the data without losing information.

Such statistics are readily available, the trick is then using them to reverse engineer the overall distribution. Straight random sampling is not well suited to this task. BLB and BMH are (particularly BMH, which accommodates arbitrarily complex models). So, now I have an easily stated justification for breaking out methods that at first glance appear to be overkill.

This gives a cognitive flow to the theory section that was completely missing up until now. There's still a fair bit of writing to go, but the path is clear and that's usually half the battle.

Monday, January 29, 2018

FizzBuzz

Today I got sent to a training class at work. I'll be there for the next couple days as well. The topic is Test Driven Design (TDD), though I prefer to call it Test Driven Development because I still think there's a place in the world for mapping out your thoughts before you start cranking code. The head honcho architect at RGA wants all the architects and leads to know this stuff so, even though I've picked up most of the concepts on my own, I was enrolled. I'm fine with that, it's well taught and it never hurts to formalize your knowledge.

It's a hands-on class and the first exercise was to code the silly icebreaker game "FizzBuzz". If you've never heard of it, you go around the room counting off numbers except that if a number is divisible by 3, you say "Fizz". If it's divisible by 5, you say "Buzz". If it's divisible by both, you say "FizzBuzz". The correct answers for the first 16 are:

1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16

It's not as easy as it sounds because it's hard to keep track of the numbers when the Fizz's and Buzz's overlap. As I've stated before, whatever you think of TDD in practice, it really shines when you have super-well-defined problems like this one. We did this one using the ping-pong approach where two developers share a platform and one writes a test and then the other modifies the code to pass the test. You try to do this as quickly as possible to prevent any extra gold-plating seeping in (you can always refactor it later).

Again, when the problem is this simple, you often get a pretty good solution on the first go. Here's the final test (the preceding ones built up to this by first getting the single value right and then gluing them together):


@Test
public void speakSequence_16returns() {
  assertEquals("1, 2, Fizz, 4, Buzz, Fizz, 7, 8, Fizz, Buzz, 11, Fizz, 13, 14, FizzBuzz, 16",
  testObject.speakSequence(16));
}

And, here's the solution:

public class FizzBuzz {

  public String speak(Integer number) {
    return number %15 == 0 ? "FizzBuzz" :
      number % 5 == 0 ? "Buzz" :
      number % 3 == 0 ? "Fizz" :
      number.toString();
    }

  public String speakSequence(Integer maxNumber) {
    if (maxNumber < 1) throw new IllegalArgumentException();
    return IntStream
        .rangeClosed(1, maxNumber)
        .mapToObj(i->speak(i))
        .collect(Collectors.joining(", "));
  }
}

Ok, the IntStream is probably overkill, but it's certainly elegant. And it took the two of us less than 10 minutes to produce this fully-tested class.

Sunday, January 28, 2018

Not much to show for it

I actually got in a good amount of research time this week. Not by normal grad school standards, mind you but, considering that I do have a day job, 25 hours on research is reasonable effort.

I just wish I had more to show for it.

Part of the problem is that I spent most of the week just reframing everything. That didn't leave a lot of time for new results. Looking for some breakthroughs next week.

Saturday, January 27, 2018

Two outta three

Not a bad average if you're a Major League baseball player. Maybe a little less great for life.

While taking a break from writing today, I scrolled through some old blog entries. I came across my goals for 2017. I didn't expect to hit them all, but I'd have to say I came up a bit short.

Nailed it:

  • Passed the qualifying exam
  • 4.0 GPA (that one's pretty much locked in now since I'm done with coursework)
  • Converted all initiatives at work to scalable platform (2200 billable hours to make that happen; might be the reason I failed some of the others)
  • Still married
  • Still debt free (though my credit card got hacked 5 times!)
  • Supported Yaya's music
Not so much:
  • Didn't even take, much less pass, the admission to candidacy. That stings a bit. There was a time I thought a PhD in 2018 might be possible. Now, 2019 is starting to look optimistic. Holding life together for two more years with this schedule will not be easy.
  • No papers accepted for publication. I'm downright despondent over this one.
  • No buckle from Leadville (not even the small one). Mountain ultras just don't seem to be my thing. That won't stop me from doing them, though.
So, what are we hoping for this year?

Well, for starters, if I don't get a paper published this year and through the A exam, it's game over. I can't do this work/school thing forever. I might even put a shorter fuse on that. It really needs to happen this spring.

I don't really have any big goals at work. There's plenty to do, but nothing like last years highly-visible milestones. Maybe I can get my hours down to around 1800. Four hundred extra hours at school would make a pretty big difference.

I'd leave the family goals the same as last year. We've got the normal teenage crap to deal with and a few not so normal challenges, but generally the status quo is fine.

It's a bit hard to think about running goals when I've been injured for the last six weeks. Just getting back to running healthy seems like it would be a super great thing right now. That said, I'm running US100 champs next weekend and a World Major marathon in the fall, so I can't pretend I don't care at all.

Wednesday, January 24, 2018

Way over

My weight, that is. For those who don't follow my running training log, I've been nursing a leg injury for the last five weeks. I fell during a run on trail and smashed the shin pretty badly. I thought it was better before running Little Woods, and it felt OK during the race, but in the days that followed it became obvious that a 50-mile run was not what the leg needed to finish healing up. I hobbled along for a few days before deciding to take a week off.

That's been extended to 10 days, though I did get in an hour on the bike and another hour on the elliptical trainer at UMSL last Sunday and Monday. Meanwhile, my weight has gone up precipitously. I usually gain a few pounds in January, but this year it's a full 10. That's probably not what I want to bring down to Texas for US100 champs next weekend, but I guess bringing a few extra pounds is better than bringing an injured leg.

Tuesday, January 23, 2018

Grinding it out

You may recall I said I'd derive the exact formulas for the variance of the block sums. There's nothing particularly hard (or enlightening) about doing this, it's just straightforward symbol manipulation. It's also long and tedious symbol manipulation. For example, the covariance term, which I had previously claimed we can safely ignore:



Well, that last line means we can safely ignore it (since the other variance terms grow with the square of n rather than linearly. Not particularly rewarding work. But, it does preempt any accusations of hand-waving.

Monday, January 22, 2018

Semi-annual outline update

Don't bother saying you've heard this one before. It seems that reorganizing the theory section twice a year is the way this thing is going to roll. Here's my outline. I stress my. This is not a bipartisan effort just yet and I expect my adviser will want to change it (mainly because it doesn't include the very thing we said it should when we met on Thursday, sounds a lot like the Senate right now). Nonetheless, for those of us trying to actually reach solutions rather than grandstanding for the public, having a draft in hand is better than starting from scratch.

Introduction doesn't change much. My adviser suggested a slight rearrangement of the ideas, which is easy to accommodate.

Theory section.
1. Formulation assuming iid observations - Layout the basic notation, introduce the query function, some quick comments about how the sum is obviously Normal, and the exact distribution of the estimator from a random block sample. Might throw in a graph from simulated results. Argue that using physical blocks in this case as a selection procedure for BMH and BLB sub-samples yields the same results as true random sampling.

2. Formulation assuming a correlated mixture distribution on IQ - Add the notation for partitioning and note how partitions tend to be grouped in contiguous physical storage. Derive the blocksum variance. Demonstrate quick convergence to normality of the estimator even in heavily skewed blocksum distributions. Derive the exact distribution of the estimator from a semi-random block sample (using radix sampling to avoid correlation between adjacent blocks). Show how radix sampling can also save the BMH and BLB methods.

3. Formulation assuming a true heterogenous mixture (both the distribution of Xi and IQ depend on pi). This is where I believe my adviser wants to insert some methods based on Dirichlet priors. I'm only somewhat familiar with the techniques, so I'm going to have to study up on this. Again, get the exact distribution of the semi-random block sample. Also demonstrate the impact on BMH and BLB using radix sampling.

The claims in the theory sections will be backed by simulated results.

Application section.
We bust out the big data set and run it against the third formulation.

Discussion and future research.
Probably best to actually get the results before writing too much discussion, but I expect we're going to find the BLB with radix sampling does pretty well. BMH should also do fine, but it's computationally expensive; enough so that we have to factor that in (generally, we are claiming that I/O dominates run time). Random blocks sampling won't work particularly well without stratification, but it still is worth showing if for no other reason than to point out why we're not just pursuing the simple path. Future research is focused on imposing organizational constraints on the physical data so that the methods converge faster. The two obvious choices (both of which I have conveniently already coded) are dynamic partitioning to ratchet the correlation up to where each block contains exactly one partition and stratification. Both of these allow us to ignore a lot of blocks because we know from the outset they won't have much impact on the estimator.

This isn't the total re-write that it appears to be. It's more a shift in focus. Before, we had BLB and BMH along just for the sake of comparison. Now, they are actually the subjects of our research. We're showing how the correlated data messes them up, but also showing how to adjust for that. Most of the code is written and I've been pretty good about writing it in a way that I can make the adjustments quickly. As for the paper itself, good writing always takes time but, fortunately, I actually have some time right now. With all our initiatives at work in finally in production (the last one being the source of the 40-billion-row data set that I'm going to use), work life is moving back to reasonably predictable 40-hour weeks.

Saturday, January 20, 2018

New abstract

I'm jumping the gun on this a bit as there are a few results that I don't have backing for just yet. However, it is helpful to spell out what you intend to show, even if you haven't got there yet. So, here's a cut at the new abstract:

When sampling from a finite population, the statistic of interest is often the sum (rather than mean) of observations from a sub-population. When the finite population is large (we will consider n>109 as being "large"), estimators for this statistic can be derived from the asymptotic convergence of the mean of a random sample. However, the quality of such estimators must be evaluated in light of the finiteness of the population. Further, the realities of physical storage of large data sets introduces correlations in the data that can significantly impact the quality of the estimate when a truly random sample is replaced by sampling by physical block (the latter being less efficient in the statistical sense, but far more efficient in terms of computational resources).

In this paper we examine the convergence of estimators using block sampling rather than true random sampling for three different methods: random block, Bag of Little Bootstraps (BLB), and Bootstrap Metropolis-Hastings, and derive adjustments to the estimator quality based on the organization of the data in the physical database. We then demonstrate the correctness by applying the methods to a 40-billion-row data set of cash flow projections from life insurance policies.

The gist is this:

  • In infinite populations, unless the mean is zero, the sum drifts off to infinity, so there's no point in estimating it. In finite populations, the sum is a real thing and it can be estimated.
  • Sure, you can estimate the mean and then just multiply that by the number of observations to get an estimator of the sum. That's actually a good strategy.
  • Problem is, the variance on your estimate of the mean is NOT the variance of the sum, because you were estimating the mean of an infinite population. If you're trying to assess the estimate of the sum of a finite population, you have to account for how much of the population is unsampled. (Simplest example: n=1. You sample that observation and call it your estimator of the mean. If your population was infinite, you'd have no way of knowing how good your estimator is because there's no variance from a sample of 1. However, in this case, you know exactly how good your estimator is: it's perfect. You sampled the entire population).
  • Furthermore, since we're not living in some fantasy world where observations just drop out of the sky, we need to think about how the data is stored. On any computer system capable of handling a large data set, the data is spread across multiple physical storage blocks. We could ignore that reality and just randomly grab rows using a random number generator, but that would be super duper inefficient. So much so that we'd actually be better off just reading the entire sample and getting the real number (I can prove this). So, we grab a whole block at a time and build an estimator from that.
  • All well and good if our observations are iid. Guess what? They're not. Not even close. So, we need to adjust for that.
  • Even furthermore, we're not after the entire population sum. We want the sum from a sub-population. This sub-population may or may not be spread evenly among the blocks. That drives up the variance of block sampling even further.
  • So far, all of this is pretty self-evident and amounts to hand-wringing. Now comes the big question: what are you going to do about it? We need to roll up our sleeves and include these factors in any assessment of our estimator.
  • We pick three fairly-well accepted estimators and do exactly that.
  • Then we check our work by applying what we've done to a moderately large data set. We'll run it a bunch of times and show that our estimator is within our predicted confidence interval about as often as we'd expect.


Friday, January 19, 2018

Form over function

After my retirement from cycling and before I got really serious about running, I raced cars for a few years (I'm not sure I'm capable of an existence where I'm not racing something). Just amateur SCCA stuff but real-live car racing nonetheless. After driving a couple years in the stock classes, which don't allow you to change much beyond tires, I decided to dump some money into the car to make it go faster. The hot class for my car was called Street Touring. It was basically the "Rice Boy" class. You could lower the car, stiffen the shocks and springs, make some modest intake and exhaust adjustments. As my car was a Honda product, that's where everybody expected me to go. When I chose a different class where I would have to spend my money a bit differently and wind up with a less competitive car, one of the other driver's asked me, "Why didn't you go to Street Touring?" My response, "I didn't want to spend four grand to make my car slower."

You see, unlike stock, Street Touring didn't let you run real race tires. You were stuck with the typical low profile, three-season traction stuff you see on hopped up street rods. It looks cool, but the grip is way worse than a soft-compound tire with a race profile and low-void tread. And, tires are by far the most important part of a race car. My stock car with race tires was faster than anything I could build for Street Touring. So, I went for a class that allowed race tires. I didn't do as well nationally, but I did win a bunch of local races. And, it was fun to be turning in times faster than all the Street Touring cars (as well as a lot of much more expensive sports cars from the stock classes).

I bring this up because I'm at a similar point with the dissertation research. There are better opportunities for "good looking" (read, publishable) results by ditching the stratification and focusing on the correlation. And, I guess that's the way we're going to go. But, I'll still glance longingly at those graphs from time to time knowing that all the theorems in the world won't make a block sampler converge as fast as CISS.


Yeah, that's what I was talking about.

Thursday, January 18, 2018

The new normal

The Central Limit Theorem (CLT) is exceedingly well known. Even people who don't know the details are often familiar with the general idea: if you average enough things together, you get a "bell" curve. However, the devil is in the details and before invoking it in a formal setting, we need to make sure we're not breaking some rules.

The most important rule is that the variance of the observations is finite. People often wave their hands on this one because, outside of financial data and some weird stuff that particle physics folks deal with, it's almost always true. We're using financial data and it's definitely heavy tailed, but the more I've been looking at it the more I'm convinced that we do have a second moment.

Slightly less important is that the observations be identically distributed. As long as they have the same mean and variance, the actual distribution can vary without messing things up too badly. In our case, the individual observations are really coming from a mixture distribution, so the moments aren't equal. But, because we're averaging block sums rather than the individual observations, we get quite a bit of smoothing. Given our block size, it's reasonable to claim that the block sums are pretty similar. (Ultimately, we might want to look at distributions by partition, but we're holding off on that for now).

Averaging blocksums also helps us out on the fact that we are in flagrant violation of the independence assumption. As we saw yesterday, the correlation makes the distribution of the blocksums much different from what it would be if the observations were truly pulled at random. However, from one block to the next, the correlation is very weak, especially if we are sampling blocks randomly. So, we're on reasonably stable footing with the assumptions.

That leaves us with the question of how long the convergence takes. When our blocksum and average partition length are roughly equal, we have the opposite of the normal distribution - nearly all the values are pushed to the endpoints. So, how long do we have to run this thing before we can start applying the CLT to produce a confidence interval?

Happily, not very long. Simply summing two random blocks together gets us to a more or less unimodal distribution (though there's still a disconcerting spike at zero). The distribution predicted by the Central Limit Theorem is overlaid with the solid line.


By five blocks, things are looking much better and by twenty, the fit is just about perfect.


So, that's one less thing to stress over. We'll always be sampling way more than 20 blocks. If I can get the variance right, the rest of the sampler should work just fine.


Wednesday, January 17, 2018

When in doubt... (v2)

... simulate the hell out of it.

Machine apocalypse watchers got a big milestone in 2017 when the world Go champion was bested by a computer (and it wasn't a particularly close match). That, in and of itself, isn't super interesting. It's just more evidence that the difference between the computational power of machines and humans has become sufficiently small that a purpose built machine can beat a general purpose human at a high-cognitive task.

What is interesting is how it did it. Rather than the usual "min-max" method of game theory where you build these huge trees of "best" moves on both sides, the algorithm plotted out the next few steps and then just played several thousand completely random games from each of the resulting positions. This turns out to be a really good way of appraising a complicated situation. With enough simulations, you get a pretty good distribution of what to expect. This is, not coincidentally, how MCMC samplers work.

So, faced with the fact that I haven't been able to round up a paper that simply tells me what the variance of a block sum is when the observations are correlated, I have set to deriving it. Whenever you're deriving something, it's helpful to know the answer ahead of time. That's why textbook questions generally state derivation problems as "prove the following" rather than "what's this equal to?"

So, to get that, I wrote a quick simulator that generates blocks of correlated data. Since it's my general conviction that the underlying distribution doesn't matter much aside from the moments, I just went with the easy one and generated normal random variables with a mean of 2 and variance of 1. If we generate a block of 1000 independent, identically distributed (iid) observations, theory tells us we should expect the mean and variance of the sums to be 1000 times the mean and variance of the individual observations. I generated 1000 such blocks and got almost exactly that. The sample mean was 2001 and the sample variance was 999. Also in line with theory, the histogram looks an awful lot like the familiar normal bell curve.


If we add that only a percentage of the rows from the block are selected by the query, we see the variance rise since values are now being forced to zero, which is in the tail of the N(2,1) distribution. With 40% of the rows selected (on average), the variance rises to 1302 and the mean drops to 801.


Of course, if we go far enough down this path, the zero observations become the center of the distribution and the actual selected values are sitting off by themselves. This causes the variance to drop again. At 10%, the variance is less than half the full-inclusion case.


OK, I didn't need to run a simulation to know all that. Sums and variances of iid sequences are generally pretty easy to derive. But, having convinced ourselves that the simulation matches theory, lets look at what happens when we correlate the variables. There are lots of parameters we could mess with, and I'll get to them all as I derive my results, but for demonstration purposes, let's keep it simple and say that we have two partitions, each equally likely. One of the partitions returns no rows for the query, the other returns 80%. Over the entire data set, this will look just like the iid40% case. However, we now generate blocks where if we are in one partition, we tend to stay there. On average, we go half a block length before switching. The mean of the block sums should not be affected by this. However, the variance will get cranked way up because we'll have a lot of blocks with no rows selected and about the same number of blocks where 80% of the rows are selected. Let's take a peak:


Well, the variance sure did go up! The sample mean is 784, pretty close to the true mean of 800, but the spread is completely different from the iid case. The bell curve is pretty much non-existent and there's a big spike over on the left of blocks with a sum of zero. Another bump on the right is due to blocks where nearly all the rows came from the 80% partition.

And, we're just getting warmed up. What happens if the partition length is (on average) the same or twice as long as the block length?


Pretty much what you might expect. The sums are all clustered around the endpoints. You're either getting blocks with 80% of the rows included or blocks with none included. The middle is all but gone.

This partitioning example is particularly inflationary for two reasons. First,the hit rates are very different and there aren't any in-between partitions to smooth things out. Second, zero is nowhere near the center of the distribution of the measures; it's two standard deviations away from the mean. In reality, the variance won't go up that much due to correlation.

The point, however, is that I now have a nice easy way to check my calculations as I try to put some closed formulas around my variance estimates.

Tuesday, January 16, 2018

What do we mean by mean?

Another point raised by my adviser is that it is intuitively easier to bound a mean than a sum. Unless the mean is zero, the sum will tend to infinity, so that makes it rather hard to bound. The mean will always converge if the mean is finite.

The problem with that is determining what we even mean by a mean? Obviously, the sum over the number of observations. But, what do we consider observations?

The total size of the data set? OK, that's the obvious answer, but what that means is that if somebody loads a bunch of completely irrelevant data, the mean of my query just changed while the sum didn't. Seems bogus.

The number of rows actually included by the query? Sure, except we don't know what that number is. It would be yet another estimated parameter which would contribute more uncertainty to the convergence.

How about the number of blocks? Then, we're getting the mean of the blocksums. This is still just a tad bogus for the same reason as the first, but it seems less so. Since I already have the whole theory section lined out around blocksums and I don't expect the Markov/matingale treatment to change that, I think that's where I'm headed.

But, at the end of the day it's the sum I'm after.

Monday, January 15, 2018

Implementation details

I went through the notes from my adviser today. Most were straightforward corrections or clarifications. A few require some more thought. One, in particular, cuts directly to our version of "truth" when issuing a query.

Are we viewing the query sum as a true value or the realization of a random variable? Stated another way, is this truly a finite population with a fixed query sum, or is it just a really large sample from a larger population. I have been treating it as a finite population, but I can see instances where the other treatment makes at least as much sense.

For the purposes of projecting the capital needs of the company, it is definitely a finite population. Our portfolio of policies is known and finite. Sure, we may get more, but the question at hand is how to cover our current inventory.

When we look to experience studies, however, that stance becomes harder to defend. Experience studies look at actual cash flows to see how well they match the projections. If the two are diverging, it may indicate a problem with our models and we would want to update pricing of new business to reflect that. In this case, we are using our own inventory simply because that's the best data we have. If we had cash flows from a competitor, we'd happily look at that as well. In that framing, our own data set is really a sample of all the possible policies in the world.

I'd rather stick with the finite model, simply because I think it's been less explored so our method has more novelty. In particular, while it's hardly a profound result, I like my formula for the distribution of our statistic relative to the true sum. It converges to the distribution suggested by the Central Limit Theorem as the number of observations gets very large, but it is distinct in interesting ways. However, if we really run into something intractable, we could switch and still have something useful.

Sunday, January 14, 2018

Little Woods Ultra 2018

I'm not taking a day off from research this weekend but I am doing a non-math blog post. For those few who were really craving some math content today, I'll just say that I'm tightening the model a bit. Rather than just specifying that the observations are a Markov Chain, I'm running that through a transform to produce a martingale from the partial sums. Martingales have really useful convergence properties. Before you shout with outrage that you can't do that when sampling without replacement from a finite population, let me just say that we're talking a REALLY BIG finite population here and that sampling with replacement really isn't materially different when you're talking about a single block sum out of thousands. I'll elaborate tomorrow when I have more results.

That out of the way, several have asked for a race report from Little Woods last week. I'm not sure I would have called Little Woods a "race" even back when I took these things seriously, but we'll go with it.

The Little Woods Ultra was the brainchild of Travis Redden. There are several events like it around the country. The format is referred to as Last Man Standing. In an apparent nod to political correctness, Little Woods lists it as Last Man/Woman Standing. Whatever. The point is, you run until you can't. Last person who can, wins.

Without a little structure around that, the event would quickly devolve into chaos. So, there are some additional rules that turn out to be fairly important (transparent foreshadowing here!) Most of all, one has to define what is meany by "can't" in the above statement. Even the elites take walk breaks in ultras, so rather than insist everybody run continuously, the rule is that you have to complete the loop and be ready to start the next in an hour. Finish early, yay for you, but you still have to wait until the top of the hour to start the next. True to the ultra tradition, the loop distance for Little Woods is a point on which reasonable people disagree (if one could imply that ultrarunners are reasonable people). "Around" four miles is as close as one can get to consensus. The singletrack trail winds through a small parcel of forest on the Southern Illinois University at Edwardsville campus. (There is a larger parcel of forest on another part of the campus; you can probably guess what it's called.)

Travis has passed the administration of the event to Metro Tri Club so he can run it himself. Last year, he and I were the only two to complete 12 laps at which point I decided that running through the night in January didn't sound like much fun and conceded the win to him. Some were disappointed that we didn't make a fight of it, so this year I announce up front that I am not going past 50 for fear of screwing up my run at Rocky Racoon in four weeks. Everybody buys that as a plausible excuse. Truth is, I just don't want to run through the night in January.

As long as I'm out here, I might as well get in some pace training for Rocky. Four miles an hour is a stout 100 pace in the mountains but there are no mountains (or even hills) in Edwardsville, IL. Rocky is also pretty flat, so I'm planning on running 10-11 minute miles; at least for the first 50. I decide I'll run the loop today in around 40 minutes and then walk an extra mile between laps. That will be close enough to race pace and also give me some practice eating while walking rather than standing around between laps like I did last year.

No matter how cold the start, there's always one nut in shorts.
The normal morning temperature for early January in Edwardsville is around 20F, so there was good reason to believe that this year's start would be warmer than last year's 3F. Nope, we're right at zero as we head into the woods at 8AM. In an effort to generate some body heat, I head to the front and lay down a pace that's a bit firmer than my plan. A few stay with me and we finish the loop in 38 minutes. The prize for "winning" the opening round is warm hands and first crack at the aid station, which is basically a pot luck and the fifty or so runners that actually came out in these conditions have brought a wide selection of tasty treats. I grab a few and get to walking my mile.

In addition to pace training, I had hoped to experiment with how much I eat during these things. I've always had a big low section in 100's, usually coming around 10 hours in and lasting for 2-3 hours. I had always got through it OK until last summer at Leadville when it coincided with the return trip over Hope Pass. In that context it was devastating and by the time I had collected myself, I was late for the time cutoff at Twin Lakes. So, I've decided to try eating more early on to see if that helps.

I developed my guideline for how much to eat in long events when I was doing 24-hour orienteering races and 20-30 hour adventure races. In those contexts, I calculated that I needed 100-150 carb calories per hour. This was based on a burn rate of around 500 calories an hour, a maximum of 300 calories per hour being pulled from fat, and pre-race glycogen stores of around 2500 calories. I adopted that as an ultrarunning guideline as well without really thinking much about it. After Leadville, it occured to me that I was probably burning fuel significantly faster in 100's than in the other two activities; probably closer to 600 per hour.

Therefore, upping the intake to 200-300 carb calories per hour probably makes sense. Of course, ingesting just carbs will send your GI track into a tizzy pretty quickly, so I have to eat around 400 calories each lap to make that work. While it's no problem to do that, it does leave my stomach feeling a bit full starting each lap.

Hanging with Travis, Jim Donahue, and Jeff Sona after lap 4
The next three laps are repeats of the first. I'm still very cold at the start of each (walk breaks in an open field don't do much to keep you warm) so I continue with the pace that is just a shade too fast. The same group of three follows me and we get some conversation going.

By the end of lap 4, the temperature has risen into the low teens and I feel like I can hang at the start finish for a bit rather than walking a full mile. After all, the social aspect of these things is a big part of the draw and, by the time I'm done, only a few folks will still be around.

On lap 5, I take a short walk break halfway through. The three I've been running with push on. It's the first time all race I haven't been in the lead. I'm not sweating, but I'm not cold either so I finish out the lap at an easier pace. I still get back in plenty of time to get in my extra mile. Lap 6 goes pretty much the same but, as I'm walking my mile I realize something has changed: I'm not enjoying this.

That's a bigger deal than it sounds. I love running long distances. A lot. To not be enjoying an easy pace run when I've not even hit 50K is usually an indication that something isn't right. There are plenty of candidates: the cold, the big fluctuations in pace (I normally take much shorter walk breaks), jamming down so much food each lap, a leg injury from two weeks ago that hasn't fully healed. But, as I mentally check those off, none of them seem to be the problem. It seems this just isn't the day.

Well, it looks like I'll get to train something else today: resilience. This is quite different from "toughness". Tough is when you willingly hurt yourself because that's what the competitive situation calls for. Resilience is quite the opposite: it's figuring out how to keep going without messing yourself up. It's what allows you to be in a position at the end where toughness will matter.

The ultra crowd
I've found the best way to do that is not to slow down my run pace, but mix in more walking. On lap 7, I run the first mile, walk the short incline at the beginning of mile 2 (about 60 seconds), take my usual mid-lap break on the small incline at the end of mile 2 and then walk all of mile four. This pacing has me finishing in a bit over 50 minutes, so I don't have time to walk between laps.

At 28 official miles (with my inter-lap walking, I've gone five or six more than that), we are now into ultra distance. Ten line up for the start of lap 8. I repeat my lap-7 pacing and finish the lap feeling quite a bit better about the run.

I bring my headlamp on lap 9. The sun won't set until 4:53 and I should be back at the start/finish by then, but one of my few hard and fast safety rules is to never go into the woods within an hour of sunset without a light. You just never know and any problem is made worse if you can't see what you're doing.

Sun's going down; sure you won't stay?
I don't walk all of mile 4, so I get back with a bit more time to spare. I walk a bit, but the wind has picked up and the field is really cold. The organizers are offering their usual enticements for us to stop. I hold off on the whisky, but am pretty sure I can down a piece of pizza and still run. Just before the start of 10, I get my handheld light out of the car. The trail isn't particularly technical, but it does have lots of trip hazards in the form of roots. The handheld does a much better job of exposing those.

I run most of the lap with Travis. He's debating whether to fight James Baca (3rd last year) into the night. James has never won this event, so he might be pretty motivated. With my total mileage now in the mid-40's, I decide I'll probably pack it in after this lap. I've gotten in some excellent pace training, the increased caloric intake, while somewhat uncomfortable, hasn't caused problems, I got in a night lap and, most importantly, I haven't hurt myself for Rocky.

Two women and two men toe the line for lap 11. Well, a podium place is worth a little effort, even if it is in an informal event like this. I hop in just as the countdown to 6PM completes. Once again I run most of the lap with Travis. Since this will be my last lap, I don't walk it in and get back with about fifteen minutes to spare.

I confirm that only five people took the start and then pack up my stuff. I hang around the finish to say goodbye to those still in the race. Travis comes in just a few minutes after me. James and Rosemary LaRocca (who will go on to win with 13 and 12 laps respectively) come in just before the cutoff.

Then, just seconds before the cutoff, two more emerge from the woods. The others go off, but they head over to their supplies, refill bottles, get something to eat, and head back out. "They're not still in the race, are they?" I ask. I'm told that they are. How is that since they haven't made at least the last three starts? "That's not the rule." says the Race Director.

Well, actually, it is the rule. It's rather plainly stated on the event web site: "If you can't start the next lap on time you will not be allowed to continue." That's pretty unambiguous. But even if it wasn't stated explicitly, it's ALWAYS the rule. Intermediate time cutoffs always refer to the time you have to leave. Otherwise, you could hang out at an aid station for hours and then decide to get back into the race when all the support had been yanked. It's one of those rules that is so ubiquitous it doesn't need stating, like "you're not allowed to use a bicycle" or "no throwing rocks at other competitors."

I can be a bit of an obnoxious ass at the end of long efforts, but even I can see there's no upside in pitching a fit about finish order at an event like this. So, I settle for a loudly spoken, "That's bullshit!" and head on my way. It gnaws at me for most of the way home and then I let it go.

For what it's worth, several Metro Tri members have talked with me since and agree that the rule was not properly applied. They say that will be cleared up in the future. I hope that's true (and have no reason to believe it isn't) because, while it is just a fun local thing, it's not really very much fun to get beat by people you didn't even know were in the race. Meanwhile, I still had a great day on the trail and had promised myself I'd stop at 50 no matter what, which is exactly what I did. So, I can't say I was actually injured beyond being peeved for a half hour or so.

One of these years I might actually try to win this one. It's very much my kind of race. On the other hand, tonight's vexation with the rules notwithstanding, I've really been enjoying this sport much more since I quit competing locally. It might be a good idea to stay that course, at least until I've derived the variance of a finite-length martingale of partial sums from a series of correlated random variables. More on that tomorrow.

Saturday, January 13, 2018

Reluctant Bayesian

I had a professor at Cornell once say "the only good Bayesian is a dead Bayesian." She was joking, of course, but the sentiment had basis in reality. At that time, Cornell's stats folks were solidly in the frequentist camp.

My adviser, who did some research at Cornell after I left, tells me they (like just about everybody else) have softened their stance quite a bit and have come to at least accepting it as a legitimate approach. That said, he also cautioned me about going all in on using Bayesian priors for our work because you end up having to defend the prior rather than the result you are presenting.

Fair enough but, while looking at drawing inferences from the Markov model today, it occured to me that it might be a lot easier to analyze from a Bayesian point of view. There are several parameters that we don't know going in. The three most important are the mean and variance of the measures included in the query and the proportion of observations excluded. The sample mean, sample variance, and observed proportion are Maximum Likelihood Estimators (MLE) for those and will serve just fine. In an iid setting, that would be enough to get a point estimate and confidence interval on the sum.

In our case, it's not good enough. We need to know how the correlation is affecting those estimators and adjust the variance accordingly. That's not super easy to do. However, if we turn the situation around and say, suppose the transition matrix for the Markov Chain is T, then what's the likelihood we see the data we're looking at. What happens when we move off of our estimates for mean and variance?

A full-blown MCMC simulation could give us a nice, robust, empirical distribution for the true sum based on messing with those parameters. Of course, in the time it takes to run that, we could just sample all the data, so that doesn't buy us anything. But, I'm thinking we might be able to leverage the fact that these likelihood functions are really simple to replace the simulation with some quick searches of the estimator space using traditional optimization techniques. If nothing else, it's a novel approach.

Friday, January 12, 2018

The Markov model for block sampling

Remember, I said this process is easy to model. I never said it was easy to analyze. So, I'm going to start with the easy part: the model.

We have a series of observations {(Xi, Yi)}, i = 1...n.  The distribution of Xi is dependent on Yi. More importantly, Yi is also used to determine if we should even be adding the value of Xi to our query.

We also have a partitioning function which divides the observations into groups where any two observations in the same partition are generally closer in distribution and generally about as likely to be included in a specific query. This matters because, rather than being randomly scattered throughout the data set, observations in the same partition tend to be co-located in the same physical block of storage. In fact, they are likely to be adjacent. This last bit is what allows us to model the observations as a Markov Chain; it means that if our current observation is from partition p, then that's all we need to know to have a probability that the next one is in p as well. If we knew they were just in the same general vicinity, then we'd need to either keep a longer history, or sort the block by partition (an expensive operation) to get them adjacent. If the next observation isn't in p, then all bets are off; it could be in any of the others.

So, the simplest form of the model is just the sequence of partitions {pi}, i = 1...n. The transition matrix is pretty simple. The main diagonal is the probability that the partition doesn't change. All the other entries are the probability that it does change, divided by the remaining number of partitions.

The problem with this model is that it only tells us about the sequence of partitions. We don't really care about partitions; we want to know things about the sequence {Xi IQ(Yi)}. Specifically, we want to estimate the first and second moments so we have a point estimate and a confidence interval of the sum. This is where the model can get very complex.

For starters, we're going to assume that the distribution of Xi doesn't change that much from one partition to the next. We're also going to assume that the conditional distribution of Xi given IQ(Yi)=1 is relatively constant. I don't think either of those is true, but I'm willing to go with them until the empirical evidence tells me that violating those assumptions is a big problem.

That leaves us with simply estimating, by partition, how many rows get kicked out of the query. Then, we look at estimating the variance in light of the fact that the hit rate for our block may or may not be representative of the overall hit rate. Once the hit rate is estimated, we can combine it with a simple sample variance on the included rows to get assess the quality of our estimate of the total sum.

That's still not particularly simple, but at least it's reasonably well defined. I've got a three day weekend to make something of it.

Thursday, January 11, 2018

Confounding variables

A confounding variable is one that impacts both the independent and dependent variable in a relationship. If the confounding variable is not accounted for, a false causal relationship may be deduced. A common example would be to note that prison populations are disproportionately minority and conclude that minorities must be inherently criminal. The relationship is irrefutable, but the causation may have nothing to do with race. Minorities also suffer from lower economic status, less access to good education and general racial bias in the criminal justice system. All these things make it more likely for a person to turn to and be convicted of a crime. Studies that account for such things have shown that the environmental variables, not race itself, that are much better predictors.

We have something of an inverse problem with our current formulation. We know that rows within a block have correlation in the attributes. Those attributes are the independent variables that determine the distribution for the measures we are after. More importantly, those attributes may exclude the measure from the query altogether if they don't meet the query criteria. Thus, the mean and variance of any block sum is higher than if the rows were scattered randomly throughout the database.

The attributes are correlated because people tend to batch up data by like attributes. The batch doesn't cause the attributes to be correlated. Rather, it's the correlation that causes the data to be put in the same batch.

So, we're turning the confounding thing upside down as well and using batch id as a surrogate for "these things go together." This is generally valid, but it does leave open the possibility of having some sort of background analysis performed on the data to really determine which things go together. I have some vague notions on how one might do that based on analyzing the results of query patterns (similar to what D-BESt does, but with an eye to the measures as well as the attributes). However, that is not today's problem to solve.

But I don't want to box myself in by deferring it out of hand. So, I'm going to change my notion of a "batch" slightly and instead talk of a "partition". A partition is simply a map from the attribute vector to the positive integers. As batch id is part of the attribute vector, using it as a partitioning variable is simply a special case of the more general framework. By using partitions rather batches, I make it easier to generalize the results.

I am giving up a little bit with this. With batches, there are assumptions about contiguous rows that won't be generally true with partitions (unless the data is re-blocked by partition as with D-BESt). I talked with my adviser today about how the contiguity of batches might be leveraged and neither of us came up with anything. Therefore, I'm not considering that a loss.

Wednesday, January 10, 2018

When in doubt...

... restate the problem! Yes, really, I'm going to reframe this thing yet again.

We have n observations where n is finite (important!) These observations are not iid. Instead, they come in batches. Within a batch, they can be thought of as iid. (Actually, even within a batch they aren't really independent, but it's true enough for sampling.) If we were to sample this whole population randomly with replacement, we could say that we are dealing with iid observations where the distribution is simply the entire population with each observation having a probability of 1/n. We then use all the standard techniques for parameter estimation. That's what BLB does.

Or, we could take a sub-sample, with or without replacement, and run a MCMC simulation to get an estimated distribution for whatever parameter we were after. That's what BMH does.

Both of those approaches work. Both of those approaches are hideously inefficient. Why? Because random sampling from large databases is hideously inefficient. Reading an entire physical block of data takes about the same amount of time as reading a single row within that block. Unless you're doing something really nutty, processing the entire block takes several orders of magnitude less time than reading. In other words, once you've read and processed the first row, the rest of the block is basically free.

The rub is correlation. The batches are loaded sequentially into blocks. Maybe there are some transforms that scramble things a bit, but if you look at the physical rows in just about any data store, you will see heavy correlation between adjacent rows because they very likely came from the same batch.

The impact of this is to increase the variance of any estimator we might be calculating. If we pick a block of 100,000 rows, and 90% of those rows came from the same batch, we are really computing an estimator for just that batch, not the entire population. Other batches might be very different. Therefore, if you treat a block like a true random sample, the estimator will seem much more stable than it really is.

OK. I knew that before I even started working on this problem. It is, in fact, the original idea I was trying to work on. I just got sidetracked on this heavy-tailed stuff because it kind of blind-sided me. So, let's take a deep breath and look at the original problem and this time we won't beg out of it by switching to block sums (although, I still think that's a great practical trick).

We start with the definition of a query I used the last time I restated things: Q(X,Y) = Σ Xi IQ(Yi).

Now, let bi=Yi,1 indicate the batch to which observation (Xi, Yi) belongs. The distribution of Xi is a function of Yi. Although Yi isn't usually thought of as a random variable but rather independent inputs, in this context it can be very much modeled as a random vector with a distribution determined by bi. Thus, Xi IQ(Yi) is a random variable with a distribution determined by bi. The query is just the sum of these guys. If we've sampled k < n observations, all we need to know to make a statement about the full population is how much variability remains in the remaining n-k observations.

That's hard to do because the remaining observations are all in unsampled blocks and are therefore correlated. However, their distributions, while unknown, are functions of a single attribute, b. And {bi} just happens to be a Markov Chain! Yes, this is the big revelation of the night. The variables themselves are, at best, a really difficult to model Markov Chain. But, the distributions of those variables is a really easy to model Markov Chain. Just take the number of batches, divide by n and that's the probability that the batch changes at any point on the chain. By estimating the first two moments of Xi IQ(Yi) by batch and the variance of those moments between batches, we then can estimate the first two moments of the sum of any arbitrarily long chain. That gives us both our point estimate and confidence interval (or, HDI, if you prefer Bayesian terms).

Yes, there's some serious math to do. But, that's a good thing. And, Stochastic Processes was actually one of the few classes I took at Cornell where I was one of the students giving rather than receiving assistance from classmates. This stuff is very much hitting my strong points. (Or, at least it was and should be again once I dust off my 32-year-old copy of Karlin & Taylor).

Competing goals

I'm pretty much stalled on my research again. Some of this has been due to a year end deadline at work, but mostly it's been that I've been having a real hard time modifying the thrust of the paper. Simply put, my adviser and I have competing goals. He wants something with academic merit; I want something that works. The intersection of those two is turning out to be a lot smaller than I had realized.

The truth is that, once you've stratified the data, you're done. The sampler works, period. Sure, we can refine the variance estimate a bit, but that's really splitting hairs. Nobody would use this or any other sampler for financial planning and reporting; there are no generally accepted regulatory practices that allow that. The use of this algorithm is in exploratory analysis where "good enough" is just that.

Simply stratifying the data and then performing a method of moments estimation of the variance is not the stuff of academic journals. It's more like something you'd publish in an industry white paper. Or, perhaps, as a chapter in a dissertation. So, while CISS is a huge win on the practical front, it appears it's just not worthy of a stand-alone publication in a peer-reviewed journal. Trying to make it such is just complicating the algorithm which, in turn, makes it less useful.

All that said, we have turned up some interesting stuff along this road. The original line of inquiry was how to deal with the correlation of the individual observations. Stratifying and blocking the data obviated that discussion for the sampler, but that doesn't mean the discussion is pointless. The BMH and BLB algorithms both have real challenges with correlated data, even when it's not heavy tailed. Shifting the focus back to that problem and showing how we can sample by block rather than by row even in correlated data has both academic and practical merit.

So, I think I'm going to punt on the practical for now and go where my adviser has been trying to push me for the last few months. We've got some interesting direction with the Markov direction and most of the theory section from the CISS paper is still applicable. A pivot will require a rewrite of a few sections, but it doesn't look like more than a few days of work to me. Meanwhile, the code is written in such a way that you can set the strata = 1 and it all runs fine, so I don't have to redo the implementation right now (at some point, I am going to re-host it as a fully-parallel algorithm on our HDFS cluster, but I think that can wait for the moment).