Thursday, March 31, 2016

Log Gamma approximation

By the standard I set out yesterday, C#'s math library is less than decent. I was a bit surprised that even the stats library didn't have a function for the log of gamma. Fortunately, there's no secret to the approximation. Most implementations use a variant of the Stirling Approximation, which can be super accurate if you want it to be. I don't really need much more than a few decimal places, so I coded up the Nemes version which only uses a few terms and, according to my tests, gives 6-digits when z is greater than 1.5. The CISS confidence intervals are pretty meaningless without at least 1 block sampled, so the lower bound on the range I care about is 2.0. You can read the wiki page (or the source article, if you know German) for the mathematical details. I'll simply post the code for my C# brethren:

        // log gamma function using simplified Stirling's formula
        // Gergő Nemes: New asymptotic expansion for the Gamma function
        // Archiv der Mathematik, August 2010, Volume 95, Issue 2, pp 161-169
        public static double LogGamma(double z)
        {
            return (Math.Log(2.0 * Math.PI) - Math.Log(z)) / 2.0 +
                z * (Math.Log(z + 1.0 / (12.0 * z - 1.0 / (10.0 * z))) - 1.0);
        }


Wednesday, March 30, 2016

Closed form estimates

There's a reason Bayesian's have been using the Beta distribution for Binomial and Bernoulli priors for so long: it does tie up into a closed form rather nicely. It may turn out to be not the best choice, but if it is workable, I don't have to do any numerical estimation to get the variance on CISS results.

Recall from yesterday that, if X is a random variable representing the query sum from a block:

E(X|θ) = μ(θ) = θnbk/2

E(X) = μ = ∫μ(θ)P(θ)dθ    where P(θ) is the posterior on θ

Factoring out the constant from μ(θ), this simply becomes:

μ = (nbk/2)∫θP(θ)dθ = E(X|block contains data)E(θ)

Well, duh. As long as E(X|block contains data) is independent of θ, that's always going to be true. Unfortunately, that's a flimsy assumption. If the query attributes are highly correlated within a block, θ drops and E(X|block contains data) rises. But, we're deferring that for the moment. We will have to revisit it when we get to dynamic partitioning because then we'll be intentionally increasing the negative correlation between the two. (We may have to look at it sooner; I don't trust any assumptions until I run real data past them).

At any rate, as long as we're using a distribution with a closed-form mean (Beta certainly qualifies), we can move on to the variance:

Var(X|θ) = σ(θ)2 = E(|X - μ|2) = (1-θ)μ2 + θ[(nbk)2/3 - nbk μ + μ2]

Var(X) = σ2 = ∫σ(θ)2P(θ)dθ

Before we go plowing ahead with the integration, let's simplify our symbols a bit by letting p = μ2 and q = (nbk)2/3 - nbk μ + μ2. This helps us see that these terms are just constant coefficients and can be moved in and out of the integral giving:

σ(θ)2 = (1-θ)p + θq

σ2 = ∫σ(θ)2P(θ) dθ = ∫[(1-θ)p + θq]P(θ) dθ

Substituting in the density for θ ~ Beta(a,b) gives

σ2 = ∫[(1-θ)p + θq]θa-1(1 - θ)b-1 dθ

   = ∫(1-θ)bpθa-1 + θaq(1 - θ)b-1 dθ

   = pθa-1(1-θ)b dθ + qθa(1 - θ)b-1 dθ

   = pΒ(a,b+1) + qB(a+1,b)   where B(a,b) is the Beta function.

Now, before you accuse me of cheating and simply burying the integrals inside a predefined function, recall that:

B(a,b) = Γ(a)Γ(b) / Γ(a+b)

and that the Γ function is simply an extension of factorial to the real number line:

Γ(i) = (i-1)!

So, if a and b are integer:

B(a,b) = (a-1)!(b-1)! / (a+b-1)!

and

σ2 = p[(a-1)!b! / ((a+b)!] + q[a!(b-1)! / (a+b)!]

OK, factorial isn't technically closed form, but since we're dealing with block counts rather than row counts, we're not talking about massive looping to get answers. In fact, while it's no problem to set the prior so I only get integer parameters a and b on the posterior, any decent math software library has a closed form approximation of the Gamma function, so I could just call that.

Tuesday, March 29, 2016

E Pluribus Unum

Don't panic, I'm not switching this over to be a political blog. It did occur to me though, that the nature of my data correlation allows for a nice trick which may simplify matters considerably. To review:

  • We want a weak prior on the proportion of values returned.
  • We want a somewhat stronger prior on the magnitude of those values.
  • Unless we adjust for correlation, the posterior on the proportion will be too tight because we'll give too much credence to the first data points.
  • Correlation is strong within a block.
  • Correlation is weak (nonexistent?) between blocks.
  • All we really care about is the sum of values out of a block (actually, we only care about the sum for the whole stratum, but we can only read one block at a time so we have to put that sum somewhere).


So...

Why not simply ditch the individual rows altogether and just work with block sums? Those are very nicely distributed random variables:

P(X = x) = {1-θ for x = 0;  θU(0, nbk) for x ≠ 0}
where U(0, nbk) is uniform between zero and the stratum upper bound times the block size. (In the case of the negative strata, it will be U(nbk, 0), which flips the sign on the expected value but doesn't change the variance.) Thus,

E(X|θ) = μ(θ) = θnbk/2
E(X) = μ = ∫μ(θ)P(θ)dθ    where P(θ) is the posterior on θ

and

Var(X|θ) = σ(θ)2 = E(|X - μ|2) = (1-θ)μ2 + θ[(nbk)2/3 - nbk μ + μ2]
Var(X) = σ2 = ∫σ(θ)2P(θ)dθ

If the second term in Var(X|θ) looks weird, that's because we're looking at the variability around the posterior mean, E(X), rather than the mean of the uniform random variable. If you substitute nbk/2 for μ, you get the familiar uniform variance of (nbk)2/12 multiplied by θ for that term, but that puts us right back where we were last Friday.

Even with something as computationally convenient as the Beta distribution for θ, those integrals are going to be messy. In the early going, I can get by with a rough approximation. As the confidence interval gets closer to the goal, I'll need to be more accurate. Fortunately, the distribution of θ will tighten up as well, so I may not have to integrate over the entire domain.

Also note that in the first integral, it's just a constant times the variable times the density. The second isn't much worse: μ is fixed, so we're just placing the density on top of the affine space of two constants; it's really not that bad.

In exchange for the additional computation, this allows me to intentionally increase the correlation within a block, which I certainly do intend to do when we get to dynamic partitioning. Being able to know that a block can be skipped because the partition contains no data in the query filter will more than pay back grinding out a numerical integral. As long as I'm just concerned with the sum, I can still compute a confidence interval without a bunch of correlation adjustments.

Next step is to get these ideas into CISS and test them out. Code modifications should be pretty straightforward. Really hoping to have a stable version this week so I can get to writing.

Monday, March 28, 2016

Prior resurrected

It was nice to take a couple days off for Easter. Classes are off all week for Spring Break, but I'm back on it.

I know why my confidence intervals are too tight. This should have been obvious from the get go, but sometimes you need to take a step back to see the obvious. The confidence intervals are based on the distribution of the values given θ and λ. I just took the point estimates for those two. However, they aren't constants, they are random variables themselves. So, I really need to compute the distribution given θ and λ and then integrate that over the joint distribution of the inputs. That's not terribly hard to do, but it does bring us back the problem that I dismissed too quickly: I don't even have good marginal distributions for θ and λ. I don't have a clue what the joint distribution would be.

So, the first thing I'm going to do is take lambda out of the equation. Rather than stratify on absolute value, I'm going to stratify on value. Thus, within a stratum, values will be considered uniform between the stratum bounds and the big question is how many get picked up by the query (θ).

Because this doubles the number of strata, I'm going to back off on pinning the magnitude to 2k and instead allow for configurable bounds on each strata. I've got some ideas on the optimal way to do that, but for now, I'll just set them manually.

That leaves the prior on θ. If I'm really going to use this as a distribution for purposes of confidence intervals, I need to play by the rules and compute a real posterior based on likelihood. So, the fast-converging point estimate isn't going to cut it. I think I can still get around the problem by letting observations in adjacent strata inform the posterior. So, we start with a fairly neutral prior that reflects the variability due to the fact that we don't know the value of θ and the posterior gets updated anytime we sample a block from strata k-1, k, or k+1. This will speed the convergence, while still preserving a lot of variability, at least in the early going.

Although mathematical convenience makes it tempting to use the Beta distribution as the prior, I think it converges much too quickly, especially since we have correlation within blocks. I'm going to have to work something else out. This seems like a good application of Gibbs sampling. I can try a bunch of different distributions and see which ones behave right.

Friday, March 25, 2016

Overshot

Looks like I tightened up the confidence interval just a bit too much. Here's the convergence plot for the same query posted earlier in the week. This time the uncertainty bounds are "real" confidence intervals (except that they're wrong). Ideally, the true value would be between the blue lines 95% of the time.

The convergence path is just a bit slower due to the fact that the algorithm thinks it's doing better than it is so it's not picking strata as well. More importantly, the stopping rule is the width of the confidence interval, so the algorithm terminates before the desired confidence has been reached.  I had it set to $1000 to ensure it ran out nearly to the end (note, the scale on the left is in billions). For an all-expense query like this, the actuaries would probably want things to the nearest $50-million or so, which would end it at block 282, but it was still off by $225 million at that point.


Not sure which of my assumptions is being violated in a bad way, but I'm going to leave it at this for the week. Definitely good enough progress to turn my attention to far greater matters for a few days. Happy Easter; I won't be posting again until Monday.

Thursday, March 24, 2016

Prior on hit rate

Despite a couple drawbacks, the Beta distribution is the obvious choice for my prior on the proportion of rows returned by a query. For those who haven't done much with Bayesian stats, the Beta is a favorite for proportions because it's a "conjugate prior", meaning that the posterior distribution is of the same form. This makes iterative application pretty easy.

It's more than just a mathematical convenience; it really does do a nice job of distributing uncertainty. Don't have a clue? Use Beta(1,1) and your posterior is simply the proportion in your data with confidence intervals that match "traditional" frequentist models. Sort of sure, but willing to have your mind changed? Use Beta(a, b) where a/(a+b) is your believed proportion and a+b indicates how sure you are of that. If your data size is greater than a+b, the posterior will be biased towards the data. If your data size is smaller, the prior will show through more strongly. The graph below illustrates:

Here, the prior indicates a belief that we should get 16 hits for every 36 misses. We sample the data and get 57 hits and 43 misses. The posterior now predicts (16+57) hits for every (36+43) misses. The confidence interval has been tightened appropriately. We could now apply more data using the posterior as the new prior and get the same final result as if we had applied all the data at once. Convenient, intuitive, and flexible.

But, the real world is often messier than that. Because these aren't truly random samples, but samples of blocks of correlated rows, I need the posterior to reflect a pessimistic view of the uncertainty. Specifically, I want it to assume that there's a lot of relevant data out there that we simply haven't gotten to yet (until the data we have gotten to overwhelmingly suggests otherwise). That would suggest using Beta(a, b) where a is very large and b is very small, even though I don't really believe that. Specifically, I'd like to assume that EVERY row is included. The problem with setting a prior to 0% or 100% is that it's no longer a distribution at all, simply a statement of fact. As such, any data that doesn't match that fact has a likelihood of zero and the posterior degenerates back to the prior. (There's a Bayesian "proof" of God's existence that counts on people missing this detail).

There is an out. I don't actually need a prior on θ for the first iteration. I only need the point estimate to compute the variance of the observations. I can just set it to 1 and compute the confidence interval for the stratum (which will obviously be quite wide). Then, after a block has been sampled, I can set the posterior to Beta(N+h, m) where h and m are the number of hits and misses from the first block and N is the total number of rows.

The rub is if the first block has no misses. Then I'm back to a degenerate prior. That's not really terrible. If I just continue to add hits and misses to the distribution, the function will work; it just doesn't make a lot of intuitive sense.

The real problem is that this isn't going to collapse fast enough. Remember, I don't really believe this prior. I'm just setting it that way to keep the algorithm sampling. After sampling all the data, I'll have a distribution on θ of Beta(N+H, M) where H and M are the total hits and misses for the stratum. That's clearly nonsense because everything has been sampled, so there's no uncertainty left. It should be just θ = H/(M + H). So, I'm going to use a little sleight of hand. Instead of treating the sampling as cumulative, I'm going to treat it as replacing. That is, a sampled block will now replace an unsampled block in the prior, not the posterior. Under that scheme, the prior becomes Beta(N-(h+m), 0) and the posterior is Beta(N-m, m). Repeating this over all subsequent samplings will result in a final state of Beta(H, M), which is exactly what we're looking for.

Mathematically, this is a bit bogus and I'd not feel good about it if I was giving a confidence interval on θ. However, I'm not. I'm just using a point estimate of θ to compute a confidence interval on the stratum sum. So, at each step, I'll just use θ = (N-M)/N where M is the cumulative misses so far. This gives a well-defined point estimate at each step and converges to the real answer if we end up sampling the entire stratum. Given that, I don't need to stress over the actual distribution.


Wednesday, March 23, 2016

Variance by strata

It's just symbol manipulation, but there's enough of it to be worth writing down. So, I'll write it here. This is the derivation of the variance for an individual observation from stratum k.

P(X = 2k) = θλ
P(X = -2k) = θ(1-λ)
P(X = 0) = 1-θ

stratum mean μ = ΣxP(x) = 2kθλ - 2kθ(1-λ) = 2kθ(2λ - 1)

stratum variance σ2 = Σ(μ-x)2P(x)

   = θλ(μ - 2k)2 + θ(1 - λ)(μ + 2k)2 + (1 - θ)μ2

   = θλμ2 - 2θλμ2k + θλ22k + θ(1-λ)μ2 + 2θ(1-λ)μ2k + θ(1-λ)22k + (1-θ)μ2

   = μ2 + 2θ(1-2λ)μ2k + θ22k

Substituting in the formula for the mean to get the variance just in terms of θ and λ is not enlightening and doesn't help the computational stability, so I'll leave it as an exercise to any reader that just really likes moving letters around.

Tuesday, March 22, 2016

Impact of correlation

I glossed over the whole independence problem yesterday, but it will need to be dealt with. Fortunately, independence (which I am sure is NOT true) is not really the issue. What we care about is correlation. We'll start by claiming that the variance of the sum is, in fact, finite. This seems pretty safe since, even with the heavy tail, financial metrics are bounded by the total amount of money in the world and the variance of any bounded random variable is finite. In that case, the variance of the sum is:

Var(ΣXi) = ΣΣCov(Xi, Xj) = ΣVar(Xi) + ΣΣCov(Xi, Xj)    where i <> j in the last term

So, if they're uncorrelated, this is all easy; the variance of the sum is just the sum of the variances (any introductory stats book will tell you this). Problem is, they're not.

The value of θk (the proportion of rows included in a query from stratum k) varies from one block to the next, and not just randomly. There is real correlation within a block even after stratification. That's not too terrible to work around since that's really just affecting the posterior variance of θk, not the expectation. We can still use the point estimate of θk as input to our posterior distribution for the stratum sum and get an unbiased estimate of the sum variance from that. Looking at the whole stratum, the values are not correlated (or, at least, I have no reason to believe they are). However, across strata, they are most definitely correlated. Remember that whole reversing entry thing? A bunch of detail rows get offset by one big reversing entry. This means that entries in one stratum tend to correlate to a smaller number of larger entries in a higher stratum with the sign reversed.

So, if we assume that we sample enough blocks that we have no correlation issues within a stratum, the variance for the stratum sum is:

Var(ΣXi) = ΣVar(Xi) = nkVar(Xk)     over all i in stratum k

and the formula above for the total sum becomes:

Var(ΣXi) = ΣnkVar(Xk) + ΣΣnknjCov(Xk, Xj)    where again k <> j in the last term

I hope it's clear that the right hand side is now summing over strata, not individual observations. Putting the summation indices into HTML is more work than it's worth for a blog post.

So, the big question becomes: what is Cov(Xk, Xj)? For the moment, I'm going to say it's close enough to zero that I can leave it out. However, I'm actually a little pleased that's not the right answer. Everything else in this project has been merely implementation of a somewhat clever algorithm. That's all well and good, but if I'm going to submit this for publication, it will help to have some actual math in there, too.

Monday, March 21, 2016

Confidence interval for strata variances

I'm not particularly worried about the distribution of the query results from each of the stata. Simply using the data so far divided by the proportion of rows sampled gives a non-biased point estimate. However, I can't very well call this algorithm "Confidence-based" if I don't compute real confidence intervals. So, that's where I need to go.

I'm not going to do the derivations right now, but here's the general plan:

First off, note that the actual magnitude of the measurement in each stratum is fairly tightly constrained. As such, that's not a big source of variance. So, we'll just assume that the absolute value for all measures in stratum k are roughly 2k.

That leaves just the number of measures that meet the query criteria as the big source of variance. We know the total row count for the stratum going in. What we don't know is how many of those rows are included in the query. Nor do we know the sign. So, what we really have is a sum of identically distributed random variables which can take on values of 0, -2k, and 2k. They aren't really independent, but I don't think that's violated in a way that matters.

Knocking out the zeroes is easy. Let θ be the proportion of rows in the query. We start with a prior that nearly all the rows are included in the query result (we can't set it all the way to 100% or the posterior becomes degenerate). As blocks get sampled, we update that with the posterior given the rate that we've been including rows. There are several possible distributions that could be used for the prior on θ. The Beta distribution is the obvious choice, since it forms a conjugate prior, but it may not be the best one.

The positive/negative split, which we'll call λ, is more problematic. I think the most defensible position is to set the prior to the overall +/- proportion of the stratum. Again, there are a number of prior distributions to choose from and only some empirical testing will indicate which is best.

So, each value in stratum k is now a ternary random variable Xki where P(Xki = 0) = (1-θ), P(Xki = -2k) = θ(1-λ), and P(Xki = 2k) = θλ. We then just compute the variance for the sum of all the unsampled values across all strata (obviously there's no variability in the estimate for the rows we have already sampled).

Sunday, March 20, 2016

Convergence plot

Granted, for purposes of demonstration, I got a bit lucky in that my random sampling algorithm missed a big 1-sided entry until late in the game. But, that's just the point, random means things like that will happen. CISS, on the other hand, cranked out the big uncertainty items much earlier on. Here's a graph I put together showing the convergence of CISS versus a single random run. Obviously, it would be more rigorous to perform a few thousand random runs and look at the population convergence, but that's not really a big concern right now.

What is a big concern right now is the uncertainty bounds which are clearly way too conservative. I'm calling it "uncertainty" rather than a confidence interval because right now it's just a heuristic that will nearly always contain the true value; not something based on an actual probability. That will be the main point of discussion with my adviser tomorrow. I'm pretty sure I can specify the variability as a real Bayesian prior and then work out the posterior, even without knowing the underlying distribution. The reason for this is that, since the data is stratified, I already know the distribution of the data in each stratum: it's basically uniform with a range from 2i to 2i+1, where i is the stratum number. What I'm estimating is how many of those rows are included in this query result. That's just estimating a Bernoulli success rate using repeated samplings. If I have a confidence interval on the success rate, I also have a confidence interval on the variability of the number of hits. Multiply that by the average magnitude for values in that stratum and you've got a pretty good handle on the variability of the estimated sum.

Side note on this data: if you're wondering why the random sampler appears to show so much less volatility in the path, that goes back to the original motivator of this research: correlation within a block. These are accounting entries. If you process them in blocks where the blocks represent sequential entries, you tend to get the reversing entry along with the detail. So, blocks tend to sum to zero, which means the estimate doesn't move much until you get to something where the entire entry didn't all go into one block or the reversing part is excluded from the scope of the query. (This latter case is how you wind up with profit in a year even when all entries have to balance; the reversal is in the next fiscal year). When the data is stratified, the reversing entries (which negate hundreds of detail entries) get put into different strata, so stratified sampling will show just half the entry until the rest of the details are found. That leads to very high volatility at first, but better convergence in the long run.

Saturday, March 19, 2016

2005 Missouri Road Championships

Yes, really, I used to race bikes! This week's off-day throwback race report is from State Champs.

Raced May 21, 2005

I traveled north to Clarksville, MO (population 490) for Missouri Road Cycling Championships. I entered the 40-mile old guy race since my form isn't really where it needs to be to go 100 miles with the open field. I only have one Big Shark teammate (Joe Walsh) with me in the M40+ field.

Other than being in the middle of nowhere, the Clarksville course is without fault. The course opens with two (some would say three, but the third is much smaller) climbs followed by about 10 miles of rolling terrain. After that is a flat section that always seems to have a cross wind. The lap finishes with the feed hill which can be done in the big ring if you're serious about getting away. From the top of the feed hill it's about 800 meters, mostly downhill, to the finish. Each lap is roughly 20 miles although no claims have been made as to the accuracy of that measurement.

The field moves easily over the first climbs. Nobody wants to mix things up so early. About halfway through the rolling section the attacks begin. None gets very far away, but the sorting has begun. Everyone takes a break on the flat section and we end the first lap all together.

On the first climb of the second lap, the pace is considerably higher, but no serious attempts to escape are made. On the descent, I speculate to Joe that the field may well stay together the whole way. He agrees.

I go to the front at the base of the second climb, hoping to at least shed the back half of the field so we have a more manageable group for the sprint. About 400m from the top comes a violent reaction as the top riders are sensing that the field may be ready to crack. It's all I can do to match the acceleration, but I do get on the back of the group. By the top we have 10 riders clear.

As we are getting organized on the descent, Joe finds himself adrift off the front. Although it's a long way to go, he seizes the opportunity and quickly opens up a decent gap on the remaining nine. Three riders bridge across. I would like to join them, but I'm afraid the gap isn't large enough yet and if two Big Shark riders are in the break, the whole group might come up. I sit on the back of the line, waiting to see who will chase.

The answer turns out to be nobody and while the lead group makes their exit, we are joined by three from behind. I'm none too pleased by this turn of events, but that's the way things go in road racing. At least, with a teammate in the break, I can sit on the back and rest.

The pace stays high enough that we aren't caught by any more stragglers, but  the gap to the leaders holds. Approaching the feed hill, I get back up to the front just in case someone tries to go clear. I set a stiff tempo up the first part of the hill and that's enough to dissuade any attacks.

I drop back to fifth in line as we start heading down the hill to the finish. The windup develops slowly and we are on the flat section before the real sprinting begins. At 200m, I move to get on the front, but have to tag the brakes hard as the rider ahead of me goes for the same hole. I get extricated with about 100m left, but don't have quite enough distance to get around the last two riders. Waiting so long to go was a dumb mistake given the relatively slow windup, but overall I'm pretty happy with third in the sprint. Physically, my form is further along than I thought.

Joe ended up getting third overall. It's nice to know that my patience at least helped a teammate get a medal. He deserved it - it was a gutsy move to drop the hammer solo so far from the finish.

Friday, March 18, 2016

Convergence!

Don't have much time to write today, so I'll just mention that CISS seems to be converging quite nicely on the sample data. Actually, just stratifying the results greatly accelerated convergence using more conventional sampling, but when we get to the full data set (which, at 50 billion rows, is five orders of magnitude larger than the sample), simply stratifying won't do any better than the index lookups we're currently using. Hopefully CISS will, though it's going to be a while before I get this hooked to the full set.

Meanwhile, I still have the non-trivial task of deriving real prior and posterior distributions for my result. Without a mathematically defensible confidence interval, this thing is useless to the actuaries.

Full discussion and pretty graphs of convergence will also follow in the next few days.

Thursday, March 17, 2016

Stratified Sampling background

I did a quick Google search to see if anybody already had an algorithm named Confidence-based Incremental Stratified Sampler. I didn't get any hits, but I did turn up a really relevant paper from Ernst & Young of all places. The reason it's so relevant is that it's stratified sampling of financial data and they indicate that they see the Gamma distribution all the time. (I'm still not sure if my data is Gamma or Chi-squared, but both have the problem of heavy tail).

Rotz, Wendy, et al. "PPS vs stratified sampling in modern audits." Joint Statistical Meetings, 2002.

Anyway, it's a relevant hook into the financial literature on sampling, so that's a good thing in and of itself. I'll add it to my citation tree.

Wednesday, March 16, 2016

For lack of a better term

Or post, for that matter. Realized that I didn't post anything yesterday. There's no law that says I should, but I like to get one out every day. So, I'm double posting today. And, since I haven't been able to come up with anything better, I am now naming my algorithm "Confidence-based Incremental Stratified Sampler" (CISS).

I may change that at some subsequent date, but it is a somewhat catchy acronym and a faithful description of what it actually does, so it's probably a keeper. Now I just need to get it working and then get to writing. Time's running out on this semester.

Things that still need to be done:
  • Compute prior. For now, I'm just going to start with the overall distribution of the data as the prior. I'll need that as a fallback anyway since not all data sets will have easy to compute priors for subsets of the data. The prior will actually sit on each stratum; that is, I'll have a parameter Si which is the absolute sum of the values for stratum i. So, for the first step, this is just the row count for the stratum times 2i (the average magnitude of values in the stratum).
  • Compute the posterior. Again, we'll start with the simple one. We'll just weight the prior against the actual data we've collected so far. An important caveat is that when we have sampled the entire stratum, we crank the confidence interval down to a point.
  • Compute the next stratum to sample. This is easy to do; simply assign a probability to the stata. The only question is whether it will converge faster if we assign the probability proportional to the variability (width of the confidence interval) or the mass of the distribution. I'm hoping for the first because that presents fewer edge cases (a fully sample stratum will get zero probability of further sampling). But, I suppose I really should check both techniques.
None of this is particularly hard, so I'm still hoping to get it ready to present to my adviser tomorrow afternoon. Might be a late night, though.

Stats HW3

More fairly straightforward applications using R and BUGS.

Final version can be viewed here.

Monday, March 14, 2016

Thoughts for summer research

I need to figure out what I'm going to do for a summer research project. I don't want to take a formal class because I'll be prepping for the Qualifying Exam. I don't want to squander the summer, either. So far, two applied projects are appealing to me:

  1. Implement my heavy tail estimation algorithm on a scalable platform (most likely Hadoop). Having a truly working implementation would greatly increase the chance of getting this spring's work published.
  2. Use machine learning to create a mapping engine that takes LIDAR and orthorectified photos and creates orienteering maps. Pretty much zero market for this (though it might get published just because it's an oddball app). It's just something that I think would be cool. Having spent 400 hours mapping Cuivre River state park a few years ago, the idea of an algorithm that would do even half of that is pretty appealing.

Sunday, March 13, 2016

Description versus explanation

Most people believe that Galileo's big contribution to science was noting that the Earth revolved around the sun. That's not really true. Astronomers from various cultures had worked that out long before and, even in Europe at the time, the scientific community wasn't exactly shocked by the idea (Kepler was expounding an even better model at the same time).

However, Galileo is rightly considered the "father of science" because he was the one that really moved science away from trying to explain the universe and instead settle for describing it. It took a while for that concept to really take hold, but once it did, Europe immediately vaulted to the forefront of innovation and, as innovators tend to do, took over the world.

Statistics teachers seem to have missed the memo.

This is particularly ironic since if there was ever a field that focused on description over explanation, it would be statistics. Still, stats textbooks are littered with these problems that supposedly debunk commonly held beliefs simply because they can provide a descriptive model that does not require them. Note, I didn't say they can provide a descriptive model that disproves them. Even statisticians are generally careful not to go that far. They just figure that if the random model covers it, the phenomenon probably doesn't exist. QED.

Well, no.

What the hell is "random" anyway. Try pinning a statistician down on that one. It can be amusing. At Cornell, I took a course called "Frequentist Theories of Statistics". When I walked into class on the first day, there were seven other students in the room. Four of them were faculty. One of them had been awarded the Nobel Prize. Really! This is a deep question and even the world's best minds struggle with it. I got an A in that class, not because I had any clue but, as the professor told me when I defended my term paper, I "wasn't afraid to face the subject head on." He then added, "That's pretty impressive for a guy coming from a school with a T in the name."

Then he farted. Really loud. He was super smart about math but rather ignorant as to social graces.

I went on to explain that my choice of a tech school over a university for undergrad was based solely on the fact that my dyslexia would kill me in a traditional literature course; I was quite willing to ponder philosophical questions as long as I didn't have to read 100,000 words a week to do so.

So, what's my point?

We had a problem on the homework for Bayesian Stats that talked about the "hot hand" in basketball; the idea that a player would get on a roll where they seem to be sinking everything. They then presented some stats and we had to show whether there was any evidence of non-random activity. Obviously, there wasn't, or I wouldn't be writing this.

I've seen this problem in various forms in at least half the statistics classes I've ever taken (and I've taken quite a few). I get that they are just using it as a convenient metaphor but, as someone who knows more than a bit about both stats and sports I can assure you that it's the statisticians that are full of shit.

First off, the set of sports data is so vast that you can easily find a slice of it to prove or disprove just about anything. Even without nefarious intent, one can get sucked in by something that appears to be a trend when it's really just noise.

More importantly, it violates Galileo's specification of the scientific method. Experiments only prove things by showing that the data is completely incongruous with competing theories; leaving only the theory to be accepted. Simply showing that the same thing can happen under multiple competing theories (in this case, random versus real variation in success) proves absolutely nothing.

More importantly still, description does not trump truth. Every elite athlete in the world will attest that there are times when your game simply gets elevated. Where everything suddenly seems easy. The standard term is "in the zone." People who have never experienced this have no clue. People who have will defend it with the vehemence of a TV Evangelist. It is a TRUTH. Not a fact.

OK, I've been up way to long this weekend working. But it was fun to rant.

Saturday, March 12, 2016

2007 Conquer Castlewood

I guess I'm just a little nostalgic for the days when I was plugged into the adventure sports scene around here. This week's off-day throwback race report is to another one where I paired up with David Frei.

Run June 3, 2007

The call from David was amusing. "Alpine Shop is sick of Big Shark winning their race every year so they want to pay the entry fee for me and my partner." The race in question was the Conquer Castlewood Adventure Race which is sponsored by Alpine Shop. The reason this is amusing is that the Big Shark team he was referring to was none other than David and I. No problem, I'll "join" his team rather than the other way around if it makes his sponsor happy. I do plenty of other events wearing a Big Shark jersey each year.

Despite our track record, Conquer Castlewood is not an automatic win for us. It used to be, but the quality of the field has really come up in the last few years. Three years ago we got bit in the butt when we were just a bit too casual about our preparation. We finished third, only 12 seconds out of first. That hurt so the next year we got our act together again and won, but it was hardly a rout.

The team we're most concerned about is the one that finished between us and the lead in 2004. We affectionately refer to Nick & Trey Robinson as "the college kids" even though they have graduated since acquiring the moniker. Normally, we don't pay much attention to teams in their young 20's because they have trouble with the distance. Conquer Castlewood is short enough (each of the three disciplines is roughly 30 minutes) that there's more of a balance between speed and endurance and they've got plenty of speed. With David out of the country for last year's race, they won easily.

The week before the race, David and I scope out the river. We paddle the Meramec all the time, so we have a pretty good idea of where the current runs, but we've never actually timed the best lines. The key decision is when to cross to the south bank. Getting through the current in the middle of the river quickly can save a lot of time. We try two obvious points and they come out exactly the same. I suggest we try one more line that has less shelter along the bank, but goes behind a large log stuck in the middle of the river. It turns out to be 10 seconds faster. Not much, but remembering 2004, we decide that will be our play.

I get to the park right when the gate is supposed to open at 6:30AM. There are already a few teams there waiting to get in. Snagging a good parking spot and transition area is worth getting up early. The race director is there and since he doesn't seem to have anything else to do right at the moment, I pester him about the start. In my most diplomatic tone, I suggest that he might consider having us run to the boats rather than the chaotic on-the-water start that's traditionally been used. He's skeptical, but the more we talk about it the more he seems to like the idea.

After snatching both the best parking spot and the best transition area, I set about getting my things ready. David shows up a bit after seven and we go for an easy jog to warm up. When we get back, we're told that some of the canoes came off the trailer when being transported to the race. As a result, they can't start all the teams at once and we'll be in the second heat. At first I'm none to happy about this because I think we might run into slow teams on the single track descent during the bike. David's pretty sure we won't catch them that quick and points out that having sixty riders go over the trails ahead of us might squeeze off some of the surface mud from the previous day's rain. At any rate, it doesn't appear to matter as all the fast teams that we know about are in the second heat.

After the boats are back from the first heat, we all line up for the run down to the water. With the field broken into heats, this is less of an issue since 30 boats can fit across the river. I still prefer the format, and we are able to get underway without any serious altercations with other boats. After a minute of paddling, we're in a lead group of three boats. Up front are the college kids. We don't recognize the next team, but they've obviously held a paddle before. We alternate between drafting them and staying in closer to the shore.

We're a bit surprised that none of the lead teams go for the early crossing. In past years, most of the field has preferred to get over to the south bank as soon as the river stops bending to the north. We figure that means they'll be taking the late crossing; we stay in third so as to not tip our hand. We cut over just below the log we identified in practice. The river is down just a few inches from our test runs which has things flowing faster in the middle. What was a minor shelter before is now a game changer. We surge past the other boats and hit the far bank ahead. They scramble to cross, but with no shelter in the channel, they are a good 15 seconds back when they get across. This is our chance to get clear and we make the most of it. We hammer back downstream to take out with a minute lead.

At the transition, I take a few seconds to squirt the sand out of my cleats. That puts me on the bike behind David who sets a stern pace. While teams are allowed to split after the paddle (team time is the sum of the individual finish times), I'd like to at least stay with David until the descent. I'm quite sure I won't be able to match his speed there, especially in these slick conditions.

The first mile is flat, followed by the long climb up the ridge. Shortly after the climbing starts, I spin my rear wheel on a slick rock and go down. I get up quick and try to hammer back and do it all over again. Stop, breathe, compose. I get back on and manage to get the rest of the way up without incident. A bit of hard charging along the ridge top gets me to within about 10 seconds by the start of the descent.

Descending Love trail is normally one of those activities that reminds you why you do these things. It's fast, twisty, with just enough bumps to make it exciting, but nothing to really knock you off your line. Having already hit the dirt twice, I just don't have the confidence to really go after it. I get down with no problems, but by the bottom I've been caught by another rider. The rest of the course is flat, so I figure I'm still in good shape. I get back on the gas and, whoops! The front wheel washes out on a tight turn and I go down hard. I'm not hurt, but it does cost me the spot.

Back at the transition, I'm told that David is about 90 seconds up the road. A quick shoe change gets me back into second, but what I'm really wondering is where the college kids are. A mile into the run I get word that Nick is a minute behind me while Trey is over 2. Assuming they're on form, that means I'll likely get caught by Nick. I don't think he'll get all the way up to David. If I can hold off Trey, we're in good shape.

Much to my surprise, I get all the way through the trail running before Nick shows up. I make a feeble attempt to match him on the road back in, but I'm already pretty close to redline and drop back after a few hundred meters. I lose 30 seconds to him over the last mile, but he doesn't catch David. I'm still well ahead of Trey, so we've pulled off the win.

Racing with David is always a blast. Afterwards we agree that we really ought to do it more often (although his regular commitments with the Alpine Shop team don't give him a whole lot of empty space on the schedule). Oh well, if it's going to be just once a year, at least we put out a great performance when it happened.

Friday, March 11, 2016

Willing to fail

I saw a documentary today on the Barkley Marathons, generally regarded as the toughest race in the world, though debating that claim here misses the point: it's really hard

The RD had an interesting take on the race's appeal. Basically, we go through life trying to avoid failure. To sign up for Barkley is to invite it. Just the signup process is no easy thing to get right. There are no instructions on how to register. No website to ask questions. No entry deadlines or lottery procedures. Basically, figuring it out is part of the race; if you're not tenacious enough to get past that hurdle, you don't have chance.

Come race day, you're faced with five nominally 20-mile laps (the actual distance is considerably longer) through some of the toughest wilderness in the world. The time limit is 60 hours. Very few even finish the third lap. In the 25 years it's been held, only 14 people have managed all 5.

Of course, people accept much longer odds all the time when they buy lottery tickets. But, failing to win the lottery is much different than failing to finish an ultra. In the former, you've invested very little and have absolutely no control over the outcome. In the latter, you've invested thousands of hours of training and the outcome is almost entirely up to you.

Any activity carries some risk of failure. The Barkley Marathons just move that needle way over to one side. It is at the very limit of what is possible, yet most of the 14 finishers are not truly elite athletes. They are just sufficiently fit and extremely determined.

Perseverance is the one thing that nearly all successful people have in common (regardless of how you choose to define success). It is not a sufficient condition, but it is necessary. Like all abilities, it is unevenly allocated by birth, but anybody can increase what they have through practice. And, the way to practice perseverance is to put yourself in a position where lack of it leads to failure.

I don't particularly want to run Barkley; it just doesn't sound like much fun. But, I do get it. While I like the risk/reward equation to be a bit more in my favor, I run 100's largely because they do force you to stare failure in the face and decide to either fight on or quit. Fighting on is hard. But it's also exhilarating. It becomes a habit, then a lifestyle and, ultimately, a part of our character. It is good.

Thursday, March 10, 2016

Data Mining Tan ch 8 notes

Rather last minute, but I'll throw (literally, this is very late in the game, but production issues at work diverted my efforts last night, so this will have to do). together a few notes for the exam from the supplemental chapter in Tan regarding clustering.

Reasons to cluster:

  • Summarization
  • Compression (use of cluster prototype points to represent all points)
  • Nearest neighbor algorithms can search same cluster first
Hierarchial - nesting of clusters
Partitional - non overlapping

Exclusive - each point gets 1 or zero clusters
Overlapping - point may be in more than one
Fuzzy - point is allocated partially to several clusters

Prototype-based: object assigned to cluster with most similar prototype
Graph-based: clusters represent connected objects
Density-based: clusters represents regions of heightened density
Shared-property (conceptual): objects share some property or set of properties

K-means: prototype-based partitional clustering

Dependent on distance metric. Euclidean distance results in clusters with a centoid at the mean of the dimensions. Minimizes Sum of Squared Error (SSE).

Alternative metrics include the cosine similarity measure, which is useful for document data. (Cohesion measure).

Complexity: Time - O(I * K * m * n) where I is the iteration bound, K is the number of clusters, and the data is a m points with n attributes. Space - O((m + K)n).

Techniques to reduce SSE after convergence on a (possibly) non-global optimum: split a cluster, introduce a new centriod, disperse a cluster, merge two clusters.

Bisecting K-means: Start with K-means(2) and then continue to split a cluster at each iteration. Cluster to split may be chosen by several criteria (largest, largest SSE, greatest diameter, etc.)

Agglomerative Hierarchial: starts with singletons and combines nearest sets

Distance metrics to find nearest sets: single link (closest pair), complete link (most distant pair), group average.

Complexity: Time - O((m2 log m) assuming an efficient heap or list. Space - O((m2)


DBSCAN: density-based finds high density areas and treats the rest as noise.

Points classified as Core, Border, or Noise based on how many other points are with in a set distance eps. Core points within eps are put in the same cluster. Border points are put in a cluster of a core point they border. Noise points are ignored.

Cluster evaluation

Cohesion vs separation (cohesion with clusters, separation between clusters).

See p538 for some specific formulas for separation and cohesion.

Silhouette Coefficient: a = average distance from object to all objects in same cluster. b = average distance between object and all objects in a different cluster. s = (b - a)/max(a, b) => [-1, 1]. closer to 1 is better.



Wednesday, March 9, 2016

Data Mining ch3 notes

Read this chapter in Aggarwal a few weeks ago, but forgot to write up my notes. I may need them tomorrow for the exam (if nothing else, writing them up is a good review). More than just terms from this chapter as there are several important formulas to remember.

Quantitative distance measures:
Lp norms: Dist(X,Y) = sum(|xi - yi|p)1/p
Special cases:
  • L0 - Doesn't really work mathematically, but intuitively it's the number of dimensions that differ between X and Y, so that's how we define it.
  • L1 - grid (AKA: taxicab or Manhattan) distance: sum of all the absolute differences along each dimension.
  • L2 - standard Euclidean distance.
  • L - greatest single dimension distance (i.e., max(|xi - yi|).
Issues with higher dimensionality - loss of contrast (max - min dist from origin)/mean dist from origin. Lower values of p (possibly between 0 and 1) help with higher dimensions, but all degrade as p gets large with distance dominated by white noise rather than relevant feature distinction.

Proximity thresholding: divide data into discretized buckets for each dimension. The proximity set S(X,Y) is the set of dimensions where X and Y map to the same bucket. Then apply the distance formula only to dimensions in S(X,Y):

PSelect(X,Y) = [sumi∈S(X,Y)(1 - (|xi - yi|/(range of bucket))p]1/p

This gives a similarity metric rather than a distance metric ([0,1] with 1 being similar).

Mahalanobis distance: Maha(X,Y) = sqrt[(X - Y)Σ-1(X-Y)T] where Σ is the covariance matrix. Basically Euclidean distance on data after Principal Component Analysis has been applied.

ISOMAP and Geodesic distance: shortest path along using path that passes only to k-nearest neighbor. Once the all-pairs shortest path graph is constructed using k-dist hops, multidimensional scaling (MDS) can be used to embed the data into a lower dimensional space where a euclidean distance metric will work.

Shared Nearest-Neighbor Similarity: number of common k-dist neighbors between two points.

Generic Methods: Partition the data into local regions (typically via a clustering algorithm). For each pair, determine the appropriate region and apply the distance metric using the statistics for that region.

Categorical distance measures:
Overlap measure: S(xi, yi) = {1 if xi = yi; 0 otherwise. Similarity S(X,Y) = sum(S(xi, yi)).

Inverse frequency similarity (reduces noise from common attributes): Let pk(x) = proportion of records where attribute k = x. S(xi, yi) = { 1/pk(xi)2 if xi = yi; 0 otherwise.

Goodall measure: S(xi, yi) = {1-pk(xi)2 if xi = yi; 0 otherwise.

Cosine distance:

cos(X,Y) = X · Y / ||X|| ||Y|| = sum(xi yi) / [sum(xi2)sum(yi2)]1/2

Inverse document frequency id: norm each term by idi = log(n/ni) where n is the total number of documents and ni is the number of documents containing the word.

Normalized frequency: h(xi) = f(xi) idi. Then the cosine measure is simply applied to the normalized vector X' = f(X) = (f(xi)).

Jaccard coefficient: J(X,Y) = sum(h(xi) h(yi)) / [sum(h(xi)2 + sum(h(yi)2 - sum(h(xi) h(yi))]
More commonly used for sparse binary data than text. In the binary domain, this reduces to:
J(X, Y) = |SX ∩ SY| / |SX ∪ SY|

Time series transformations
Behavioral attribute scaling (variance normalization) and translation (mean shifting)
Temporal (contextual) attribute translation (shifting to align series)
Temporal (contextual) attribute scaling (time warping). May be fixed throughout series or vary over time (dynamic time warping).
Noncontiguity matching (deleting/ignoring noise segments)

Discrete sequence
(AKA Levenshtein distance) - minimum cost function to transform one string to another given a cost function for each type of edit.
Dynamic programming objective function:

Edit(i,j) = min{Edit(i-1, j) + Cost(delete); Edit(i, j-1) + Cost(insert); Edit(i-1,j-1) + Iij Cost(replace)

where Iij is 1 iff xi <> yj

Longest Common Subsequence: also solved via dynamic programming.

Graphs
Homophily measures: short path and/or many path measures.
Random walk-based similarity
Maximum common subgraph distance (extension of longest common subsequence)
Substructure-based similarity: counts matching substructures
Graph edit distance
Graph kernels



Tuesday, March 8, 2016

More intervals!

No, not track intervals, I'm done with serious running training (though I may hit the track this summer just for a change of pace). Brain intervals! I've let my brain get out of shape again and need to address it pronto.

Today, the prof went through a few of the problems suggested for test prep. As it's always better to solve them yourself, I tried to get through them faster than he was covering them. I was successful in a few cases when he got slowed down by questions but, in general, I was too slow. Correct, but too slow. That's what killed me on the Algorithms midterm last year.

I know the brain intervals work because I totally nailed the final in Algorithms and it was every bit as difficult a test. Not sure how long it takes to reset the pace of your response. Hopefully less than 48 hours, because that's all I've got. Did quite a few of them tonight and will do more tomorrow.

Monday, March 7, 2016

Midterm prep

I felt like I was going into the midterm in Data Mining (this Thursday) a bit blind. I don't feel like I'm behind, I've done all the reading and homework, I just didn't have a good feel for what to expect on the test. I guess I wasn't alone. Today, the prof posted this. Going through all of them between now and Thursday will be a lot, but well worth it I'm sure.

CS 5342 / 4342 Practice Questions
In addition to reviewing all the questions in the homework assignments and the ones worked out in class, attempt the following:
1.      Questions on time complexity of k-means, DBSCAN
2.      Questions on convergence (guaranteed or not) properties of k-means, DBSCAN
3.      Questions on local-vs.-global optimum finding, on the effect of the choice of the initial centroids, on the effect of the order of the points, of k-means
The book by Aggarwal:
Chap 2:  Questions  1, 13
Chap 3:  Questions  1, 5, 15
Tan et al.:

Chap 2: Questions 16, 17, 23 (see below):

16. Consider a document-term matrix, where tf_ij is the frequency of the ith word (term) in the jth document and m is the number of documents. Consider the variable transformation that is defined by

tf/_ij = tf_ij log (m/df_i)

where df_i is the number of documents in which the ith term appears and is known as the document frequency of the term. This transformation is known as the inverse document frequency transformation.

 (a) What is the effect of this transformation if a term occurs in one document? In every document?
 (b) What might be the purpose of this transformation?

17. Assume that we apply a square root transformation to a ratio attribute x to obtain the new attribute x. As part of your analysis, you identify an interval (a, b) in which x has a linear relationship to another attribute y.

(a) What is the corresponding interval in terms of x?
(b) Give an equation that relates y to x.

23. Given a similarity measure with values in the interval [0,1] describe two ways to transform this similarity value into a dissimilarity value in the interval [0,].



Chap 8: Questions 5, 6, 9, 16 (see the pdf file for Chap 8)

Sunday, March 6, 2016

Data Mining HW2

It's not due for a while, but I'm not worried about anybody copying. For one thing, open source implementations are already all over the place. So, if any of my classmates want to see how I approached it, they are free to do so.

HW2

Saturday, March 5, 2016

The Woodlands Marathon 2016

I was originally planning just a short write-up, but decided to get the full report done this afternoon since I'm just hanging out and then get on with the serious business of performing at work and school without trashing my family tomorrow. So, you're getting a rare same-day race report.

Run March 5, 2016.

Funny how feasible things can sound when you've had a few glasses of wine. I was in such a state when Kate's best friend, Kelli Brown, mentioned that her husband, Adam, was thinking about running The Woodlands again and my presence would likely be the clincher. This was said less out of concern for Adam than to present an excuse to get Kate flown down to Texas so they could spend some time together. I wasn't sloshed, so I immediately picked up on the ulterior motive, but my rational mind was just enough inhibited that the idea of a four-day marathon weekend in the middle of the semester seemed like something that could be pulled off. The little matter of training for the thing was a detail not even considered.

As I finished in the money last year (literally, there's a modest cash prize for the 50+ win), the Elite Coordinator blithely places me in the elite field again. He doesn't ask if I've got time to train and I don't volunteer the answer. But, really, the 3:02 standard for 50+ elite seems pretty doable even on the reduced regime. Of course, I have no empirical evidence to support that view; I've never even been under 3:05 without 100-mile weeks.

I actually do manage a couple 100-mile weeks during the break between semesters and the 12 weeks preceding the race each contain two reasonably stout workouts, but it's not enough. I run the Frostbite half in January, finishing nearly three minutes behind last year's time. Subsequent workouts provide more evidence that a sub-3 is very unlikely and even the 3:02 is in serious question.

The race-day forecast doesn't provide much cause for optimism, either. The dew point is expected to follow the temp from mid-50's at the start to mid-60's by 1 hour in. It will stay flat from there, but the temp will continue on to high 70's by the finish. Oh, yeah, sunny, too. Well, it is Texas. Nothing to do but show up and give it my best shot. It's not like I'll get in trouble if I don't run the standard; it's just a little embarrassing.

I decide my best shot is to hang with the 3-hour pace group as long as I can and then try to gut out a decent finish. At number pickup, the pace team coordinator assures me that the pair handling the 3-hour group have an excellent track record of nailing the pace.

Race day weather isn't quite as dire as the forecast, but there is plenty of fog along the river. As an elite, I get to drop my stuff right at the start and don't have to worry about getting into the corral before it closes, so I wish Adam luck and get in a brief warmup. At five to seven the wheelchair racers are sent off and the elites are called to the line. As with last year, there are 20 in the field, 13 men and  7 women. I'm the only  50+, but there are four other masters (40+). The hoards are brought up behind us and off we go.

I had the foresight to seek out the 3-hour pacers before the start. They cross the line just a few seconds behind me and within a few hundred meters, they've pulled up alongside me with a dozen sub-3 hopefuls in tow. They take us through the first mile in 6:44, 6 seconds fast, but that's entirely appropriate since the conditions at the finish will be a lot worse than they are now. Their names are Carey and Chris and, in addition to accurate pacing, they lead the group in affable chatter. The message is clear: "Relax and leave the driving to us." I'm more than happy to oblige.

We get to 10K still just a handful of seconds ahead of pace. The opening miles are  slightly net downhill and, despite the fog, it's still pretty good conditions, so I appraise my condition skeptically. However, even in that light, I'm surprised at how well I'm doing. It seems like I might just be able to hang with these guys the whole way.

The fog burns off in the next few miles and I notice that several in our group are completely soaked in sweat. I've been drinking a bit at each aid station and don't seem to be sweating much, but I make a note to check my condition each mile. By the half (in 89:54) several in our group have succumbed to the conditions and we are reeling in quite a few from ahead. Chris and Carey continue our to lay down nearly perfect splits and, while I'm feeling the effort, I'm not sensing any impending doom.

During mile 16, we do a short out and back to get the course to exact distance. The turnaround is considerably closer in than it was last year. We chalk it up to the fact that the start was moved. However, when we get to the flag at 16, Chris and Carey both comment that the mile split is way under expectations. It could be a misplaced mile marker, but when 17 comes up 6:50 later, we conclude that it was the turnaround cone that was misplaced.

Well, you run the course they give you. While it might make the time bogus for those who care, I'm pretty sure I'm leading 50+, so it's about getting to the line first, regardless of the distance. We soldier on.

By 18, our group is down to just five. I stride out a few meters ahead to provide the photo op from Kate, who has journeyed about 26.2 meters from Kelli's living room to the side of the course. While the effort does no damage, it does reveal how close I'm getting to my limits. I hang on for another two miles and at 20, I have to relent. My legs still feel OK, but my stride is shortening (as it always does just after 2 hours, even in training runs and ultras; it seems to be a function of duration, not pace) and I'm overheating. At first, the loss is small, but by 23, my pace is really flagging; Chris and Carey are nearly a minute ahead.

It's time to dig. I knew I was likely to run into trouble holding this pace but, to use local parlance, this ain't my first rodeo. I've been in worse spots. I alternate between pumping my arms harder and switching to 2-step breathing (out on the left foot, in on the right; I normally breathe out 2, in 1 near the end of races). This gets me through 24 without much further loss, but leaves me reeling from the effort. I try quickening my turnover; no dice. I take a few minutes a bit easier, trying to summon some reserves. There aren't many.

At 25, I check my watch: 2:50:40. Sure, it's a dubious time, but a sub-3 on paper is enough to motivate one last push. I'm running through the back of the half marathon field now, but it's not too hard to get around them. I hit the timing mats at 2:59:10.

I turn around to see that my final push has not been mere vanity; John Robertson crosses just 15 seconds behind me. Staying ahead of him scored me the third masters spot along with a repeat as grand master winner. It also means that, while I didn't embarrass any of the true elites, I did finish exactly midway in the masters elite field, so I didn't embarrass the elite coordinator, either.

I get a quick meal and change in the elite tent, then get back to the line in time to see Adam finish. He really got hammered by the heat at the end, but his 3:48 was still a PR by a fair margin.

Now, about that time. As, I mentioned, I wasn't keeping splits, so I really don't know just how short mile 16 was. Given that Chris and Carey were running nearly perfect pace, I expect this equates to a 3:01 or 3:02. If this was a PR, I'd stress a bit over how to list it. As it's not, I'm just going to let the official time stand as my 7th fastest marathon without an asterisk.

And now, I'm going out to dinner with Kate, Kelli, and Adam. We will have wine. If anybody mentions that I should accept the elite invite to next year's event that will undoubtedly follow from the repeat win, I will laugh in their face.





Friday, March 4, 2016

Obsessive Compulsive Overkill

So, I'm sure there were better ways to pass the time, but to keep my mind off what tomorrow's race, I spent a whole lot of time this afternoon and evening working on my homework for Data Mining. Oh, that's right, I was done with that already. Yeah, but nothing's ever so done that you can't mess with it for a few hours. It was fun; probably more fun than tomorrow is going to be.

There will not be a throwback race report tomorrow, but I will post a brief update on how the race goes with a real race report to follow next week.

Thursday, March 3, 2016

Texas toast

Off topic post today even though I did get some schoolwork done (finished the DBSCAN stuff, which is a good thing).

Flew to Texas midday for the Woodlands Marathon on Saturday. Yes, we've had some warm weather recently but, wow, not 85F. It was a shock to say the least. Of course, it won't be that hot at 7AM when the race starts. In fact, the current forecast isn't that bad: 60's for most of the race with the dew point in the 50's. That second number is the more important one and it's higher than I'd like, but not terrible.

This will almost certainly be the last time I'm ever invited to run in an elite field. While it's tempting to just run it easy and regard it as a reward for past results, I'm just not wired that way. I'll give it my best shot, even though that will be considerably slower than last year.

Wednesday, March 2, 2016

Overkill

The Data Mining assignment clearly stated that a simple implementation was sufficient. Still, just throwing together the DBSCAN algorithm from pseudo code in the paper didn't seem like a really great learning experience. So, I've gone ahead and done the things you do when you're serious about programming. Like, full unit test coverage, proper segmentation of the problem into model, data, and controller classes (no view, since it's a command line app), and interfaces so you could hook up a high-performance spatial database to the data layer without modifying the algorithm. I'll give it a nice writeup, too, along with some pretty graphs our of R. I'd say I've probably tripled the work necessary.

I'm doing fine and know I could get by with less, but looking for the minimum successful path is what got me in trouble at Cornell. This go round I'm focusing on maximum output. It may be sub-optimal, but it's the surest road to success. And, if this try doesn't work out, there won't be a third.