never2old4school: April 2016

Saturday, April 30, 2016

Switching the sabbath

Just to prove I'm not being mindlessly legalistic about it. I switched up the sabbath this week. I worked today and will take tomorrow off. So, no throwback race report. No post of any other merit, either. It's late, I'm tired.

Friday, April 29, 2016

Summer plans

I've pretty much firmed up my plans for the next three months. As I mentioned before, the overriding goal is to be completely prepared for the Q. I'm going to allocate 5 hours a week to that starting May 15. The Q date is TBD, but it will be sometime in September, so that's a little under the 100 hours I targeted, but that number is pretty arbitrary. I should know by July if I need to adjust.

The remaining available time in May will be devoted to finishing the CISS paper.

Starting June, I'll be digging into to Cluster Analysis in general and working on my partitioning algorithm in particular. The general research is more important at this point; I need to get current on the literature or I'm going to end up reinventing the wheel. While I'm at it, I'll start putting together my literature review for my dissertation.

I'm not really too concerned if I don't get a lot done on the partitioning algorithm. I don't expect I'd be able to write it up and publish it before next spring, anyway. If I hold off until then to do the heavy lifting, I can pack it into a CS5880 Independent Study course. That, along with this summer's research and my fall courses, would be enough to complete all my coursework.

Thursday, April 28, 2016

Data Mining HW3

is here. I'm posting it before it's due because it would be pretty tough to copy an original algorithm and claim otherwise.

Wednesday, April 27, 2016

R code for SA Clustering project

Here's the actual code file:

#plot the clusters
#
PlotClusters = function(obs, clust, iter, t)
{
plot(
xlim = c(0, 100),
ylim = c(0, 100),
obs[,'X'], obs[,'Y'],
asp = 1.0,
pch = 20, #small points
cex = .5,
main = sprintf("Interation: %d; t = %6.3f", iter, t),
frame.plot = TRUE,
xlab = "",
ylab = ""
)
rect(clust[,'Xl'], clust[,'Yl'], clust[,'Xh'], clust[,'Yh'])
}
#
# initialize data frames for points and clusters
#
InitPoints = function(x, y)
{
data.frame(X=x, Y=y, Dist=rep(1,length(x)), Cluster=rep(0,length(x)))
}

InitClusters = function()
{
clusters = structure(
list(
Xl=double(),
Xh=double(),
Yl=double(),
Yh=double()
), class="data.frame"
)
}
#
# generate seeds for clusters from points currently in default bucket
#
Seeds = function(obs)
{
subset(obs, obs$Cluster == 0 & runif(length(obs$Cluster)) < t/10)
}
#
# turn seeds into clusters
#
NewClusters = function(seeds, border)
{
data.frame(Xl=seeds$X-border, Xh=seeds$X+border, Yl=seeds$Y-border, Yh=seeds$Y+border)
}
#
# compute distance from fixed point to all cluster
#
ClusterDistance = function(x, y, cluster)
{
max(0, x-cluster['Xh'], cluster['Xl']-x) + max(0, y-cluster['Yh'], cluster['Yl']-y)
}
#
# find best fit for point
#
BestFit = function(obs, clusters)
{
costs = apply(clusters, 1, function(cluster) ClusterDistance(obs['X'], obs['Y'], cluster))
return( c(min(costs), which.min(costs)))
}
#
# vectorize best fit across all points
#
AssignClusters = function(observations, clusters)
{
t(apply(observations, 1, function(obs) BestFit(obs, clusters)))
}
#
# determine points to be dropped because they are too far away
#
DropVector = function(obs, t)
{
obs$Dist > (rexp(length(obs$Dist), 1/t) * 10)
}
#
# reshape clusters around points
# on first pass (t>0), trim the boundaries so the border points drop off
# (allows big reductions in parition area if just one point is holding it)
# also check that cluster is viable (n>2, prior to dropping border)
# on second pass (t=0), don't trim so any border points added back stay
# always drop the default cluster, since it's not really a cluster
#
ReformClusters = function(obs, t)
{
xl = tapply(obs$X, obs$Cluster, min) + (.01*t)
xh = tapply(obs$X, obs$Cluster, max) - (.01*t)
yl = tapply(obs$Y, obs$Cluster, min) + (.01*t)
yh = tapply(obs$Y, obs$Cluster, max) - (.01*t)
keep =
(tapply(obs$Cluster, obs$Cluster, min)!= 0) & #not default
( (t > 0) | (table(obs$Cluster) > 2) ) #check size only if first pass
data.frame(Xl = xl[keep], Xh = xh[keep], Yl = yl[keep], Yh = yh[keep], row.names = NULL)
}

And, here's the interactive session that actually runs the algorithm:

#
# running algorithm in interactive session to make it easier to
# generate intermediate plots
#

#
# data setup for uniform random in (0,100)x(0,100)
#
set.seed(100)
x = runif(100, max=100)
y = runif(100, max=100)

#
# with some actual clusters
#
set.seed(200)
x = c(
runif(20, min=5, max=10),
runif(30, min=50, max=80),
runif(30, min=40, max=45),
runif(20, max=100) #noise
)
y = c(
runif(20, min=90, max=99),
runif(30, min=20, max=30),
runif(30, min=60, max=70),
runif(20, max=100) #noise
)

#
# clustering along a single dimension
#
set.seed(300)
x = c(
runif(80, min=10, max=15),
runif(80, min=60, max=70),
runif(40, max=100) #noise
)
y = runif(100, max=100)

#
# after choosing the data scenario, run from here
# choose while condition to suit needs
#
#initialize
observations = InitPoints(x, y)
clusters = InitClusters()
t = .99
i = 0

#loop
j=2 #number of intermediate steps
while (j > 0) #for intermediate steps
#while (t > 0.25) #to run until desired temperature
{
#add in new seeds
clusters=rbind(clusters,NewClusters(Seeds(observations), 1.0))

#find nearest cluster
observations[,c("Dist","Cluster")] = AssignClusters(observations, clusters)

#drop observations that are too far away into default bucket
observations$Cluster[DropVector(observations, t)] = 0

#reform boundaries to be just smaller than current to kick out edge cases
clusters = ReformClusters(observations, t)

#reassign to reformed clusters
observations[,c("Dist","Cluster")] = AssignClusters(observations, clusters)

#kick out anybody not actually in the reformed cluster
observations$Cluster[observations$Dist > 0] = 0

#and now reset the bourdaries for only the points still in there
clusters = ReformClusters(observations, 0)

#cool it down and go again
t = t*.99
i = i+1
j = j-1
}

PlotClusters(observations, clusters, i, t)

Tuesday, April 26, 2016

Partitioning using SA Clustering

Well, it needs a lot of tuning, but my algorithm for Data Mining basically works. As this is a homework assignment and not something for actual publication, I'll probably leave it at that. Here's the plot on a uniform random set of points:

The rectangles are the boundaries of the partitions after two iterations. The annealing temperature starts at .99 and works down to .25.

OK, that's not exactly thrilling. But, there weren't any real clusters so what do you want? Let's try it with a clustered data set:

Sorry, it's hard to read. I'll need to fix that before I turn it in. Basically, it's found all three clusters in just two iterations though, if you look close, you can see that the bottom cluster is split into two partitions. Letting it cool to an annealing temperature of .25 (which is absolutely arbitrary, it just seemed to work for the test data I was using) fixes things up a bit.

Still not perfect, but there's a partition on each cluster and the boundaries aren't off by much. Furthermore, the code is pretty slick in R; every operation is vectorized. I'll publish that tomorrow after I turn it in.

Monday, April 25, 2016

Behind again

Seems I'm not as fast a learner as I might hope. Oh, I'm picking up the course material just fine; it's the larger lessons that don't seem to be sinking in. Like, maybe I shouldn't wait until two days before a test to start working problems. Sigh. It's been a long evening of cranking through stuff. None of it is particularly hard, it just needs to be practiced.

Sunday, April 24, 2016

Flying blind

I was a bit worried that I didn't really know what would be on our next exam in Data Mining (coming this Tuesday). We've covered stuff, but it's not at all clear how it translates to testable material. With the prof out sick on Thursday, I'm left with this review sheet:

What, if any, is the connection between classification and clustering?
Explain the Naïve Bayesian classification method. How are continuous attributes handled in this method?
Study the Bayesian classification numerical examples from Chap 5 slides (Tan et al.)
Problem 11, page 344, Aggarwal
Problem 12, page 320, Tan et al.
Consider three variables A, B, C in the following Bayesian belief network: A→E→C with P(A) = 0.3, P(E|A) = 0.8, P(E|Ā) = 0.6, P(C|E) = 0.2, P(C|Ē) = 0.3. Find P(A|C).
In stochastic heuristics for optimum clustering (such as simulated annealing or genetic algorithm), a worse move is often accepted as a step (among many steps) in the search for an optimum cluster. What is the justification?
Describe (i) one method of representing (encoding) the solution, and (ii) one method of creating the “next” point from the “current” point in simulated annealing-based clustering.
Is the simulated annealing-based clustering method guaranteed to find the optimal clustering solution?

Ok, I can answer all those, but it still seems like I might be missing something. I guess we'll see in two days.

Friday, April 22, 2016

Double Chubb

Run April 16, 2016.

I figure anybody who reads my race reports has already figured out that I'm arrogant, so I'm going to open with a statement that will come off even more arrogant than I intend: there's not a lot about competition I don't know. There's plenty of stuff about various competitive activities that I don't know, but the essence of competition, I get that. So, when I retired from competition late last year (which then got postponed to this spring after the elite invite from The Woodlands), I felt like there was more I could have done, but not much more I could learn. That softened the blow of walking away, since much of my enjoyment of an activity comes from learning through the experience.

So here's a statement that's a little less in character: I was wrong. There are actually some very important things about competition I don't understand. But, to understand them, I need to stop competing.

Which brings us to the 18th edition of the Double Chubb. While I've only run it three times before, I've hit all three configurations of the course: the 1-lap 25K, the standard "Double" for 50K, and the infamously brutal "Quad Bypass" 50K, which is 4 laps of just the hilly section (used when the river section is flooded). As competitive efforts, they've ranged from good to great (the 2013 race on the quad course was quite possibly my best race ever). This year, however, I am very much entering it as an "event" rather than "race". Of course I'll run it firm because that's what you do, but I will not be stressing over my result, either by the watch or the finish order.

Easier said than done for someone like me, but the presence of my college roommate, Kevin Robertson, makes it a bit easier. He's running the single which will be his longest trail run ever. He's run full marathons, so I'm not worried about him finishing, but it's always better to run conservatively when you're going into new territory. Stressing that to him helps keep my own competitive urges under control.

While the start conditions are on the cool side of pleasant, the humidity is already high and the bright sky is a harbinger of warmer temps. Foliage is late this year so, even in the woods, there won't be much shade. The run up the road to the trailhead has nobody too interested in laying down a hard pace. A few firm strides would have me leading onto the singletrack. As I see no upside in that, I drop in behind the first five runners.

While there is some shuffling and spreading of the field over the next couple miles, everybody seems pretty content to take a wait and see approach with respect to conditions. We get to the high point of the course (known simply as "the picnic table" because there's a picnic table there) with a pretty good sized group strung out in a long line. There's a gap from me to the next runner ahead, so I have a clear line on the long, rocky descent to the river. Technique counts for a lot on this section and I open some distance without really putting out much effort. I get through the aid station quickly, grabbing a quick drink and some grapes while leaving one of my bottles to be refilled.

The flood plain woods

The flood plain section could not be more different than the ridges we've just run over. The fiercely steep and rocky terrain gives way to smooth dirt singletrack that winds around and through the tiny contour features created by the eddies of hundreds of floods. The forest floor is no longer stark limestone, but lush green. And, it's getting warm. With the sun now beating down on the river and meadows, the humidity isn't dropping, either.

Halfway through this section, David Pokorny and Joel Lammers both pass me. I've raced against both of them several times and, while it's always been reasonably close, they've beat me more than I've beat them. I'm not surprised that I'd be ahead through the opening ridges of West Tyson Park as that terrain plays to my strengths. But if they're just catching me now, I'm probably going just a bit harder than I should. David pushes ahead, but Joel is content to ease up for a few minutes for some conversation. He notes that he's more than a little concerned about the heat. Being from Wisconsin, he hasn't had many opportunities to acclimate this year.

The last mile through the flood plain is on a dirt road. It's the only part of the course I don't like. It's hemmed in between the railroad tracks and the high perimeter fence for Lone Elk Park (a rather necessary fence, as the Elk and Bison in the park would not mix well with passing freight trains). It's only ugly in comparison to the rest of the course but, by that standard, it's pretty bleak. The fact that it's also the only muddy portion of the course this year doesn't improve my appraisal. Then, it's back on singletrack for the steep climb up to the turnaround.

A few observations from the turnaround:

I'm getting my ass kicked. There are quite a few runners ahead of me and only one of them is in the 25K.
While that doesn't bother me, the fact that I'm deeper in the field means that the singletrack in and out of the turn is much more crowded than what I've dealt with in past years when I've made it back to the gravel road before meeting the bulk of the field head-on. I don't lose much time, but it definitely requires more attention.
SLUG aid stations are really good. Well stocked and staffed. I leave a bottle to pick up on lap two.
Kevin is doing remarkably well; nearly matching my pace for the first half of his race.

The Chinkapin

The return trip is uneventful. I run most of it alone, though I do go back and forth with Brent Haefner, who's second overall in the 25K. The only new part of trail is the Chinkapin Loop which is tacked on to the end of the lap to get the distance right. Heading straight up the ridge and then right back down, it's a very brutal way to add distance.

I get to the end of the loop in just under 2:10. The second loop is about 2 minutes longer (you have to run the park road a quarter mile from the finish back to the start) and the heat will add a few minutes more, so this is going to be a really slow time. Again, I'm fine with that (I'm retired, right?) but I do stop to consider if maybe I should just call it a day and hang with Kevin. After some cajoling from the finish line crew, I decide to head back out for the double.

Shortly into the lap, I meet Kevin coming in. He's slowed, but still seems to be running OK. I encourage him to hammer the final climb on the Chinkapin. The field is more strung out and passing on the Tyson side is easier since the woods are more open, so I make good time back to the flood plain. The aid station is surprisingly busy, but some of the volunteers recognize me as going out rather than coming in and I get priority treatment. In this heat, I don't dare eat much, so it's a quick stop.

Speaking of the heat, an interesting transformation is taking place. In a contest of fitness, I am badly outmatched, but as the emphasis shifts to fortitude, I'm holding up pretty well. My pace is off from lap one, but not by much. I've already passed a few folks and I start to wonder how many ahead are crumbling. It's certainly not death-march conditions, but it's warm enough for April that this thing might be in play. I dutifully stick to my plan of just running my own race, but I also resolve to stay on pace just in case things break my way.

At the turn, the situation is revealed as I get to see the leaders coming back. Hugo Lee is out front, about 10 minutes ahead of me. That's where he was at the end of lap one, so I'm matching him, but no better. Anything short of a complete collapse will have him in first. David is next, about five minutes ahead. If it was anybody else, I might think I have a shot, but I know he's a great pacer and very unlikely to fold. Then comes Joel, another minute back followed immediately by Ryan Winter. I'm certainly not betting against either of those two, but they're close enough that I decide to keep the pace going. It's about this time that I admit to myself that I'm racing again whether I want to or not.

I could play this one of two ways: firm it up for the entire remaining hour, or keep my pace steady along the river and then unload on the Tyson section. While the latter plays to my strengths, it seems unlikely that I'd take a minute per mile out of such good runners, even if it is my best terrain. With nobody pressuring me from behind, there's really no reason to play it safe; I start pressing.

When I get to the aid station at the base of the ridge, I still can't see anybody ahead. However, I'm not on the climb for long before I spot Joel up ahead. By the picnic table, I'm only a few seconds behind. The heat has done him in and he makes no attempt to hold me off. As I pass, he offers encouragement, indicating that Ryan isn't very far ahead. Indeed, shortly before the next climb, I spot him. I push hard while passing him, hoping that he'll decide not to fight. No such luck. He hangs tough over the final mile to the trailhead. As I make the turn onto the Chinkapin, I can see that he's only about fifteen seconds back. Rats, these final few minutes are really going to hurt.

Done.

They do. It's all I can do to keep a running stride going. It would probably be faster to power walk the climb, but I don't want Ryan thinking I'm falling apart. Finally at the top, I look back and see that Ryan has, in fact, decided that chasing me is no longer worth the trouble. I've got at least half a minute. My technique is completely gone, so I back off on the descent and hit the line at 2:27 for third overall and 1st 50+. My second-half split is second only to David who finished second.

So, what were these big revelations learned from "not competing?" First and foremost, as they say in the mob, this ain't a job you're free to quit. I was quite enjoying my run up through halfway and I'm sure I would have enjoyed a leisurely second half had the conditions not changed. But, once pressed into the mix, there really wasn't much chance of not joining the battle. If you're wired for this sort of thing, it's just what you do. Fortunately, ultras are long enough that you can play them by ear. Biding your time for the first half and then deciding whether or not to go is actually pretty sound strategy. So, I think I can continue to enter and enjoy these things as events and just amp it up on the occasions when things are breaking my way. We'll see.

The second is more of a technical revelation, but an important one none the less. I've long claimed that marathon is the "hardest" distance. All races are hard if you run them right, but the marathon has always struck me as particularly insidious. I've always felt this was because the marathon is the longest race where you are really fighting for seconds each mile. I still think that's a true statement, but I now believe it's a symptom, not a cause. In short races, fitness is dominant. If I was to run a 10K against national-level age-group competition right now, I'd get demolished. In ultras, it's all about keeping your head in the game. Even without adverse conditions, I could probably run a competitive 100 right now simply because I know how to do it. The body doesn't really have to cooperate if the brain is willing to put up with the discomfort. Even at today's distance, adverse conditions (and they weren't even that bad) were enough to make fitness a secondary concern.

The marathon is where these meet. For all but the very elite, it is just past the distance where fitness and form carry you. Whether you're an olympian or weekend warrior, a little over two hours is about as far as you can go on evenly measured maximal effort (meaning, constant effort that gets you to the line with nothing to spare). Only the very elite finish faster than that. And yet, so much of the race is in the space where fitness counts for everything that you absolutely have to have run it like you would a shorter race. The only way to run your best marathon is to pace it like you can actually carry the distance and try not to think about how much the last 10K is going to suck. Willingly running yourself into that situation is a very difficult thing to do.

And, if you're wondering which road marathon I'm going to sign up for to test these ideas, I've got two words: I'm retired.

Three Minute Thesis

UMSL held a 3-Minute Thesis competition today. You get one static slide and 3 minutes to pitch your topic. After giving the talk in a preliminary round, I was invited to the finals. I didn't place in the top three. Here is my slide and what I said.

Hello, my name is Eric Buckley and I am pursuing a PhD in Math..

As it’s an election year, I’ve chosen to introduce my subject with one of the most famous pictures ever taken in St. Louis. Contrary to popular belief, I was not alive at the time, but I’m told the 1948 election was almost as crazy as what we have going on now. For me, the most interesting fact is that EVERY SINGLE POLL got it wrong. Why? Technology bias.

I am old enough to recall when working class people would have “party lines”, a single phone number shared by several families. Pollsters would sample the number once and count it once, but it actually represented several votes. As a result, Truman’s blue collar base was under-represented..

Cell phones and caller ID have given pollsters plenty of new problems to wrestle with, but I am more interested in what happens when the source isn’t a person at all, but a database. Or, many databases combined into what’s known as a “data lake”.

In such cases, one needs to consider not only source bias but also the fact that records are highly correlated. So, it may appear you’re converging on an answer when in fact your sampling has missed a whole lot of contradictory information all tucked away in a tiny corner of the lake. Finally, in the case of financial data, which is the focus of my research, the distributions are not your typical Normal (or “bell”) curve. They are “heavy-tailed” which means they are spread out with more observations very far from the mean. Random samples from such distributions converge very slowly, if at all..

My dissertation takes a three-pronged approach to these issues. First, we stratify the data to ensure we get adequate representation of the distribution tails. This part of the work has been very productive and will be submitted for publication next month.

Next, we adjust for correlation in the data so we can accurately estimate the variability of our results. More correlation generally means higher variability. Therefore we have to dig deeper into the data to get a reliable sample that doesn’t miss pockets of significant information..

Finally, we put these two together to produce not just an estimate, but a distribution for the estimate. We can then give a range known as a “Highest Density Interval”. This is the narrowest interval that contains the real value with a given certainty. The more we sample, the narrower the interval becomes. We control the sampling to stop when the width reaches acceptable limits.

This rigorous quantification of results has been sorely lacking in the emerging field of Data Science. I am excited to be among those addressing this need.

Thursday, April 21, 2016

Functional algorithms

I've written plenty of algorithms and I've written plenty of functional code. Oddly enough, I've never had cause to specify an algorithm in functional terms (at least, not a significant one, I'm sure I had to write some trivial stuff for a class at some point).

Functional pseudocode is both beautiful and a bit unsettling. Add in the vectorization that comes from R and it's not at all clear at first glance what the time complexity for an algorithm is. Nor is it terribly obvious that the algorithm actually works. As my "algorithm" for the SA Clustering is actually a heuristic (meaning, it's not guaranteed to work), the whole thing just starts looking a bit shaky.

That said, I'm rather enjoying coding this thing up in R. I expect to have something running tomorrow. It will either work or it won't, but it's been good to give it a try.

Wednesday, April 20, 2016

Wag the dog

Writing this clustering assignment in R has had some unintended, but interesting, consequences. Namely, the "need" to vectorize everything. I put that in quotes because you can do things iteratively in R, but why would you want to? The vector capabilities are what make the language powerful.

That actually changes the algorithm more than I would have thought. Normally, I'd not be a big fan of changing functionality to accommodate an implementation. If an implementation can't handle it, use something else. However, in this case it's been both instructive and quite possibly beneficial (though I have no intention of putting the time in to verify the latter assertion - this is just a homework assignment).

Things that have to change to get the vectorized version to work:

I can't start with an empty set of clusters. The first pass will try to map all points simultaneously so I have to have something to map them to. We'll need to start with a seed set of clusters and they may overlap. The next item addresses that.
Don't create new clusters during the assignment phase. Just tag everything to the closest one. Since there will be ties in the case of overlaps, always choose the earlier cluster in the list. This way, redundant clusters will get killed off due to lack of points.
Pick a few points that are far from their assigned clusters to create new clusters. The first step is a special case of this where every point is considered far at initialization.
Rather than shrinking the boundaries to jettison points, do the assignments based on a probabilistic check of how far they are outside the boundary. Points inside will also have to meet a probabilistic threshold based on the temperature (as it cools, points inside become more likely to stay). Then recompute the boundary to cover just the points that got assigned.

So, the updated algorithm looks something like this:

set all points' cluster distance to a non-zero initial value

t = starting temperature (between 0 and 1)

while t > final temperature

select new cluster points (function of t, and point's cluster distance)

recompute cluster distances and nearest cluster for all points

assign cluster if probability threshold met (function of t, and cluster distance)

drop clusters with too few points

t = t * cooling constant

If all this looks like a lot more fuss than simply coding up a known SA Clustering algorithm, it is. Sometimes you just have to indulge your curiosity. Fortunately, it's still not that big of a deal. The hardest part is that I'm still not really fluent in R so coding takes longer than it should. Then again, that's the point. I won't ever get fluent if I don't write some stuff.

Tuesday, April 19, 2016

Clustering using Simulated Annealing

For our next Data Mining project, we're supposed to implement a clustering algorithm that uses simulated annealing. I could, of course, just look one up but, I decided it might be more interesting to come up with my own. Since my interest in clustering is actually partitioning, I'm going to look at cutting clusters along dimensional boundaries. Therefore, we'll use a distance metric that simply looks at whether a point is in or outside of the dimensional bounds for a cluster. If it's in, the distance is obviously zero. If it's out, the distance is how much needs to be added to the rectangular (or hyper-rectangular, in the case of more than 2 dimensions) perimeter of the cluster. The idea is to minimize the total perimeters, while capping the number of rows that can be in any one cluster.

Here's the basic algorithm:

clusters = null list

pool = all points

t = starting temperature (parameter strictly between 0 and 1)

while t > final temperature

select all but n * t points for insertion into clusters

assign points to (possibly new) clusters that minimizes total perimeter growth

shrink perimeter of each cluster to force out edge points

drop clusters with too few points

t = t * cooling constant

I'm also considering coding it entirely in R, though I may change my mind on that if it gets unwieldy. While it's an interesting exercise, I don't want to spend too much time on it.

Monday, April 18, 2016

The second question

Somewhat off topic, but I told you I almost did my dissertation on development practices.

So, we have a group at work that is having more than a bit of trouble actually delivering stuff. It's not a problem of capability; the developers are plenty capable. Nor is it a problem of vague and/or ephemeral requirements; the users are quite sure of what they want.

When I came into that group four years ago, it had the same problem. I fixed it. I moved on. Now they are right back to where they were before.

Am I some kind of miracle worker? Hardly. But I do know how to ask the second question.

The first question is the question that even the most incompetent lead will ask: "Is it done yet?" If the answer to that one is "Yes" then there isn't a whole lot more to be done (except, maybe, checking to make sure that's the right answer; developers have a habit of calling things done when they aren't). However, that's not management, that's statusing. Any idiot can do that. Management is the second question.

The second question is not "When will it be done?" It was supposed to be done already, so obviously this is a task that defies estimation. The second question is not "Why isn't it done?" This just invites excuses. The second question is, "What do you need to get it done?"

This is a much more difficult question to wriggle out of, especially if the developer knows you are serious about clearing obstacles. Rather than asking for the unknown, you're asking for the known. That means you expect a real, verifiable, answer. Rather than asking for the excuse, you're asking for the solution. That means it better really be the solution because it's going to happen. Most importantly, rather than providing an out, you're insisting on performance. Sure, I'll take care of whatever's stopping you. Now, stop being stopped.

It ain't rocket science, but it does require two things that a lot of managers seem to lack:

A true belief that the job of a manager is to make others better.
A clue as to what the developers are actually doing.

I know plenty of managers that have one or the other. Managers that have both are surprisingly hard to find.

Sunday, April 17, 2016

Distance metric and cluster shape

Looking ahead to dynamic partitioning...

The distance metric very much impacts the shape of the clusters. In particular, the standard Euclidean distance metric will tend to find clusters of spherical shape. For dynamic partitioning, I'm more interested in rectangular clusters. Specifically, clusters with boundaries along dimension values (e.g., age between 50 and 60, account type = expenses, year = 2016, etc.).

To find clusters of this shape, one should use the rectilinear distance (more commonly referred to as the Manhattan distance since it gives the distance one use if driving on a rectangular street grid, like in Manhattan). This fact is well known, but rarely applied (since most clusters are, in fact, fairly spherical, or at least elliptical). I'm not exactly sure how it changes the application of BIRCH to the problem of partitioning, but it's something I'll need to work out this summer.

Saturday, April 16, 2016

Another big finish

Whad'ya know? I can still finish a race. Decided to go ahead and run the full 50K Double Chubb (which is what I signed up for, but I had considered dropping down to the single on account of the fact that I've been running so little). The race was going pretty much as I expected at halfway (as in, I was getting my ass kicked). Then, it got hot. That turned out to be a game changer. I ended up with the fastest second lap split to snag third overall and first 50+.

Just goes to show that ultrarunning is more about your brain than your body. I can't recall too many times when I was so focused in a long race. Full writeup next Saturday.

Friday, April 15, 2016

Big finish

Astute readers may have already deduced from this week's posts and related study entries that I haven't been getting much done lately. Looks like I'm going to finish the week with a whopping 10 hours of schoolwork, which is about half what I'd expect at this point in the semester. There have been good reasons for the lull and I don't think I'm in trouble, but I'd better not slip any further.

Well, maybe two days further. My college roommate is coming to visit this evening and will run Double Chubb with me tomorrow. Even though we're just doing the 25K "Single" rather than the ultra, it will still be his longest trail race ever. He's completed a few marathons on roads and is generally a good technical runner, so I don't expect him to have trouble finishing, but it's still a big deal for him and I'm glad I'll be part of it.

He has to leave early Sunday morning, so I won't be using our time together to write a blog entry tomorrow. I'll post something about the race on Sunday.

Then, I really need to get on the gas to finish out this semester. I have a test in Data Mining next Thursday, assignments for both classes the following week and I really want to have the CISS paper pretty close before semester end (May 12).

Thursday, April 14, 2016

Genius hour

Mildly off-topic post tonight.

My daughter's grade did their "Genius Hour" presentations this evening. They basically get an hour a week in school to research whatever subject interests them and then they put together a display of their results for perusal by the parents. Many of them were quite good. Yaya's presentation was on LGBT rights in general and gender identity specifically. It's a subject she's been wrestling with and she did a pretty fair job of framing it. (No, I'm not outing my daughter; she may be conflicted, but she's sure not shy about it).

Anyway, one of the other kids had a presentation on why learning to write code is such a great thing. His argument basically boiled down to "you're guaranteed a job."

Well, I certainly wasn't going to debate him on that point because it seemed that he really did think programming was fun and, if that's true, you pretty much are guaranteed a job. Unfortunately, there still are way too many young folks picking majors who believe that the skill without passion is all that's required.

I do know a few uninspired programmers who manage to get enough done that their employers find it easier to keep them on than go through the hassle of firing them. Most leave on their own volition (or switch to non-programming roles within the same organization). However, employers are increasingly using right to hire contracts to make sure that programmers are able to cut it before making a perm offer. I wouldn't think you'd have to bomb out on too many of those before contracting firms would get sick of placing you. Simply put, if you don't love this stuff, you probably won't last. And, even if you do hold onto a job, you won't be well paid. Junior programmers really don't make that much and you won't get promoted unless you show some initiative.

In 1990, I had a student at Mount Union who seemed to be of average intelligence, but was doing terribly in a programming class I was teaching. I asked her why she was majoring in Computer Science since it didn't appear to interest her much. She said her dad had pushed her into it so she could get a good job. That's just sad and it's still going on today.

I turns out, she did have a passion for a subject; it just had nothing to do with Computer Science. She switched her major to Art and, upon graduating, got a job as a graphic designer. I don't know if that worked out or not, but I do know that she left school eager to get at it and such individuals tend to succeed.

I wish Yaya's classmate success. But mostly I wish for him what I wish for all kids (and adults, for that matter): that he finds what he loves.

Wednesday, April 13, 2016

Goin' it alone

I'm pretty much out of people to ask for help this summer. Seems like all the faculty who would be qualified to help with my research are otherwise disposed. That may not be a terrible thing. I'll have my life back for a couple months. It does lessen the chance I'm done in three years.

I won't be completely dead in the water. Aside from the Q preparation, much of the planned research is actual implementation and I certainly don't need guidance on how to do that. Getting projects from whiteboard to production is what I do for a living.

Tuesday, April 12, 2016

Dynamic Block Partitioning with BIRCH Clustering

Here's the slide deck for my presentation in Data Mining:

Dynamic Block Partitioning using BIRCH clustering.

Monday, April 11, 2016

Outline for BIRCH presentation

Recalling, of course, that this isn't really a presentation on BIRCH, but rather the concepts from BIRCH that will be relevant to my work this summer (and/or possibly fall).

Introduction: Partitioning as a form of clustering

Show a simple example of how partitioning cuts down on query time
Optimal partitioning has blocks where all or no rows are included

Also need a way to know without reading the block - attribute tagging

Building the initial data set partitions

Since we have no real query data, use surrogate partitioning

Weight attributes by expected query use

Insert entries into CF tree as per BIRCH except that we use hard partition lines rather than closeness when splitting a node. That is, an attribute is chosen to yield the best split and then the values are divided according the that attribute only.
Label each block with appropriate attribute tags

When querying keep track of key metrics:

Actual attribute slices.
Percentage of rows are actually used when a block is read.
Percentage of queries where the attribute tags allow the block to be skipped entirely.

Dynamic repartitioning

Rank the blocks based on the effectiveness of their attribute tags

Good blocks are either skipped entirely or return a high percentage of rows.

Remove underperforming block(s)
Re-insert data, using attribute partition weights updated from actual use

Why that won't work

Even with updated weights, most of the re-inserted data is going to flow to the same spot on the CF tree.
Insert a step between the deletion of the bad blocks and re-insertion of data where a new CF Tree is created using only blocks (not the individual rows in the blocks).
This will sufficiently re-shuffle the CF tree so that the re-inserted data goes to new (presumably better) locations.

Implementation details to be sorted out through empirical testing

Exact distance metric given query history
Block ranking metric
Fixed or dynamic number of blocks to delete on each iteration
Maintaining stability during the reshuffle operation (system needs to be continuously available for query and new data loads).

Sunday, April 10, 2016

Next big thing

Tons to do before the current paper is ready for publication, but it's not too early to think about what's next. My main priority this summer will be to make sure I'm ready for the Q. I'm not worried about Analysis, that's always been a really strong subject for me. I've already gone through Linear Algebra, so that's "just" practice (I use quotes because I think it will be quite a bit of practice to get that really sharp). That leaves the two stats courses. I don't know the exact content of those courses, but it appears to be fairly standard upper-undergraduate level stuff. As with Algebra, it's just a matter of doing enough problems to bring it all back the the surface.

I've got about 20 weeks to work with and I'd think 100 hours of working problems should be enough, so that's 5 hours a week. That leaves at least 10 hours a week for getting real research done. If I can tie it in with our re-platforming efforts at work, I could probably double that to 20. Subtract out this coming month which needs to be dedicated to finishing off this semester (and my CISS paper) and I've got somewhere between 200 and 300 hours. A lot can be done in that amount of time.

There are two possible paths with respect to dissertation work. One would be to continue on the implementation road by working on adding dynamic partitioning to the stratification. The other would be to dig into understanding the mathematical properties of heavy tailed distributions and the implications for processing financial data. The latter is a fairly active line of research right now, which means there's probably a lot less low hanging fruit for publication. However, it also means that cracking a tough nut could lead to a dissertation with some real significance.

My gut is telling me to take the first route since that will be much easier to fit in with activities at work. We're currently spinning up scalable environments for our larger analytics databases and dynamic partitioning would play right into that.

To that end, I spoke with my Data Mining prof about the presentation that's due this Tuesday. The assignment is to present the BIRCH algorithm. Well, with all six grad students presenting the same paper, the undergrads are going to be pretty bored by the end of class (this is a mixed grad/undergrad course). However, my thoughts on dynamic partitioning are pretty close to how BIRCH works. Optimal partitioning is really just a clustering problem where the distance metric is whether or not two rows show up in the same query. Furthermore, I was already going down the b+tree path for managing the splits, which is how BIRCH does it as well. So, I convinced him that I should present my forward looking view, that is, what ideas I'm going to use from BIRCH rather than the algorithm itself. It was actually a rather easy sell; I think he was somewhat dreading hearing six more or less identical presentations on a topic he already understands.

Now there's just the trivial matter of getting all my thoughts together and trimming it to a 10-minute presentation in two days.

Saturday, April 9, 2016

2011 "Double" Chubb

Next week I'll be running the Double Chubb Single Edition so I figured this week's off-day throwback race report should be to my previous run of the 1-lap course.

Run April 16, 2011

Perhaps it's a case of familiarity breeds contempt. I run the Chubb trail once or twice a month. It's a great trail. The half-marathon distance makes it a good intermediate run and it's easy to add distance using connecting trails if you want to go long. The hilly and technical portion through Tyson drains well and can be run in almost any conditions. When it's dry and the Meramec River is below flood stage, the bottoms section is fun, fast singletrack. The gravel road through Lone Elk is less than inspiring, but it's only a mile long.

And yet, I've never run Double Chubb.

Now in its 13th edition, the Double Chubb is arguably the most anticipated trail race in Missouri. No other trail event fills faster (10 hours this year). Some of this is a consequence of logistics. Most of the trail is singletrack and it's an out-and-back course. Therefore, the SLUGs (St. Louis Ultrarunners Group) have wisely held the field limit low to keep head-on traffic manageable. Also contributing is the fact that there is a non-ultra distance offered (the 25K "single" Chubb) which expands the appeal to a broader range of runners. But, mostly, it's a great race on a great trail.

Falling just two weeks before I'm scheduled to run the Illinois Marathon, I decide to enter the single. Even at a training pace, a 50K would be too much so close to a marathon. My plan is to run the race at marathon effort (that is, the pace I can hold for three hours), which should have me done in just under 2 without destroying my legs. As that's a reasonably competitive master's time, I leave open the option of kicking it a bit in the last few miles if there's an age group win on the line.

Due to a minor informational snafu on the SLUG website, I end up missing packet pickup the night before. While that's not a problem itself, it leaves me not knowing what time to show up the next morning. I decide to arrive at 6, which turns out to be a bit on the early side. Only Race Directors David and Victoria White are there. I offer some assistance in getting the registration tent set up and then take a nap in my car. By 6:45, the registration area is hopping. While standing around chatting with my friends sounds appealing, the damp, chilly morning has me shivering a bit so I get started on my warmup. I don't want to start with my feet all gooped, so I jog easily on the road for half an hour, then change into my orienteering shoes for the race. While just a bit heavy for trail racing, my hope is that the open tread on the O-shoes will shed the mud we'll be running through along the river.

At the gun, Andy Koziatek and Ben Creehan bolt off up the steep road to the trailhead at a pace I reserve for track work. Ben's in the 50 and I've never come close to beating Andy at any distance, so I don't concern myself with them. Behind, a pack of six quickly forms including Chad Silker, Joel Lammers, David Frei, Jeff Sona, Mitch Faddis (all in the 50K) and me. While it's nice to be leading the masters in the 25K, it's still way too early to assume I won't get a push from behind.

David sets the pace for about a mile at which point I decide to up the tempo a bit. Only Chad follows, but David comes blasting by when we go down the first descent. "Your quads will remind you of that in about three hours," I tell him. "It's free speed!" he responds. Yeah, free just like charging things on a credit card is free. He'll pay, but that's his problem. As we start the big climb to the picnic table (the high point on the course), I resume the lead and, again, only Chad matches the effort. By the time we're heading back down the other side, we are clear of the rest of the field.

Despite the heavy rains of the preceding week, the footing on the rocky descent to the river is fine. We get to the aid station at the train tracks in 24:27. Though I have run the opposite direction (which is slightly tougher) a minute faster during tempo workouts, I'm a bit surprised we're that quick. The pace feels right, but just to be sure, I strike up a conversation with Chad to keep myself from overcooking it.

Turns out, Chad doesn't need much encouragement. He talks nearly continuously all through the river section. I don't get in too many comments myself, but it's enough to keep the pace in check. The mud isn't nearly as bad as feared (even the big culvert crossing can be taken without resorting to hands and knees), but it seems reasonable to expect that to change once the entire field has splashed through. Along the gravel road, I intentionally run through some of the puddles to clean my shoes for the climb up to the turnaround.

As we start the climb, Chad decides to back off a bit. After all, he does have lap two to think about. At the top of the hill, I meet Andy coming back the other way and still looking quite strong. Ben is a couple minutes behind him having obviously thought the better of trying to match Andy's pace when he has twice the ground to cover. I arrive at the turnaround in 55:22. I don't really have a context for that split since the turnaround is beyond the trailhead, but it seems like I'm still moving at about the right speed.

One of the nice things (perhaps the only nice thing) about out-and-backs is that you get to see what your competition is up to. Clearly, Andy is going to win, but the Master's prize is still up for grabs as Bob Fuerst is less than a minute behind me. I may have to push that last stretch through Tyson after all. The top two 25K women, Mary White and Laura Scherff, have a tight battle going and are putting away nearly all the men in the process. Joe, David, and Mitch are still running well in the 50 but Jeff looks like he might be off his game today. Travis Liles has worked his way into the mix and looks quite fresh. I make it back to the gravel road before hitting the bulk of the field, which makes passing easy.

The mud has certainly gotten worse on the flood plain, but it's still not as bad as I was anticipating. About half way through it, Chad catches back up to me. I worry that maybe I'm slacking off the pace and letting Bob back into the race, but we hit the midway aid station at 1:25:46 which is within a minute of our split time heading out. I can't see Bob behind me, but just in case he is closing, I decide to firm up the pace a bit. Chad takes some time to get food at the aid station so I'm left on my own to push up the ridge away from the river.

I've run the Tyson section from the tracks to the trailhead at tempo pace so many times, the added pace feels quite natural. I don't quite match my normal tempo effort, but I'm not slow by much, hitting the trailhead at 1:49:56 for a 24:10 split. Having found my groove, I keep it going through the final loop on the Chinkapin trail to finish at 1:55:13. Andy's been done for over ten minutes, but as he's a young pup, I get the 40+ win. Chad comes through a couple minutes later looking relaxed for lap 2. Bob finishes at 1:58:43. While he may have missed the Master's prize, he does win Senior as he turned 50 last year. I put my finisher's medal and winner's plaque (both quite nice) in my car and then head back out for a few easy miles. Joel and Mitch have already come through. David is just starting his second lap, so I run with him for a while. He admits he took it out too fast, but he's still moving well enough that after about half a mile I let him go. He says he really wants to try to beat Mitch who is less than a minute up the trail. David always saves his best efforts for beating his friends. I consider it something of an honor that many of his best performances have come in the process of handing me a loss.

I plod along easily, internally debating the wisdom of going all the way to the aid station. Travis comes by at a pace indicating he's saved himself for lap two. At the picnic table, I rest for a few minutes and then decide to head down to the aid station to say hi to the volunteers. I'll walk back if I have to. At the base of the descent, Jeff catches me and we jog into the aid station together. He's not feeling great, but doesn't expect to have any trouble finishing.

I hang at the aid station for a few minutes, chatting it up with the volunteers and passing runners. All I ate during the race was a single packet of Gu, so I take this opportunity to down some of the refreshments. As usual, the SLUGs have a fine spread - their aid stations are among the best.

It's turning into a longer run than planned and I don't want to put myself too far in a hole for Illinois, so I go back to the start straight through the woods, which is not only a mile shorter, but also a softer surface. I've done enough navigation training at Tyson that I have no trouble doing this without map or compass.

I get back barely in time to clean up and change clothes before Ben comes in, winning the Double in 3:56:19. He looks pretty fresh for having just finished an ultra. Given that his second lap was nearly a minute per mile slower than his first, I have to assume he got off the gas when he realized he wasn't going to be challenged for the win. Chad takes second in just over 4 hours, then there's a pretty good gap back to Joel in third (winning Masters). Travis follows just a couple minutes ahead of David who admits that the thought of catching and then staying ahead of Mitch had a lot to do with holding onto a decent lap 2 for a 50K PR of 4:21:47.

While the cool, misty conditions may have been a bit uncomfortable for the volunteers, the runners all seem to agree it was a great day to be out on trail, even with the mud along the river. Personally, I would have loved to run a second lap hard. It was definitely a day to shoot for breaking 4 hours (my PR for a trail 50K is 4:06). My intention is to run Boston next year, so I won't likely try the double until 2013, but I'm already looking forward to it.

Friday, April 8, 2016

Green light

Reviewed results with my adviser yesterday and the verdict is "go". Biggest task is framing the result in the context of current research. To that end, here's my bibliography so far:

Why heavy tailed distributions show up and why they are a problem:

Mandelbrot, Benoit. “The Pareto-lévy Law and the Distribution of Income”. International Economic Review 1.2 (1960): 79–106. Web...

Nolan, J.P. Stable Distributions — Models for Heavy Tailed Data. Birkhauser, Boston, 2015. Note: In progress, Chapter 1 online at academic.2.american.edu/~jpnolan. This is a book that is still a work in progress, but just ch1 is good background info. The author has a good page on Stable Distributions.

No Stable Distributions in Finance, Please! I'm not sure if this one's been published or not. It seems to raise some valid points about the overuse of Stable distributions. Worth a read just because the English is comically bad (certainly better than my Czech, but it's still funny)

L. B. Klebanov, G. M. Maniya, and I. A. Melamed, A Problem of Zolotarev and Analogs of Infinitely Divisible and Stable Distributions in a Scheme for Summing a Random Number of Random Variables Theory Probab. Appl., 29(4), 791–794. I couldn't find the full text online at any site I have access to so I'll have to track down a hard copy in a library. If I can't get this one in St. Louis, I know I can read it at Cornell when I visit my folks in June because we used to have this journal in the OR department.

Stratified sampling:

Anderson, Paul H.. “Distributions in Stratified Sampling”. The Annals of Mathematical Statistics 13.1 (1942): 42–52. Classic background on stratified sampling.

Shahrokh Esfahani, Mohammad; Dougherty, Edward R. (2014). "Effect of separate sampling on classification accuracy". Bioinformatics 30 (2): 242–250. Discusses bias issues arising from stratified sampling when the ratio of data between strata is not known.

Two texts I should probably read; hopefully I can find them in a library cuz they're REALLY EXPENSIVE. Fortunately, my family is in Ithaca and Kate's is in Champaign, so I'm regularly making trips that land me near two of the world's best mathematics libraries.

Heavy-Tailed Distributions in VaR Calculations. Chapter in Springer Handbooks of Computational Statistics.

Svetlozar T. Rachev & Stefan Mittnik, Stable Paretian Models in Finance

Thursday, April 7, 2016

Variance 1.01

OK, let's try this variance derivation again, this time remembering that probabilities need to sum to 1.

Recall that

μ = (nb_k/2)∫θP(θ)dθ = E(X|block contains data)E(θ)

and that θ ~ Beta(a, b) where a is the number of blocks sampled that contained rows for the query and b is the number of blocks sampled that didn't.

Var(X|θ) = σ(θ)² = E(|X - μ|²) = (1-θ)μ² + θ[(nb_k)²/3 - nb_k μ + μ²] = (1-θ)p + θq

Var(X) = σ² = ∫σ(θ)²P(θ)dθ

OK, we've been here before. Let's do the integrals right this time:

σ² = ∫σ(θ)²P(θ) dθ = ∫[(1-θ)p + θq]P(θ) dθ

   = ∫[(1-θ)p + θq]θ^a-1(1 - θ)^b-1 dθ / B(a,b)    (forgot about that denominator last time - it's kinda important, especially when a and b start getting big.

   = ∫(1-θ)^bpθ^a-1 + θ^aq(1 - θ)^b-1 dθ / B(a,b)

   = [p∫θ^a-1(1-θ)^b dθ + q∫θ^a(1 - θ)^b-1 dθ] / B(a,b)

   = [pΒ(a,b+1) + qB(a+1,b)] / B(a,b)

Here's the best part. Look what happens when you sub in B(a,b) = Γ(a)Γ(b) / Γ(a+b) and remember that Γ(z+1) / Γ(z) = z for all z ≥ 1:

   = [pΓ(a)Γ(b+1) / Γ(a+b+1) + qΓ(a+1)Γ(b) / Γ(a+b+1)] / [Γ(a)Γ(b) / Γ(a+b)]

   = (pb + qa) / (a + b)

The real world never comes out that cleanly. But, hey, you can check out the convergence graph yourself. Notice how the confidence bounds wander right up to the true value, but nearly always contain it. It really is that slick. There's much more testing to be done. My guess is that certain types of queries will obey these limits better than others. But, I don't think many will be too far off. The theory is sound and the implementation is correct. At any rate, it's coded and it's working on my test data. That's good enough for me to write this thing up.

Wednesday, April 6, 2016

All done!!!

No April Fools this time; I found my mistake. I left the norming constant off the Beta distribution when I was integrating my variance over θ. That's the sort of mistake I often make. It actually simplifies things even more. I'll post the updated derivations tomorrow. Meanwhile, here it is, convergent with true confidence bounds:

A lot of writing still to do, but you could say I'm pretty happy about this.

Tuesday, April 5, 2016

Math as religion

Well, it's not like it's a new idea. While the ancient accounts are conflicting in the details, there is no question that the discovery of irrational numbers was a fairly big deal to the Pythagoreans of the fifth century BC and that it resulted in more than a few debates where dogma was valued over evidence.

As an ardent Christian, I'm not bashing religion (though I distinguish it from the higher calling of faith, which I would define as belief that transcends catechism). And, to be fair, Krushke's diatribe against frequentist hypothesis testing in my Bayesian stats book (which goes on for two full chapters) could be framed as a legitimate philosophical position rather than a statement of faith. But, that's not how he worded it.

I really have no problem with the fact that frequentists and Bayesians disagree on methods. Nor do I have a problem that the disagreement is not one of technique, but a fundamental understanding of what constitutes something as central as a random variable. What I don't particularly care for is that the arguments are always framed as "the other side is just nutty for not seeing how obvious this is." We're not talking about disenfranchised primary voters here. These are some of the greatest minds on the planet. The fact that there is disagreement should be cause for joy, not alarm. It means we've got a meaty topic on our hands. How is that not a good thing? Debate away, but let's show a little respect.

Monday, April 4, 2016

Abstract

Here's a first draft of the abstract for this term's paper. Google's dictionary didn't recognize "heteroscedasticity" and wanted to change it to "heterosexuality". I don't need a spell checker making that mistake! I added it to my dictionary.

Financial Service Companies are now seeing data warehouse volumes running into the trillions of rows. Under such volumes, performance of queries using conventional “full search” technology (e.g., SQL, OLAP cubes) can degrade to where the user experience of immediate feedback is compromised. Simple random sampling methods do not yield reliable results due to the typically heavy-tailed distributions of financial data and heteroscedasticity. In this paper we present an algorithm that adjusts for both factors and returns both a point estimate and a confidence interval for the queried result. Empirical tests from a sample of projected cash flows are given to demonstrate the convergence and correctness of the results.

Sunday, April 3, 2016

Cart before the horse

Well, I've got a more or less working algorithm. That's a good thing. Now it's time to put the RE back in REsearch. I really need to do a more thorough search of the literature to make sure this thing is publication worthy. I've only turned up a few relevant papers so far, and nothing that would prevent CISS from getting published, but I certainly can't claim that I've looked very hard. So, lots of library work over the next couple weeks. Tedious, yes, but necessary.

One would normally do all this BEFORE developing the result.

Saturday, April 2, 2016

2012 Kettle Morraine 100

I've already posted the train wreck that was my first attempt at Kettle. This week's off-day throwback race report details round 2.

Run June 2-3, 2012

Oddly enough, I found it quite enjoyable. Not so much the part that involved staggering through the fields in the midday sun and then being iced down at each of the next three aid stations. But, what followed was without question my most remarkable recovery in any event, ever. The trip through the night was slow, but I could feel my strength coming back. By morning I was running again and even managed to salvage a top-10 finish. I've never pulled myself out of such a deep hole.

But, enough about last year. I decided to return to the Kettle Moraine 100 this year with the goal of actually getting it right. I've spent a lot of time reliving the nightmare of last year's return trip from the 50K turnaround and developed a strategy designed to avoid a repeat:

Don't hang out, but make sure to get enough at each aid station.
Absolutely no sustained efforts until past the fields on the return trip.
Don't hang out, but make sure to get enough at each aid station.
For that matter, no short efforts until past the fields, either.
Don't hang out, but make sure to get enough at each aid station.
Once past the fields, go ahead and get on it with the idea of getting as much technical trail done in the daylight as possible.
Don't hang out, but make sure to get enough at each aid station.
Once night falls, use the remaining technical stuff as recovery.
Don't hang out, but make sure to get enough at each aid station.
Get back on it for the last 10 miles (which are pretty easy trail).

I still expect the night section, which is hillier and more technical to be significantly slower going than the day, but I'm hoping to be actually running it rather than walking like last year. In the process, I'd also like to move myself a few places up the finish order. A review of the start list casts some doubt on that secondary goal. Kettle is now part of the newly-formed Midwest Grand Slam and, the notable absence of course record holder Zach Gingerich notwithstanding, both the quality and, quantity of entrants are at record highs. Two hundred forty three folks sign up to tackle the 100 miles of glaciated trail with another eighty eight planning on stopping at 100K and seventeen teams in the relay.

Like last year, I pick up my number Friday afternoon and then go for a short jog on the Ice Age Trail to get a read on conditions. Simply put, they couldn't be better. The trail has just enough moisture to give a nice, springy landing without robbing any power and there isn't a trace of mud. I decide I'll run the first 100K in my road racing flats and then switch to my trail shoes for the rockier section at night. I then head an hour west to Madison to spend a very fine evening with my college cycling teammate Tom Rickner and his wife, Rebekah, who put me up for the night.

The first evidence of the larger field comes when I arrive at the start at 5:30AM to get one of the very last parking spots in the grass. I'm not sure where they are going to put everybody else, but I don't have time to worry about that. With all the extra commotion, the 30 minutes before the gun are barely enough to get my drop bags sent to the proper stations, pick up my timing chip, fill my bottles, and change into my racing flats. Seasoned racers will note the lack of a restroom trip in the preceding list. That would bother me in a shorter race, but I figure stopping at the first aid station in a 100 isn't a significant hit to the time.

Once underway, a lead group forms containing the usual nutballs who seem to have forgotten that the course distance has three digits. This is exacerbated by the fact that the conditions are very nearly perfect: 60F, low humidity, clear skies, excellent footing. Despite intentionally holding back, I still go through the first mile in 9 minutes, a minute faster than planned. I slow to what seems a truly absurd pace, and manage stay there. Steve Pollihan runs with me a bit, but it's clear he wants to go faster so I tell him to set his own pace.

I arrive at the Tamarack aid station at 5 miles in just under 50 minutes. I barely break stride as I've still got enough in my bottles to get me to Bluff at 7.5. Bluff is a longer stop as I make my requisite bathroom stop. It's 8 miles to the next manned station with just one water drop in between, so I top off my bottles and grab a little extra food.

This is the first section of true singletrack, the opening miles being on a grassy ski trail. I keep the "effort" (still feels ridiculously easy) constant which slips my pace a bit. Still, the increased technical demands of the trail have me more in my element and I pass a dozen or so runners. Fortunately, trail etiquette is observed and I'm never stuck behind anybody for more than a few seconds. Unfortunately, the stop at Bluff didn't get the job done and I need another restroom break by the time I arrive at Emma Carlin. That, and getting my extra water bottles out of my drop bag, makes it a seven-minute stop. I had resolved this year to not rush through aid stations (see odd-numbered strategy items above), but 15 minutes of down time in the first 16 miles is not what I had in mind, either.

It's nearly another 8 miles across the notorious fields and marsh to the next manned station at Route 67. While not nearly as hot as last year, I'm still glad I've got the extra bottles. The running is easy and I firm up the pace just a bit so I get out of the sun sooner. I leave my extra bottles at the Route 67 station and head back into the woods for the remaining distance to the turnaround at 50K. The lead relay team is tearing up the course; I meet their runner coming back while I still have nearly five miles to the turn. The first 100-mile solo runners start showing up while I still have over two miles to go. Not surprisingly, Tommy Doias is near the front. Before the start, he had indicated some concern about his fitness as he's coming back from injury, but he certainly appears to be running well now.

I hit 50K at 11:28 in 35th place; three minutes and 23 places behind last year. I hope the fine conditions have people taking this out too hard because, while the opening pace was very easy, I'm starting to feel the effort. Steve is at the aid station and asks how I'm doing. I reply that I'm a little concerned that I feel this tired given my pace. He echoes the sentiment.

The trail is quite busy on the return. Fortunately, I pass most of the field while still on the wide ski trail that leads in and out of the turnaround. By the time I hit the singletrack, the runners still heading out are moving pretty slow and happy to yield the trail. It's too early to be fighting for positions, but I do note that I am catching a few runners. At Route 67, I grab my extra bottles and head back out into the sun. It's nothing like last year, but it is getting warm. I'm considerably slower than the morning passage through this section. Still, it feels good to come striding into Emma Carlin with a bit of form as opposed to last year's drunken stagger. My legs are telling me I should forget about the big surge to sunset, but at least they are still willing to run.

I put my extra bottles back in my drop bag and also leave my shirt. It's soaked through with sweat which is causing some chafing and I'm sure I'll be back to the start before it gets cool. I decide I need to run easy for a few miles to recover from the fields. Leaving the station, I'm passed by a runner who I decide to name Jose Cuervo since he looks like he should be playing volleyball on a beach in Mexico rather than running through the kettles of Wisconsin (turns out his real name is Rolando Cruz). He's pretty much the opposite of me: young, broad shoulders, deep tan, and a full head of dark hair. His stride is smooth and powerful. I'm surprised anybody carrying that much muscle is running so well at nearly 50 miles. Less than a quarter mile into the trail, we pass a couple girls coming the other way. They wave happily to Jose; perhaps members of his support crew. As soon as they are out of sight, he decides running fast isn't such a happening thing. He smiles and offers a word of support as I pass, no doubt thankful that his little show of bravado was enhanced by my pasty-white, concave chest as a basis for comparison. I get a chuckle out of it myself and return the encouragement.

The rest of the way in is uneventful, through I'm never able to get back on pace. I'm a bit surprised that I don't see the first 100-mile solo heading out on the night section until I'm two miles from the start. Seems that the leaders have had to give up some pace as well. More surprising is the fact that I finish the 100K in 9th place (at 5:33PM). I certainly didn't pass 25 people coming in, so they must have been camping at aid stations or dropped out altogether. As I'm changing my shoes, Tommy comes over and informs me that I'm about to be 8th; he's not going out for the night section. He says his heel is OK, but he just doesn't have the miles in to run 100 today. Tommy was a big part of me finishing last year, pushing me out of aid stations and telling me I'd recover if I just kept moving. I think about returning the impetus, but decide to respect his decision. It's one thing to have difficulties, quite another to realize you aren't prepared.

The night section starts the same as the day, following the ski trails to Bluff. This is some of the easiest running on the course, but it's not feeling easy to me. In fact, I'm increasingly having difficulty running at all. I was half an hour slower coming back on the day section which is a pretty good indication the first 50K were too fast, regardless of how effortless they felt. Now it's looking like the wheels may be falling off entirely. It takes a full hour to get to the Tamarack station, 2 minutes per mile off the morning pace. Two runners pass me, and I usually take that seriously at this point in the race, but there's nothing I can do about it. As much as I want to put ground under me before sunset, it's obvious that I need to take a break and gather myself or risk having to walk the entire night. I walk for a mile eating and drinking along the way and then pick up a very easy jog into Bluff.

After Bluff, it's back onto the Ice Age Trail again, but this time the out and back is in the opposite direction, heading southwest to Rice Lake. I'm less than a mile into the trail when I hear footsteps coming up behind me. "Solo or Relay?" I ask, without looking back. "Fun Run!" comes a cheerful reply. Oh, no! Not that again!

The "fun run" (which, at 38 nighttime miles on technical trail, is itself a pretty stout ultra) is used to keep the trail warm since the 100 field gets very spread out. It's a good idea to have more people available to render assistance if needed, but I am really not in the mood to have dozens of fresh runners blowing by me over the next few miles. I ask my new companion why they didn't wait until 8PM to start and she says they had so many that wanted to run, they decided to start in waves to keep the trail from being clogged. That's something of a relief and it is nice to have some company. She runs with me for a couple miles and then moves ahead. Within minutes, another fun runner has latched on and when I offer to move over so he can pass, he says he'd rather just hang back for a bit as it seems I'm negotiating the trail pretty well.

And that is actually true. Sometimes the best way out of difficulties is simply to tend to something else. During the fifteen minutes of conversation, my form has returned. And, none too soon as we're into fairly technical terrain. My hope was to make it to the Route 12 aid station (77 miles) prior to sunset where I would get my headlamp out of my drop bag. I have my backup light in my pocket, but this trail has enough roots and rocks that the big light is definitely worth the extra weight. The last mile to the aid station is in fields, so we press the pace to get through the woods as the light dims. We get to the edge of the field just after sunset and give each other a quick high-5 for beating the dark. I tell him to go ahead as the surge has been felt and I'm going to need to run slow again for a bit. In the open field, there is plenty of twilight to get me into the station.

I take my time at the aid station getting a cup of chicken soup (which is about all my increasingly fragile digestive system can handle right now) while I hook up my light and battery. I'll need everything I've got left for what's to come. The four and a half miles from Route 12 to Rice Lake are the toughest of the course. There is some easy trail in the middle, but the rest is steep, uneven, and littered with rocks and roots. As expected, I have to walk most of the really steep stuff, both going up and down, but I'm pleased that I still have enough form to run the rest, albeit slowly. The round trip takes well over two hours, but it's done. I know I can run the rest and, while any hopes of a sub-20 are gone, this is shaping up to be a reasonable competitive result (it's impossible to tell who's in what event at night, but it seems like I'm still around 10th).

I make the stop as short as I can, opting not to revisit my drop bag. I seem to be teetering on the precipice of disaster and my body is interpreting even the briefest rest as a signal to start shutting down. The snap is completely gone from my legs, but I still have enough motor control to stay upright on the trail. My stomach is churning, but I'm able to get enough down and nothing is coming back up. My joints (particularly my hips, which are always a problem because of all the bike wrecks in my 20's) are stiff, but functional. In short, I feel like I've been running all day and as the clock rolls past midnight, that is pretty much the case.

I pass through Bluff for the fourth and final time and again find the ski trails to be tough going. I'm sure I'm moving faster than when I was on singletrack, but since they are much wider, it feels slower. I pass the Tamarack station getting nothing more than a cup of water. I'm pretty sure nothing else will stay down and there are only five miles to go. I shuffle along for another mile or so and then manage to find just a bit of form. It's not much, but with the finish so close, I decide to hang onto it and cover the rest of the distance without losing the pace.

I cross the line at 2:35 AM, for an official finish of 9th in 20:35:19. As is typically the case in ultras, the top 10 is jammed with folks in their 40's, so I don't get any age-group hardware. It's a little disappointing to lose a place in the last third of the race. That doesn't happen to me very often and it's never happened in a 100 before. I tend to view the late stages of an ultra as an annuity where I get to cash in on my patience earlier in the race. If I had been able to execute the last part of my plan and get back on it with 10 to go rather than 3, it might have worked out that way; 6th place was only 15 minutes ahead. It's still hard for me to believe that the opening pace was too firm, but I guess it was.

At any rate, that's a rather minor blemish on what was really a pretty solid run on a fantastic trail in great conditions. It's not odd at all that I found it quite enjoyable.

Friday, April 1, 2016

All done!

April fools. Certainly very encouraging progress this week, but it's still not quite right. I ran it with several random seeds and this plot is pretty representative:

It seems to work great for driving out the big variability, but then it wanders around a bit. That doesn't bother me, but the fact that the confidence interval doesn't indicate as much bothers me a lot. In just about every trial, the confidence bounds exclude the real answer for an extended period between 300 and 400 blocks sampled. They generally do a pretty good job of everywhere else. I could just arbitrarily widen the confidence interval, but I'd like to know why I have to do that.

My first thought was that modeling the block sum given a hit as Uniform(0, max possible) was bogus. Turns out, it's not. I checked the data against that assumption and it came back looking pretty darn uniform.

My guess is that my test data set is simply too small. Since the total sum is just that, a sum, I'm counting on the central limit theorem to give me a normally distributed error. Between the fact that the sums in different strata have significantly different variances and the relatively small number of non-zero entries, things are still a bit more spread out. Also, we're sampling without replacement. Once more than half the data has been sampled, that's going to start cramping the whole "the block sum is an independent random variable" thing. This algorithm was conceived as one that would sample a sufficiently small percentage of the total to not worry about that.

Hooking this thing to the real data set will require a new data layer. I've architected it so that the data layer can be swapped out for something high-performance fairly easily, but there's still the non-trivial matter of writing a high-performance data layer and hooking it to a 4Tb data set that isn't really set up to be read this way.

In the mean time, I think I do have enough to start writing. The confidence intervals aren't way off. I think this could be presented as a first cut.