never2old4school: February 2016

Monday, February 29, 2016

Epidemiology

So, what if you read yesterday's post and thought to yourself, "that's great, but I don't know about a field called Epidemiology"? Well, you could look it up on Wikipedia, of course, but, as long as you're going to trust some amateur hack to give you an encyclopedia entry, it might as well be me.

It started with shit. Seriously.

London in the 19th century was being overrun with the stuff. The flush toilet has been around in some form since ancient times. Sir Thomas Crapper contributed to its widespread development in the late 1800's, but did not actually invent the thing. What happened between ancient times and Crapper's fortune was the birth of Epidemiology.

In 1854, London was thrown into a panic as an epidemic of cholera hit the city. All sorts of things were blamed for it including the simple observation that the destitute were being hit the hardest, so it must just be that god doesn't like the lower class. Talk to Mr. Trump about Mexicans if you think we've progressed since then.

John Snow (no, not the dude from Game of Thrones) was a doctor in London who surmised that it might have something to do with the fact that people were literally drinking their own waste; the standard way to get rid of the stuff was to dump it out your window into the gutter where it could contaminate the water supply.

The powers that be were not impressed with his argument so he did what any self-respecting man of the Age of Reason would do: he gathered more data. He plotted where each victim lived on a map. He then looked for the nearest well. No luck; it appeared several wells must be contaminated and that would be quite a coincidence if the root cause was a breech allowing waste water into the fresh supply.

Then comes the brilliant part. Like most brilliancies, it's obvious in hindsight. People don't take helicopters to fetch water, they walk. So, he redrew his map, this time using minimum walking distance as his distance metric. He accounted for shortcuts like back doors and allyways. Blam. A single well was isolated. In the face of the newly presented evidence, officials issued an order to close the well in question. The epidemic ended as quickly as it had started and a new field of study was born. (And a new industry - London became the pioneering city for modern sanitation).

Like every field, Epidemiology has come a long way in the subsequent 150 years, but it's still basically about the same thing: extracting information embedded in disease and mortality data to suggest (not prove) a root cause. My master's work looked at methods to prioritize public health investigations based on the likelihood (or, unlikelihood) of the spatial distribution of disease cases. It was a minor result, for sure, but some very practical good came out of it. The New York State Department of Health used the results to streamline their investigation process and they did, in fact, turn up some nasty toxins.

Epidemiologists have been prying pearls of knowledge out of data for over a century, but the Data Science crowd seems to think they're inventing all this stuff. Read a typical Data Science paper and you'll see few, if any, references to journals like Biometrika (looks Russian, but it's actually an Oxford publication) or The American Journal of Epidemiology (the latter being where I was published). It's a gold mine. Just read all the significant results from the last 40 years in Epidemiology and you'll blow the doors off these newbies.

Please note that I'm not suggesting that the Data Science crowd is a bunch of idiots. They're just not using a lot of work that's already been done because it's been published in places they don't generally care about. In fairness, the vast majority of Epidemiology literature is very specific to particular applications so, unless you're in the trenches trying to sort out why diabetes rates are inverse correlated with income or some other such problem, you wouldn't bother reading it. However, if you want to get in on this Data Science thing (And, why not? It's a booming field), a review of the milestone articles is an easy way to get a leg up.

Sunday, February 28, 2016

What's old is new again

No great pearls of wisdom today. Spent most of the evening finishing off my homework.

I will note, however, that if one was to be seeking a PhD in Data Science, clustering would be the quickest route to that provided you knew about a field called Epidemiology. I'm reading all this stuff for my Data Mining class and the citations are all between 1990 and 2010 and I can't help thinking, "we knew all that in the 80's." It's almost enough to make me want to pivot my thesis topic but, for now, I'll carry on with the problem at hand.

Saturday, February 27, 2016

2005 Chicago Snowgaine

Plenty warm out today so this week's off-day throwback race report is to a run in more seasonable conditions. And, while she doesn't read this blog, I'll still give a "Happy Birthday" shout out to my wife Kate.

Run February 19, 2005

I'm not quite sure why it took David and I 8 years to try a team orienteering event together, but it obviously falls in the "really good idea" bucket. We've done navigation sections together as part of adventure races, but since you only get one set of maps, it wasn't so obvious how well we work together.

We arrived separately since I was staying in Chicago for the weekend. When I get there, David has already sketched out a preliminary route. It looks ugly - lots of doglegs - but I'm only able to find a few small improvements. The route ends with a road run of nearly two miles. To make that pay off, we'll have to hammer it pretty hard. One thing is for sure: the winning time will be well under the 6-hour limit.

Competition map: North, South.

Just before the start, meet director Kathy Bullard calls us all together for some instructions. As she counts down to the start I realize David is nowhere around. David always seems to be doing something right before the start of these things, even today when he was the first person to register. As she says "go" David pops out from behind his van and we head off.

On the way to our first control we are passed by Michael Eglin and Dimitry Perchonok. I expect that they will be our primary competition. Michael is as fast as we are and a better navigator on a true orienteering map. This map is a touched up version of USGS so our experience on rough maps evens the odds. I don't know whether Dimitry will help or slow him down.

Although the temperature is in the high 20's, the ground is frozen so solid it is like running on concrete. Make that concrete covered with ice, sticks, and leaves. Footing is bad in the valleys, worse on ridges, and nearly impossible on hillsides.

We get to the fourth control at the same time as Eglin/Perchonok and open up a slight lead heading up the hill. On the other side of the ridge, we drop into a steep reentrant. We can hardly go three steps without slipping and bruising something. Our adversaries take the power line cut down to the road and beat us to the next control by about 30 seconds. A few things are becoming clear:

We really should have worn spikes.
We are going to need to pick it up if we're going to get rid of Eglin/Perchonok (who are wearing spikes).
Spikes would be a good thing.
We need to give excessive preference to routes that follow trails and fields so we can run fast without spikes.
It sure would be nice to have some spikes.

Speaking of spikes (now in the navigation sense), another fact is becoming apparent: when David and I are navigating together on ridge and valley terrain (which is 90% of our terrain in St. Louis), we basically spike everything regardless of how fast we go. This is both good and bad. Good because it means we can go really fast. Bad because we now have no excuse not to go really fast.

After the fifth control (25), we head to 19 while the others go to 23. We crank up the pace on the route from 19 to 23 (mostly trail and field) and continue to push to 20. Just as we are getting there, we see Michael and Dimitry leaving the control. Encouraged by the fact that we are nearly a control ahead, we push even more.

Since the out and back strategy on trails and fields is working so well, we switch the order of 13 and 14, making 15-14-13 one big dogleg. 13-12-11 also becomes a dogleg as we stay in the valley rather than going over the ridge between 12 and 11.

We're running about as fast as I can go for this distance. Although we are covering a lot of extra ground to avoid the hillsides, we're still doing 8:00/Km against the straight line. We have one little boom at 8 as we both drift right of the control. We recover quickly and the error is less than two minutes.

Our most radical departure from the red line comes at control 4 where we decide to take the road all the way around to 3. This would have been great if the road hadn't been covered in ice. We have to take tiny steps as we half run/half slide down the hill.

When we get to 2 (our last control that requires any real navigation) we spend a few seconds taking in the view off the cliffs. Part of this is because it really is a nice view and part is because we know what comes next - the road run to the finish.

David scrambles down to the road ahead of me and I tell him to get moving since he has the punch card. I follow about 30 meters behind. If any further evidence was needed as to how evenly matched we are, we run the entire leg up to 10 spaced exactly like that. As David punches 10, I catch up and we run into the finish together.

Our strategy has paid off huge. We are the only team to use the long road run. Avoiding the hillsides near the end put us way out in front. Michael and Dimitry show up 43 minutes later and the third place team of Maricel Olaru and Natalia Babeti is nearly an hour back. Five teams and one solo competitor manage to sweep the course in under six hours.

Kathy is a little disappointed that we got done in just over 3 hours. I don't know that you could make the course much longer without adding a lot of controls. The area just isn't that big. I suggest that if she really wants to increase the distance, she could do a map exchange.

While I would have enjoyed a longer course, I also found it great fun to hammer full-on for three hours. We certainly wouldn't have been able to hold that pace for six. I felt like David could have gone a bit faster without me, although he might have had to slow down to navigate. At any rate, it was fun to put the Buckley-Frei Death Match aside for a day and work together. So much fun that I think we should do it more often than once every 8 years.

Friday, February 26, 2016

About that class thing...

Yes, I am taking classes this semester, not just writing algorithms of my own choosing. Spent a couple hours working the chapter 5 problems for Bayesian Stats this evening (homework is due Monday; this kind of crept up on me). A bit frustrating to spend two hours writing R programs to solve problems that I could do by hand in 15 minutes but I do get it. No point in programming stuff you don't understand until you're really sure you can validate the stuff you do understand.

Counting the cost

Missed yesterday's post due to yet another packed day. Straight to work upon waking so I could get out of there in time to meet with my adviser at 3PM. Then my Data Mining class at 4. Followed that with a short track workout (4x800m@3:00 plus a warm up and warm down). Then, straight to choir practice (I've pretty much quit choir, but a long-time associate pastor of our church passed away last week and I felt joining them for his funeral this Saturday was a fairly high calling). Finally, a short bit of research and off to bed. So, I'll enter a somewhat off-topic post this morning and may have some academic meat later on today.

It's only somewhat off topic because, as I've mentioned before, the big casualty in all this has been my running. That's a good thing, really. Something had to give and better my running performances than something that really mattered. That said, running matters to me.

I'll begin by acknowledging that there is absolutely no reason why somebody needs to be their absolute best at anything, much less a recreational pursuit. However, if one is trying to do something well, looking for shortcuts is not the way to go. My impending performance at The Woodlands serves as a illustrative case. Accepting the elite invite to The Woodlands Marathon last fall seemed like a good idea at the time. I was coming off a sub-3 and age group win at Milwaukee. With winter break providing six weeks of decent training time, it seemed like I could hold some decent form together.

We'll find out in 8 days just how far we need to stretch the definition of "decent form" to make that statement true but, for the purposes of argument, let's assume that all my pre-race predictors are right and I run somewhere around 3:03.

Taken in isolation, Runner's World would use that as a case study for their typical "Here's a super busy guy who does great on less training" story. Only 12 weeks of prep totaling barely 800 miles, but those two quality workouts per week had him beating his Boston Qualifying time by half an hour. They might even go further to suggest that I could qualify for Boston on even less if I just turned up the intense workouts a bit more.

Ah, but that's the rub. I probably could. But not because 40 miles a week and some track work is enough. It's because I have this huge base from when I had time to train right. My best marathons have been preceded by around 1700 miles in 18 weeks. The only reason I'll still be way under the BQ standard is that fitness doesn't leave you overnight (it just feels that way). I know, as does just about every credible coach, that running marathons well means running twice a day every day for years.

While it'a a surprising fact of biology that one can get reasonably close to optimal performance on much less, the running press has interpreted this as meaning that the extra effort doesn't really pay off. Nothing could be further from the truth. Five minutes may not sound like much, but the winning time for 50+ at nearly every mid-sized city marathon is between 2:55 and 3:00. That five minutes has demoted me from someone who had a legitimate shot of winning my age at 90% of the world's marathons to someone who doesn't have a chance. It's not coincidental that the winning times are so uniform. It's a diminishing returns game which compresses the results into a very small range. Those tight times are indicative that the top runners are engaged in some pretty serious marathon prep. You either do it or you lose to someone who did.

Again, I'm not super depressed about this. I knew that grad school would be the end of my competitive athletics. I'm merely making a point that so many people seem to miss. There simply are no shortcuts. If you want to be good at something, do so. But count the cost, because it is high.

Wednesday, February 24, 2016

Sampling harness complete

In a perfect world, I'd be further along, but I did get to a tangible milestone today and sleep is important, too. My sampling harness is complete. The architecture is pretty straightforward. As it's basically a command-line app, there's no presentation layer; just a model sitting on top of a data layer.

I've only implemented csv files in the data layer as that's all I'm working with right now. When I start using the full dataset, I'll swap that out for HDFS and host it in the cloud. Aside from basic read/write, the data layer also does the stratification of the data before presenting it to the model. This effectively re-blocks the data in the csv case. When I go to Hadoop, I'll do the stratification as part of the partitioning of the Parquet files so the data layer can leverage the parallelism inherent in the file system.

The model currently implements two sampling methods; a sequential scan of all blocks and a random sampling of blocks. Both run through the whole data set right now. I probably won't bother putting a stopping rule on either since I'm only using them as a baseline for convergence. If I do put a user interface on this thing, I'll factor the sampling methods out into a controller layer. For now, I'm just being careful not to build sampling rules into the true model classes. That separation will likely be important when I go to parallelize things.

It's just a start, but it works fine and the whole thing has unit test coverage. That's actually pretty important because once I start running large data sets through this thing, it's going to be very difficult to validate that each step is working right. Having those test cases to fall back on for regression when making changes will likely save a ton of time.

Next up, obviously, is to code my algorithm (need to come up with a better name than "my algorithm"). Most of it is pretty simple, the only difficulty is in computing the posteriors. I'll probably use a heuristic for starters just to get it working and then work through the math with my adviser when I get back from Texas.

Tuesday, February 23, 2016

Moving on

I'm pretty much ready to shelve the Metropolis algorithm. I'll concede that when very little is known about the distribution of the observations, it does have some merit, but that's not really the problem I'm trying to solve. I'm looking for fast convergence and this is clearly not it. Rather than demonstrate on my data which is too irregular to be a good illustration, I'll show what I mean on a the uniform distribution.

The blue bars show the count of actual observations from a discrete uniform (1,100) distribution. The red bars show the Metropolis count where the current observation, x_i, is kept if it is larger than the next, x_i+1, with probability x_i+1/x_i. If x_i+1 is larger, Metropolis always moves to the next observation.

As one would expect, Metropolis correctly shifts the sampling roughly in proportion to the mass of the observations (I'm using the term mass to indicate density times the value). However, the convergence is terrible. This is a sample set of 10000 observations on the smoothest possible distribution. One can reasonably ask for pretty nice convergence here. Yet, look at the deviation from the expected value for the counts for each possible value of X:

The actual observations have a few spikes, but are in line with expectations (they'd better be, or I need a new random number generator). Metropolis does great in the area of the distribution where there is very little mass, but look at those giant spikes on the right. That's in the heavy part of the distribution where I can't afford to be off by much. This isn't a fluky set of observations, either. I ran it quite a few times with different seeds and the graphs looked pretty much the same each time. If Uniform is off by this much after 10000 observations, my data is going to need millions of hits to converge. Given the difficulties of random sampling a large database, there's no way that's going to outperform a full scan until the data size is way into the billions. My full data set is that large, but individual queries will often return "only" several million rows. This just isn't looking like a good answer.

So, here's my revised algorithm that simultaneously monitors convergence of the posterior distributions of the extreme value (to avoid missing big observations) and the sum (which is the value we really want):

As before, when loading the data, stratify it by log2 and group into blocks large enough to get some decent mixing of attributes. Might have to encourage such mixing by employing a hash to the block assignment, but I'm not worried about that right now.

Then, on query, start with a prior (ideally from actuals) P(|X_s|) which is the density of the sum of the absolute values of the relevant measures from a block from stratum s (note that, despite the somewhat misleading notation, this is the sum of absolute values, not the absolute value of the sum). Set P(X_s) to be some convenient distribution centered at zero, which I think will work fine as these are generally offsetting accounting entries, though I may need to generate a better prior for the actual sum as well. Then:

Pick a stratum s in proportion to E(|X_s|).
Sample a block from there to get both the sum of the absolute values |D| and the actual sum D.
Update the posteriors P(|X_s| |D) and P(X_s | D). May also need to update the nearby posteriors, but, for now, I'm treating them as independent even though they aren't.
Keep going until the (1-α) HDI on all the posteriors P(X_s | D) is within the specified tolerance.
Multiply E(X_s) by the number of blocks in each stratum and sum to get your answer.

This is essentially stratified Gibbs sampling with an update on the distribution at each step. That strikes me as obvious enough that I did a cursory review to see if there's already a paper out there somewhere for this and didn't find one (though I did turn up some related papers from Importance Sampling that I should probably read).

For a first cut, I'm not going to stress too much about computing posteriors; I'll just assume something mathematically convenient with a tail to the right. The goal is to get something working in the next few days so I can then refine it while I'm in Texas.

Monday, February 22, 2016

Refining the algorithm

I'm still not sold on Metropolis, but at least I did get my questions cleared up. In the mean time, it occurred to me that my algorithm is more complicated than it needs to be. While having a good handle on the extreme value is critical, I can build that into the sampling process by stratifying not only the data, but the prior and posterior distributions as well. In short, I'll treat each stratum as a more or less independent sample space and continue to sample from it until the distribution for that strata is sufficiently tight. Bouncing between strata will still be handled a la Gibbs sampling (by picking a value from the overall distribution and then selecting a block from the corresponding stratum.

The hangup is the "more or less independent" part. These distributions are absolutely not independent. So, while I will sample from them, I need to be able to adjust posteriors based on results in other strata. I'm banking on only having to look at adjacent strata (or, at least, only near neighbors) as opposed to the joint distribution across all of them.

Still lots of work to do.

Sunday, February 21, 2016

Tapping the brakes

I've got some issues with the Metropolis implementation that I want to clear up before finishing the coding. In the mean time, I've got a bunch of new articles to read. Might as well get a start on my literature review. On the upside, it turns out that I can read a lot of the UMSL library material online by proxying in through my student account. That's going to make the research a lot less wasteful (no spending 15 minutes tracking down a journal just to find the article is irrelevant) and a lot more comfortable (though, there was a certain charm to having my own tiny desk in the stacks at Cornell's grad library).

Links may or may not work depending on what sort of research access you have:

Gilks, W. R., and P. Wild. “Adaptive Rejection Sampling for Gibbs Sampling”. Journal of the Royal Statistical Society. Series C (Applied Statistics) 41.2 (1992): 337–348.

Roberts, G. O., A. Gelman, and W. R. Gilks. “Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms”. The Annals of Applied Probability 7.1 (1997): 110–120.

Kotz, Samuel et al.. “Hitchcock, D. B. (2003), "A History of the Metropolis-hastings Algorithm," the American Statistician, 57, 254-257: Comment by Kotz, Johnson, Read, and Banks and Reply”. The American Statistician 58.1 (2004): 90–90.

Ritter, Christian, and Martin A. Tanner. “Facilitating the Gibbs Sampler: The Gibbs Stopper and the Griddy-gibbs Sampler”. Journal of the American Statistical Association 87.419 (1992): 861–868.

Saturday, February 20, 2016

Meramec Orienteering 2016

So, after throwing way back last Saturday, this week's off day throwback race report is to... last Saturday.

Run February 13, 2016.

There are maps you compete on and maps you own. I own Meramec. I'm not entirely sure why, but I do. The confluence of very physical running, large feature navigation, with just enough penalty for error seems to line up pretty much perfectly with what I bring to an orienteering competition. I've been beat here, but not often.

So, even though the temperature was in the teens and my fitness was off and I hadn't done much navigation training lately, I still happily drove the hour out to Meramec to see what sort of course Gary Thompson had put together for this year's meet.

Turns out, a pretty good one (not that I was surprised by that). Meramec doesn't offer much in the way of detailed navigation, so what you want to test here is the ability to simplify legs and make good route choices (and, having done that, run your ass off). Gary has fully embraced that and put together an itinerary that offers many over versus around decisions.

It's super frosty at the start, but a short jog up on the road up the hill parallel to #1 does a good job of getting me warmed up. I decide to go with my crappy plate compass because I'm afraid my nice thumb compass will constrict bloodflow to the thumb and turn my left hand into a block of ice by the end of the meet. I figure that Meramec has enough big feature detail that the compass won't get much use. (Astute readers will have already detected the foreshadowing).

The rest of the report will probably make more sense if you pull up a copy of the map in another window.

1: It's pretty obvious that one wants to run the road to the day use parking area. After that, the choice is high or low. I choose high, cutting through the field heading straight for the control. This turns out to be a big win as there is a deer track through the high grass and the woods have recently been burned so they are full speed. I spot the control from a good 50m away. I usually carry my punchcard loose, but I knew I'd drop it with my fingers numb in these conditions. So, I put it inside my map case. Despite the warmup, my hands aren't working too well. I have to put the punch card on my thigh and hit the punch with my fist to get the pins to penetrate my map case.

2: This is my first opportunity to get myself warm and I charge up the hill hard. I then drop down to the trail, using the bend after the cave as an easy attack point to 2. My hands are doing better, but I still resort to the fist punch method.

3: I pick up the trail again to where it crosses the stream and then again jam it up the hill to try to generate some heat. The reentrant midway looks a lot bigger than I expected, but I'm sure it's right. I nick the very top thinking how nice it is to know exactly where you are half way through a long leg taken at full speed. Dropping over the ridge gives lots of opportunities for parallel errors, so I take care to drift right and check off the top of the reentrant east of the control (A). From there, it's an easy attack to the control.

4: I didn't even see the route to the left staying high. I wouldn't have taken it even if I had. The redline route has a great handrail with the big reentrant leading you right up to the saddle, so the only reason to go around is that you don't have the legs. For the moment, my legs are doing fine so I charge right at it.

5: I don't think my route was terrible on this one but, in retrospect, going one spur to the west (B) and then descending was the better choice. My route crosses a lot of little reentrants which, like just about everything at Meramec, are strewn with small rocks. Once low, finding the right spot to turn up is trivial since the road intersection points right at it. I am a little put out by how high the control is placed. It's not wrong, I just misread the number of contours. It's a bit of a grind back up to it and I'm plenty warm enough now that I don't need to be finding ways to generate more heat.

6: I'd call this little leg trivial except that I passed two people here who obviously felt differently. Not sure what was hard about it. Run the stream until you see the rocks.

7: Not much to this one either. Aside from being just a hair left (which meant I had to climb a couple extra lines crossing the spur midway), it was a pretty straightforward leg.

8: Well all good things come to an end. Seven spikes in a row is about as long a streak as I can keep going. The exit from 7 is easy and there's no problem picking up the reentrant to the hilltop after crossing the road. I did entertain the idea of taking the road around. If I had, I would have finished about 5 minutes quicker because at the top I mistake the saddle to the south for the one I really want (C). It's almost impossible to do that if approaching from the north. Right turn into the wrong reentrant system, which doesn't look at all right but, I've been misreading the size of reentrants all day, so it's not until I'm sure I've gone too far that I consider what might have gone wrong. I check my compass, which clearly indicates that the whole system is pointed the wrong way but, it's a crappy compass, so I look around some more to make sure and then decide that I have, in fact, blown it. Back to the saddle and straight into the control, but it's a pretty big boom.

9: Fortunately, the opening 500m of the next leg are easy running down a very broad spur. That lets me regain my composure and get my head back in the game without giving away any more time. I consider taking the trail along the stream and then attacking straight up the ridge. Safe, but slow. I'll never be able to run up that pitch this late in the race. By taking the more direct route up the spur, I can keep moving quickly. I stay just a little low so I can use the top of the reentrant as an attack point.

10: Even if the entire rest of the course was there merely to set up this leg, it would have been worth it. This is a truly great route choice leg. Down to the stream and take the trail? Up to the ridgetop? Or contour just below the ridgetop? I choose the third and I'm pretty sure it was right, though I'd have to time the other two to be positive. The problem with all three choices is that there's no good attack point. This is where knowing your mapper can help. One could forgive a mapper for deciding that either the entire Meramec map should be mapped as rocky ground or just don't bother and consider it a default condition. However, Plamen has a very well defined standard for what rock features he maps. And, since I've run on many of his maps, I know what that standard is (basically, rocky ground is mapped when it goes from something that can sprain you ankle to something that can break your ankle). Sure enough, the little patch of rocky ground 200m west of the control (D) is easy to identify and I contour in. I even score some ego points (like I need those) when another orienteer wandering the other direction asks if I've just punched 10 and I say, "No, 75 more meters" and it turns out to be almost exactly that. (Yes, I know, you shouldn't answer that question, but this is a SLOC meet, not National Championships).

11: I contour over towards 11 frantically trying to find a decent attack point. None presents itself. I decide to stay on the same contour as 10, figuring it will be easier to spot the bag looking up the hill. Indeed, with great relief, I spot it from about 50m away and turn up the hill.

12: The technical part is done (I'm not likely to miss a highway overpass), but there's still the issue of getting there. I take a pretty straight line, which goes through some fairly nasty woods and crazy loose footing, not to mention the rockface which, while correctly mapped as crossable, does require scrambling. I don't think the powerline to the right or the fields to the left were any better.

I sprint in for a finish time of 79 minutes. Ten minutes per K is a bit slower than I'd like at Meramec, but I'm really pretty pleased with the effort. All but one leg was run just as I had planned and the damage on the boom was pretty well controlled. It nets first overall which is never a bad thing. The conditions and my current training would probably not have produced a good result anywhere else but, I own this map.

Friday, February 19, 2016

Baseline

Well, my adviser wasn't quite as enamored with my algorithm as I had hoped. He want's me to apply Metropolis directly first. I'm not wild about that because a) I know Metropolis works and b) so does everybody else. It's just not particularly efficient.

I'm trying to look at the bright side. It will be a good baseline. If my algorithm can beat Metropolis in a straight-up fight, I'm onto something. If not, I'll need to look for improvements. Meanwhile, much of grad school is sucking it up and formalizing results you already know on the off chance you may be wrong so, coding Metropolis it is.

I've decided to write it in C# simply because it will take me half the time of writing it in Java. I'll regret that if we port the data layer to Hadoop but, we're six weeks into the semester; I need some tangible results real soon. I should have it working over the weekend.

Thursday, February 18, 2016

First cut at the algorithm

Preparation:

Stratify data upon loading, creating blocks of rows with roughly equal order of magnitude (in absolute value, so big offsetting entries will tend to wind up in the same block).
Store references to blocks as an array of lists, so it's easy to retrieve all blocks with the same order of magnitude. Minimally, this will be an access pointer and row count for each block, plus a count for each stratum. It may make sense to also store attribute information so one can quickly determine if a block has ANY rows of interest without actually scanning the rows; that's an implementation detail to be dealt with later.

Query (single cell result):

Compute initial prior for the result and the extreme value (largest value included in the result). I'll use a generic prior for a first go, but using actuals is probably where I'll wind up.
Create an (empty) list of blocks already queried, so we can enforce sampling without replacement. (Might make more sense to do the inverse: create a copy of the full list and remove references as we go).
Set the initial stratum to the mode of the extreme value prior.
While the width, e, of the (1-γ) Highest Density Interval (HDI) of the extreme value prior is greater than E (γ and E are parameters):

Choose an unsampled block from the current strata.
Add that block to the list of sampled blocks.
Sample that block completely.
Calculate posterior distributions on the result and the extreme value (lots of edge cases here; particularly, what to do when no rows are found in the block and when the block is the last one for this stratum).
Generate a random variable from the extreme value posterior and set the current stratum to match the magnitude of that value.
Set the current priors to the posteriors.

At this point, we have a very good distribution on the extreme value and should also have a reasonable handle on the result distribution as well (although, since we haven't actually used it yet, it might be best to hold off on the result distribution until we get here). What we may or may not have is a good handle on the distribution of the individual data elements. At any rate, compute one now from the initial prior and all the data sampled so far and set that as the prior for this step.

While the width, r, of the (1-α) HDI of the result prior is greater than R (α and R are parameters):

Generate a random variable from the prior distribution of the individual data elements and set the current stratum to match the magnitude of that value.
Choose an unsampled block from the current stratum.
Add that block to the list of sampled blocks.
Sample that block completely.
Calculate posterior distributions on the result and the individual data elements.
Set the current priors to the posteriors.

And, when that converges, you're done. Yes, it looks like a lot of work, but would you like to see what a keyset search for all matching records by a traditional database system looks like? Actual computation is so fast now that it's almost free. The stick in the mud is getting stuff in and out of the computation engine. More calcs on less data will likely carry the day. At least, that's what I'm betting on.

Also note that this algorithm parallelizes very easily. Just delegate each step to a separate process. There isn't much downside to doing them out of order; just a little coordination required to maintain the list of sampled blocks.

Wednesday, February 17, 2016

New algorithm

Well, things may not be too bad. First off, my data isn't quite as 1-sided as I had feared. Here's a histogram of the order of magnitude (log2):

I'm not sure what that distribution is, but it sure isn't exponential. If it was, the partitions would get exponentially larger as I approached the lower order values and that would make it very hard to sample without missing a few really important values way out on the right. It starts out that way, but then it levels off. I just needed to dig further into the center of the distribution to see that.

I think we're looking at the union of two distributions. A normal distribution centered around 2^10 ($1000) and a Chi-squared with a mode at just under 2^20 ($1,000,000). I have some theories on what's driving those two distributions, but I'll talk to the actuaries before sharing them as I may be embarrassingly off base.

The next question is how to go through the partitions (keep in mind that this is a tiny slice of the full data set; the real histogram would have five more zeroes tacked on to the counts). A thought on that struck me as we were going through one of the more basic MCMC (Markov Chain Monte Carlo) methods in class today.

While the example is intentionally trivial so as to demonstrate the process, a key aspect of it struck me. The problem under consideration is that of a politician who wants to spend time in each district proportional to the number of people living there. Problem is, this bozo doesn't know how many people are in each district or even how many districts there are. By using a basic MCMC strategy know as the Metropolis algorithm, the politician can localize the decision to one of simply staying where he is or going to an adjacent district, knowing only the populations of his current and potential district. Just that information at each step, a random walk can be executed that converges to spending days in proportion to population.

The salient feature here is that you don't need to know very much about the range, or even the domain, of the random variable coming in. You can figure it out locally as you go. This is a very desirable feature when trying to sample from a 60 Terabyte database.

While somewhat problematic, it would not be terribly difficult to use an MCMC technique directly to estimate a posterior distribution for the result of a query. The problem, as seen when I first tried some sampling against this data is that the heavy observations in the tail are both very significant and very infrequent. By the time the sample is large enough for the distribution to converge, you've read darn near the entire database; you might as well just run a regular query and get the real answer.

For any given query, we don't know how far the tail extends, but we can form a reasonable prior from either the general shape of the data (in this case, it would be good to know if the metric we were after was from the Normal or Chi Squared component) or, we could inform the prior from actuals, which is a much smaller set of data indicating past performance and then sample from the much larger set of projected results to form a posterior. This latter approach is used by the actuaries to validate that the projections are making sense and they do form a pretty reliable starting point.

So far, nothing new. Trouble is, as Financial Services companies are forced to declare in their ads, past performance does not necessarily indicate future results. It is not uncommon for a large block of business to be sold off or some other significant development to change the projections of the cash flows from policies. Even with a good idea of where the tail is, we can't be sure of that without looking at least a few orders of magnitude (log2) beyond. So, we're still looking at fairly slow convergence. To speed that along, I'm proposing that we violate the tenets of a Markov Chain and actually build some memory into the process.

Rather than continue to apply the same prior at each step in the walk, we compute a new posterior based on the data sampled so far. That becomes the prior for the next step. We then compute the next block to sample, using a branching distribution that targets the tail. Once we have a handle on the largest observations, filling in the rest of the distribution is relatively easy and can be done either by continuing the pseudo-Markovian sampling or by switching to more traditional random sampling (I'm favoring the former, but leaving my options open at this point).

The rub, of course, is computing the new posterior. That's usually the output of the entire MCMC process, not the result of one step. It's not that computing it at each step is intractable, just time consuming, and the whole point of sampling is to get to an answer more quickly. That's another good reason to be sampling blocks rather than rows. Each step will at least generate a bunch of observations which will presumably move the posterior along a bit quicker (that's still conjecture given the early divergent results). It may also be that the whole distribution doesn't need to be estimated. Just having a distribution on the extreme value is exceedingly useful.

Lots of other questions to answer, too, but I do believe that there's something here and I do need to get some sleep. Next step is to actually try this and see how well it converges. Project for the weekend, I guess.

Tuesday, February 16, 2016

Blindsided

I got to class a couple minutes late today and the prof is writing some problems on the board. OK, first problem is that I didn't bring anything to write with to class. I usually just try to pay attention and maybe make a note or two on my tablet. Second, and more serious, problem is that I am way behind after last week's crazy work hours. So, I ask if these are homework or if we're supposed to do them now, hoping for the former but getting the latter.

However, it wasn't a test, just an opportunity to work through some problems and get some feedback. It turns out, I'm not quite as far behind as I thought. I was able to do most of them. Still, it's another wake up call. I need to get on this studying thing. The semester is in full swing now and I can't afford to fall behind.

Monday, February 15, 2016

Data in R

So, I'm working with the data in R now. Haven't done a whole lot yet, but here are some graphs that make the problem obvious enough (no attempt at making them pretty right now):

partPlan = read.csv('data.csv')
amount = sort(partPlan$AMOUNT)
index = 1:length(amount) #length is approx 700,000 observations
plot(index, amount, type='l')

OK, obviously we have some sort of exponential growth going on at the tails. Let's take a closer look at the left side (the right isn't much different).

smallest = 1:1000

plot(smallest, amount[smallest], type='l')

That's ugly, but it's only a few observations at the tail. Let's trim that off and see how it looks.

nearlySmallest = 1000:10000

plot(nearlySmallest, amount[nearlySmallest], type='l')

Uh, oh, magnitude is less, but the shape hasn't changed. Wonder if this keeps up?

notSoSmallest = 10000:100000

plot(notSoSmallest, amount[notSoSmallest], type='l')

Hell's bells. We're now 1/3 of the way to the median. I had hoped things would get more linear as we moved into the middle of the data. If anything, the bend is more pronounced. What this means is that even if we stratify on order of magnitude, we're still going to run into the same problem at each stratum and if the maximal observation for the slice we're looking at is in one of the lower strata, we may be sifting through darn near the entire data set.

That's not to say I think the general idea won't work. Just that it's going to have to be a bit more sophisticated in trimming the data.

Sunday, February 14, 2016

Grunt work

I don't have a whole lot of time to work on schools stuff today (well, OK, I chose not to devote a whole lot of time to school stuff today). Suffice it to say that all dramatic revelations are blocked on the fact that my data is not in a usable form. Next major task is the truly exciting work of getting the dataset into R. I haven't done any loading of large datasets into R. On the surface, it doesn't look very hard and it probably isn't. Just one of those thing ya gotta do. Hopefully, I'll have something more interesting to say about it tomorrow.

Saturday, February 13, 2016

2004 Adventure Racing Nationals

Apropos of nothing, we're hopping in the way back machine for this week's off-day race report to my only entry at USARA nationals.

Run November 5, 2004.

David Frei, Yvonne Deyo, and I traveled to French Lick Indiana for the US Adventure Racing National Championships on November 5. Although we later qualified on our own merits at Raid The Rock, we are competing as the Gateway Adventure Team, taking the slot that David earned while racing with Jason Coleman and the Sonas at Berryman.

The trip gets off to an auspicious start when I get tied up at work Thursday afternoon forcing David and Yvonne to head off without me. I arrive later that evening, missing the race meeting. Fortunately, they were able to check in all my gear, so no harm done. We spend the night in the opulent French Lick Spa which is also the race headquarters, start, finish, and primary transition area.

The next morning, we wake at 4AM to setup our transition area. We can't help but feel a little small as we lay out a lame blanket on the grass amidst the tents and canopies of the big sponsored teams. Fortunately, there is no rain in the forecast, so the disadvantage is only psychological. At 5AM we are given the race maps, instructions, and checkpoint coordinates. We spend the next two hours plotting points and deciding what gear to bring on each leg.

Just before 7AM we assemble with the 47 other teams in front of the hotel for the start. Standing in the frigid air in a crowd of competitors, each of us holding a 7-foot carbon fiber paddle by our side, one feels like part of the Gallic horde at Alesia preparing the assault on Ceasar's ramparts as day breaks. We obviously hope for better success, or at least fewer casualties. At just after 7, the horn sounds and we all squeeze through the start gate and begin our run to the boats.

I'm carrying Yvonne's paddle as well as my own because she's being towed by David. Since she has both hands free, we have Yvonne carrying the punch card. This works for about 200 meters at which point she drops it (in her defense, Yvonne always uses a punch card case, so she's not used to carrying one free). I turn around to get the card while Yvonne and David stop short and clothesline the team behind us with the tow. By the time we've sorted it out we're in last place. At least we can look forward to improving our position.

I take over punch card duties and by the end of the 2-mile run, we've moved back into the top half of the field. The checkpoint at the end of the run is inside the ostentatious New Baden Hotel. This massive structure was the largest free-standing dome in the world until the Houston Astros decided playing baseball indoors was a good idea. The scale is so overwhelming that running inside it is somewhat disorienting. Adding to the surrealism is the fact that when we get to the center of the dome to punch, we appear to be standing in some sort of bicycle museum. It's all pretty weird.

We grab a boat and put into the river where we will spend the next few hours. After just a few minutes of paddling, we hit a submerged branch and roll the boat. If you've never been dumped into near freezing water, don't feel too bad, there are better things to do with your time. The banks are too steep to climb out, so we scramble up onto a log to right the canoe. We're back on our way without loosing too much time, but we've lost a dozen positions and, as we soon find out, that has some serious consequences downstream.

Along the river are six logjams. None are big enough to be particularly dangerous, but they do form choke points where only one or two teams can get through at a time. At the first, we patiently wait in line for our turn to hop onto the logs and then drag the boat over. At the second and third, we try portaging around, but don't really get through any faster (getting the canoe up and down the steep banks is a slow and difficult process.) At the fourth, David devises a new tactic that works quite well. Yvonne and I hop out and take the packs. With the boat lightened, David is able to paddle right up onto the logjam and pull the boat over (passing the teams waiting in line along the bank). Yvonne and I run along the bank and get back in further downstream. This method is a bit less effective at the fifth jam because the bank is so steep that getting back in is difficult. The sixth jam is quite large so the organizers have mandated a portage. By this point the teams are spread out enough that we don't lose any time waiting.

The ambient temperature rises throughout the paddle, but the steep banks trap the colder air over the water. Most of the time we are in the shade, so our clothes remain wet. By the takeout we are all fighting hypothermia. Fortunately, the next leg is a run, and that is always the quickest way to warm up. Still, the time spent packing up the paddles and PFD's (which the race organizers will bring to the next paddle section) is misery as I can't stop shaking.

We start the run in around 30^th place. The navigation is pretty easy, but even at nationals, any navigation slows down the back half of the field. I don't know why adventure racers don't take navigation training more seriously. Of course, they probably wonder why we don't take paddling more seriously.

We have two maps for this section. With both David and I both navigating we blow through the run at full speed (well, full speed for a race of this distance) with no errors. At several checkpoints we see teams coming back the other way, having overrun the control. The run ends with a fairly long slog down a rock-strewn reentrant. This is where having all three members good at the same thing really pays off. We zoom past a bunch of teams who are delayed by one or more members not skilled at uneven terrain. Even though the run is only an hour long, we get back to the French Lick Spa in the top 20.

At the transition, we shed any clothes that are still wet and change into our cycling shoes. It's almost noon and the weather has taken a big turn for the better. Blue skies and warm autumn air buoy our moods as we set out on the bikes. On the roads, I ride lead and navigate with David pushing Yvonne on hills. On trails, David takes over the lead duties. The navigation is still easy and we get to the end of the 90-minute leg in 15^th place, having made no mistakes.

We're now faced with the second paddling leg. After the disastrous trip down the river, we are determined to redeem ourselves. This time, we are on a large lake that requires some non-trivial navigation. I'm in the front of the boat as usual, so I get the maps. Much of the lake is filled with rotting trees. While it's easy to paddle around the ones still standing, the stumps below the surface are harder to spot. There are several tense moments when we feel a solid bump on the bottom of the boat and wonder if we are going for another swim. The paddle takes 2 hours and, although we don't pass any teams, we consolidate our position and finish up feeling pretty good about our effort.

Sunset is still a few hours off, but the temperature is dropping again. As we put our packs back on and remount our bikes I am once again shivering badly. Yvonne and David are both concerned, but I assure them that I'll be OK once we get moving. The climb up away from the lake helps and after about half an hour I'm feeling fine again.
About this time we turn off roads onto trails. Unlike the trails we rode on the way to the lake (which were steep in places but generally fast), these trails are muddy and technical in spots. We have no trouble riding them, but the going is slow and by the end we are all running out of food. The sun sets just before we get back on roads. David's disappointed that the trail network was so simple. True, there weren't too many places to go wrong, but a few teams sure blew it. By the end of the trails we're up to eleventh even though we rode conservatively.

David has to ride lead now because I don't have a light on my helmet to read the map by. Yvonne is drafting off David and I'm sitting on the back of the train. After a few miles I find myself drifting off Yvonne's wheel. Hmmm, that's weird, focus. I push a bit and get back on. Then it happens again. And again, but this time I can't get back on. I call up ahead and they slow down a bit. I stay with them for a while, but the next time the road tips upward, I'm off again. I can't believe it. I'm bonking. On the bike. On the road. This isn't supposed to happen. My sole purpose on this team is to punch big holes in the air for David and Yvonne to slip through. My only salvation is that we only have another 20 minutes to get back to French Lick.

Other than a serious blow to the ego, my collapse has not had much affect on our standing. We're still in eleventh and are told that fifth place is less than half an hour in front of us. I'm usually the one snapping at the others to hustle in the transition, but this time I tell them that we're going to have to hang out for a bit while I get my blood sugar back up. After 20 minutes, 7 ounces of Hammer Gel, 2 Ensures, and some Fig Newtons, we're back on the road.

David still does the bulk of the work on the front riding out to the orienteering section. We pass one other team (finally in the top 10!) and by the time we get to the end of the riding, I'm feeling OK.
The orienteering section is laid out as a Farsta. This format is named for a Swedish town where it was supposedly invented (the Norwegians have a competing claim). The Scandinavians can squabble about the origins while the rest of the world simply enjoys it as a great way to run a race. Basically, you do two (or more) laps of a course. The course has forks and whichever fork you take on the first lap, you go the other way on the second. Therefore, everybody covers the same legs, but in a different order depending on which fork you get on which lap. It gives all the excitement of mass start racing, without letting people blindly follow around the course.

This format is particularly good for us since we hope to pass some teams and don't want them latching onto us as we go by. David carries the maps and I take the punch card (which also tells us which forks to take). Although we have a few small bobbles, we move well and make no big mistakes on the first lap. As I'm about to punch the final control, I look at the number on the card: 8g. Hmmm, I could have sworn we were going to 8h (the other fork). I ask David,

"Which control are we at?"
"8h."
"We should be at 8g."
"You said 8h."
"I know, but we should be at 8g."
"That's really bad."
"How bad?"
"Like the exact wrong way bad."

I go over to look at the map. Sure enough, of all the forks to screw up, this one is clearly the worst. We will have to go almost all the way back to 7 and then do the leg to 8g. At this point I would excuse any reaction from uncontrollable sobbing to homicidal rage, but my teammates offer neither. Instead, we turn around and head back the way we came. Nobody talks for a long time.

The error costs us about 15 minutes, which, in the context of 4 hours of technical night orienteering, is not that terrible, especially considering that we are still the fastest team through the section by a good margin. It does hurt just a bit to spoil some really clean night navigation on David's part with such a dumb mistake. If it bothered him, he didn't let on and by the middle of the second loop he's got bigger problems: he can't keep anything down. Near the end of the loop I offer to take the map so he can empty his stomach in peace. Not much comes up, but with nothing going down it's only a matter of time before he's toast.

The orienteering section moves us all the way up to third, but the fourth place team is only a few minutes behind us. They come into the transition while we are still sorting out our gear for the trip back and get back out ahead of us. The final leg is called a triad. We can take one bike and one (non-motorized) scooter. We have to stay together, but we can swap equipment as much as we want. Our original plan was to have David and I trading the bike while Yvonne rode the scooter. Now we decide that David had better be on the scooter most of the way and we'll just go as fast as we can with Yvonne towing me from the bike. It works pretty well and David does hand me the scooter for a few short rests. We get close enough to the team ahead that we start thinking it just might be possible to nip them at the line. They clearly have more in the tank, though, and lay down a brutal pace for a few miles that we simply can't match. With two miles to go they are a minute up the road. We look over our shoulders and, seeing no one, jog it in for a fourth place finish.

It's 2:07AM. While we're happy to be done, we are confronted with an unexpected dilemma. We assumed the race would be considerably longer so we didn't rent a room for the night. Now we are cold, dirty, and tired and can't do much about it. To the rescue comes Matt Luetje who offers us the Race St. Louis Team room for the remainder of the night. He even carries our bags up to the room. There are some small favors that will never be forgotten and this is one of them.

It's tempting in such cases to torture oneself with what-if scenarios. What if we'd seen the branch and kept the boat upright? Being a few places further up the field would have saved us some serious time at the logjams. What if we'd accounted for the fact that the cold had us burning fuel faster than usual? David and I might have avoided the bonk (a rare condition for either of us.) What if I hadn't misread the punch card on the Farsta? That would have been 15 minutes for free. But the truth is that all teams have problems. Last year's champs couldn't even finish after dumping their boat and running into severe hypothermia. Getting past the obstacles is what matters. In that respect, we did pretty well and we never turned on each other. So I'd say fourth in this field is probably right about where we deserved to be and have managed to sleep through the subsequent nights without dreaming up a thousand ways to save five minutes and get third. Certainly, getting 40 minutes for second would have been quite a stretch and no combination of factors would have had us in at 12:05 when the winning team, Hooked on the Outdoors, was finishing.

The course was a bit lean, both technically and in distance, for a championship event. The organizers had originally planned for a caving section, which would have been cool, but still wouldn't have put the winning time anywhere near 24 hours. However, the race organization was nearly flawless, and the course layout was well thought out. On the whole it was a very satisfying weekend and a great reminder of how much I enjoy racing with David and Yvonne.

Friday, February 12, 2016

Redirect on the sampling problem

Not sure if my adviser will go for this, but I think I'd like to redirect my focus on this semester's research a bit. The reason is that I've figured out why the data diverges so badly. I'll put together some graphs to illustrate all this, but for now, you'll have to take my word for it.

Basically, the data is distributed with a very heavy tail. What I mean by that is that most of the observations are pretty tight to the mean, but there are some that are waaaaaaaaay far away. Making it worse is the fact that the mean is pretty close to zero. So, if you are trying to estimate the sum, those outlying observation are pretty crucial. If your true sum is, say, 100 million, and you miss an observation for 10 billion, well, your sum just went negative and increased two orders of magnitude.

That's a pretty big miss.

So, my focus for the moment isn't correlation, but accounting for the big misses. (Those actuaries are pretty sharp; they've been telling me this all along). However, just because a problem is hard doesn't make it impossible (or, so I keep telling my daughter when she doesn't want to do homework). Here's the plan in a nutshell:

Partition the data not by attributes, but by the order of magnitude (probably log2, but that's an implementation detail) of the measure.
When running a query, start with the partition that has the biggest values. Read them all; you can't afford to miss anything in this group.
Repeat through all the partitions, changing the proportion of rows sampled based on how much data you've already hit. For example, if the big partition only gave you a single row, you should probably sample everything in the next partition, too. But, if the big partition gave you a bunch, you could start being more selective because you know that even ALL the rows in a lower partition can't offset a bunch of rows in a higher one.
When your sampling threshold gets to 0% (meaning you don't need to sample any rows from that partition), break. This is your optimal stopping rule.

Questions:

How do you determine what percentage to sample? This is where it ties back to Bayesian stats (a rather important tie-in since this is a project linked to a Bayesian stats course). We need some sort of prior on the distribution and the sample size would be set based on the width of the highest density interval of the posterior covering the confidence desired.
Where do we get the prior? In the data I'm working with, we have the advantage of actuals. This is a much smaller data set that has a very similar distribution to projected future cash flows.
This is all well and good if the data is static, but are you really going to re-partition all your data when new stuff comes in? I don't think that's necessary. My thought is to fall back on the block sampling that I've been wanting to use all along. Within a block, data would all fall into the same partition. When new data comes in, blocks are added to partitions, but existing data doesn't get moved around. The problem then becomes one of picking how many blocks to sample rather than how many rows. Once you have a block, you read everything.
Doesn't that get you right back where you were started? What if all the important data goes into a single block? That's the rub, of course. And, it gets worse as you increase the grain of the query since a query with many attributes specified will be going after fewer rows which may all reside in a single block. However, I think that setting the sample percentage based on the posterior from the previous steps saves us. If things aren't converging, we'll sample a lot more stuff, maybe all.
This isn't very complicated; are you sure nobody's thought of it already? They may well have. Fortunately, this is a project for a class, not a dissertation. If it turns out that the work's already been done, I'll still have enough work adapting it to my data set to justify it as graduate work. If not, there should be at least a paper in it; maybe a real start on the dissertation itself.

Thursday, February 11, 2016

All dayer

Another very long day at work digging out from a failed migration. I managed to beg off until 11AM since I was up all night. Then, it was war room until, well, I'm still in here (7PM), so who knows?

There is a part of me that has always wanted to do my PhD on IT development processes. To say there is room for improvement is a colossal understatement. The piece that went bad on this one should never have made it through testing. Too many assumptions were built into the acceptance criteria. Given the number of people who stayed up last night, spent the day in the war room, and will re-deploy tomorrow, this was probably around 50 grand spent for lack of clear specification of the problem.

Anyway, I skipped class and have had no time to do any school work. Fortunately, I'm currently caught up. Or, I was until recently.

Wednesday, February 10, 2016

All nighter

I worked through the night a few times in undergrad. I don't recall ever doing while at Cornell, but I might have. I do it at work all the time. Tonight is one such time. We're installing a new version of our tools and have to convert the data. Most of it is automated work, but there's enough hand-holding that it ends up being all night with just a few naps.

I suppose there are many who would think that such efforts are a bit nutty. Maybe so. It's always been a reality, so I don't really give it too much thought. I would certainly not want it to be a regular thing, but I don't mind it every now and then. That said, with classes ramping up, this one hurts a bit. I'll probably take tomorrow off (from work, I'll still go to class). Overtime is nice, but so is sleep.

Tuesday, February 9, 2016

Added value

So, it just hasn't come up in the past few years. I've written tons of web services, but it's been several years since I've written a web app. I have to write a little internal site at work so, rather than re-invent the wheel, I took a look at the latest iteration of Microsoft's Entity Framework for ASP.net.

I have to say, if I was a 25-year-old web programmer, I might be a bit concerned that they've made it this easy to get something up and running. The functionality would have to be fairly complicated before any real programming skills were called into play. It makes you wonder how long web programming will be a viable profession. As I'm a 52-year-old Data Architect who doesn't really want to blow a week writing a silly little dashboard showing the status of our reporting cubes, I was pretty happy about how effortless it was.

I'm sure the anti-MS crowd has equally compelling tools. I'm not a zealot; if I can get it done without installing 10 new open-source tools on my PC, I'm great with it. Even the stuff I am good at, generating the schema for the underlying performance data in SQL Server, was handled pretty much automatically and the results were about as good as I would have done by hand.

I'm not seriously suggesting that solid technical skills are becoming worthless. Actually, quite the opposite. So much of the fluffy stuff is now handled for you that the only reason to involve a programmer is to deal with the really hard stuff. If you're not up to that, well, you might want to find another line of work.

Monday, February 8, 2016

Data Mining HW1

... is here.

Saturday, February 6, 2016

Minimum Description Length Principle (MDLP)

This is a new one for me and I have to say I think it's very elegant. It's very late (I've been working at my "day" job all evening, doing a little reading during lulls) and this was supposed to be an off day so I won't write anything more than an overview right now, but it's worth coming back for.

Basically, the principle is a formalization of Occam's Razor. Occam claimed that if there were competing plausible explanations, the simpler one was the one to go with. That's fine if you're trying to decide whether a wet sidewalk on a sunny day was caused by a sprinkler system or an alien flyover. However, it gets a bit more dicey if you're wondering if the warm temperature on that same day is just random fluctuation or a symptom of global warming. Both are plausible. Both are simple. And it's not at all obvious which is more so.

The MDLP asks, if you were to encode your hypotheses and your data, how much storage would you need? Most concise answer wins.

What this means is that a simple model that doesn't fit the data very well will lose to a more complicated model only if the improvement in fit offsets the complexity of the model. The second model takes up more space, but you don't have to store as much data, because the deviations from the model are smaller and smaller numbers can be stored in fewer bits.

In the above example, both hypothesis require very little coding. The first takes none (it's random, there's not much more to say). The second requires only indicating some sort of trend function. If you have enough data to actually derive such a trend, that is, you can fit your data to a trend and reduce the variance from the model, then global warming wins. However, with only one data point, you can't do that. You either have to store the one data point on it's own and call it random, or build that data point into your model and then add a trend you made up. That increases your data size. So, in the absence of more data, random gets it. I think even Al Gore would concede that one hot day doesn't prove anything.

More, including references and why this is relevant to what I'm doing, to follow.

Friday, February 5, 2016

Another publication opportunity

Looks like we're going to present our "big data" results from work with an accompanying white paper. And, it looks like I'm going to get to write it. Technically not a peer-reviewed pub, but it's still a good thing.

Thursday, February 4, 2016

Research!

Well, what do ya know? Upon seeing the convergence difficulties I was having, my adviser recommended I actually look some stuff up rather than trying to figure it out on my own. Things I'll be reading up on over the next few days:

Metropolis-Hastings MCMC - I knew about this one, but I've never read it. Not terribly surprised that it's published in Biometrika. I knew the epidemiology literature was going to be my fertile field.
Gibbs sampling - my Bayesian text has a chapter on this, so I guess I'll be reading it sooner rather than later.
Mixing theory - don't have any really good references here. The basic concept is straightforward (assuming you already know what a Markov process is). I think there's more to it than Wikipedia is giving credence to.

Also have a few ideas of my own to try:

Use the seriatum nature of the data to enhance the adjacent time-series cells in the estimate.
Use actuals as a prior and compute a posterior based on the sample.
Combine the "block of rows at a time" concept with Metropolis-Hastings to come up with a rule for when to stay in a block versus jumping to a new block of rows.

That should keep me plenty busy until we meet again in a week.

Wednesday, February 3, 2016

Divergent

So, for today's epic failure, I pulled a subset of data from work and tried building a cube from samples using a few different strategies. Most of the cube cells were aggregates of over 10,000 rows, so I was expecting them to converge pretty quickly. Nope. I have no idea what to make of this data. Missing even a single row can throw it way off. That is, of course, what the actuaries have been trying to tell me and why they don't really like the sampling approach.

Well, if it was an easy problem, they wouldn't give me a PhD for solving it. I'm meeting with my adviser tomorrow. Maybe he's got some ideas on it.

Tuesday, February 2, 2016

Bayesian HW1

Google continues to make a mess of Word documents so I'm just posting a link.

HW1

Even with that, it dropped the stuff I did in the equation editor. I guess I'll switch back to Latex for assignments that require that.

Monday, February 1, 2016

Principal Component Analysis

Well, I said I'd give it a shot a week ago, so here it is. I'm going to take a stab at explaining Principal Component Analysis in my own words. The concept is not brand new to me, but the method suggested in Aggarwal is more elegant than decomposing sums of squares, which is how I've seen it presented before.

For starters, we have a data matrix D with n rows (observations) and d columns (attributes). For the moment, we're assuming real-valued attributes, though there are ways around that assumption. As our goal is to transform this data into a lesser-dimension linear subspace, we start by mean centering all the attributes, that is, we subtract from each row μ, the d-dimensional mean vector. This projects the data from an affine subspace to a linear subspace, so we can create a basis for the rows. That sets up the next trick: projecting that basis onto an orthonormal basis of lesser dimension while preserving as much variability in the data as possible.

That last statement is often a sticking point for folks. Don't we really want to reduce variability? No. We want to reduce residual variability. That is, we want to have as little variance between the data and the model as possible. The total variability is what it is, so to reduce residual variability, we have to maximize the variability accounted for by the model itself. Thus we are looking for a basis that spreads the data out as much as possible, while reducing the number of dimensions.

OK, back to the technique. Let C be the covariance matrix for D. That is, c_ij is the covariance between the ith and jth dimensions. Plugging in the formula for the covariance and coming up with C = D^TD/n - μ^Tμ will be left as the proverbial exercise for the reader. Now, here's the cool part. C is symmetric and positive semidefinite. Therefore, we can diagonalize it into C = PΛP^T.

What have we done here? We've busted the covariance into two parts: variance along an orthonormal set of eigenvectors (the columns of P) and the variance along each of those vectors (the diagonal elements of Λ, which are also the eigenvalues). Now re-rank the eigenvectors by the magnitude of the eigenvalues and pick the top however many you want as a basis. Blam! Your done. You've now picked the optimal k < d vectors to span your space minimizing the residual variance (which is just the sum of the remaining eigenvalues).

I'm going to have to try this out on some of our data at work. What I'd really like to see is if the reduced basis is pretty constant across all the non-numeric attributes, or if very different spaces emerge for different slices of business. My guess is the latter, which would imply that rolling a bunch of the non-numeric data into the analysis could further improve the model.

Where I'm trying to go with all this is to build a model for a reasonably solid prior using actuals that can then be used to improve the convergence of the posterior distribution when the projection data is sampled.