never2old4school: May 2016

Monday, May 30, 2016

Memorial Day

As the son of immigrants who were not in the military, I don't have any direct connection to anyone who lost their life while fighting for the United States. Frankly, I don't have any real connection with fallen soldiers from anywhere else, either. I'm a typical mix of Irish, Scotch, and English. They have been hacking at each other for centuries, so I'm sure a few of my ancestors died in battle, but I know nothing of them. My Grandfather had a brother killed in WWI; that's the most recent casualty that comes to mind.

That's why I was glad we spent a good chunk of today at the Elks. Kate is a member of BPOE#9, here in St. Louis. Aside from being a good time (the Elks make no apologies for mixing socializing with their mission), it's nice to be involved with an organization that supports veterans and their families. A number of the folks there are veterans and several have children in active duty. I made a point of indicating my appreciation. I'm very much against war, but I surely do understand that, if it comes to that, you really want your folks doing the fighting knowing you are behind them.

Sunday, May 29, 2016

Back at it

If you've surmised from the missed posts in this blog and entries in the study log that I'm taking a breather, you're mostly right. I say mostly because we're back to working overtime at work (which is normal for this time of year). Still, it's time to get back on it. Tomorrow will be the last easy day. After that, I'm planning on the following schedule to prepare for the Q:

Monday: 1 hour working on the CISS Paper and 1 hour on Algebra
Tuesday: 1 hour working on the CISS Paper and 1 hour on Calculus
Wednesday: 1 hour working on the CISS Paper and 1 hour Real Analysis
Thursday: 2 hours on Statistics
Friday: 1 hour working on the CISS Paper 1 hour Statistics
Saturday: 1 hour addressing whatever didn't go well in the preceding week.
Sunday: Off

I'm going to be on vacation from June 13-22, which is actually a good opportunity to get in additional study. Kate and Yaya don't exactly roll out of bed at the crack of dawn.

Thursday, May 26, 2016

Traffic cop

Well, here's one way to kill traffic: act like a cop. Ever since I called out the Russians for potentially stealing my ideas, my hits on this blog have gone in the tank. And, almost none from Russia in the last few days. I think that's pretty damn funny.

Wednesday, May 25, 2016

Big Data Geography

I met with Walt Maguire, Big Data Chief Field Technologist for HP Enterprise (that's a truly great job title) today. Sharp guy, as you would expect. Lots of good technical insights, as you would also expect. I could write about that, but most of it is already on HP's website in the form of various white papers. One quasi-technical observation that I thought was both funny and instructive was the geographical differences in the approach to technology adoption. This one probably isn't on HP's website, though I'm sure he doesn't mind me sharing it.

On the West Coast, particularly Silicon Valley, companies just want the ideas and the enabling technology. They'll write the actual implementations themselves. In a few cases, this makes sense. Facebook and Google really are breaking new ground in terms of data volume, so no off-the-shelf solution is going to work for them. But, he sees it more as a pervasive culture thing than necessity. They just think that they'll do a better job. And, since they do pay the highest IT salaries, they're often right.

On the East Coast, what they want is certification and compliance. They aren't so concerned about whether it's the latest and greatest. They want to know that it will work and that it won't get them in trouble with regulators.

Midwestern companies tend to be later adopters; looking for both proven technologies and the big price breaks that come by being a version behind. In short, they're looking for how to get the job done without putting a lot of capital at risk.

He didn't mention the South. Having several industry friends in Texas and having just worked with a consulting group from Atlanta on another project, I could speculate that they're biased towards open source platforms since they are also trying to control costs, but are less risk averse because they generally have pretty strong technical talent available to fill in the gaps.

He wasn't advocating one philosophy over any of the others; just noting that there are many ways to get this stuff done and the best solution for one organization isn't necessarily the best for another. Of course, anybody who knows anything about business already knows that, but putting it in geographical terms was an interesting way to frame it.

Tuesday, May 24, 2016

What's impossible is what's required

We get fruit delivered at work twice a week. This morning, I took an apple and noted that wasn't a particularly good apple. Not terrible, just not great. I'm not sure how long apples have been around. Certainly, they'be been cultivated for several thousand years. One would guess they existed wild for a lot longer than that. The presence of fresh apples in the northern hemisphere in May is an extremely recent development. So, I didn't bother complaining about the less than great apple to those who supply the fruit.

Technology users (myself included) are generally less understanding. New features and higher levels of performance are not only expected, they are required. Just because it's impossible doesn't take it off the table because impossible is a temporary state.

Two years ago we put together an analytic cube for a group at work that holds 50 billion fact rows tied to about 300 dimension attributes. When we first turned it on, queries were coming back in around 20 minutes. That seemed like a pretty big improvement over the 2-3 weeks they had been spending to get the same information out of the data warehouse, but the users wanted performance comparable to other cubes they used. The fact that this one was 20 times larger didn't impress them.

Through some very aggressive tuning, we were able to get most queries coming back in just a few minutes. This month, we gave a demo of their data on a true scalable platform (Hadoop/Impala/atScale) and showed that we can, in fact, return their results in seconds; there's just the little matter of paying for the hardware. And, if they really decide money is no object, we also showed how adding Vertica into the mix gave them another order of magnitude in speed.

It will be interesting to see how this unfolds. I'm pretty sure they'll go for for the top of the line solution. There's already talk of initiatives with trillion-row data sets. And, when those come about, they'll want those answers just as fast.

Monday, May 23, 2016

Plagiarism

So, I was reading an article on Slate today about dissertation plagiarism in Russia. (Don't read too much into the liberal source; like most techie types, I'm a staunch Libertarian. I just find it more interesting to read the opinions of those who disagree with me.) Anyway, apparently there are A LOT of fake PhD's in Russia.

I wouldn't say I'm particularly worried about somebody ripping off my research. First off, Math/Stats/Data Science aren't the fields one generally goes into if looking for a quick win on the PhD front. Even if you do manage to dupe a committee into granting you a degree, you want last too long in the actual field if you don't have at least something to offer. I'll grant, the bar is lower than one might hope, but you can't get by on smoke and mirrors forever. At the end of the day, your clients are going to want an actual number.

Secondly, I haven't really put much out on this blog that couldn't have been done by any competent programmer with a math background. The reason it's a fertile field is that most people have been attacking the problem with hardware rather than playing the long game where the quantity of data will outstrip anything you can build (Or will it? Time will tell).

That said, I did notice that nearly half the hits on this blog are from Russia. Does make one wonder. I guess as long as I'm just posting intermediate results and not actual thesis text, the risk is pretty low. No point in plagiarizing a rough draft when there are thousands of finished copies to rip off. Still, I suppose a little caution is in order.

Sunday, May 22, 2016

2014 Berryman 50 mile

The Berryman races were this weekend. I have worked aid station #1 at this race almost every year (didn't this year because my Sister in Law was in town). In 2014, I didn't because I was going after the EMUS series.

Run May 17, 2014

As I'm running the EMUS (Eastern Missouri Ultra Series) this year, I actually entered the Berryman rather than manning my usual post at aid station #1. I don't completely abandon my contribution; I bring Yaya along to work the start/finish aid station while I run. We camp out the night before just a hundred feet from the line.

As usual, when camping right at the start, the main pre-race challenge is figuring out what to do; even with the 6:30AM gun, I've got well over an hour to kill. After a breakfast of coffee and oatmeal, I mill about catching up with a few friends I haven't seen recently. One such individual is Paul Schoenlaub, who insists that he is not going to run fast. He's the current course record holder in my age group, so I'll believe that when I see it.

There's less than 30 seconds of running from the start line to the trail head but, the marathoners start 90 minutes after the 50-milers, so there's no significant congestion on the trail. Paul slots in behind me and we chat for a few miles before he makes good on his promise and drops back. I get to aid station #1 where Paul's wife greets me with the expected barb about taking a holiday rather than helping her hand out water. She's surprised I'm not even in the top 10. I remind her of our many conversations of how different people look on lap 2 of this course. It's very comfortable now, but in the dozen or so times I've worked this race, conditions have always punished those who go out too fast.

About halfway through the lap, I decide it's time to up the effort a bit. The increased pace feels good. I pass several runners and cruise into the start/finish at 10:44AM (4:14 lap time). Another lap like that and I'll be taking the age group record from Paul.

Yaya is working the aid station and quickly fills my bottles while I stuff some gu's into my pockets. The leg to aid station #1 (about 5 miles) is the longest distance between stations and I don't want to get depleted now. The stop is quick and I'm back on the singletrack in under a minute.

Everything still feels like it's working well, but the watch tells a different story. I had arrived at #1 in just under 45 minutes on lap 1, and that was taking it easy. Now the pace is feeling forced and it still takes 48 minutes. It's time to get tough as the heat is filling in and my body is obviously in worse shape than I realized. I stay on the gas as best I can for the next leg which is quite short. The seven miles that follow are the fastest running on the course as the trail does a lot more contouring rather than bouncing straight over the steep ridges. I'm able to hold a decent pace but, as I hit the big climb out of Brazil Creek at mile 39, it's obvious that I'm not going to be able to keep this up.

I take a few short walk breaks, then they start getting longer. It's not that hot; probably low 80's, but it's very humid and the hills are relentless. Coming into the last aid station with 3 miles to go, I'm caught by Andy Emerson. He's in his 40's, so it's not a place I need to stress over, but I still hate getting passed in the late going. There's not much I can do about it, though. He's still running well and I'm not.

I stagger into the finish at 3:24PM (total time 9:54) for 8th overall, winning 50+. The second lap was a full minute per mile slower than the first. Not my proudest day of pacing, to be sure. Still, I had the 7th fastest time on both laps, so I didn't fade any worse than most others. And, in the context of EMUS, it's pretty much a perfect day: maximum points for both distance (you get a point for every mile covered) and age group placing.

So, it's all good, but I have to concede that this race is much harder than it appears when you're watching it from the sidelines. The hills are just short enough that you feel compelled to run them all on lap 1 when it's cool. As I found out, you pay dearly for that strategy when things heat up on lap 2. While I never thought that breaking Paul's mark of 8:36 would be easy; I have newfound respect for just what a stellar time that is.

Friday, May 20, 2016

Life

... has usurped my plans. All of this could have been predicted; I was just way too optimistic about how much I could get done in light of Yaya finishing her school year and Kate's sister and family coming to visit for the weekend. We'll give this heads-down let's get a paper written thing another try next week.

Thursday, May 19, 2016

Quick thoughts on Bayesian Stats

Full disclosure: my adviser reads this blog and was also the prof for the course. I don't think that's biasing my appraisal of the course, but it probably is, at least at some sub-conscious level.

This was really two courses. One was an upper undergrad course in applied Bayesian methods. The other was some directed readings and research so I could get credit for the class as a true graduate course. Since the thesis research part of it is unique to my case, I'll only discuss the part of the course that everybody took.

I wasn't sure coming in how applied the focus was going to be. It turned out to be pretty applied. Given the class composition, that was probably the way to go. I got the impression that most of the students were not math majors interested in the underlying theory. Instead, they were from other disciplines where they might have to actually use this stuff.

That would normally disappoint me but, the underlying theory of Bayesian stats really isn't that deep. It's the frequentist stuff that's all contorted to make up for the fact that so many problems have intractable Bayesian solutions.

Or, used to.

Now that almost any level of complexity in the model can be simulated, there really isn't much downside to just plowing ahead and letting the software packages do the work. Spending a semester learning how to set up and run arbitrarily complex models was time well spent. It was interesting to see that the applied part of the field was moving so fast that in the space of the semester, a new GLM (General Linear Model) package was released for R that pretty much obviated the last few chapters of the text.

The reliance on programming did make tests somewhat problematic. We got around that by simply not having any. As regular readers of this blog already know, I'm fine with that. I produce much better responses when I have time to think a problem through. My only complaint on that front was that we weren't given more to do. Not that I had tons of time on my hand this semester as I was really hammering on the research stuff, but the regular class assignments seemed a bit light.

Overall, I'm pretty happy with it. I really didn't realize how much computer modeling had revolutionized things. Thirty years ago, the Bayesian crowd was pretty marginalized because they had to make so many damaging assumptions to get their posterior distributions to converge. Freed from those limitations, the power of the method is obvious and it's quite powerful, indeed.

Wednesday, May 18, 2016

Bayesian Stats Final Assignment

is here.

FWIW, this was good enough to seal an A for the course. Transcripts have been updated, so I'm officially a 4.0 student for another 7 months.

Tuesday, May 17, 2016

Variance 2.0

The derivation of the variance with the second parameter doesn't change too much, though the result is a bit messier. First we recall that we had carried a couple of constants p and q through the integration. Well, q is still a constant with respect to θ, but not ν. So, we re-write it as:

$q(\nu) =\nu^2/3 - \mu\nu + \mu^2$

This yeilds:

$\begin{align*} \sigma^2 &=\int \int \sigma(\theta,\nu)^2P(\theta)P(\nu)~d\theta d\nu \\ &=\int \frac{pb+q(\nu)a}{a+b}P(\nu)~d\nu \end{align*}$

This is going to get messy, so let's jettison the constant terms and focus on just the part dependent on ν:

$\begin{align*} \int q(\nu)P(\nu)~d\nu &=\frac{c+1-m}{(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}}\int (\nu^2/3 -\mu\nu + \mu^2)\nu^{c-m}~d\nu \\\\ &=\frac{(c+1-m)\left[\frac{\nu^{c+3-m}/3}{c+3-m} - \frac{\mu\nu^{c+2-m}} {c+2-m} + \frac{\mu^2\nu^{c+1-m}}{c+1-m} \right ]_{max\{x_i\}}^{nb_k}} {(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}} \\\\ &=\frac{\frac{1}{3}(c+1-m)}{c+3-m}\left( \frac{(nb_k)^{c+3-m}- max\{x_i\}^{c+3-m}}{(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}} \right ) \\\\ &\quad-\frac{\mu(c+1-m)}{c+2-m}\left( \frac{(nb_k)^{c+2-m}- max\{x_i\}^{c+2-m}}{(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}} \right ) \\\\ &\quad+\frac{\mu^2(c+1-m)}{c+1-m}\left( \frac{(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}}{(nb_k)^{c+1-m}- max\{x_i\}^{c+1-m}} \right ) \\\\ \end{align*}$

It's not quite that bad. The middle term is just μE(ν) and the bottom term is just μ². (Both those facts could be derived simply by looking closely at q(ν) rather than grinding out the integration.) The first term is the one of interest and it's the one that's going to drive down our variance estimate. But, it will do it in a controlled manner as the distribution pushes more mass towards the maximum observed block sum. Further, we can throttle it by tuning the value of c.

The rest of the variance is just symbol manipulation which I won't bother reproducing. Here's the final result (where the first term in the above result is renamed ν*):

$\sigma^2 =\frac{bp+a(\nu^*/3-\mu E(\nu)+\mu^2)}{a+b}$

Yes, I'm burying some computation in that formula, but not complexity. The terms all make intuitive sense; some of them just require a little number crunching. And, I do mean just a little. Twenty floating point operations seems like a lot until you compare it to reading several million bytes off a disk. Getting this variance right counts for a lot.

Monday, May 16, 2016

"nu" prior

Sorry, that's not even a particularly good pun, but I couldn't resist. I'm going to use nu (ν) rather than "U" as the upper bound of the distribution of block sums. Just seems more consistent to use a greek letter for a distribution parameter.

As mentioned on Friday, the standard non-informative priors don't work well in this case. While we don't actually have any information, we want to assert the idea that ν is to be treated as being high until the data proves elsewise. The simplest prior that accomplishes that is p(ν) = ν. That's an improper prior, but we'll get to norming constants in a moment; it doesn't matter for now.

Given a series of block sums X = (x₁, ..., x_m), P(X|ν) = Π P(x_i|ν) = 1/ν^m max{x_i} ≤ ν. Thus, the un-normalized posterior g(ν|X) = ν ν^-m = ν^1-m ν ∈ (max{x_i}, nb_k). Integrating this to get the normalizing constant gives:

$\int_{max\{x_i\}}^{nb_k}g(\nu) =\left [\frac{\nu^{2-m}}{2-m} \right ]_{max\{x_i\}}^{nb_k} =\frac{(nb_k)^{2-m}-max\{x_i\}^{2-m}}{2-m}$

so
$P(\nu | X)=\frac{(2-m)\nu^{1-m}}{(nb_k)^{2-m}-max\{x_i\}^{2-m}}$

That looks messier than it is. Plug in m = 1 and you see that the prior goes from being linearly increasing to a flat posterior running from the block sum to the maximum possible block sum. That seems reasonable. If we've only sampled one block, all we really know is that the distribution of block sums goes at least as high as what we just saw. Of course, if we sample lots of block sums and the distribution really is uniform, then the posterior on ν should converge to the maximum observed value as the number of observations gets large. Let's check on that:

$E(\nu | X) =\frac{(2-m)\int\nu \nu^{1-m}\textrm{d}\nu}{(nb_k)^{2-m}-max\{x_i\}^{2-m}}\\ =\frac{(2-m)}{(nb_k)^{2-m}-max\{x_i\}^{2-m}}\left [ \frac{\nu^{3-m}}{3-m} \right ]_{max\{x_i\}}^{nb_k}\\ =\frac{2-m}{3-m}\left ( \frac{(nb_k)^{3-m}-max\{x_i\}^{3-m}}{(nb_k)^{2-m}-max\{x_i\}^{2-m}} \right )\\ =\frac{2-m}{3-m}\left ( \frac{(nb_k)}{1-(\frac{max\{x_i\}}{nb_k})^{2-m}} - \frac{max\{x_i\}}{(\frac{nb_k}{max\{x_i\}})^{2-m}-1} \right )$

The ratio on the left clearly goes to 1. The first term in the parens goes to zero because nb_k > max{x_i} so the negative exponent will send the denominator to -infinity. The denominator of the rightmost term goes to -1 so the entire thing converges to max{x_i}. Yay for that.

Here's the rub: suppose the first couple observations are particularly low. That's not unusual; with at least 16 strata, we'd expect at least one to have the first two in the bottom quartile. With two observations, the posterior is already biasing towards max{x_i}, but that's going to chop off a lot of our distribution (and variance) and cheat this stratum. So, we need slower convergence.

Suppose we were to change our prior to p(ν) = ν^c where c is some real number > 1. g(ν|X) becomes ν^c-m and the c just propagates through everything (simply replace 1-m with c-m). Now, you can dial c up as high as needed to keep the posterior from collapsing too quickly, but still get the same asymptotic convergence to the proper mean. As c really is arbitrary, I'll need to run a bunch of tests and derive a heuristic for picking a good value, but I'm pretty optimistic that this is going to work.

Sunday, May 15, 2016

2005 SLOC Laumeier

With no particular reason to favor one report over another this week, I just went with confluence of dates. This week's off-day throwback race report is from exactly 11 years ago.

Run May 15, 2005.

The SLOC picnic has been a bit of a wash the last few years. Two years ago, it was literally washed out by an absurdly strong thunderstorm. Last year, the publicity went out too late and not many people brought food. This year, everything went well and we had a really nice picnic on a beautiful day in a tiny, but fun park.

Laumeier is a sculpture park. The park has a dense trail network connecting several dozen sculptures. Most of the sculptures are quite large and many are huge. The combination of the unusual features and contrasting vegetation (half fields, half thick woods) makes for a pretty unique orienteering experience.

The course was a 22-control Score-O set by Mark Geldemier. Although the only overall route decision was which direction to run the loop and which order to take 16 and 18, Mark did an excellent job of providing many route choice legs. Such legs are particularly hard to evaluate when you're punching a control every minute.

I won with a time of 22:19, with David Frei coming in second at 23:18. Jeff Sona got all but one control [probably coming in fourth, as my map notes have Rick Armstrong listed as third, but the original report just listed the Carol's Team participants]. Yvonne Deyo ran with her husband. After eating more chips and cookies than we probably should have, we all went over to the Meramec River and paddled for a couple hours.

Saturday, May 14, 2016

Quick thoughts on Data Mining

Following my format from last semester, here's some quick hits on Data Mining with a more thorough recap to follow.

I went into the class not sure how much I would learn. After all, it's not that far off from what I do every day at work. Turns out I learned a lot. Probably more than any other course I've taken since Cornell. So, why am I not happy?

Honestly, I don't know. Maybe I'm just being a jerk. I won't dismiss that, but let's shelve it for the moment.

The lectures were super dumbed-down. Basic probability and linear algebra were pre-requisites. While I certainly get that a prof doesn't want to dust half the class, I don't think there's anything wrong with telling the students that if they don't remember a certain fact from a prereq course, they can just dig out their old text and if they still don't get it, bring it to office hours. The students who do remember the stuff (or, as in my case, the students who went to considerable effort over winter break to dust that stuff off) would rather the class time be spent on the actual subject matter at hand.

We got almost no feedback on our assignments and tests until after the final. That's total bogus. Even after the final, all we got were numeric scores. That's not particularly useful even before the final. It's completely useless after.

As with the other two CS courses I've taken at UMSL, the focus was way too applied for graduate level work. I'm beginning to wonder how anybody writes a credible dissertation in this department. It's certainly not inspired by coursework.

OK, that's all true, but the fact remains that I really did learn a lot (albeit, mainly from the text and assigned papers). And, while I don't think that getting an A in this course constitutes as particularly strong academic achievement, I obviously received no injustice in the grading. So, I should probably stop being pissed about it.

I will. I'm pretty good at just moving on. But, while tuition at UMSL is ridiculously cheap for in-state students such as myself, these courses do represent a significant investment of time and there's nothing cheap about that. I think they could do better and I think they should.

That said, I'm basically done with traditional coursework at this point. Pretty much everything going forward will be directed readings or dissertation research. So, it's certainly not worth getting worked up over.

And, yes, maybe I'm just being a jerk.

Friday, May 13, 2016

Uninformed

I had intended to use a non-informative prior on the upper bound of the Uniform distribution that the block sum is drawn from. A few problems immediately present themselves.

The best non-informative prior for Uniform is the Jeffreys prior (or reference prior; they're the same in the univariate case) p(U) = 1/U, where U is the upper bound of the distribution.

The first obvious issue is that this is an improper prior. Not only does it not integrate to 1, it doesn't integrate to anything. That means that, until I have some data, I can't estimate a mean which means I can't estimate a variance which means I can't determine whether I should be sampling this stratum.

That's not terribly difficult to work around. Just set U = nb_k until we have some data or, cap it at nb_k (since it can't possibly be larger than that) which makes the integral finite.

The bigger problem is what happens after the data arrives. Given a block sum of X, the posterior is p(U|X) = X / U², U > X. That's a perfectly good density function, but it has rather atrocious consequences. Namely, if that first block sum is small, it's going to drive the estimate for all remaining block sums way down and crush the estimate of the variance in the process. As such, we won't return to the stratum to sample more blocks and find that the sums are generally much higher.

So, while there is no way to know what the distribution of U is when starting a query, the non-informative approach is going to kill the algorithm. Therefore, I have to inject a fake belief that the sums are higher and bake that into the prior. This is essentially setting the prior consistent with an "assume the worst" attitude.

I think that's defensible in principle but leaves me without any mathematical precedent on which to pick a prior. So, I guess I'll just run a bunch of empirical tests and try to find a some sort of consistent shape. Or, at the very least, some starting point that results in posteriors that have that can represent a family of shapes observed.

Thursday, May 12, 2016

All's well that ends well

Not sure if it was generous grading or a big curve (probably both), but I wound up with an A on the Data Mining final. Grades for all the other work in the class were also published. As I expected, they were also A's, so the 4.0 stays in tact for now. Seems like if all that stuff could have been graded in the past two days, it could have been graded last week as well when it would have served a purpose. Ah, well, that's all I'll say on that.

I've also turned in my final project for Bayesian Stats even though it's not due until tomorrow. I'm going to spend the rest of this week catching up at work and then really go heads down on getting a decent draft of my CISS paper done. There's actually a bit more research I want to conduct on that front. Hopefully it will only take a few days. Basically, what I've found from messing around with it is that it works great until the percentage of rows hit within each block gets really low. Then the prior of Uniform(0,max possible) on the block total starts to mess up the estimate of the variance. So, I need to change that to a hierarchical model where the block sum is distrubuted U(0,X) and X is distributed via something else.

Not sure what that something else should be. Obviously, I don't want to pull it away from max possible too quickly or it will mess up the convergence of the bigger queries which are currently behaving quite nicely. And, of course, since we can't be running MCMC chains during query processing to get posteriors, it needs to be something with a tractable conjugate distribution.

Wednesday, May 11, 2016

Train wreck!

That didn't go particularly well. As I have no feedback from other assignments, I'm only left to guess if my average in Data Mining coming into the final was good enough to survive that mess. I'm pretty sure it is but it's still really a bummer. Spent nearly half the time on this convoluted conditional probability problem that had absolutely nothing to do with data mining. Overall, it was a good class, but these last few weeks have been exceedingly depressing. Just when we got to the interesting stuff, the course nose-dived into rudimentary calculation (which is definitely NOT what I'm any good at, nor do I care to be; that's what computers are for).

I'll wait a couple days before doing the course eval because I know I wouldn't give it a fair shake right now.

Tuesday, May 10, 2016

Last call

Later today, I'll be turning in my final project for Bayesian Stats and sitting for my exam in Data Mining. After that, I'm going out drinking with my wine club, so this is all the post you're going to get today.

Monday, May 9, 2016

Feedback loop

So, tomorrow is the final for Data Mining. I'm not terribly worried about it, but if I was even a little worried about it, I'd be a lot worried about it. Simply put, we've had no substantive feedback.

We've taken two exams, submitted three assignments, and presented a paper. We've got the first exam and assignment back. That is outrageous. The whole point of grading is to give feedback so students can adjust. There simply has been no opportunity to do that in this course.

Unfortunately, this appears to be a social norm at UMSL. I generally don't take pot shots at a school to which I'm attaching myself, but this really needs to be called out. I've now taken four classes at UMSL and the average time from when an assignment or test is collected to when feedback is given is around 3-4 weeks. That's pretty useless. By the time you're four weeks behind, you're dead.

I'm not really sure how to lodge the complaint. The course evaluation is the obvious place but, again, this appears to be a problem of culture, not just one or two profs being delinquent. I never put it to the test, but I'm quite sure that if I had sat on grading for four weeks at Mount Union, I would have had a personal and not particularly pleasant conversation with the dean. Of course, Mount Union also charges ten times as much for tuition, so the students have a legitimate gripe if they're getting anything less than stellar service.

However, just because the courses are state subsidized doesn't mean that grading isn't important. I think this point is particularly salient in a Data Mining class where all the machine learning algorithms we're studying are predicated on fast and accurate feedback.

Grading sucks. It's by far the least fun part of teaching. It's also vitally important. Every job has things that you simply have to do whether you like it or not. This is one of them.

Sunday, May 8, 2016

Modest schedule change

When I decided to practice the sabbath, I put it on the traditional sabbath (Saturday) because it seemed like I might be stressed if I had to take Sunday off and had something due on Monday. As I've been in school for a year now and I've never had anything due on a Monday, I'm going to move it to the line up with the more normal Christian observance of sundown Saturday to sundown Sunday.

This is a bigger deal than it may at first seem. Not which day it falls on; that's completely arbitrary. But the sabbath itself is no small thing to me. Even if you don't buy any of the traditional Judeo-Christian canon, there's plenty of current research indicating that taking a day off a week is a really, really good idea.

As with all disciplines, one needs to use some judgement. There are going to be cases where observing the discipline is worse in every way than breaking it. However, the point of discipline is rooted in the word itself. It comes from the Latin disciplina, which is a noun that broadly encompasses the idea that in order to learn something, you have to put your mind to it. So, it's not quite a law, but it's definitely more than a suggestion.

I've found that taking a day off each week this year has been a great way to make sure that other aspects of my life, most notably the spiritual side, are not obliterated by the single-minded pursuit of a worldly goal, laudable though it may be.

Saturday, May 7, 2016

Overkill

Didn't even get a post in yesterday and today I'm up until midnight again. Some of this is in response to some annoying run-time errors we're seeing on our Oracle platform at work, but mostly it's spending a lot of time messing around with the final project for Bayesian Stats.

I finished it this evening. I'll proof it on Monday before turning it in on Tuesday, but the work is done. I won't post it until the due date (next Friday), but when I do, there won't be much to explain why I spent so long on it.

In short, while I'm not trying to make this harder than it needs to be, I am pretty interested in actually learning this stuff. So, I don't always (ever?) go for the quickest path to an answer. I spend time thinking about ways to frame the problem. I try different solutions. I do a lot of editing.

From a purely cost/benefit standpoint, this approach doesn't make a whole lot of sense. I could get A's on a lot less effort. But, I'm really not caring that much about my GPA. A meaningful dissertation is as much about the journey as it is the destination. I need to keep myself in the mindset of learning rather than answering.

Obviously, if I was in a real crunch this semester, I'd cut corners where appropriate. As that's not the case, I don't mind a few extra hours running down some side avenues. I've actually learned quite a bit in the last week simply because I bothered to follow up on a few questions that weren't necessary to producing an answer, but presented themselves nonetheless.

Thursday, May 5, 2016

Numerical stability

The calcs we perform at work tend to be additive rather than multiplicative, so I generally don't think much about numerical stability. Got bit by it fairly hard tonight while implementing my Metropolis algorithm for the final project in Bayesian Stats. The problem is to fit an arbitrary data set to a poisson predictor model. No problem in theory, but when you compute the joint probability density, it boils down to zero pretty quickly.

Fortunately, there are ways around these problems. Namely, convert to logarithms. Then you're back to adding, which is nice and stable. So, that's what I did and it seems to work OK.

I'm a little surprised this wasn't mentioned when we were covering the problems in class. I'd expect a math major to figure it out pretty quickly, but many of the students are from other disciplines and are taking the class more for the practical methods than the underlying theory. They may have a tougher time. Well, I don't know if any of them read my blog, but I'm happy to give away that tidbit to any of them that do. The actual implementation in R, I'll keep under wraps until after the assignment is due.

Wednesday, May 4, 2016

Non-parametric priors

We had a colloquium today with a professor from Purdue. It was quite good. Rather than try to summarize it myself, I'll just repost the announcement:

SPEAKER: Prof. Vinayak Rao (Department of Statistics, Purdue University)
TITLE: Non-Parametric Bayes and Random Probability Measures
ABSTRACT: With large and complex datasets becoming commonplace, traditional probabilistic models with fixed and finite number of parameters can be too inflexible, suffering from issues of underfitting or overfitting.The Bayesian nonparametric approach is an increasingly popular alternative to parametric models, where a model with an unbounded complexity is used to represent an infinitely complex reality. Of course, any finite dataset has finite complexity, and the Bayesian approach of maintaining a posterior probability over the latent structure is used to mitigate possible overfitting. In this talk I will review the philosophy of nonparametric Bayes, and some methodology, including its workhorse, the Dirichlet process. I will also cover some work on constructing of dependent random probability measures (RPMs) which possess both flexible marginal distributions, as well as rich correlation structure. If time permits, I will should how computation via Markov chain Monte Carlo is straightforward and discuss some applications of these models to clustering and topic modeling.

Afterwards, I talked with him a bit about the work I'm doing. I think I'd like to learn more about the Dirichlet process. As with Metropolis, it's one of those things where it just doesn't seem like it will work. But, it does. I like counter-intuitive stuff like that.

Tuesday, May 3, 2016

Crunch time

Yep. One week to go and I'm definitely feeling it. It doesn't help that we've started in with the overtime at work again (not my idea, I assure you). Anyway, I got a good chunk of my last Stats assignment done tonight and I think I'm in pretty good shape for finals (though, I'll certainly put in some practice to be sure). So, it's a bit less frantic than last semester, but still a crunch.

I'll be putting down the research for a week while I get the rest of this stuff knocked out.

Monday, May 2, 2016

Building a committee

I met with the department head today, who is also a Stats guy. We talked for a while on my research and what I was trying to accomplish this summer. He's interested. As with my primary adviser, he only has limited availability this summer, but he agreed to look over my work as time permits. He also gave me a good starting reference for clustering. So, while what I really wanted was for him to agree to a directed readings course this summer, I didn't come away empty handed. If nothing else, he'll be familiar with my work when I inevitably ask him to be on my dissertation committee next year.

Sunday, May 1, 2016

Roller Coaster Race

Run May 1, 2016.

It's good to give back from time to time; running races don't just happen themselves. I try to volunteer for several local events each year. While I rarely feel put out by this, the deal offered to "volunteers" for this year's Roller Coaster Race at Six Flags St. Louis erased any chance of claiming altruistic points. In exchange for manning a 5K water stop, I got breakfast and park entry. That's a pretty sweet trade. Yaya's younger than the volunteer age limit but, as she's got more race support experience than many Race Directors, I have no qualms about bringing her along so we can enjoy the rides together. It helps that she looks a lot older than she is. I don't have to lie because nobody asks.

We have to get up mighty early to get to the park by 6:30AM. It's a bit chilly and I make the mistake of wearing my Milwaukee Marathon shirt. As it's long-sleeved, it's perfect for the conditions, but it also tips off that I've got a clue and I'm immediately promoted from course marshall to aid station captain. There is 1 station that is hit twice, shortly after 1 and 2 miles. This means there will be some time when we've got runners coming from both directions.

In addition to Yaya, I'm given three other teenage girls as assistants. Fortunately, they are all game and we get the station set up fairly quickly. I'm told the race attendance is a bit over 400. Given the conditions, I decide we should pour 300 cups for the group going out and 200 for those coming back. We'll then continue filling as needed when we see how many people actually take water.

The first few runners aren't terribly interested in taking anything so early. However, by the time the bulk of the field hits, it's clear that we're going to need a lot more than 300 cups. This isn't just joggers being anal about hydration. Standing around, we failed to notice how humid it was (hard to believe we missed this as the entire area was shrouded in fog an hour before the start). The runners really are sweating profusely, even though temps are still in the low 60's.

We quickly scramble assignments with two handing out water, two filling, and one manning the other side which is now seeing the leaders coming back. After about five minutes of this, it also becomes clear that the field size is considerably larger than what we had been told (post-race investigation reveals it to be 650). A quick inventory shows that even if we can keep up with the pouring, we're going to run out of cups.

A quick text to the RD (How did anybody pull off races before text messaging?) prompts reinforcements bearing more cups. With the extra help, we are able to keep up with demand and after another frantic ten minutes or so, we are down to just the stragglers coming back in. We get things cleaned up while they trickle through and pack up the station less than two hours after beginning setup.

Despite the crunch, this was pretty much a cakewalk compared to what goes into running an ultra aid station. After some bagels and cookies, Yaya and I proceed to ride every roller coaster in the park (there are nine of them). More than adequate compensation for a couple hours supporting the sport.