never2old4school: November 2016

Wednesday, November 30, 2016

Link functions

Sometimes I forget just how new statistics is. Informal measurement and inference predate history. Measure theory, which laid the mathematical foundations for random variables and distributions came about in response to the formalization of Calculus in the 17th century. But, actual mathematically-based statistics? That really didn't get going until the work of Fisher. He died the year before I was born. On the timeline of human knowledge, this is really new stuff.

I mention that because I attended a colloquium today on Dimension Reduction by Dr. Wen of MS&T (the engineering school formerly known as Rolla) and she threw out the term "Link Function". It wasn't the first time I'd heard it; I knew it had something to do with tying the mean to the predictors in a Generalized Linear Model (GLM), but I decided that I'd better file it under the heading of "Named Results" and actually learn what it was.

Turns out, there's a very good reason I didn't know what it was: the term didn't get widespread use until after I got my MS. I checked my text from Cornell (the rightly acclaimed Mathematical Statistics by Bickel and Docksum) and they mention the function in the context of the General Linear Model, but it's just given the anonymous designation g. The "generalization" of the linear model into the Generalized Linear Model was published in 1972, but it didn't really catch on until after B&D's publication in 1977, when computing power became cheap enough that anybody could afford to run their data through a package like GLIM. The function is still generally written as g, but it has also picked up the moniker of the Link Function.

So, what is the link function? It's actually rather well named - it provides a "link" from the mean to the linear predictor. So, if μ_i is the mean of our observed dependent variable, the link function g(μ_i) then maps that to the real number line and calls that the linear predictor. The linear predictor is modeled as a linear combination of the independent variables plus an error term which follows a normal distribution. It's sort of like a transformation, but not quite. With a transformation, you transform both the predictor and the error term. Here, the link function is only transforming the predictor.

Tuesday, November 29, 2016

Box-Muller random normal generation

This topic might be considered a bit archaic since any decent stats package comes with functions that will generate random numbers from any of the major distributions. That said, somebody has to actually write those routines and knowing how its done is helpful for knowing the limitations. More to the point in my case: it might come up on the test.

We start with the assumption that we can generate a reasonably random sequence of U(0,1) random variables. This is, of course, a whole field of research in and of itself. However, for now, we'll just assume that we can do that. There are several ways to proceed from there.

If you want uniform random variables, well, you're pretty much done. Just scale and shift it to the interval you want.

When the pdf is closed form and invertible, plugging the uniform random variable (U) into the the inverse gets the job done. So, for example, if you want exponential random variables, just invert the exponential pdf, Y = F_Y^-1(U) = -λ log(1-U) and you're done.

Of course a lot of distributions don't have nice, closed-form, invertible pdf's. In that case, you may be able to use a transform of one that does. "May" being the operative word. Some transforms are simpler than others.

Of particular note is the normal distribution. Since it comes up so frequently, one would obviously want to handle this case. However the normal pdf is not in closed-form and there's no good transformation from a variable that is. In an odd and fortuitous quick, there is a transformation from two random variables, one uniform, one exponential (the exponential, as shown above, would also likely come from a uniform). This method goes by both the Box-Muller algorithm (for it's creators) or the Polar method (which describes how you actually pull it off).

Without going into the formal proof, here's how it works. Start with the observations that, if Y ~ exp(1/2), then Y ~ χ₂² and if X ~ N(0,1), then X² ~ χ₁². Therefore, the sum of two squared standard normals is exponential. There's not much one can immediately do with that, but it should at least suggest a relationship. In particular, what if we were to consider the normals as the legs of a right triangle and the exponential the hypotenuse? How many ways are there for that to happen? Well, you basically have the whole unit circle times the exponential length. So, let's randomly generate polar coordinates as uniform and exponential and then take the cartesian coordinates as normal.

Crunching the actual proof is not as daunting as it may seem. Once you see that switching to polar coordinates is your answer, it falls out pretty quickly. So, (without proof), if we have two uniform(0,1) random variables U₁, U₂ and R = (-2 log U₁)^1/2 and θ = 2πU₂, then X₁ = Rcosθ and X₂ = Rsinθ are independent N(0,1) random variables.

Why stress over generating one when you can get two for cheap?

Monday, November 28, 2016

More deltas

Before getting down to business, here's today's weird coincidence. For the CISS algorithm, I was deriving the point at which I should jump from the conservative uniform distribution on blocksums to the more aggressive exponential model. I decided to model the decision as a Null Hypothesis test, with a p-value of 5% (meaning you'd mistakenly switch to the exponential 1 in 20 times that you shouldn't). Due to the quirks of the inverse gamma distribution on the exponential prior, the first point at which this decision can be made is when three blocks have been sampled. So, how far away from the null mean does the observed mean need to be? The pdf of the Bates distribution is a big mess, but it simplifies if you are only looking at the lower tail:

$f_{\bar{X}}(x) = 27x^2$

assuming three iid U(0,1) observations and x < 1/3. (Obviously, that gets scaled to reflect the upper bound being the maximum possible block sum for a stratum, but it's easier to compute the critical value first and then scale it). Thus, the cdf is:

$F_{\bar{X}}(a)=\int_{0}^{a}f_{\bar{X}}(x)dx=\frac{9a^3}{2}$

OK, nothing interesting so far, but here's the weird part: set that equal to 0.05 and solve for a. You get 2/9. Really!

$\frac{2}{9}=\sqrt[3]{(0.05)\frac{2}{9}}$

There's no significance to the 2/9, the p-value is arbitrary, and it's an approximation (to 3 decimal places) not a real equality. It just turns out that 1/20 is roughly 2/9 squared. Still, it kinda leaps off the page at you.

OK, enough of that nonsense. Let's continue looking at the delta method. In the first-order case, the problem was that when g'(θ) = 0, there's no way to extrapolate a distribution because the approximating line is flat. The obvious step is to go to the second-order polynomial and hope for some variation there. So,

$g(Y_n)=g(\theta)+g'(\theta)(Y_n-\theta)+\frac{g''(\theta)}{2}(Y_n-\theta)^2+\textrm{Remainder}$

which implies that

$g(Y_n)-g(\theta)=\frac{g''(\theta)}{2}(Y_n-\theta)^2+\textrm{Remainder}$

since g'(θ) is zero. Since the square of a standard normal is chi-squared with 1 degree of freedom, we see that

$\frac{\sqrt{n}(Y_n-\theta)}{\sigma}\rightarrow N(0,1)~\textrm{in distribution}\\ \\ \Rightarrow\frac{n(Y_n-\theta)^2}{\sigma^2}\rightarrow \chi_1^2~\textrm{in distribution}\\ \\ \Rightarrow \frac{n[g(Y_n)-g(\theta)]}{\sigma^2}\rightarrow \frac{g''(\theta)}{2}\chi_1^2~\textrm{in distribution}\\$

Of course, one does well to confirm the second derivative exists and is also not zero. Finally, since Taylor series work just fine in multidimensional space, it should be no surprise that there's a multivariate form of this method as well:

Let X₁, ..., X_n be a random sample of p-dimensional vectors with E(X_ij) = μ_i and Cov(X_ik, X_jk) = σ_ij. For a given function g with continuous first partial derivatives and a specific value of μ = (μ₁, ..., μ_p) for which

$\tau^2=\sum\sum\sigma_{ij}\frac{\partial g(\mu)}{\partial \mu_i}\frac{\partial g(\mu)}{\partial \mu_j}>0,\\ \\ \sqrt{n}[g(\mathbf{\bar{X}})-g(\mathbf{\mu})]\rightarrow N(0,\tau^2)~\textrm{in distribution.}$

You can pull the second-order trick on this one, too, if you need to, but that rarely happens since all the partials would have to be zero to make the method fail.

Sunday, November 27, 2016

SLOC 3-Hour

The St. Louis Orienteering Club has been running a 3-hour score event the Saturday after Thanksgiving since, well, since long before I ran my first one in 1997. At least the last 40 years. Probably closer to 50. While it's a local meet, it typically attracts some serious out-of-town talent. The first time I won it (1998), I remember feeling like I had just won a national championship.

In recent years, the event has changed a bit; catering to the Adventure Racer side of the club. It's not easy to articulate the difference between AR "Trekking" and true "Orienteering", but anybody with more than superficial experience in both understands the distinction. It's not that one is harder or more worthy than the other, it's more a matter of structure. Orienteering has very specific rules regarding technical correctness. Adventure Racing, you take what comes. If you approach the latter with the mindset of the former, you will NOT enjoy the event. If you take a chill pill first, it's all good.

This year's version starts and finishes at Raging Rivers. It's a lidar-trail map in Illinois, upriver from Alton. By lidar-trail, I'm referring to the genre of maps which grab publicly available lidar elevation data, add trails and other super-obvious features, and call that good enough. While such mapping shortcuts throw the orienteering crowd into apoplectic fits, they are actually quite sufficient for navigation. What they are not good enough for is making good route choices, since vegetation density is not indicated. Fortunately, I've run these nasty, overgrown woods before and know that ANY trail route is better than ANY woods route.

The field is a bit lean this year. The nice weather has all the locals out, but there's only one big entry from out of town: Andrei Karpoff, who is going as a team with local Scott Erlandson (who goes by Erl). These two are the real deal, especially on an adventure-race-type map. Erl recently took home bronze in US Adventure Race National Championships (teamed with Emily Korsch, who is running solo today, and Justin Bakken). Andrei is also an excellent adventure racer and has handed me a defeat in some longer orienteering events, such as the Possum Trot. So, this will be no cakewalk.

The format is a mark-on-the-clock three hour score-O. At the gun, I run a firm pace over to the map boards so I have the luxury of sitting down with a map right in front of me rather than having to mark peering over someone else's shoulder. Past experience has shown that this is a bad time to rush. I take my time, making sure all the circles are in the right spot before heading out.

Controls 1-9 are 10 points each, 10-18 are 20, 20-24 are 30. For reasons not given, there's no control 19. While I expect to sweep the course, I follow the herd heading east from the start to make sure I get all the high-point controls. Many go straight up the ridge to 8. I decide to hit 4 and 7 first so I'll have a straight shot back to 6 on the return.

At 7, I get to the control location, which is pretty open by the standards of these woods, but still can't see the control anywhere. I'm almost to the point of giving up on it when I spot it lying on the ground in the reentrant. Orienteering: protest, course gets thrown out. Adventure racing: I put it on top of the log it was hung from so others will be able to see it.

I pass Erl and Andrei on the way to 12. We meet up again at 13 and stay together through the long, very slow descent off the ridge. The woods are crazy thick here and I'm glad I'm with them or I'd be stressing over how much time I was losing. Since I'm keeping up with two of the best bushwackers in the country, I conclude that the woods are just crazy thick and there's nothing wrong with my pace.

The eastern loop is fun with its mix of, well, I can't call it "urban" but let's go with "small town" navigation and woods. Erl and Andrei skip 15, so they get a bit ahead of me. Emily, who has skipped 7, is also slightly up the road. We blast through 17, 16, 18, and 22, then start taking navigation a bit more seriously. The rockface between 23 and 24 is crossable, but not without risk. I take it cautiously and am starting to lose contact with the lead three who are apparently still young enough to believe they are impervious to injury.

I push to 21, missing the trail in the process. Apparently, my route through the woods isn't much worse because, once I do get on the trail, Emily is right in front of me. We hit 21 together. Leaving 21, I comment to her that there's no good attackpoint for 20. She agrees.

As I have no confidence in my attack from the top of the ridge, I figure I'll just take my best guess at which of the many spurs to descend and sort it out at the bottom if I'm wrong. At the bottom, it's obvious from the other reentrants coming together that I had actually descended the correct spur, but the control was not there. Emily is still with me. We look to the north where the visibility is better and see no control, so we start hunting south. Two spurs later, I'm starting to think that there's no way it could be off this much. I'm within seconds of turning around to head north when I spot the flag two more spurs over (the arrow on the map indicates the actual location). Another common ethos in Adventure Racing is that when a control is misplaced, competitors become allies and work together finding it. I call out that I've got it, but that turns out to be unnecessary. Emily has also spotted it, as have Andrei and Erl, who had been searching further up the hill. We all converge on the bag at about the same time.

The other three stay low, but I decide to get to the trail as quickly as possible to get out of the thick woods. This doesn't save any time as the woods get thicker near the top of the ridge. The last 20 meters to the trail are literally crawling on hands and knees. Once on the trail, it's very quick running down to 14. Erl and Andrei are about 30 seconds ahead heading south to get 15 and setting up a return on the road along the river. Emily is not in sight so I assume she's behind me. I head north for the long trail run to 6. The split has been made and the win will come down to executing these last few controls solo.

As I climb the ridge, my legs are reminding me that I ran 100 miles just two weeks ago. Still, they respond to prodding and, once on top of the ridge, I make great time to 6. Running the trail out of 6, I get in my obligatory fall as I get tripped up trying to jump over some storm debris. I go down pretty hard, but only my hand sustains an injury (and a minor one at that). I take a conservative route to 5 and hit it cleanly.

Leaving 5, I notice that the tail I was intending to take to 10 is actually marked as out of bounds. That's not a particularly pleasant revelation as it means I need to traverse the 500 meters over to 10. As this is the last stretch of thick woods I have to deal with, I'm able to stay focused and make pretty good time though it probably would have been faster to go back through 2 and run the field. At 10, I meet Erl and Andrei again, coming the other way. They head downhill towards 3. If they only have 3, 2, and 5 to go, this is going to be very, very close.

I'm too tired to push through thick woods and any navigation error now would be disastrous. Fortunately, the trail routes are fast, simple, and not much extra distance so I can run them full speed and still hit 11 and 3 cleanly. Heading towards 2, I pass Emily who's on her way up to 3. Erl and Andrei will have to cross in front of me to get to the finish. I don't see them pass and I can't believe they are more than a minute or two ahead of me, so it looks like I'm leading. Problem is, I squander a few seconds locating 2 and, once I've found it, I have to crawl through some vegetation to get to it. I'm in a complete panic as I crawl back out; having them come by at this point would be too much to bear. I emerge from the woods and see nobody in front, so I hit the gas for the line.

Turns out, they hadn't got 6 yet, so I actually beat them by a few minutes to take the overall win in two hours, four minutes (somebody probably wrote seconds down somewhere but, again, this is Adventure Racing and we don't need to concern ourselves with digits that don't matter). Emily comes in a few minutes after them to take the women's prize. The course setter, Jerry Young, apologizes that the course was a bit short. I respond that my legs were more than happy to stop.

The misplaced control and the unmapped thick woods would have ruined a "true" orienteering event. However, this race was just a whole lot of fun. I went into it with a relaxed attitude and really enjoyed not only the fast sections, but dealing with the vegetation as it came. In truth, I think Jerry did a pretty good job of keeping us out of the worst of it. And, oscillating between pushing through the woods and blasting the trails was pretty much the only way I could have run competitively on just two weeks recovery from Tunnel Hill. On a more open course (or one closer to the full three hours), I would simply not have been able to maintain a decent pace.

The event has definitely changed over the past 20 years, but not for the worse. Local turnout was excellent and the mood at the finish was buoyant. Veterans and newbies alike returned with smiles. There was a time, not long ago, when I would have decried the fact that having fun was taking precedence over rigorous competition. I'm pretty much over that one.

Saturday, November 26, 2016

Delta method

Another biggie. I'm going to state it right up front without introduction because I, personally, am less interested in why it works than when it doesn't:

Let {Y_n} be a sequence of random variables such that sqrt(n)(Y_n - θ) converges in distribution to N(0, σ²). For a given function g and a specific value of θ, suppose g'(θ) exists and is not 0. Then: sqrt(n)|g(Y_n) - g(θ)| converges in distribution to N(0,σ²[g'(θ)]²).

Yes, that's a mouthful. All those caveats come from the fact that this result is derived using first-order Taylor series approximations. If any of them don't hold, the Taylor series doesn't work. It's true that these conditions do hold quite often and the Delta method is a very useful way to get a distribution on the transform of a sequence of random variables (typically, a transform of the mean). Still, let's take a closer look.

The first condition, that sqrt(n)(Y_n - θ) converges in distribution to N(0, σ²) is lifted straight from the Central Limit Theorem. You could derive similar results for any statistic that converged to some other distribution. The Delta method is targeted specifically at transforms of the mean. Fair enough.

The transform g has to be differentiable, at least at the point of interest. This point of interest, of course, is typically the sample mean. Also not a big deal.

The not zero condition, though, is a little troubling. Why would it matter that g'(θ) is zero? Working backwards from the result, it's obvious that the result is meaningless when g'(θ) is zero (the variance of the limiting distribution becomes zero). Still, what is it about that transform that causes such a problem?

This isn't some cooked-up theoretical case. Suppose the transform is something as simple as g(x) = x² and the point of interest is 0. Why shouldn't that work?

The problem is that first-order Taylor polynomials are kinda dumb. They only know a point and a slope. If the slope is zero, they degenerate to constants which doesn't make for a very interesting random variable.

So, here's the rub: what if g'(θ) is really close to, but not equal to zero? Technically, the Delta method works. In reality, not so much. Sure, it will converge if you give it a big enough sample size, but it will have to be a really, really, big sample. So, as with all of these things, applying the rule blindly is a quick route to some pretty bogus results. All these assumptions need to be checked.

Friday, November 25, 2016

Central Limit Theorem and Slutsky's Theorem

Today I'll look at two results stemming from stochastic convergence. The first is quite possibly the most significant result in all of statistics. Odd that nobody seems to know who first proved it. There are many variants extending it to special cases, but the original proof of the Central Limit Theorem is not easy to track down. It might be one of those things that evolved; first conjectured by example, then proven in a bunch of special cases and then incrementally extended to its current robust form:

Let {Xi} be a sequence of iid random variables with finite variance, σ and EXi = μ. Then then the mean for a sample size of n converges in distribution to a normal with mean μ and variance σ²/n.

It should be pretty obvious that this is an important result. It basically says that as long as our sample sizes are large and variances are finite, we wind up with normally-distributed means. Stated another way, it's a bridge between a finite sample and the strong law of large numbers. Yes, the sample mean converges, and here's how it converges. The problem, of course, is that variances aren't always finite and "large" is a pretty subjective term. Many researchers are far too quick to assume normality without actually verifying it.

Slutsky's Theorem is a lesser-known, but in many ways more practically useful result:

If X_n converges in distribution to X and Y_n converges in probability to a constant, a, then

Y_nX_n converge in distribution to aX.

X_n + Y_n converge in distribution to X + a.

You're still stuck with the vagueness of how big n needs to be, but taking linear combinations of things is a really common operation. It's nice to know that you're not completely invalidating your results by doing it. In particular, we can use this to plug the sample variance back into the Central Limit Theorem:

$\bar{X}_n\rightarrow N(\mu,\sigma^2/n)\Rightarrow \sqrt{n}(\bar{X_n}-\mu)/\sigma\rightarrow N(0,1)$

The right hand side is actually the typical way to state the CLT. We know that the sample variance, S_n², converges in probability to σ². So, by Slutsky, we can swap it in to get:

$\sqrt{n}(\bar{X_n}-\mu)/S\rightarrow\left(\frac{\sqrt{n}(\bar{X_n}-\mu)}{\sigma}\right)\left(\frac{\sigma}{S}\right) \rightarrow N(0,1)\\$

This is a much more useful form of the result because it allows us to make inferences on the mean without knowing the true variance. Again, adequate sample size is very dependent on the underlying distribution. But, assuming you perform your proper normality checks, you're good to go.

Thursday, November 24, 2016

Sick day

Well, here's something that doesn't happen very often: I got sick. Not really sure what the problem was. Definitely wasn't a normal cold or flu. Woke up feeling a bit hung over (no surprise there as my wine club's big annual party is always Thanksgiving Eve). But, rather than feeling better once I got up, I very quickly started feeling a lot worse. After about fifteen minutes, I had to go back to bed and slept until noon. That's pretty unheard of with me.

At noon I got the dinner rolls rising (we do Thanksgiving at our church with a bunch of other families who have no relatives in town). Then I had to lie down again. Fortunately, by dinner I was feeling up to going over to church to bake the rolls and was able to enjoy the dinner. Still don't feel right, though.

My guess is that I'm just a lot more drained than I'm willing to admit. This has been a tough month. Physically, I've done a 100 and also given blood. Mentally, I'm at it 60-70 hours a week. Spiritually, it's just been flat out depressing to see so much hatred. I think when I poured a bunch of wine on that last night, my body just through in the towel and forced me to take a day off.

Wednesday, November 23, 2016

Median

Statisticians are addicted to the mean. Sure, it has nice mathematical properties, assuming you're using a Euclidean distance metric, but that's about all it has going for it. Consider the following scenarios:

Ten people are in a room. Donald Trump walks in. What just happened to the "average" wealth of the people in the room? If you use the mean, it went up by several hundred million dollars. That's obviously nonsense. Only one person is affected by all that money. The "average" person in the room hasn't seen any change. The median (which, may move ever so slightly) and mode (which won't move at all) both reflect that reality.
The Powerball lottery payout gets so high the payout for a drawing is greater than the total ticket sales. That means your expected gain on a ticket is positive. You should buy as many as possible, right? No, of course not. Even if you bought millions of tickets your most likely outcome is to lose everything. The median and mode both indicate an expected result of total loss. Only the mean hints at the fairy tale.
Exercise is generally good for you, but I don't know any serious athletes that haven't injured themselves. Most have suffered rather serious injuries. Some have died. That might be enough to dissuade some from participating, but most people accept that a few nasty incidents aren't enough to outweigh a small, but real, improvement in general well being.

The "correct" average to use is entirely dependent on your cost function. Squaring the error generalizes nicely to higher dimensions, but it's not really the way we tend to operate in our daily lives. If someone consistently arrives on time, I consider them punctual. I don't change that opinion on the day they get stuck in traffic and are an hour late. We dismiss outliers all the time without even thinking about it. That's because, internally, we're thinking more about medians and modes than means. Means are very sensitive to outliers, medians and modes are not.

The mode is probably the most intuitively obvious "average". It's also the correct one if you're using an all-or-nothing distance metric. That is, if a miss is a miss, regardless of how close it is, the mode is the natural center of the distribution. There aren't too many cases of that; close counts in more than just horseshoes and hand grenades. But, the mode is still a very easy concept: it's simply the most likely outcome. The problem is that there are many distributions, such as exponential waiting times, where the most likely outcome is way over on one side of the distribution. Also, if a distribution has a relatively flat top, estimating the mode from a sample is dicey as the most frequent result could be observed anywhere along the flat upper portion.

So, the median is pretty much the way to go for everyday life. It's also easy to understand: half the time your above it, half the time your below it. And, it's very easy to estimate; just sort your sample and grab the point in the middle. It's also very stable. You can be pretty sloppy in your sampling and still get the median right.

Tuesday, November 22, 2016

Convergence

It's late, so I'm just going to post a few named results without a lot of commentary. Actually, most are definitions, but they set up two very important results:

Convergence in distribution (definition; also known as weak convergence or convergence in law): The PDF's of a sequence of random variables converge to a single limiting PDF for every value at which the limiting PDF is continuous.

Convergence in probability (definition): A sequence of random variables {X_i} converges in probability to a random variable X if, for every ε > 0, P(|X_n - X| < ε) converges to 1.

Weak Law of Large Numbers (first important result): The sample mean of a sequence of iid random variables converges in probability to the true mean provided it is finite.

Almost sure convergence (another definition): Stronger than convergence in probability, here we move the limit inside the probability. That is, it's not just that the probability of the sequence and limit get close goes to one, but the actual points on which that relationship holds has probability 1. Or, to state the contrapositive, the portion of the sample space where they don't converge is a set of measure zero. You have to construct some goofy cases to show these aren't saying the same thing, but failing to deal with such counterexamples is what got mathematics into so much trouble in the 16th and 17th century, so modern mathematicians are rightly careful to consider them.

Strong Law of Large Numbers (second, and even more important result): A sequence of iid random variables {X_i} with E(X_i) = μ (finite) and Var(X_i) = σ² has a sample mean that converges almost surely to μ.

Again, in the vast majority of cases, the two results are a distinction without a difference, but there are cases where the second is stronger. The first is simply saying that the probability of the sample mean being off by greater than some arbitrarily small value goes to zero. The second says that, plus that the set of values where the convergence doesn't hold is of no significance.

Monday, November 21, 2016

F

No, no, I didn't fail a class. Today, I'm writing about the F-distribution. In another odd naming scenario, the F is in honor of Ronald Fisher but it was actually developed by George Snedecor. Snedecor had no problem with publishing under his own name, he just felt that Fisher had done so much of the work on this particular problem that he should get some credit, too. Some texts call it Snedecor's F distribution.

OK, fine, get on with it.

Just as the t-distribution is used to substitute in the sample variance for the true variance to get a distribution on the mean, the F-distribution substitutes in two sample variances to get a distribution on the ratio of the variances. Formally, if X_i are N(μ_X, σ_Y²) and Y_i are N(μ_Y, σ_Y²), then:

$\frac{S_X^2/S_Y^2}{\sigma_X^2/\sigma_Y^2}=\frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2}\sim F$

Also as with the t-distribution, there are degrees of freedom. In this case there are two and they are what you'd expect from the t: n - 1 for n observations of X and m - 1 for m observations of Y. You might wonder why the degrees of freedom are always one less than the number of observations. Why not just arbitrarily re-label them so they match? The reason is that these are coming from the underlying chi-squared distributions. Recall that S_X²/σ_X² is a scaled chi-square, so the F is really just the ratio of two independent chi-squared random variables with corresponding degrees of freedom.

It was this whole chi-squared, degrees of freedom bit that Fisher had worked out with respect to normal samples. Snedecor and Gosset just did the wrapping and formalizing of the distributions. Both were admirers of Fishers work, as is obvious from their deferment of credit. (Though, in the case of Gosset, he pretty much had to publish on the down low because one of his colleagues at Guinness had given away some trade secrets in a publication and the company subsequently took a dim view of anybody publishing anything.)

And, here's three fun transformation facts before I call it a night:

If X ~ F_p,q then 1/X ~ F_q,p. That should be pretty obvious from the formulation as the ratio of two random variables.
If X ~ t_q then X² ~ F_1,q. Nothing obvious about that one. At least not to me.
If X ~ F_p,q then (p/q)X/(1+(p/q)X) ~ beta(p/2, q/2). Wow, really? OK.

t and Cauchy

On Saturday, I said I'd skip the derivation of the t-distribution but, I'm going to come back to it because I've been thinking more about it and it kind of freaks me out. Recall that the t-distribution is the ratio of a normal and independent chi-squared random variable which we'll denote U and V, respectively. To get the pdf, we consider the transformation of U and V to T and W:

$t = \frac{u}{\sqrt{v/p}},\qquad w=v$

We don't really care about W, but we can only use the Jacobian transform trick if we have an equal number of variables on both sides of the transform, so we just pick the easiest possible transformation for the second variable. Inverting those transformations and taking the Jacobian gives

$u = t{\sqrt{w/p}},\qquad v=w\\ \\ \begin{vmatrix} \sqrt{w/p} & (1/2)(w/p)^{-1/2}\\ 0 & 1 \end{vmatrix} = \sqrt{w/p}$

Now we compute the marginal:

$f_T(t) = \int_0^\infty f_{U,V}(t\sqrt{w/p}, w)\sqrt{w/p}~dw$
Here, I will skip some steps since it's messy and not particularly enlightening. Suffice it to say, you collect terms and notice that you can factor out the kernel of the Gamma distribution to get:

$f_T(t) = \frac{1}{\sqrt{2\pi}}\frac{1}{\Gamma(p/2)2^{p/2}p^{1/2}}\Gamma\left(\frac{p+1}{2}\right) \left[\frac{2}{1+t^2/p}\right]^{(p+1)/2}$

And you were wondering why people just look the numbers up in a table. Seriously, though, forget about all the crazy norming constants and just look at the part term that involves t. What happens when p = 1?

$f_T(t) = C\left[\frac{1}{1+t^2} \right ]$

It's the freaking Cauchy distribution! Why does this seem crazy? Because, consider what this represents. This is the distribution of the mean when all we have is a sample of two from a normal population of unknown mean and variance. Basically, we're saying that, if we don't know the mean or variance, the sample mean comes from a nice normal distribution (we just don't know the parameters), but the mean itself (or, our estimator of the mean if you want to use frequentist terms) is so crazy unbounded that we have no moments on the distribution at all.

Maybe it's just me, but that seems nutty. One thing's for sure: don't ever believe a sample of 2.

Sunday, November 20, 2016

Tunnel Hill 100

Run November 12, 2016.

OK, it technically wasn't a DNF, it was a drop down. Still, the reason you enter a 100 mile race is to put yourself in a situation where continuing is hard. Responding to that by taking the easy way out at 50 miles rather misses the point. So, while September's lame result in the Mark Twain 100 still had some sting, I signed up for Tunnel Hill. Not so much for redemption as much as to remind myself that I do know how to keep going.

The "trail" leaving the campground

Tunnel Hill bills itself as an "easy" 100. It's certainly true that it requires a lot less work (as in, force times distance) to cover 100 miles on an old railway grade than technical singletrack. If I still had my pre-return-to-grad-school fitness, I'd be going for a PR (I don't, so I'm not). Still, it's a long way to run and the lack of terrain variation means that you don't get natural opportunities to vary your stride. That can lead to some serious tightening or even cramping in the second half. The monotony of a long, flat, straight trail can also be a drain on mental fortitude (though, as rails-to-trails courses go, Tunnel Hill offers more scenery than most).

Camping is allowed at the start, which is almost always my preferred way to do it. It makes the morning less stressful and provides an immediate place to lie down after finishing. More importantly, it offers the opportunity to share some time with like-minded souls. Nobody I know is camping, so I'm forced to make some new acquaintances (engineering types like me often need to be forced). As is always the case in the ultra community, outsiders are considered friends until proven otherwise and my neighbors immediately treat me as kin.

Prior to dinner, I go for a short jog on the trail. It's exactly as I expected. Six feet wide, fine gravel, woods on both sides. Most of the leaves are down, so there won't be a lot of shade. Given the cool forecast, that's not a concern. In a shorter race, the loose surface would make spikes advisable, but I don't expect to be running fast enough to worry about traction. My regular road shoes should be fine. Many of the competitors are opting for gators to keep the cinder out of their shoes. That would probably be a good idea, but I never use them and don't want to try something new in competition.

Bill & I at the start

Sleep is a little less than hoped for. I think I'm getting too old for my lightweight backpacking bed roll; a true air mattress is in order. At 4:30AM, I give up on the activity even though the 7:30 start afforded me more time if I wanted it. After my usual pre-race breakfast of oatmeal and coffee, I busy myself with getting my gear in order.

The course heads south for 13 miles and comes back. Then it goes north for 12 miles and comes back. Fifty-mile runners will stop there; the 100 crowd does the whole thing again. That means that we go through the start finish area roughly every 25 miles. That seems plenty often enough, so I don't send any drop bags to the remote aid stations. I pack my bin with a full change of clothes, plus extra layers for the night (temps are predicted to get near freezing). I'm content to rely on the aid stations for food, but do pack some electrolyte and caffeine tablets. With over 600 in the race, getting bottles refilled at aid stations might be a pain, so I give some spare bottles to Laura Langton who is crewing for her husband Bill in the 50. I don't expect congestion to be an issue in the second 50.

The southern section had a few pretty spots

The race starts on time with a short lap around the campground to string things out a bit before hitting the "trail". Bill and I run together fairly near the front. After a couple miles it's obvious that the third mug of coffee was one too many. Bill and I both stop for a quick pee. A couple minutes later, I look down and notice my number is missing. I don't pin my number to my clothes in ultras because I often end up changing at some point. Instead I use a number belt, popular with triathletes who are also changing mid-race. It must have unclipped when I stopped.

I head back, getting a lot of bewildered looks from the hundreds of oncoming runners. When I get to where we stopped, there's no sign of my number. The race is chip timed, so not having my number is probably grounds for disqualification. Still, that's not a good reason to quit. I'll know whether or not I ran regardless of what the results page says. I take one last look and then realize that I actually had the number all the time, but when I did my shorts back up, it had wound up on the inside. In the context of a 20-hour race, losing five minutes to a boneheaded mistake is not a big deal, but it's still not the start I was hoping for.

Rock cut at the start of the climb

Trying to get time back is pretty much a losing strategy. Setbacks occur and that time is gone for good. I make a point of jogging very easy the rest of the way to the first aid station. As I'm moving through the thick part of the pack, I don't really have much choice. I chat with a few folks I know as I slowly work my way through the field. Coming up on the aid station at 5 miles, the watch of a woman running next to me emits a chirp. "Hey!" she exclaims, "I've got my 10,000 steps in today!" Only 190,000 to go.

By the second aid station at mile 11, I'm back in the first 50 or so runners and things are stringing out. Laura hands me a bottle and tells me that Bill is about five minutes up the path. Not long after leaving, I meet the lead runners coming back. I see Bill coming back about a quarter mile from the turnaround. After the turn, there's A LOT more oncoming traffic. Passing isn't a big problem on a path this wide; it's just unfamiliar to see so many people in an ultra. Most of the events I run have considerably smaller fields.

I get back to the start just before noon (4:28 elapsed). While I'm not running this one for time (I'm not even wearing a watch), it's in line with expectations. All I have to do is drop my long-sleeved shirt and change bottles whereas Bill is taking a longer stop, so we head back out together.

Trestle near the top of the climb

Running with Bill has a number of advantages. First and foremost, it's just more fun to run with company. Additionally, it removes any thoughts I have of trying to push the pace. Pushing from 30-50 to chase down the leaders was what did me in at Mark Twain. I now focus on Bill's goal of getting 50 done in under 9 hours. That means keeping the pace right where it's been, with short walk breaks thrown in every 1-2 miles.

The southern part of the course was less than inspiring, but this section is more interesting. After the aid station at mile 30, we climb steadily towards the tunnel for which the race is named. As it's an old railroad grade, it never gets steeper than 2%, but even that amount of climb is enough to bring relief to my glutes and hamstrings which were getting mighty tired of flat terrain. As we climb up the ridge, we're also treated to a number of interesting rock cuts and trestles and then, finally, the tunnel.

Laura is at the aid station at the far side of the tunnel. We get through it quickly and do the short out and back to the turnaround. Back at the aid station, I check the time. We're 7:11 in, so we have to cover the remaining 10 miles in just under 110 minutes. That seems pretty doable to me, but I have no idea how Bill is really feeling. We talk a lot on the way back down, making sure he's not overcooking things. He decides to blow through the aid station at 47 so he doesn't stiffen up. I get a bottle refilled for him and that's enough to get him home for a sub-9 PR with six minutes to spare.

The tunnel

I cheer him across the line and then collect my stuff for the next quarter of the race. This is always the toughest section for me. It's not just that physical fatigue becomes very real at this point, there's also a mental letdown exacerbated by the fact that the daylight is vanishing. As I head out, I'm still feeling fairly good so I try to stay somewhere near the pace I've been running.

That pace fades over the next few hours and I'm now taking two walk breaks between each aid station rather than one. I'm also spending more time at the aid stations because I can no longer keep solid food down. That limits me to soup broth. Leaving the station at 55 miles, I try to carry a cup with me and end up spilling most of it. Aside from defeating the purpose, that makes my hands really cold as my gloves aren't waterproof. At subsequent stops, I take an extra couple minutes to finish the broth before heading back out.

As usual at this point in the race, I'm alone. Often, that can be depressing. Tonight, it is not. Sure, I'm tired (sleepy more than exhausted) and my stride is getting pretty stiff, but it is a truly beautiful night to be out. Approaching the southern turn, the path leaves the woods and crosses some misty fields. A full moon has risen and bathes everything in eerie blue made all the more surreal by the forlorn howl of a lone wolf off to the west.

While the southern loop takes nearly an hour longer than it did in the morning, there's never a point at which I have to wrestle with quitting. In fact, I get back feeling good enough to start caring about breaking 20 hours.

That goal isn't helped by spending a whopping 15 minutes in the transition. I'm really not sure why it takes so long. Yes, I have two cups of soup, and I'm messing around a bit with clothing (basically putting on everything I have as it's obviously going to get much colder than freezing before I finish), but that should still only be a 5-7 minute stop. I guess my brain is just not up for doing these things quickly. At any rate, I leave with 14:40 on the clock, which means I've got time, but can't do too much walking.

Heading up the grade to the tunnel, I take walk breaks every mile. At the aid station, I linger a bit and make sure I get enough broth down to hold me the rest of the way. I push a bit on the out and back and return at 1:33AM. I am exactly on 20-hour pace and it's ten mostly downhill miles to the finish. I decide I'll run the descent easy and then put everything I've got left into the final four miles.

The LED bulb and the lithium batteries in my headlamp are having more trouble with the cold than I. The trail is devoid of trip hazards, but it's still preferable to have some idea where you're putting your feet. Fortunately, it's a clear sky and the moon is still pretty close to the zenith so, except for a few shaded patches, I can see the trail fine even with my light significantly dimmed.

I plan to skip the final aid station. Food won't help me at this point and I've got enough water. Unfortunately, as I go for a drink I find that it's no longer in liquid form; my bottle has frozen solid. I stop at the final aid station just long enough to grab a gulp of water and then start pushing for the finish.

The finish push in a 100 is probably pretty comical when viewed by an impartial observer. It certainly feels like I'm running fast. In reality, I cover the last three miles in around 26 minutes which won't likely take home age-group hardware in a local 5K, but it does get me to the line in 19:51. There's no particular significance to finishing under 20 hours; it just sounds better. Also under the heading of results that don't matter: I pass three people right near the end to move into the top 20 (19th) overall.

I haven't been sweating since mid-afternoon, so I don't bother cleaning up before getting into my tent and getting some sleep. I don't sleep for long. Partly because I'm too sore to sleep well and partly because the sun comes up only three hours after I tucked in. Fortunately, one of my newfound friends from Friday night needs a ride back the the St. Louis airport, so I have company to keep me awake for the drive.

The goal coming in was simply to cover the distance so I wasn't going out to Leadville next summer wondering if I even had the discipline to finish. Granted, this course is considerably easier than Leadville, but the physical challenges of a 100 are secondary to keeping your head in the game. In that respect, this was probably the best 100 I've ever run. I backed off when I needed to, pushed when I needed to, and never had a period where I just stopped caring or trying altogether.

Empirical evidence would indicate that the course was only easy if you made it hard. Basically, your options were run or freeze. 23 finished under 20 hours, which is a lot given 209 starters. But, if instead of telling you that, I just said that a quarter of the field was under 24 and less than half covered the full distance, that would indicate a moderately tough course. Runners typically regard heat as the ultimate enemy but, in ultras, cold can be far more devastating. It's hard to stay warm when your body is depleted.

As I mentioned earlier, if I had the form of several years ago, this is a course where I would have been trying for a PR. The fact that I ran well and was still two hours off my best is an indication that my fitness is probably gone for good. However, the fact that I ran that much slower and never felt like quitting is an indication that I'm finally getting to the far side of the bridge from competition to participation. I hope so, because my glory days are clearly behind, but I still love to run.

Saturday, November 19, 2016

Student's t-distribution

Let's get right to the obvious: "Student" is, in fact, a real guy named William Sealy Gosset. He had some issues. Not only did he not like his name (he regularly went by W.S. Gosset rather than William, or Bill), he felt even that lacked sufficient anonymity, so he published under the name of "Student". Dude, what's your deal? What's even nuttier is that he worked for Guinness. Yes, the brewery in Dublin. You'd think a guy who got free pints of stout would be a bit more chill.

Anyway, the t distribution is sufficiently important that if ole W.S. had been a bit less bashful, we'd all be writing Gosset distribution all over the place. Basically, it addresses the fact that we typically don't know the variance of a sample. We can estimate it, but that's not the same as knowing it.

If X_i are iid N(μ, σ²) random variables, then

$\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\sim\textrm{N}(0,1)$

If we knew σ, then we could make all sorts of statements about μ based on the sample mean. But we don't. So what we need is to know the distribution of

$\frac{\bar{X}-\mu}{S/\sqrt{n}}=\frac{(\bar{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{S^2/\sigma^2}}$

Where S is the sample variance. While somewhat more complicated on the surface, the right side is actually just the ratio of a N(0,1) and a mildly transformed chi-squared. That is:

$\frac{(\bar{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{S^2/\sigma^2}}\sim U/\sqrt{V/p}\\ \textrm{where}\\ U\sim\textrm{N}(0,1)~\textrm{and}~V\sim\chi_p^2\\$

Importantly, U and V are independent. The distribution of this ratio is defined as t with p degrees of freedom (which is determined by the sample size). Grinding out the pdf is left as an exercise to the reader (or you could just look it up). In reality, everybody just uses the tables or a stats package. Tying this all back to the original question, if X₁, ..., X_N are iid N(μ, σ²) random variables, then

$\frac{\bar{X}-\mu}{S/\sqrt{n}}\sim t_{n-1}$

which we call a t distribution with n-1 degrees of freedom.

Friday, November 18, 2016

More quick facts while I do real work

The square of a standard normal random variable is chi-squared with 1 degree of freedom.
The sum of independent chi-squared random variables is chi-squared with the sum of their degrees of freedom.

Don't you feel enlightened?

OK, I'll throw out one more:

chi-squared with p degrees of freedom is really Gamma(p/2, 2) in disguise.

Thursday, November 17, 2016

iid normals

Another quick post today because I really, really, really, need to finish off my CISS paper soon.

Let X₁, ..., X_n be a random sample from a N(μ, σ²) distribution. Then

The sample mean and variance are independent random variables.
The sample mean is distributed N(μ, σ²/n).
(n - 1)S² / σ² is distributed chi-squared with n - 1 degrees of freedom.

That last one is exactly the type of thing that would show up on the Q.

Wednesday, November 16, 2016

Sample variance

Well, last weekend was great, but I guess I need to get back to studying. I'm now moving from Probability Theory to Statistics. One stat that always seemed weird to me was the sample variance. Sure, I get the whole degrees of freedom thing, but it still seems odd that the sample variance gets the adjustment for the sample mean.

$\bar{X} = \frac{1}{n}\sum X_i\\ \\ S^2 = \frac{1}{n-1}\sum(X_i-\bar{X})^2$

Again, I get it. Because you the sample mean is centered on the sample, not the distribution from which the sample is pulled, dividing by n will bias the variance low. What's always been odd to me is that dividing by n - 1 fixes it. It's one of those results that I understand but still gnaws at me a bit. So, as a public service to anybody else who might be in the same boat, here's the proof that S² is, in fact, an unbiased estimator of the true variance, σ².

$\begin{align*} \textrm{E}(S^2) &= E \left(\frac{1}{1-n}\sum(X_i^2 - n\bar{X}^2) \right )\\ &= \frac{1}{n-1}(n\textrm{E}(X_1^2)-n\textrm{E}(\bar{X}^2))\\ &= \frac{1}{n-1}\left(n(\sigma^2 + \mu^2) - n\left(\frac{\sigma^2}{n}+\mu^2\right)\right)\\ &= \sigma^2 \end{align*}$

Feel better? Neither do I. Just learn it and get on with life.

Sunday, November 13, 2016

The bees knees

I have no idea what that expression is supposed to mean, but here are two quick takeaways from Tunnel Hill:

My knees are destroyed! Everybody else was complaining about it, too. The surface is your typical rails-to-trails cinder on top of gravel which is reasonably soft. Maybe it's the lack of climb/descent. That is, because you're on flat or close to it the whole way, there isn't enough variation in your stride. That accelerates overuse issues. Still, dang, I don't every recall my knees being this sore after an event.
Honey Stinger gels don't seem to upset my stomach the way just about every other gel does. Curative power of honey?

Full report next weekend. For now, let's just say it went pretty well. Good race. Great weather (although it got mighty cold at night). Finished in under 20 hours (barely). No injuries. No significant mishaps.

Friday, November 11, 2016

Results?

I'm heading out today to run the Tunnel Hill 100. Given the rather spotty results I've had this year (no surprise - grad school has that affect on athletic performance), I am just planning on covering the distance. It is, however, a fast course and conditions are looking nearly perfect. I'll have to be disciplined and not fall into the trap I did at Mark Twain where I switched to thinking of it as a competition and found myself unable to revert to my original plan.

The goal is to cover the distance. Period.

No posts until Sunday evening.

Thursday, November 10, 2016

Results

Before getting on to today's topic, let me just say that, if you liked yesterday's post (or, if you thought I might have been on to something but didn't say it well enough), Nate Silver posted pretty much the same thing over on 538. Except that, since blogging about polling is his full-time job, he did it better than me.

Anyway, life goes on. This isn't the first time in history a bigoted narcissist has risen to power and we're still here.

I'm writing up the results section for my CISS paper. It's true that I really need to just finish this thing, but all this talk about polling errors got me thinking about whether I should run it against a much larger data set. Basically, I've been testing it against a reasonably consistent slice of our business. I took the projections from a few business units; roughly 5% of our total business. Since I was pulling the entirety of the data for each BU, I wasn't in danger of the eliminating the heavy tail or correlation concerns that motivated the algorithm in the first place. However, it is a small sample of the total. A poll, so to speak, not the election.

So, why not run the election? Sure, the results aren't likely to differ much, but they will almost certainly be different in ways that are noticeable. Last spring, I decided against this simply because I wanted to keep the data in memory, which would make everything run a lot faster and make results analysis easier. However, it's really not that much work to write a real data layer. And, now that I've got a Hadoop cluster at my disposal, it will run plenty fast even on the "full" set. I use quotes there because I have no intention of running it on the 65,000,000,000-row data set that was the original motivation for this work; I'll use the initiative that has just over a billion rows.

Wednesday, November 9, 2016

Polling error

Well, I assume I don't need to tell you that the polls missed it yesterday. While that is a fact, what people don't seem to realize is that they really didn't miss by that much. In fact, you'd have to say that if a miss like this didn't happen every 20 years or so, then the polls are being way too conservative.

That's what a confidence interval means. There's a probability that the result falls in a given range. There is a complementary probability that the result does not fall in that range. If you never miss, you're confidence interval is off.

It's true that Sam Wang's 99% prediction over at the Princeton Election Consortium was a bit nutty and several other estimators called him out on it. (To his credit, he discovered his error and not only admitted it, but published the details on election night). However, most had Hillary at around 80%, which means they would expect to be wrong once every five elections (or, 20 years). This was the year.

Monday, November 7, 2016

Diagonalization

My linear algebra text doesn't name the theorems relating eigenvectors to diagonalization. That surprises me a bit. Not because they are super deep theorems, but the results are so incredibly important. I guess it's an example of stuff that was proved long before it was needed. Either that, or the author (writing in 1980) had no clue how important diagonalization was going to become to data analysis.

Anyway, as a practitioner in 2016, not to mention someone about to take the Q, these results are pretty much the most important takeaway of the whole course.

First, distinct eigenvalues yield independent eigenvectors:

If λ₁, ..., λ_k are distinct eigenvalues of A with corresponding eigenvectors x₁, ..., x_k, then x₁, ..., x_k are linearly independent.

From this immediately follows the result that makes Principal Component Analysis possible:

An nxn matrix is A is diagonalizable if and only if A has n linearly independent eigenvectors. Furthermore, if A = S^-1DS where D is diagonal, then S is composed of the column eigenvectors and the diagonal entries of D are the corresponding eigenvalues.

As mentioned above, the proof is fairly easy exercise in construction:

Suppose A has n linearly independent eigenvectors. Let S be the matrix formed by taking them as column vectors. For each associated eigenvalue, we know that Ax_j = λ_jx_j so:

$\begin{align*} AS &= (A\textbf{x}_1, ..., A\textbf{x}_n)\\ &= (\lambda_1\textbf{x}_1, ..., \lambda_n\textbf{x}_n)\\ &= (\textbf{x}_1, ..., \textbf{x}_n) \begin{bmatrix} \lambda_1 & & \\ & \ddots & \\ & & \lambda_n \end{bmatrix}\\ &= SD \end{align*}$

Since the eigenvectors are independent, S is non-singular. So, multiplying both sides by its inverse gives A = S^-1DS. The converse is basically walking the same thread back the other way.

Sunday, November 6, 2016

Gala

Last night Kate & I attended the annual fundraising gala for the Catholic Student Center at Washington University. We go every year as guests of Beth Schnettler (second from left in photo) who recently retired as Dean of Admissions and is still very active in her support of the students. It's always great fun and it's also a great cause.

Reasonable people disagree on the virtue of these things. There is no question that getting a bunch of rich folks in one spot and having them bid on things donated by the same group of rich folks will certainly incent some to give for the wrong reasons. Who cares? They give. A lot. And, for those who haven't tried to run a non-profit, it's a really hard thing to do without funding.

Jesus went to lots of parties. Seriously. The gospels are loaded with stories of him showing up at the celebrations of the well to do. Sure, those stories also include him throwing out some spiritual truth for them to chew on, but there's no indication that by doing so he buzz-killed everybody's evening. The fact that he kept getting asked back (OK, sometimes he invited himself) would seem strong evidence otherwise.

Likewise, when Father Gary addressed the crowd last night, he had everyone's attention and said some really challenging things. Then, we went back to eating, drinking, and dancing.

Oh, and raising several hundred thousand dollars for the student center. That part was kinda important, too. Kate and I always bid on a few of the lesser items (the big-ticket stuff is way out of our price range; I'm not kidding when I say some of these folks are really rich). For example, there's a really great wine tasting that we always get tickets for. Next year, it's going to be on my birthday so we bought a whole table and that will be my party.

The programmed part of the evening finishes with the direct appeal. Nothing in return, just hold up your auction paddle if you want to kick some more in. This is the most conflicted part of the evening for me. After all, Christ said we were to give in secret. But, if you want your contribution go get the matching funds (which were 2:1 this year), you have to do it at the auction. Fortunately, it's actually pretty anonymous for someone who doesn't have thousands of dollars burning a hole in their pocket. The "bidding" started at 20 grand, so by the time they got down to numbers we could consider, nobody but the auctioneer was paying a lot of attention to who was bidding.

Saturday, November 5, 2016

Java

I'm not sure why I didn't originally write my CISS algorithm in Java. I guess it was just expediency. I needed to do it quick and I can program faster in C#. More properly, I can program faster in Visual Studio. As I mentioned last year when I was taking my languages class, the programming environment has a lot more to do with productivity than the language. It's not that IntelliJ is bad, I just don't know it real well (OK, I hardly know it at all.) So, while I'm happy that Java now has a development environment to rival Visual Studio, until I actually use it enough that I know all the shortcut keys and navigation, it's still going to slow me down.

Anyway, that's my problem and one I need to fix since we're going to be doing a lot more development in Java at work. The stuff I'm writing for school is sufficiently trivial that I can poke my way through it. The code base at work is substantially more complex.

Incidentally, if you're a Java zealot who is reading this and celebrating that another Fortune 500 has left the evil empire and embraced Java, you couldn't be more off-base. My particular group needs to start using Java because that's the native language of Hadoop. I expect this is a temporary situation; there's no reason the .net framework can't run efficiently on Hadoop. However, since the vast majority of Hadoop developers have Java, it makes sense to stick with that for now if only to make staffing easier. Meanwhile, the organization as a whole is still predominantly a .net shop.

Which, of course, brings us to the crux of the matter: it doesn't matter. In the world where people evaluate your performance by how well your code works and how much they had to pay you to write it, there is no place for religious crusades for language purity. Almost all large organizations support a variety of development tools. You use what works. If that means you have to learn something new that isn't implemented quite the way you would like or (even less relevant) isn't implemented by a company you like, tough.

So, this weekend, I'm re-writing CISS in Java so I can run it on Hadoop (and, yes, to make the source code more palatable to academics who would regard the previous paragraph as heresy). That's not really a big deal; the only remotely complicated part of the program is the data layer and even that is pretty straightforward. I haven't yet decided if I want to write a parallel version of it. Obviously, that would be the way to go if this algorithm was ever going to be implemented in production, but it might be time to move on to the next thing rather than optimizing an algorithm that was really only created to demonstrate the larger point that random sampling of financial data is problematic.

Thursday, November 3, 2016

Eigenvalues and differential equations

In my view, the most interesting application of eigenvalues and vectors is Principal Component Analysis, a method for determining the most significant dimensions in a data set. However, since I already wrote about that, I'll take a quick look at some more traditional applications.

One of the classic uses is in solving differential equations when the value at some point is known (generally referred to as initial value problems). Here, the formulation is:

$\begin{align*} y'_1 &= a_{11}y_1 + a_{12}y_2 + ... + a_{1n}y_n \\ y'_2 &= a_{21}y_1 + a_{22}y_2 + ... + a_{2n}y_n \\ ...\\ y'_n &= a_{n1}y_1 + a_{n2}y_2 + ... + a_{nn}y_n \\ \end{align*}$

where y_i = f_i(t) is a continuous function over the relevant domain of t.

In vector form, this becomes Y' = AY and the solution will be of the form Y = e^λtx. Thus, if λ is an eigenvalue of A then AY = e^λtAx = λe^λtx = λY = Y'. So, the eigenvectors of A provide a basis for the solution space of continuous vector-valued functions satisfying the conditions. To force a unique solution, an additional constraint must be added. Setting Y(0) = Y₀ allows one to solve exactly. This is considered an the initial value, though it doesn't technically have to occur at time 0 as once can transform the input to use a value at any known time.

This holds whether the eigenvalues are real or complex and can be generalized to higher-order systems by partitioning the matrix A.

Wednesday, November 2, 2016

Eigenvalues and Eigenvectors

Back to linear algebra for a bit. Eigenvalues and vectors are linchpins for much of big data analysis, so I really should know this stuff absolutely cold not just for the Q and possibly research, but for my day job as well. Of course, so much of the actual work is now done in off-the-shelf packages, one can be pretty naive as to how it functions and still get results. However, I don't want to be that guy.

Today, I'll just post the basics.

If A is an n x n matrix then any scalar λ is an eigenvalue of A if there exists a nonzero vector x such that Ax = λx. In such a case, x is the eigenvector of λ.

It's clear that we only have such a case if (A - λI)x = 0 which implies that |A - λI| = 0. Expanding that determinant gives the characteristic polynomial of A: p(λ) = |A - λI|. The roots of that polynomial are the eigenvalues of A.

OK, finding roots of a polynomial. We did that in 9^th grade. What's the catch? Well, mainly that it doesn't work. Finding the roots of higher-order polynomials is A LOT harder than High School Algebra texts would have you believe. Pretty much impossible using digital hardware unless you specifically constructed the polynomial to have an easy solution. Real world polynomials don't have easy solutions. So, we're off to the realm of numerical approximations (which explains why so many people just fire up the software package from the get-go and don't bother asking questions).

And, that really is the way to get actual solutions. However, understanding what's going on is a big plus when you're trying to figure out what problem you want solved. We'll look at a few applications over the next few days.

Tuesday, November 1, 2016

More Inequalities

Finishing up probability theory a day late. Here are some named inequalities that I'll have to memorize:

Holder's inequality: |E(XY)| ≤ E(|XY|) ≤ (E(|X|^1/p))^p(E(|X|^1/(1-p)))^1-p, p∈(0,1).

when p = 1/2, this is the Cauchy-Schwarz inequality, which I've already written about. Centering X and Y about their respective means and leaving p = 1/2 yields:

Covariance inequality: (Cov(X, Y))² ≤ σ_X²σ_Y²

Another sometimes useful case which can be fairly easily derived from Holder is:

Liapounov's inequality: (E|X|^r)^1/r ≤ (E|X|^s)^1/s, 1 < r < s

Similar looking, but proven differently is:

Minkowski's inequality: (E|X+Y|^p)^1/p ≤ (E|X|^p)^1/p + (E|Y|^p)^1/p.

Jensen's inequality: Eg(X) > g(EX) for any convex function g.