never2old4school: Fisher Information

It may start to look like all of statistics was invented by Ronald Fisher. If one is talking about frequentist statistics, that would not be too far from the truth. Just as Newton did with Calculus, Fisher took a field that wasn't going anywhere for reasons of intractability and figured out how to make it useful by focusing on a tighter problem and leaving the nasty edge cases for others. Now that computational resources exists to get around a lot of the thorny problems of open-form solutions, the work of Bayes and Laplace who preceded Fisher are gaining traction. But, back to Fisher.

As with most experimenters, Fisher was less interested in "truth" and more interested in making good decisions based on data. Others could work out why genetics (his primary field) operated the way it did, he was more interested in how to make it happen. Also common to most experimenters, Fisher had to deal with budget constraints. So, the obvious question: how do I get the most usable information for my money?

The idea that some estimators are better than others is obvious enough. Even little kids quickly figure out that just because a baseball player strikes out once, it doesn't mean they can't hit a ball. You have to see at least a few at bats to know that. But, once you're past that, you still have all sorts of estimators of ability. Should you look at batting average, slugging percentage, on base percentage, runs batted in, home run total, or (yes, people actually track these things) how often the opposing manager calls for a pitching change when this batter comes up?

Well, a big part of the problem with baseball stats is that nobody can agree on what they are actually trying to estimate. "Ability" is a pretty vague term. However, if you can actually quantify a parameter, then Fisher does have a pretty good technique for determining how well a statistic estimates that parameter. And, it's named after him: Fisher Information.

$I(\theta)=\textrm{E}\left[\left(\frac{\partial}{\partial\theta}\log f_\theta(X)\right )^2 |\theta \right ] =\int\left(\frac{\partial}{\partial\theta}\log f_\theta(X) \right )^2 f_\theta(x)\textrm{d}x$

It's probably not intuitively obvious what that quantity is telling us, so let's break it up into pieces.

First, we have f(X). That's the probability density of X for a given θ. Taking the log gives two nice properties. The first is that the information becomes additive. That is, the information from two random variables is twice the information from one. The second is that, under reasonable regularity conditions (there he goes again, just chucking the theoretical messy stuff to get something that works for every real world case you can imagine), the expectation of the first moment (also called the score) is zero.

A consequence of the latter property is that the second moment (that is, the Fisher Information) is equal to the variance of the score. So, what the Fisher Information is really measuring is the variance of the density function conditioned on the observation. If the variance is high, that means that the value of the observation will move the density function around quite a bit. If the variance is low, that means that the observation doesn't do much to inform our belief about θ. Well, OK, Fisher was a frequentist, so I'll put it in those terms: the variance tells us how much we should adjust our estimate based on the data.

If the density is twice differentiable, this becomes more obvious. In that case, you can decompose the second moment to get:

$I(\theta)=-\textrm{E}\left[\frac{\partial^2}{\partial\theta^2}\log f_\theta(X) |\theta \right ]$

That is, the Fisher Information is the negative expectation of the second derivative of the support curve at the maximum likelihood estimated of θ. A steep negative second derivative means that the support drops off very quickly on both sides of the maximum likelihood value. That is, the maximum likelihood is a lot more than any other likelihood.

Applied to random variables, this is little more than a mathematical curiosity. However, because of the additive property, it also means that we can apply it to linear combinations of independent observations. That is helpful and it tells us how much marginal information we get when we increase sample sizes or use alternative metrics to estimate a parameter (for example, the median rather than the mean).

never2old4school

Saturday, December 24, 2016

Fisher Information

No comments:

Post a Comment