Wednesday, November 23, 2016

Median

Statisticians are addicted to the mean. Sure, it has nice mathematical properties, assuming you're using a Euclidean distance metric, but that's about all it has going for it. Consider the following scenarios:

  • Ten people are in a room. Donald Trump walks in. What just happened to the "average" wealth of the people in the room? If you use the mean, it went up by several hundred million dollars. That's obviously nonsense. Only one person is affected by all that money. The "average" person in the room hasn't seen any change. The median (which, may move ever so slightly) and mode (which won't move at all) both reflect that reality.
  • The Powerball lottery payout gets so high the payout for a drawing is greater than the total ticket sales. That means your expected gain on a ticket is positive. You should buy as many as possible, right? No, of course not. Even if you bought millions of tickets your most likely outcome is to lose everything. The median and mode both indicate an expected result of total loss. Only the mean hints at the fairy tale.
  • Exercise is generally good for you, but I don't know any serious athletes that haven't injured themselves. Most have suffered rather serious injuries. Some have died. That might be enough to dissuade some from participating, but most people accept that a few nasty incidents aren't enough to outweigh a small, but real, improvement in general well being.
The "correct" average to use is entirely dependent on your cost function. Squaring the error generalizes nicely to higher dimensions, but it's not really the way we tend to operate in our daily lives. If someone consistently arrives on time, I consider them punctual. I don't change that opinion on the day they get stuck in traffic and are an hour late. We dismiss outliers all the time without even thinking about it. That's because, internally, we're thinking more about medians and modes than means. Means are very sensitive to outliers, medians and modes are not.

The mode is probably the most intuitively obvious "average". It's also the correct one if you're using an all-or-nothing distance metric. That is, if a miss is a miss, regardless of how close it is, the mode is the natural center of the distribution. There aren't too many cases of that; close counts in more than just horseshoes and hand grenades. But, the mode is still a very easy concept: it's simply the most likely outcome. The problem is that there are many distributions, such as exponential waiting times, where the most likely outcome is way over on one side of the distribution. Also, if a distribution has a relatively flat top, estimating the mode from a sample is dicey as the most frequent result could be observed anywhere along the flat upper portion.

So, the median is pretty much the way to go for everyday life. It's also easy to understand: half the time your above it, half the time your below it. And, it's very easy to estimate; just sort your sample and grab the point in the middle. It's also very stable. You can be pretty sloppy in your sampling and still get the median right.

No comments:

Post a Comment