Books have been written about this distribution. But there are really just a few things you have to know:
- It's called "normal" because it really does come up a lot. Why? Because of the Central Limit Theorem. That's a sufficiently important result to warrant it's own post, but the gist is that if you average together enough random variables, the result is normally distributed. There are some very important caveats, but it holds in the vast majority of cases.
- Even when you're not averaging things, it's often a good fit. Symmetrical, single-mode, smooth, with long, but very light tails. That last bit means you can often model items with a finite range as normal because the part of the tails that is chopped off has almost no probability associated with it.
- Even though the cumulative distribution function is intractable, it's relatively easy to work with. And, when you do need an actual number rather than a formula, pretty much every stats book has tables in the appendix you can use to look up a point on the cdf.
- Not only does it have all it's moments, the first two are actually parameters. Mean: μ; Variance: σ2.
- "Normality" is preserved under linear combinations of normal random variables, and the mean and variance are pretty much what you would expect them to be (the means and standard deviations are simply the same linear combinations of the parameters μ and σ).
It all adds up to an incredibly versatile distribution. There's just one catch: sometimes it's completely wrong. And wrong in ways that aren't easy to see. The violation of the normality assumption can have catastrophic results on the validity of analysis. Much of my research is motivated by the fact that there really are data sets out there that do not converge to normal, no matter how much you sample them. While such atrocities will not likely surface on the Q, they show up in my work data all the time. I'll write more about them tomorrow.
No comments:
Post a Comment