Sunday, January 31, 2016

Centered

Well, my intent was to write up the eigenvalue stuff for Data Mining today but the production support work yesterday messed up my schedule a bit. I ended up taking today pretty much off rather than stress myself out this early in the semester. It's now 9:30PM and I should get to sleep so I'll just drop a little pearl that anybody who's taken a stats course has probably heard, but the application of it doesn't become clear until you start working with large data sets.

Data can always be centered. That center may be very artificial, but you can always transform it so that it does have a center. In the vast majority of cases, the sample mean is used. Of course, one could just as easily use the median or the mode. Why the mean? It comes down to how we measure variability.

The variability of data is usually measured by using a sum of squares. There are good reasons to do this when you get to higher dimensions. If you're projecting a set of points in, say, 3-space, onto a 2-dimensional line, the projection lies along the line and the error is a perpendicular vector. As even the scarecrow can tell you at the end of The Wizard of Oz, perpendicular projects mean you square the sides to get the magnitudes to add up.

So, since the point of a center is to minimize the variability between the data and the center, you solve for the minimum sum of squares and that turns out to be the sample mean. If, instead, you just used linear distance rather than squared distance and minimized that, you'd get the median. And, if all you cared about was whether things were equal or not (that is, your distance from the center is zero if you are at the center and 1 if your not) then your minimum variability is around the mode.

It's hard to see why this matters in a stats class. Everybody knows we're going to use the mean. But, when analyzing big data sets, your distance measure may be all sorts of weird things. Certainly, the 0/1 measure that yields the mode is a very common thing to care about and finding the most frequent value (i.e., the mode) is a very common database operation. Medians also get used a lot; knowing the point that cuts your data in half is useful in many situations. It's helpful to keep in mind that the answers aren't random - they result from the conscious choice of how to measure things.

No comments:

Post a Comment