Well, I said I'd give it a shot a week ago, so here it is. I'm going to take a stab at explaining Principal Component Analysis in my own words. The concept is not brand new to me, but the method suggested in Aggarwal is more elegant than decomposing sums of squares, which is how I've seen it presented before.
For starters, we have a data matrix D with n rows (observations) and d columns (attributes). For the moment, we're assuming real-valued attributes, though there are ways around that assumption. As our goal is to transform this data into a lesser-dimension linear subspace, we start by mean centering all the attributes, that is, we subtract from each row μ, the d-dimensional mean vector. This projects the data from an affine subspace to a linear subspace, so we can create a basis for the rows. That sets up the next trick: projecting that basis onto an orthonormal basis of lesser dimension while preserving as much variability in the data as possible.
That last statement is often a sticking point for folks. Don't we really want to reduce variability? No. We want to reduce residual variability. That is, we want to have as little variance between the data and the model as possible. The total variability is what it is, so to reduce residual variability, we have to maximize the variability accounted for by the model itself. Thus we are looking for a basis that spreads the data out as much as possible, while reducing the number of dimensions.
OK, back to the technique. Let C be the covariance matrix for D. That is, cij is the covariance between the ith and jth dimensions. Plugging in the formula for the covariance and coming up with C = DTD/n - μTμ will be left as the proverbial exercise for the reader. Now, here's the cool part. C is symmetric and positive semidefinite. Therefore, we can diagonalize it into C = PΛPT.
What have we done here? We've busted the covariance into two parts: variance along an orthonormal set of eigenvectors (the columns of P) and the variance along each of those vectors (the diagonal elements of Λ, which are also the eigenvalues). Now re-rank the eigenvectors by the magnitude of the eigenvalues and pick the top however many you want as a basis. Blam! Your done. You've now picked the optimal k < d vectors to span your space minimizing the residual variance (which is just the sum of the remaining eigenvalues).
I'm going to have to try this out on some of our data at work. What I'd really like to see is if the reduced basis is pretty constant across all the non-numeric attributes, or if very different spaces emerge for different slices of business. My guess is the latter, which would imply that rolling a bunch of the non-numeric data into the analysis could further improve the model.
Where I'm trying to go with all this is to build a model for a reasonably solid prior using actuals that can then be used to improve the convergence of the posterior distribution when the projection data is sampled.
No comments:
Post a Comment