Here's a somewhat related take on sampling. The emphasis here is on how to actually perform the least squares computation efficiently. In stats class, this is done with QR factorization. That's an O(mnn) algorithm where m is the number of constraints and n is the number of data points. Squaring the number of rows when the number of rows goes into the trillions results in a really, really big number.
Reference: Iyer et al, A randomized least square solver for terabyte-sized dense overdetermined systems. Journal of Computational Science 2016.09.007.
Iyer looks at ways to sample the data and then perform the regression on the sample. This is actually part of a whole field of computational algebra which goes by RandNLA (Randomized Numerical Linear Algebrea). The basic idea is to transform the data into a smaller set which produces a solution that provably approximates the original problem with high probability.
Transforms that have been tried include Hadamard, Randomized Fourier, and Blendenpik. This paper focuses on the latter. The algorithm is implemented on a 5210-node super computer and it appears to work pretty well.
Most of the paper is straight computational considerations; primarily efficiency and preservation of accuracy. Both of these are extremely important when doing large matrix operations. Not so much in my problem domain. They do spend a little time on the randomized transform, but not enough to use it as a primary reference. I'll need to find better references on Blendenpik as well as others.
No comments:
Post a Comment