Wednesday, August 2, 2017

Implementing Randomized Matrix Algorithms

This is purely background, but important background for my sub-sampling work.

Reference: Yang, Meng, and Mahoney: Implementing Randomized matrix Algorithms in Parallel and Distributed Environments:  arXiv:1502.03032v2 [cs.DC].

The paper looks at how one implements RandNLA (Randomized Numerical Linear Algebra) on distributed systems (such as Hadoop, which is my implementation target, though I'm keeping the theory general enough to be ported to any horizontally scalable platform).

The paper has a good presentation of the basic concepts of RandNLA, as well as covering the eigenvector stuff used for axis rotation and dimension reduction.

They also cite a distinction that I hadn't really thought that much about. That is, the difference between Monte Carlo and Las Vegas simulation. In Monte Carlo algorithms, you fix the computational complexity and accept a certain risk that your answer will fall outside the confidence you are looking for. Las Vegas algorithms do the opposite: they guarantee an error bound, but you may pay for it with an arbitrarily long run time.

This distinction is important in my application space because, most of the time, a randomized sampling algorithm would be used for exploratory research. Here, a bogus result is acceptable since any significant decision arising from it would be confirmed against at least another sample, if not the entire data set. On the other hand, if these queries were to be used for regulatory purposes, you would need to guarantee the accuracy. Currently, sampling is not an accepted practice for regulatory reporting, but that could change.

The authors also present some techniques for solving when either low or high precision is required.

Unfortunately, the bulk of the paper is devoted to regression problems, which aren't really what I'm after. It's still a good reference, though.


No comments:

Post a Comment