Reference: Ping Ma & Xiaoxiao Sun: Leveraging for big data regression. WIREs Comput Stat 2014. doi: 10-.1002/wics.1324
Don't let the Chinese names fool you. These guys are at University of Georgia and they write quite well in English. The "paper" (I have no idea if Wiley peer reviews this publication, but it seems legit) focuses on applying linear models to very large data sets. While I have no particular interest in that, if you can fit a linear model, you can obviously estimate a mean (or, at least a median), and that I do want to do. In fact, this paper is basically the theoretical validation of CISS that I was looking for.
Ordinary Least Squares is really unstable when using heavy-tailed distributions. By intentionally biasing the sample towards the tails (and, obviously, adjusting for that bias via a Weighted Least Squares), you get better convergence. Hooray, I already knew that. Some of their proofs are useful, though. In particular, they show the estimators are unbiased. Since showing the estimator of a mean is unbiased is pretty obvious, I had done that a little less formally in my paper. Rather than beef up those arguments, I can leave them as is and just cite this as backup.
More importantly, they reference two other lines of inquiry that will get put on the must read list, plus another that probably is only marginally relevant, but might be good to know about anyway:
Priority papers:
Drineas & Mahoney: Looking at computational efficiency of sub-sampled regression. Two papers cited: Sampling algorithms for l2 regression and applications. Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, Miami 2006 and Faster least squares approximation. Numer Math 2010.
Ma & Mahoney: Generalizing the above results. Again, two papers: A statistical perspective on algorithmic leveraging. ArXiv e-prints, 6/2013 and the same title (same paper?) in Proceedings of the 31st International Conference on Machine Learning, Beijing 2014
Background:
Hansen & Hurwitz: Defines an estimator based on a weighted sample where the weight is the inverse of the probability of the point being selected (this is sort of what CISS uses, though since CISS samples without replacement, there's a little more to it). The reference is old, which means it's probably one of those results that you simply have to cite in a complete literature review: On the theory of sampling from a finite population. Ann math Stat 1943, 14:333-362.
So, there's one paper reviewed. Everybody sing now:
99 bottles of beer on the wall...
No comments:
Post a Comment