Thursday, December 8, 2016

Full data set

I've run the CISS algorithm on a full-sized production data set. The results aren't much different from what I was expecting, but I'm still glad I took the time to do it. Actually, "full-sized" is probably a stretch. This is a 480-million row data set reflecting 5-year cash flow projections. Any decent BI tool can handle that. However, it is a "complete" set, meaning that the distribution of measures should closely match the 60-billion row data set that motivated this problem. By 10% sampled, it's pretty close. By 20%, you basically have your answer.



If you think that using a complete data set made the heavy-tail problem go away, guess again. Here's the estimate convergence where blocks are selected randomly:


Even at 90% of the data read, it's way off. Note that I have matched the axis on these graphs, so there's no cheating going on. Also, in case it's not clear (which it probably isn't), the x-axis is the number of blocks read. I used 100,000 row blocks.

No comments:

Post a Comment