Monday, November 27, 2017

Size matters

Block size, specifically. And, none of the methods I've been looking at seem to care.

For example, in the BMH paper, the example uses k samples of m observations per iteration. Values of k are 25 and 50, m is 200, 500, and 1000. I'm sorry, I thought the title of your paper included the phrase "Analysis of Big Data". I don't know of a Big Data platform that uses 1000-row blocks. 1-million-row blocks would be more common.

That matters because if I have, say, 10 billion rows, that's only 10,000 blocks. If I use up 25 of them on each iteration, I'm not going to have many iterations before I've gone through the bulk of the data set. It's not that I can't see that a larger number of smaller blocks converges with less data, it's just that such an arrangement completely defeats the purpose of these algorithms; that is, to get the answer as quickly as possible. And, simply counting the number of floating point operations is a very poor predictor of how long an algorithm takes when the size of the data is larger than the size of memory.

This, of course, is one of the big differences between academic work and industry. Academics my rightly point out that fewer computations are needed with smaller blocks. What they are glossing over is the fact that computation time doesn't really matter much. It's getting stuff off the disk that takes all the time. Once the block is in memory, running the computation on the full block is basically free (unless you're doing something really nutty).

So, I'm implementing these algorithms using much larger blocks. The authors could rightly accuse me of not giving their algorithms a fair shake. I'm not. I'm actually giving them a huge advantage: I'm letting them use all the data in memory instead of just a tiny amount. My metric continues to be how many blocks are read, even though these algorithms have more computation than mine.

At the end of the day, I'll have to justify all this with some true time studies. To do that, I'll need to re-write all this stuff (again) to run on the cluster. It's actually not that much code and I was going to get around to it anyway. But, I'm hoping I can put that off until after we get this paper out the door.

No comments:

Post a Comment