Monday, August 31, 2015

Asymptotic Convergence vs. Bounded Domain performance

Much of what we're looking at in Analysis of Algorithms is Asymptotic Convergence. This is the traditional way to assess an algorithm's performance. Basically, it's looking at the shape of the bounding curve as the amount of input data goes to infinity. That's all well and good but, in reality, the input data doesn't go to infinity. There's some point at which, either due to the problem itself or limitations of the system, that the data size reaches a bound of it's own.

At work we've exploited this fact and designed systems to be performant within a bounded domain. For example, the system I spend most of my time on is a big ETL engine that takes submissions ranging from a few thousand to several hundred million input rows. The Translate step of the ETL does pretty much the same thing for each row, so the process is O(n). However, we scale out onto a 32-node grid. We chunk the process into groups of roughly 200,000 rows. So, for data sets between 200,000 and 6.4 million rows (200K * 32), we have a constant-time process. Anything less than 200K rows processes so fast, nobody cares about performance, so the only submissions that are affected by O(n) processing are the relatively rare submissions larger than that.

It's not that we don't care about the asymptotic performance; we get submissions as large as 60 billion rows, so we do need to be cognizant of what that does to the system. But, by setting a bound on the data that we wish to optimize and then tuning the system for that, we can deliver performance levels much better than what we could if we had to tune to worst case performance.

No comments:

Post a Comment