While the term assignment doesn't require the question be answered (or even acknowledged), any application of an Evolutionary Algorithm should be preceded by at least considering if it wouldn't be better to solve the problem directly.
The question is particularly relevant in my blocking problem because, right now, direct solution is pretty much the way it gets done. Non-optimal, for sure, but direct. Partitioning is either determined up front by the database designer or created after the fact in the form of aggregations. The former is almost never done by rigorous analysis; it's more the database designer's intuition. The latter may also be specified by the designer. If not, dynamic aggregations are built in response to query patterns. This last bit is the closest to what I'm proposing, but I'm trying to do it without losing the granularity of the data.
At any rate, none of the above "direct" methods constitute algorithms because the resulting solutions are just a good crack at the best solution. Nobody claims they are actually optimal.I'm not making any claim of optimality, either, so that state of affairs needs to be justified. The simple answer is that it's hard. In the world of optimization theory, that means NP-hard. NP-hard is generally accepted to mean that you can forget about solving it exactly on large data sets (this assertion is unproven, but even if it's turns out that they can be solved in polynomial time, it's a polynomial that very quickly gets bigger than anything an existing machine can handle).
The usual strategy for showing that an optimization problem is NP-hard is to show that the corresponding decision problem is NP-Complete. That is, that every other NP-Complete decision problem can be decided by an algorithm that solves this one (if such an algorithm existed). Fortunately, as this operation is transitive, you only have to find one and they all fall into line.
So, which one? And, more fundamentally, what is the corresponding decision problem?
Starting with the second question, the optimal solution is one that minimizes the number of blocks that have to be read to satisfy a finite set of queries. The corresponding decision problem is, "Does there exist a solution in which exactly k blocks need to be read?"
Given that, what known NP-Complete problem can be beaten into a form that solving this problem would also provide a solution to it? That's going to take some thinking. The one that looks most promising to me at first glance is Steiner Trees, one the original 21 NP-Complete problems proposed by Karp in 1972.
The Steiner Tree problem tries to connect terminal vertices in a graph with a tree having a given total edge weight. It seems there might be a way to frame that with each tree node being a partitioning attribute, the terminal nodes being the blocks, and the edge weights being how often the partitioning attribute(s) don't allow you to quit traversing the tree.
Another possibility would be to frame it as a Linear Programming problem since all of those are known to be NP-Complete. You have to be careful with that, though. It's fairly clear that you could solve this using Linear Programming, but you can solve lots of easy problems using Linear Programming. That just proves that the problem is NP-easy (as in, no harder than NP). To show it's NP-hard, you have to do the opposite: show that any Linear Programming problem can be solved by solving your problem.
Again, not needed for the class project, but definitely worth pondering if it's going to become part of my thesis.
No comments:
Post a Comment