Thursday, April 6, 2017

Evolution

With the data layer and query generator done, it's time to get serious about the actual algorithm I'm going to employ for Blocker, my term project in Evolutionary Programming. Here are some thoughts.

The general flow is as follows: run a set of queries on the data and score the blocks. High scoring blocks get carried on without modification. Low scoring blocks are dissolved and their contents are copied back into the data base using the block assigner. The evolutionary part is that the block assigner changes with each iteration as more information is gained about the distribution of the data and the queries.

The novelty, (it's not much of one, but the term paper doesn't require any original work; just an implementation) is in the way the block assigner evolves. In a normal Evolutionary Strategy setting, the evolution comes by perturbing the data randomly and then comparing the fitness of the new generation to the old. That's too inefficient for this application because reblocking the entire database takes quite some time.

Instead, I'll have a prior distribution on blocking preferences. From the query results, a posterior will be computed and that will be used to set the distribution for preferences in the next round. For this to really work well, it has to take into account the correlations between attributes. I'm not sure if I can go beyond pairwise correlations for this effort, but even that should help quite a bit.

A couple other quick thoughts on how I can get generations through faster:

  • Time box the operation. That is, keep generating and running queries as until a fixed time has passed rather than a fixed number of queries. This allows early generations, which tend to see big gains, get done quickly. Later generations, where the improvements are much smaller, will get a larger query base because the query times will drop as a result of the optimization so far.
  • Change the sample threshold over time. Start with sparse sampling, maybe only actually reading 10% of the selected blocks. Increase the percentage of blocks actually read as the posterior distribution gets better and can benefit from more data.

No comments:

Post a Comment