Problem statement
Queries against very large fact tables are slow. There are three common approaches to this problem: indexing, partitioning, and aggregation. Indexing is effective when the number attributes attached to a row is small. When the number of attributes runs into the thousands, indexing each attribute becomes infeasible. Partitioning is effective when query criteria allows the search engine to exclude partitions. However, if the query crosses partitions, little is gained. Likewise, aggregations are only beneficial if the query is being grouped by attributes used in the aggregation.Furthermore, the above three approaches are all static. The addition of new attributes or a change in query patterns can significantly curtail the effectiveness of the initial solution.
The goal of this research is to explore dynamic partitioning based on changing query patterns. There are several possible implementations. The first avenue will be to construct a genetic algorithm which creates an ecosystem of partitions which compete for correlated records. Over time, low performing partitions are pruned and replaced with partitions generated from higher performing partitions.
The overall objective function is to minimize the number of blocks that need to be read to satisfy a query. Since query patterns change, minimizing this directly is problematic. Thus, a surrogate fitness function for each block will be used: maximizing the query correlation of rows within the block. That is, a fit block will either return many rows for a query or can be excluded entirely. An unfit block will be searched often, but will not return many rows.
An outline of this approach follows.
Self-tuning partitioning ecosystem for large fact tables
Data Layout
Data is organized using a traditional star topology. Dimensions contain collections of attributes. An attribute can only appear in one dimension. The set of (attribute, attribute value) pairs in a dimension row must be unique in a dimension.Fact rows contain one or more measures and foreign keys to dimensions. Facts are partitioned into blocks. The criteria for inclusion in a block is the output of this process at each generation.
Generation zero
The initial blocking of data is arbitrary. Data blocks can be built sequentially as data comes in or, if some information about query patterns is known, some attributes can be selected as candidates for splitting data into blocks.As fact rows are submitted, the loading program assigns each row to a block matching the criteria for that block. Blocks are examined from smallest to largest and the first block with matching criteria gets the row. If the first matching block is full, the block is split into two based on either a random criteria or the next available candidate attribute.
Query processing
As queries come in, statistics are kept on how often a block is excluded from queries without having to examine its contents (that is, it can be determined from the block inclusion criteria that no rows would match the query) and how many rows are returned each time the block is read. The query criteria are also saved so they can be used for subsequent fitness testing. (In testing, this query load will be simulated).Block propagation and fitness testing
During times when the system has available resources, a background task creates new generation candidate blocks. This is done by crossing and mutating the criteria for more successful blocks and then populating the new blocks by selecting rows from less successful blocks. The query history is then applied to the candidate blocks to assess their fitness.Generation n+1
When enough generation candidates have been tested (a tunable parameter expected to be roughly 10 times the desired block count), a new set of block criteria are selected for inclusion in the next generation based on a random selection weighted by fitness. A percentage of the existing blocks (another tunable parameter, probably around half in early generations, then decreasing) is randomly selected for end of life weighted by lack of fitness.The remaining blocks are carried to the next generation without modification while the rows from the terminated blocks are added to the entire generation using the same scheme as in generation zero (smallest block matching criteria; split when full).
The generation counter is then incremented and all new queries are applied to the new set of blocks. When all queries running against the old set are complete, the terminated blocks are deleted and the storage freed. Propagation and fitness testing of the next generation of blocks resumes immediately as a background task.