Introduction: Partitioning as a form of clustering
- Show a simple example of how partitioning cuts down on query time
- Optimal partitioning has blocks where all or no rows are included
- Also need a way to know without reading the block - attribute tagging
Building the initial data set partitions
- Since we have no real query data, use surrogate partitioning
- Weight attributes by expected query use
- Insert entries into CF tree as per BIRCH except that we use hard partition lines rather than closeness when splitting a node. That is, an attribute is chosen to yield the best split and then the values are divided according the that attribute only.
- Label each block with appropriate attribute tags
When querying keep track of key metrics:
- Actual attribute slices.
- Percentage of rows are actually used when a block is read.
- Percentage of queries where the attribute tags allow the block to be skipped entirely.
Dynamic repartitioning
- Rank the blocks based on the effectiveness of their attribute tags
- Good blocks are either skipped entirely or return a high percentage of rows.
- Remove underperforming block(s)
- Re-insert data, using attribute partition weights updated from actual use
Why that won't work
- Even with updated weights, most of the re-inserted data is going to flow to the same spot on the CF tree.
- Insert a step between the deletion of the bad blocks and re-insertion of data where a new CF Tree is created using only blocks (not the individual rows in the blocks).
- This will sufficiently re-shuffle the CF tree so that the re-inserted data goes to new (presumably better) locations.
Implementation details to be sorted out through empirical testing
- Exact distance metric given query history
- Block ranking metric
- Fixed or dynamic number of blocks to delete on each iteration
- Maintaining stability during the reshuffle operation (system needs to be continuously available for query and new data loads).
No comments:
Post a Comment