Monday, April 11, 2016

Outline for BIRCH presentation

Recalling, of course, that this isn't really a presentation on BIRCH, but rather the concepts from BIRCH that will be relevant to my work this summer (and/or possibly fall).

Introduction: Partitioning as a form of clustering
  • Show a simple example of how partitioning cuts down on query time
  • Optimal partitioning has blocks where all or no rows are included
    • Also need a way to know without reading the block - attribute tagging
Building the initial data set partitions
  • Since we have no real query data, use surrogate partitioning
    • Weight attributes by expected query use
  • Insert entries into CF tree as per BIRCH except that we use hard partition lines rather than closeness when splitting a node. That is, an attribute is chosen to yield the best split and then the values are divided according the that attribute only.
  • Label each block with appropriate attribute tags
When querying keep track of key metrics:
  • Actual attribute slices.
  • Percentage of rows are actually used when a block is read.
  • Percentage of queries where the attribute tags allow the block to be skipped entirely.
Dynamic repartitioning
  • Rank the blocks based on the effectiveness of their attribute tags
    • Good blocks are either skipped entirely or return a high percentage of rows.
  • Remove underperforming block(s)
  • Re-insert data, using attribute partition weights updated from actual use
Why that won't work
  • Even with updated weights, most of the re-inserted data is going to flow to the same spot on the CF tree.
  • Insert a step between the deletion of the bad blocks and re-insertion of data where a new CF Tree is created using only blocks (not the individual rows in the blocks).
  • This will sufficiently re-shuffle the CF tree so that the re-inserted data goes to new (presumably better) locations.
Implementation details to be sorted out through empirical testing
  • Exact distance metric given query history
  • Block ranking metric
  • Fixed or dynamic number of blocks to delete on each iteration
  • Maintaining stability during the reshuffle operation (system needs to be continuously available for query and new data loads).

No comments:

Post a Comment