never2old4school: Outline for BIRCH presentation

Monday, April 11, 2016

Outline for BIRCH presentation

Recalling, of course, that this isn't really a presentation on BIRCH, but rather the concepts from BIRCH that will be relevant to my work this summer (and/or possibly fall).

Introduction: Partitioning as a form of clustering

Show a simple example of how partitioning cuts down on query time
Optimal partitioning has blocks where all or no rows are included

Also need a way to know without reading the block - attribute tagging

Building the initial data set partitions

Since we have no real query data, use surrogate partitioning

Weight attributes by expected query use

Insert entries into CF tree as per BIRCH except that we use hard partition lines rather than closeness when splitting a node. That is, an attribute is chosen to yield the best split and then the values are divided according the that attribute only.
Label each block with appropriate attribute tags

When querying keep track of key metrics:

Actual attribute slices.
Percentage of rows are actually used when a block is read.
Percentage of queries where the attribute tags allow the block to be skipped entirely.

Dynamic repartitioning

Rank the blocks based on the effectiveness of their attribute tags

Good blocks are either skipped entirely or return a high percentage of rows.

Remove underperforming block(s)
Re-insert data, using attribute partition weights updated from actual use

Why that won't work

Even with updated weights, most of the re-inserted data is going to flow to the same spot on the CF tree.
Insert a step between the deletion of the bad blocks and re-insertion of data where a new CF Tree is created using only blocks (not the individual rows in the blocks).
This will sufficiently re-shuffle the CF tree so that the re-inserted data goes to new (presumably better) locations.

Implementation details to be sorted out through empirical testing

Exact distance metric given query history
Block ranking metric
Fixed or dynamic number of blocks to delete on each iteration
Maintaining stability during the reshuffle operation (system needs to be continuously available for query and new data loads).

never2old4school

Monday, April 11, 2016

Outline for BIRCH presentation

No comments:

Post a Comment