Looking ahead to dynamic partitioning...
The distance metric very much impacts the shape of the clusters. In particular, the standard Euclidean distance metric will tend to find clusters of spherical shape. For dynamic partitioning, I'm more interested in rectangular clusters. Specifically, clusters with boundaries along dimension values (e.g., age between 50 and 60, account type = expenses, year = 2016, etc.).
To find clusters of this shape, one should use the rectilinear distance (more commonly referred to as the Manhattan distance since it gives the distance one use if driving on a rectangular street grid, like in Manhattan). This fact is well known, but rarely applied (since most clusters are, in fact, fairly spherical, or at least elliptical). I'm not exactly sure how it changes the application of BIRCH to the problem of partitioning, but it's something I'll need to work out this summer.
No comments:
Post a Comment