Reasons to cluster:
- Summarization
- Compression (use of cluster prototype points to represent all points)
- Nearest neighbor algorithms can search same cluster first
Hierarchial - nesting of clusters
Partitional - non overlapping
Exclusive - each point gets 1 or zero clusters
Overlapping - point may be in more than one
Fuzzy - point is allocated partially to several clusters
Prototype-based: object assigned to cluster with most similar prototype
Graph-based: clusters represent connected objects
Density-based: clusters represents regions of heightened density
Shared-property (conceptual): objects share some property or set of properties
K-means: prototype-based partitional clustering
Dependent on distance metric. Euclidean distance results in clusters with a centoid at the mean of the dimensions. Minimizes Sum of Squared Error (SSE).
Alternative metrics include the cosine similarity measure, which is useful for document data. (Cohesion measure).
Complexity: Time - O(I * K * m * n) where I is the iteration bound, K is the number of clusters, and the data is a m points with n attributes. Space - O((m + K)n).
Techniques to reduce SSE after convergence on a (possibly) non-global optimum: split a cluster, introduce a new centriod, disperse a cluster, merge two clusters.
Bisecting K-means: Start with K-means(2) and then continue to split a cluster at each iteration. Cluster to split may be chosen by several criteria (largest, largest SSE, greatest diameter, etc.)
Agglomerative Hierarchial: starts with singletons and combines nearest sets
Distance metrics to find nearest sets: single link (closest pair), complete link (most distant pair), group average.
Complexity: Time - O((m2 log m) assuming an efficient heap or list. Space - O((m2)
DBSCAN: density-based finds high density areas and treats the rest as noise.
Points classified as Core, Border, or Noise based on how many other points are with in a set distance eps. Core points within eps are put in the same cluster. Border points are put in a cluster of a core point they border. Noise points are ignored.
Cluster evaluation
Cohesion vs separation (cohesion with clusters, separation between clusters).
See p538 for some specific formulas for separation and cohesion.
Silhouette Coefficient: a = average distance from object to all objects in same cluster. b = average distance between object and all objects in a different cluster. s = (b - a)/max(a, b) => [-1, 1]. closer to 1 is better.
Cluster evaluation
Cohesion vs separation (cohesion with clusters, separation between clusters).
See p538 for some specific formulas for separation and cohesion.
Silhouette Coefficient: a = average distance from object to all objects in same cluster. b = average distance between object and all objects in a different cluster. s = (b - a)/max(a, b) => [-1, 1]. closer to 1 is better.
No comments:
Post a Comment