Thursday, March 10, 2016

Data Mining Tan ch 8 notes

Rather last minute, but I'll throw (literally, this is very late in the game, but production issues at work diverted my efforts last night, so this will have to do). together a few notes for the exam from the supplemental chapter in Tan regarding clustering.

Reasons to cluster:

  • Summarization
  • Compression (use of cluster prototype points to represent all points)
  • Nearest neighbor algorithms can search same cluster first
Hierarchial - nesting of clusters
Partitional - non overlapping

Exclusive - each point gets 1 or zero clusters
Overlapping - point may be in more than one
Fuzzy - point is allocated partially to several clusters

Prototype-based: object assigned to cluster with most similar prototype
Graph-based: clusters represent connected objects
Density-based: clusters represents regions of heightened density
Shared-property (conceptual): objects share some property or set of properties

K-means: prototype-based partitional clustering

Dependent on distance metric. Euclidean distance results in clusters with a centoid at the mean of the dimensions. Minimizes Sum of Squared Error (SSE).

Alternative metrics include the cosine similarity measure, which is useful for document data. (Cohesion measure).

Complexity: Time - O(I * K * m * n) where I is the iteration bound, K is the number of clusters, and the data is a m points with n attributes. Space - O((m + K)n).

Techniques to reduce SSE after convergence on a (possibly) non-global optimum: split a cluster, introduce a new centriod, disperse a cluster, merge two clusters.

Bisecting K-means: Start with K-means(2) and then continue to split a cluster at each iteration. Cluster to split may be chosen by several criteria (largest, largest SSE, greatest diameter, etc.)

Agglomerative Hierarchial: starts with singletons and combines nearest sets

Distance metrics to find nearest sets: single link (closest pair), complete link (most distant pair), group average.

Complexity: Time - O((m2 log m) assuming an efficient heap or list. Space - O((m2)


DBSCAN: density-based finds high density areas and treats the rest as noise.

Points classified as Core, Border, or Noise based on how many other points are with in a set distance eps. Core points within eps are put in the same cluster. Border points are put in a cluster of a core point they border. Noise points are ignored.

Cluster evaluation

Cohesion vs separation (cohesion with clusters, separation between clusters).

See p538 for some specific formulas for separation and cohesion.

Silhouette Coefficient: a = average distance from object to all objects in same cluster. b = average distance between object and all objects in a different cluster. s = (b - a)/max(a, b) => [-1, 1]. closer to 1 is better.



No comments:

Post a Comment