Wednesday, March 9, 2016

Data Mining ch3 notes

Read this chapter in Aggarwal a few weeks ago, but forgot to write up my notes. I may need them tomorrow for the exam (if nothing else, writing them up is a good review). More than just terms from this chapter as there are several important formulas to remember.

Quantitative distance measures:
Lp norms: Dist(X,Y) = sum(|xi - yi|p)1/p
Special cases:
  • L0 - Doesn't really work mathematically, but intuitively it's the number of dimensions that differ between X and Y, so that's how we define it.
  • L1 - grid (AKA: taxicab or Manhattan) distance: sum of all the absolute differences along each dimension.
  • L2 - standard Euclidean distance.
  • L - greatest single dimension distance (i.e., max(|xi - yi|).
Issues with higher dimensionality - loss of contrast (max - min dist from origin)/mean dist from origin. Lower values of p (possibly between 0 and 1) help with higher dimensions, but all degrade as p gets large with distance dominated by white noise rather than relevant feature distinction.

Proximity thresholding: divide data into discretized buckets for each dimension. The proximity set S(X,Y) is the set of dimensions where X and Y map to the same bucket. Then apply the distance formula only to dimensions in S(X,Y):

PSelect(X,Y) = [sumi∈S(X,Y)(1 - (|xi - yi|/(range of bucket))p]1/p

This gives a similarity metric rather than a distance metric ([0,1] with 1 being similar).

Mahalanobis distance: Maha(X,Y) = sqrt[(X - Y)Σ-1(X-Y)T] where Σ is the covariance matrix. Basically Euclidean distance on data after Principal Component Analysis has been applied.

ISOMAP and Geodesic distance: shortest path along using path that passes only to k-nearest neighbor. Once the all-pairs shortest path graph is constructed using k-dist hops, multidimensional scaling (MDS) can be used to embed the data into a lower dimensional space where a euclidean distance metric will work.

Shared Nearest-Neighbor Similarity: number of common k-dist neighbors between two points.

Generic Methods: Partition the data into local regions (typically via a clustering algorithm). For each pair, determine the appropriate region and apply the distance metric using the statistics for that region.

Categorical distance measures:
Overlap measure: S(xi, yi) = {1 if xi = yi; 0 otherwise. Similarity S(X,Y) = sum(S(xi, yi)).

Inverse frequency similarity (reduces noise from common attributes): Let pk(x) = proportion of records where attribute k = x. S(xi, yi) = { 1/pk(xi)2 if xi = yi; 0 otherwise.

Goodall measure: S(xi, yi) = {1-pk(xi)2 if xi = yi; 0 otherwise.

Cosine distance:

cos(X,Y) = X · Y / ||X|| ||Y|| = sum(xi yi) / [sum(xi2)sum(yi2)]1/2

Inverse document frequency id: norm each term by idi = log(n/ni) where n is the total number of documents and ni is the number of documents containing the word.

Normalized frequency: h(xi) = f(xi) idi. Then the cosine measure is simply applied to the normalized vector X' = f(X) = (f(xi)).

Jaccard coefficient: J(X,Y) = sum(h(xi) h(yi)) / [sum(h(xi)2 + sum(h(yi)2 - sum(h(xi) h(yi))]
More commonly used for sparse binary data than text. In the binary domain, this reduces to:
J(X, Y) = |SX ∩ SY| / |SX ∪ SY|

Time series transformations
Behavioral attribute scaling (variance normalization) and translation (mean shifting)
Temporal (contextual) attribute translation (shifting to align series)
Temporal (contextual) attribute scaling (time warping). May be fixed throughout series or vary over time (dynamic time warping).
Noncontiguity matching (deleting/ignoring noise segments)

Discrete sequence
(AKA Levenshtein distance) - minimum cost function to transform one string to another given a cost function for each type of edit.
Dynamic programming objective function:

Edit(i,j) = min{Edit(i-1, j) + Cost(delete); Edit(i, j-1) + Cost(insert); Edit(i-1,j-1) + Iij Cost(replace)

where Iij is 1 iff xi <> yj

Longest Common Subsequence: also solved via dynamic programming.

Graphs
Homophily measures: short path and/or many path measures.
Random walk-based similarity
Maximum common subgraph distance (extension of longest common subsequence)
Substructure-based similarity: counts matching substructures
Graph edit distance
Graph kernels



No comments:

Post a Comment