Monday, January 25, 2016

Terms from Data Mining ch2

More poorly organized notes, primarily definitions from Data Mining.

Data type portability - ease at which a data type can be transformed to another type required for analysis.

Data type porting - the process of transforming. Can be as simple as casting or may involve significant ETL.

Imputation - assigning a best guess for a missing value based on other values in the observation and correlations in the data set.

Latent Semantic Analysis (LSA) - Transforming text to a nonsparse space of lesser dimension through use of word counts and covariances. Leverages the following two properties of text:

  • Synonymy - multiple values mean the same thing
  • Polysemy - a value can mean more than one thing based on context

Symbolic Aggregate Approximation (SAX) - Method for converting time series to discrete sequence. Aggregates value over time windows and creates a discrete approximation for the entire window.

Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT) - methods for converting time series to series of real-valued coefficients.

Multidimensional Scaling (MDS) - Method for converting graph information to multidimensional vectors for purposes of determining distance between graphs. Typically uses Euclidean measure on resulting vectors, but can use other distance functions as in nonmetric MDS.

Neighborhood graph - used to represent similarity along dimension of any type.

Sampling techiques:

  • Biased - intentionally shifting sampling distribution to get more "interesting" data.
  • Stratified - form of biased sampling that assures representation in each of several segments of data.
  • Reservoir - used to ensure proper representation of moving sample from stream data.
Principal Component Analysis (PCA) - rotating multidimensional basis to reduce number of independent variables (or, more precisely, leveraging correlations in data to capture the information using fewer variables.

Singular Value Decomposition (SVD) - More general case of PCA.

Mean centering - transforming scalar values so the have a mean of zero. PCA and SVD yeild the same results when values are mean centered.

Energy of a transformation - sum of the squared distances from the origin of the original observations projected onto a orthonormalized subspace basis. SVD maximizes energy under this transformation.

Spectral decomposition (also, Schur decomposition) - Diagonalization of a symmetric matrix used by SVD.

Homophily - tendency (preference) of individuals to repeat joint ventures with those with whom they have prior experience. Leads to highly correlated data.





No comments:

Post a Comment