Four major problems: association pattern mining, clustering, classification, outlier detection.
Data stream: continous data which has to be processed in a single pass because the raw data isn't saved (at least, not for long).
Major processing steps:
- Data Collection
- Processing
- Preprocessing
- Feature extraction - organizing raw data into collections of attributes
- Data cleaning
- Feature selection and transformation
- May include Integration - combining with other data sources
- Analytic processing and algorithms.
Basic data types (this is where there's a LOT of disparity among authors)
- Nondependency - Data items that do not rely on neighboring fields or observations
- Dependency - Data items that are correlated either with other fields within an observation or with neighboring observations.
- Implicit dependencies are things that are correlations that are generally known within the data (e.g., it was 70 degrees a minute ago, it's probably around 70 now).
- Explicit dependencies are built into the data model; often, relationships represented as graphs.
Multi-dimensional or multivariate data - set of (records, instances, data points, vectors, entities, tuples, objects) consisting of (fields, attributes, dimensions, features).
Atomic data types
- Quantitative (measures): continuous, discrete, finite
- Ratio: Can be compared as ratios - i.e., quantitative with a true zero point
- Interval (this term is from the other text): Quantitative, but not valid as ratios.
- Categorical: discrete
- Ordinal: Can be ordered, but no arithmetic operations apply.
- Nominal: merely a label.
Binary and set data - special cases of categorical and quantitative.
Text - can be just about anything; I'd also put blob data in this group, even though text is technically more parsable, image and sound files can also be parsed and mined.
Contextual attributes: attributes that define the context of the data. We just call these "attributes" at work.
Behavioral attributes: attributes that represent values measured in the context. At work, we call these "measures". I don't know if we are just behind the times or if these are both accepted usages.
Multivariate time series. Multivariate data with a contextual time stamp. Duh.
Multivariate discrete sequences. Like multivariate time series, but with categorical measures and only order of observations is important (not the time between observations).
Univariate discrete sequences. Fancy phrase for strings.
Spatial and spactiotemporal data. Extended time series to more contextual dimensions. Typically stops at 3-space x time, but it could be any number of dimensions as long as ordering and distance along each dimension is meaningful.
Another form of spatiotemporal data is when the spatial component is behavioral rather than contextual, i.e., a trajectory.
Networks and graphs. Alternative representations of dependency data.
General 4 problems framed as operations on a data matrix.
- Classification: looking for relationships between columns in which one (or more) columns is predicted from others. Also known as supervised classification.
- Clustering: looking for similarities between rows. Can also be thought of as unsupervised classification.
- Association Pattern Matching: like classification, but looking at correlation between columns rather than predicting one from a set of others. Strictly, columns are (0,1) variables and we are trying to find subsets of the columns where all values are 1 at least s% (known as support) of the time.
- Anomaly Detection: looking for rows that don't cluster.
Association confidence rule: A => B is the fraction of transactions supporting A that also support B. A=>B is valid at support s and confidence c if support of A is at least s and confidence of A=>B is at least c.
No comments:
Post a Comment