Curse of Dimensionality
Data is represented in a
The Curse of Dimensionality
In high-dimensional spaces (
- Distance Concentration: In high dimensions, points tend to be equidistant.
- Convex Hull: In high dimensions, almost all samples lie on the convex hull.
Consequently, for a fixed sample size
Assessing Separability
Scatter Matrices
To assess the separability of classes, we decompose the total variance into specific scatter matrices.
Within-class scatter matrix (
Here,
Between-class scatter matrix (
Where
Total scatter matrix (
Mahalanobis Distance
Mahalanobis Distance
To measure the distance between two classes, we extend the Mahalanobis distance. Assuming Gaussian distributions with equal covariance matrices, the distance is defined using the within-class scatter matrix
Fisher Discriminant Ratio (FDR)
To quantify the separability capabilities of individual features, we combine the Within-class and Between-class scatter matrices into the Fisher Discriminant Ratio.
For a 1-D, two-class problem, the criterion is defined as the ratio of the squared distance between means to the sum of their variances.
This concept generalizes to finding a projection vector
Dimensionality Reduction Strategies
To address the curse of dimensionality, we reduce dimensions through two main methods:
- Feature Selection: Selecting a subset (
) of the original features ( ). The original features are retained. - Feature Extraction: Mapping
measurements to new measurements. The original features are transformed.
Feature Selection Strategy
Selecting a subset
Exhaustive Feature Selection: finding the optimal subset requires checking all possible feature combinations. This becomes computationally impossible as
To avoid the cost of exhaustive search, heuristic methods are used.
- Forward Selection (FS): Start with an empty set. Iteratively add the feature that maximizes some criterion when combined with the already selected set.
- Backward Selection (BS): Starts with the full set of features. Iteratively removes the feature that affects the criterion value the least until the desired number of features remains.
- The Nesting Effect: A major limitation of FS and BS. Once a feature has been added/removed, it cannot be undone.
- Bidirectional Selection: Run FS and BS simultaneously. Ensures convergence by preventing FS from selecting features BS has removed and vice-versa.
- Floating Selection: Addresses the nesting effect by allowing "backtracking." Allowing backward steps after a forward step (or vice versa) if it improves the criterion.
Feature Extraction Methods
Transforming the original
Principal Component Analysis (PCA) is an unsupervised feature extraction method and is considered the most classical approach to dimensionality reduction. It transforms the data into a new coordinate system where the axes (principal components) are ordered by the amount of variance they capture.
- Goal: retain as much variance as possible, which is equivalent to minimizing the reconstruction error.
- Process:
- The data
is mean-centered. - The covariance matrix is computed as
. - Eigen-decomposition is performed on the covariance matrix:
where are the eigen vectors and eigen values. - The eigen vectors are ordered by their eigenvalues from largest to smallest.
- You construct the projection matrix
by taking the first columns from the ordered eigenvector matrix . - Finally you project the original mean-centered data
onto this new subspace to get the lower-dimensional data .
- The data
Linear Discriminant Analysis (LDA) is a supervised feature extraction method. The goal is to find the projection vector
Standard PCA and LDA are linear methods, meaning they may be inefficient if the data is not linearly separable in the original dimension. One can project the data onto a higher dimensional space such that the data becomes linearly separable. To reduce the consequent increase in computation complexity we use the Kernel Trick.