Curse of Dimensionality

Data is represented in a p-dimensional space where each dimension is a feature.

The Curse of Dimensionality

In high-dimensional spaces (p), the data needed for the same density grows exponentially with the dimensions p. This leads to:

Consequently, for a fixed sample size N increasing features p initially reduces the error, but eventually increases due to the curse of dimensionality.

Assessing Separability

Scatter Matrices

To assess the separability of classes, we decompose the total variance into specific scatter matrices.

Within-class scatter matrix (SW): The matrix represents the average compactness of individual classes. Calculated as the weighted sum of the covariance matrices of each class:

SW=i=1MniNΣi

Here, M is the number of classes, N is the total number of samples, ni is the number of samples in class wi, and Σi is the covariance matrix of class wi. For good classification performance, a smaller SW is preferred.

Between-class scatter matrix (SB): The matrix measures the separation between classes. Calculated as the weighted average distance of individual class means from the global mean.

SB=i=1MniN(μiμ)(μiμ)T

Where μi is the mean of class wi and μ is the overall global mean. For good classification performance, a larger SB is preferred.

Total scatter matrix (ST): The matrix represents the overall width or variance of the entire dataset and is the sum of the within-class and between-class scatter matrices.

ST=SW+SB

Mahalanobis Distance

Mahalanobis Distance DM measures the distance between a point and a distribution, accounting for the variance using the covariance matrix Σ. It is defined as DM=(xμ)TΣ1(xμ). If Σ is the identity matrix, then the mahanalobis distance equals the Euclidean distance.

To measure the distance between two classes, we extend the Mahalanobis distance. Assuming Gaussian distributions with equal covariance matrices, the distance is defined using the within-class scatter matrix SW rather than a single global covariance matrix.

DM=(μ1μ2)TSW1(μ1μ2)

Fisher Discriminant Ratio (FDR)

To quantify the separability capabilities of individual features, we combine the Within-class and Between-class scatter matrices into the Fisher Discriminant Ratio.

For a 1-D, two-class problem, the criterion is defined as the ratio of the squared distance between means to the sum of their variances.

JF=(μ1μ2)2σ12+σ22

This concept generalizes to finding a projection vector a that maximizes the ratio of between class scatter to within class scatter.

JF(a)=aTSBaaTSWa

Dimensionality Reduction Strategies

To address the curse of dimensionality, we reduce dimensions through two main methods:

  1. Feature Selection: Selecting a subset (d) of the original features (p). The original features are retained.
  2. Feature Extraction: Mapping p measurements to d new measurements. The original features are transformed.

Feature Selection Strategy

Selecting a subset d from p original features.

Exhaustive Feature Selection: finding the optimal subset requires checking all possible feature combinations. This becomes computationally impossible as p grows large.

To avoid the cost of exhaustive search, heuristic methods are used.

Feature Extraction Methods

Transforming the original p features into a new set of d features.

Principal Component Analysis (PCA) is an unsupervised feature extraction method and is considered the most classical approach to dimensionality reduction. It transforms the data into a new coordinate system where the axes (principal components) are ordered by the amount of variance they capture.

Linear Discriminant Analysis (LDA) is a supervised feature extraction method. The goal is to find the projection vector a that maximizes the separability between different classes. To achieve this LDA maximizes the Fisher Criterion.

Standard PCA and LDA are linear methods, meaning they may be inefficient if the data is not linearly separable in the original dimension. One can project the data onto a higher dimensional space such that the data becomes linearly separable. To reduce the consequent increase in computation complexity we use the Kernel Trick.

Powered by Forestry.md