Computational Learning Theory

Under what conditions is a learning algorithm guaranteed to perform well?

Probably Approximately Correct

The core philosophy of Probably Approximately Correct (PAC) is that we cannot expect a learner to learn a concept perfectly every single time (because we rely on a random sample of data). Instead, we aim for a learner that will probably (with high confidence) produce a hypothesis that is approximately correct (has very low error).

Formal definitions:

$X$ (Instance Space): The set of all possible examples/instances.
$C$ (Concept Class): The set of target concepts we want to learn. A concept $c$ is a function $c : X \to {0, 1}$ .
$D$ (Distribution): A fixed, but unknown, probability distribution over the instances $X$ .
$H$ (Hypothesis Space): The set of possible hypotheses the learner considers to approximate the concept $c$ .
True Error ( $e r r o r_{D} (h)$ ): The probability that the hypothesis $h$ disagrees with the true concept $c$ on a randomly drawn instance from $D$ :
Approximately Correct: The learner outputs a hypothesis $h$ such that the error is bounded by a small parameter $ϵ$ :

A major goal of PAC theory is determining the bound on $m$ (the number of training samples) required to guarantee learning.

Discrete Hypothesis Spaces: A consistent learner is one that produces a hypothesis with zero error on the training set. But, fitting the training data does not guarantee the true error is low. The Version Space is the set of all hypotheses consistent with the training data. We want this version space to be $ϵ$ -exhausted, meaning it contains no hypotheses with a true error $> ϵ$ . The bound for the number of samples $m$ required to ensure this with probability $1 - δ$ is:
Continuous Hypothesis Spaces: The slides provide a concrete example of learning an axis-parallel rectangle in a 2D plane.
- Strategy: the learning finds the tightest fit rectangle $R^{'}$ around the positive training examples.
- Analysis: The error is the area difference between the true rectangle $R$ and the learned rectangle $R^{'}$ . By analyzing the probability that training points "miss" the error strips along the edges, we can derive a sample bound.
- The bound: For axis-parallel rectangles, the sample complexity is: $m > \frac{4}{ϵ} \ln (\frac{4}{θ})$
- This shows that even with infinite possible rectangles we can find a finite sample size to guarantee PAC learning.
VC-dimension: For general continuous spaces where we cannot count $| H |$ , we use the Vapnik-Chervonenkis (VC) dimension. The VC-dimension replaces $\ln | H |$ in the sample complexity formula. It measures the "capacity" or complexity of the hypothesis space (roughly, how flexible the model is).

There is a fundamental trade-off between the number of samples $m$ , the allowable error $ϵ$ , and the confidence $θ$ . To get higher confidence or lower error, you typically need more data.

No Free Lunch Theorem: Across all problems, no single learner is better than any other. A learner only performs well if its hypothesis space is suited to the problem.

Weak vs Strong Learners

A Strong learner (PAC-learner) is a learner that can achieve an arbitrarily small error $ϵ$ with arbitrarily high confidence $1 - δ$ .

A Weak Learner is a learner that performs only slightly better than random guessing with some fixed error rate and a fixed confidence.

It has been Mathematically proven that if a "weak" learner exists, it can be "boosted" into a strong learner. This means that strict PAC requirements are not always necessary, we just need a learner that is slightly better than a coin toss.

The original idea behind boosting involves resampling the training data.

Train a weak learner
Identify the objects that were misclassified.
Repeat 1-2 with the misclassified objects, some number of times.
The final prediction is made by a majority vote of the collection of weak learners.

Adaptive Boosting (AdaBoost)

AdaBoost is a constructive algorithm designed to implement the theoretical promise of boosting: turning a weak learner into a strong PAC learner. Instead of resampling the data like original boosting, AdaBoost works by explicitly re-weights the training instances.

Linear Additive Model: AdaBoost builds a strong classifier $F_{K} (x)$ by creating a weighted sum of $K$ weak classifiers. The final decision is a linear combination of these weak learners:
Exponential Loss: The algorithm optimizes the model by minimizing an exponential loss function rather than a standard 0-1 loss. The loss $L$ over $N$ training examples is defined as:
Optimizing all weights $α$ and classifiers $f$ simultaneously is an open problem. Therefore, AdaBoost uses a "greedy" incremental approach:
- At each step $K$ , the algorithm fixes the previously learned ensemble ( $F_{K - 1}$ ) and only attempts to find the best new weak classifier $f_{K}$ and its weight $α_{K}$ to add to the sum.

The full algorithm:

Initialize Weights: Start by assigning every training object an equal weight $w_{i} = 1$ .
Train Weak Classifier: Train a new classifier $f_{K}$ that minimizes the weighted error $ϵ_{K}$ on the training set. This forces the learner to focus on objects with high weights (those that were previously misclassified).