E2 236 Foundations of ML¶

Lab 5: Gaussian Processes¶

Regression & Classification with Laplace Approximation¶

Topics covered:

  • GP priors: mean functions and covariance kernels
  • GP regression (exact posterior)
  • GP classification via the Laplace approximation
  • Predictive distributions and uncertainty quantification

Instructions: Fill in all sections marked # YOUR CODE HERE.


0. Setup & Imports¶

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import solve, cho_factor, cho_solve
from scipy.special import expit   # sigmoid σ(x) = 1/(1+e^{-x})
from scipy.optimize import minimize
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

np.random.seed(0)
%matplotlib inline
plt.rcParams.update({'figure.dpi': 110, 'font.size': 11})

1. The GP Prior¶

A Gaussian Process is specified by a mean function $m(\mathbf{x})$ and a kernel (covariance function) $k(\mathbf{x}, \mathbf{x}')$.

We write $f \sim \mathcal{GP}(m, k)$.

1.1 Implement the squared-exponential kernel¶

$$k_{\text{SE}}(x, x') = \sigma_f^2 \exp\!\left(-\frac{(x-x')^2}{2\ell^2}\right)$$

In [ ]:
def se_kernel(X1, X2, length_scale=1.0, signal_var=1.0):
    """
    Squared-exponential (RBF) kernel.

    Parameters
    ----------
    X1 : ndarray (n, d)
    X2 : ndarray (m, d)
    length_scale : float  ℓ
    signal_var   : float  σ_f²

    Returns
    -------
    K : ndarray (n, m)
    """
    # YOUR CODE HERE
    raise NotImplementedError

1.2 Sample from the GP Prior¶

Draw 5 functions from $\mathcal{GP}(0, k_{\text{SE}})$ on $x \in [-5, 5]$ for three length-scales: $\ell \in \{0.5, 1.0, 2.0\}$.

Hint: Compute the kernel matrix $K$, then sample $\mathbf{f} \sim \mathcal{N}(\mathbf{0}, K)$ by drawing $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, I)$ and computing $L\mathbf{z}$ where $L = \text{chol}(K + \epsilon I)$.

In [ ]:
# YOUR CODE HERE
In [ ]:
 
No description has been provided for this image

2. GP Regression (Exact Posterior)¶

Given observations $(\mathbf{X}, \mathbf{y})$ with Gaussian noise $\varepsilon \sim \mathcal{N}(0, \sigma_n^2)$, the posterior is also a GP with

$$\boldsymbol{\mu}_* = K_*^\top (K + \sigma_n^2 I)^{-1} \mathbf{y}$$

$$\Sigma_* = K_{**} - K_*^\top (K + \sigma_n^2 I)^{-1} K_*$$

where $K = k(\mathbf{X}, \mathbf{X})$, $K_* = k(\mathbf{X}, \mathbf{X}_*)$, $K_{**} = k(\mathbf{X}_*, \mathbf{X}_*)$.

In [ ]:
class GaussianProcessRegressor:
    """
    Exact GP regression with a squared-exponential kernel.

    Parameters
    ----------
    length_scale : float
    signal_var   : float   σ_f²
    noise_var    : float   σ_n²
    """

    def __init__(self, length_scale=1.0, signal_var=1.0, noise_var=0.1):
        self.length_scale = length_scale
        self.signal_var   = signal_var
        self.noise_var    = noise_var

    def _kernel(self, X1, X2):
        return se_kernel(X1, X2,
                         length_scale=self.length_scale,
                         signal_var=self.signal_var)

    def fit(self, X, y):
        """
        Store training data and precompute (K + σ_n² I)^{-1} y.

        Parameters
        ----------
        X : ndarray (n, 1)
        y : ndarray (n,)
        """
        # YOUR CODE HERE
        raise NotImplementedError

    def predict(self, X_star, return_std=False):
        """
        Compute posterior mean (and optionally std) at X_star.

        Parameters
        ----------
        X_star     : ndarray (m, 1)
        return_std : bool

        Returns
        -------
        mu   : ndarray (m,)
        std  : ndarray (m,)  — only if return_std=True
        """
        # YOUR CODE HERE
        raise NotImplementedError
In [ ]:
 

2.1 Fit and visualise¶

  • Generate 20 noisy observations of $y = \sin(x)$ on $[-4, 4]$ with $\sigma_n = 0.2$.
  • Fit your GaussianProcessRegressor.
  • Plot the posterior mean ± 2 standard deviations and 5 posterior samples.
In [ ]:
# YOUR CODE HERE
In [ ]:
 
No description has been provided for this image

2.2 Marginal Likelihood¶

The log marginal likelihood is

$$\log p(\mathbf{y} | \mathbf{X}) = -\tfrac{1}{2}\mathbf{y}^\top(K + \sigma_n^2 I)^{-1}\mathbf{y} - \tfrac{1}{2}\log|K + \sigma_n^2 I| - \tfrac{n}{2}\log 2\pi$$

Q2.2 — Implement log_marginal_likelihood and plot it as a function of length_scale (keep $\sigma_f^2 = 1$, $\sigma_n^2 = 0.04$ fixed). Mark the optimum.

In [ ]:
def log_marginal_likelihood(X, y, length_scale, signal_var=1.0, noise_var=0.04):
    """
    Compute the log marginal likelihood of the GP.

    Returns
    -------
    lml : float
    """
    # YOUR CODE HERE
    raise NotImplementedError

# Plot log marginal likelihood vs length_scale
# YOUR CODE HERE
In [ ]:
 
No description has been provided for this image
Best length-scale: 1.750

3. GP Classification with Laplace Approximation¶

For classification, the likelihood $p(y_i | f_i)$ is non-Gaussian (e.g. Bernoulli with sigmoid link), so the posterior is intractable. The Laplace approximation fits a Gaussian to the posterior mode $\hat{\mathbf{f}}$.

Algorithm (binary classification, sigmoid likelihood):¶

  1. Find the posterior mode $\hat{\mathbf{f}}$ by Newton's method: $$\hat{\mathbf{f}} \leftarrow (K^{-1} + W)^{-1}(W\hat{\mathbf{f}} + \nabla \log p(\mathbf{y}|\hat{\mathbf{f}}))$$ where $W = -\nabla^2 \log p(\mathbf{y}|\hat{\mathbf{f}}) = \text{diag}(\pi_i(1-\pi_i))$ and $\pi_i = \sigma(\hat{f}_i)$.

  2. Approximate posterior covariance: $\Sigma = (K^{-1} + W)^{-1}$.

  3. Predictive distribution: $$\mu_* = K_*^\top K^{-1} \hat{\mathbf{f}}$$ $$\sigma_*^2 = k_{**} - K_*^\top (K + W^{-1})^{-1} K_*$$ $$p(y_*=1 | \mathbf{X}, \mathbf{y}, \mathbf{x}_*) \approx \sigma\!\left(\frac{\mu_*}{\sqrt{1 + \pi\sigma_*^2/8}}\right)$$

In [ ]:
class GPClassifierLaplace:
    """
    Binary GP classifier using the Laplace approximation.

    Parameters
    ----------
    length_scale : float
    signal_var   : float
    max_iter     : int    Newton iterations
    tol          : float  convergence tolerance
    """

    def __init__(self, length_scale=1.0, signal_var=1.0,
                 max_iter=50, tol=1e-6):
        self.length_scale = length_scale
        self.signal_var   = signal_var
        self.max_iter     = max_iter
        self.tol          = tol

    def _kernel(self, X1, X2):
        return se_kernel(X1, X2,
                         length_scale=self.length_scale,
                         signal_var=self.signal_var)

    def _log_likelihood_grad_hess(self, f, y):
        """
        Bernoulli log-likelihood with sigmoid link.

        Returns
        -------
        ll   : float     log p(y|f)
        grad : ndarray   ∂ll/∂f  =  y - σ(f)
        W    : ndarray   -∂²ll/∂f²  (diagonal vector, positive)
                         W_i = σ(f_i)(1 - σ(f_i))
        """
        # YOUR CODE HERE
        raise NotImplementedError

    def fit(self, X, y):
        """
        Run Newton iterations to find the posterior mode f_hat.

        Implementation notes
        --------------------
        1. Form B = I + W^{1/2} K W^{1/2}  (PD matrix of size n×n).
        2. Compute L = np.linalg.cholesky(B)  — plain lower-triangular factor.
           Do NOT call cho_factor on L again; use np.linalg.solve(L, ...) directly.
        3. Newton RHS: b = W*f + grad
        4. Woodbury step:
               rhs  = sqrt_W * (K @ b)
               tmp  = np.linalg.solve(L.T, np.linalg.solve(L, rhs))
               f_new = K @ (b - sqrt_W * tmp)
        5. Store self.f_hat_, self.sqrt_W_, self.L_B_ for use in predict_proba.

        y should be in {0, 1}.
        """
        # YOUR CODE HERE
        raise NotImplementedError

    def predict_proba(self, X_star):
        """
        Predictive probability p(y_*=1 | ...) using Laplace approximation.

        Implementation notes
        --------------------
        mu_s   = K_s.T @ (y - σ(f_hat))
        v      = np.linalg.solve(self.L_B_, self.sqrt_W_[:, None] * K_s)   # (n, m)
        var_s  = diag(K_ss) - einsum("ij,ij->j", v, v)
        kappa  = 1 / sqrt(1 + π·var_s/8)   (probit approximation)
        return σ(kappa * mu_s)
        """
        # YOUR CODE HERE
        raise NotImplementedError

    def predict(self, X_star, threshold=0.5):
        return (self.predict_proba(X_star) >= threshold).astype(int)
In [ ]:
 

3.1 Test on 1-D toy data¶

  • Create a binary dataset: $y=1$ if $\sin(x) > 0$, else $y=0$, with a few label flips for noise.
  • Fit your GPClassifierLaplace.
  • Plot the predictive probability $p(y_*=1|x_*)$ along with the training data.
In [ ]:
# YOUR CODE HERE
In [8]:
# 3.1 — 1-D toy dataset
rng4 = np.random.default_rng(3)
X1d  = rng4.uniform(-5, 5, (60, 1))
y1d  = (np.sin(X1d[:, 0]) > 0).astype(float)
# add ~10% label noise
flip = rng4.random(60) < 0.1
y1d[flip] = 1 - y1d[flip]

gpc1d = GPClassifierLaplace(length_scale=1.2).fit(X1d, y1d)

X_s1d = np.linspace(-6, 6, 300).reshape(-1, 1)
proba = gpc1d.predict_proba(X_s1d)

plt.figure(figsize=(9, 3))
plt.plot(X_s1d, proba, 'b-', lw=2, label='$p(y=1|x_*)$')
plt.scatter(X1d[y1d==1, 0], np.ones(int(y1d.sum())), marker='|', s=60,
            c='green', label='y=1')
plt.scatter(X1d[y1d==0, 0], np.zeros(int((y1d==0).sum())), marker='|',
            s=60, c='red', label='y=0')
plt.axhline(0.5, color='k', linestyle='--', alpha=0.4)
plt.ylim(-0.1, 1.1); plt.xlabel('$x$')
plt.title('GP Classifier — Laplace Approximation (1-D)')
plt.legend(); plt.tight_layout(); plt.show()
No description has been provided for this image

3.2 Test on 2-D make_moons¶

  • Load make_moons(n_samples=200, noise=0.2).
  • Fit GPClassifierLaplace with length_scale=0.5.
  • Plot the decision boundary and predictive probability as a heatmap.
  • Report test accuracy.
In [ ]:
# YOUR CODE HERE
In [ ]:
 
Test accuracy: 1.000
No description has been provided for this image

4. Hyperparameter Tuning via Log Marginal Likelihood¶

Q4 (optional / advanced) — Implement gradient-based optimisation of the GP regression hyperparameters $\{\ell, \sigma_f^2, \sigma_n^2\}$ by maximising the log marginal likelihood. Use scipy.optimize.minimize with method='L-BFGS-B' on the negative log marginal likelihood. Plot the fitted posterior before and after optimisation.

In [ ]:
# YOUR CODE HERE (optional)
In [ ]:
 
Optimised: ℓ=1.211, σ_f²=0.475, σ_n²=0.0219
No description has been provided for this image

5. Conceptual Questions¶

Q5.1 — What is the key difference between GP regression and GP classification in terms of tractability of the posterior?

Q5.2 — Explain the role of the noise variance $\sigma_n^2$ in GP regression. What happens when $\sigma_n^2 \to 0$?

Q5.3 — Why is the Laplace approximation not always accurate for GP classification? When would you expect it to fail?

Q5.1: > [Your answer here]

Q5.2: > [Your answer here]

Q5.3: > [Your answer here]


End of Part B¶