Understanding the Bias-Variance Decomposition with A Simulated Experiment

maths & statistics
Author

zenggyu

Published

2018-03-11

Abstract
Explains bias-variance decomposition with simulation.

Introduction

The bias-variance tradeoff is an essential factor to consider when choosing a model to minimize test error. To truly understand the underlying concepts, it is helpful to learn how exactly the test error can be decomposed into bias and variance. However, for simplicity, most textbooks do not offer a precise derivation of the process, which may lead to confusion1.

1 A common source of confusion is that textbooks like to simplify math expressions. For example, in section 7.3 of The Elements of Statistical Learning, the text seems to suggest that only one input point (\(X = x_{0}\)) is needed to compute the test error. This confusion led to the failure of my first attempt to verify the decomposition.

To address the problem, this post provides two ways to demonstrate the bias-variance decomposition. In the first part, a precise mathematical derivation of the decomposition process is provided, along with some illustrative text. In the second part, a simulated experiment in R is presented to demonstrate the theory in a more realistic background.

Some theoretical background

Consider a regression problem where the observed outcome is denoted by \(Y\) and the predictor is denoted by \(X\). The relationship between the observed outcome and the predictor can be defined as:

\[Y = f(X) + \epsilon\]

where \(f()\) denotes the function that maps the predictor (i.e., \(X\)) to the true outcome (i.e., \(f(X)\)), and \(\epsilon\) denotes an error term.

In real-life scenarios, \(f()\) and \(\epsilon\) are unknown. But if we have a training set \(\tau\), we can obtain an estimate of \(f()\), which is denoted by \(\hat{f_{\tau}}()\). We can then use \(\hat{f_{\tau}}(X)\) to predict the outcome. Using squared-error loss, the expected prediction error based on a test set (i.e., test error) can be defined as2:

2 Some textbooks would drop the error term \(\epsilon\) and just derive \(E[(f(X) - \hat{f_{\tau}}(X)) ^ 2]\).

\[E[(Y - \hat{f}(X)) ^ 2] = E[(f(X) + \epsilon - \hat{f_{\tau}}(X)) ^ 2]\]

Here are some things to keep in mind before performing the bias-variance decomposition. First, the values of \(X\) and \(\epsilon\) are those from the test set. Second, the value of \(Y\) (i.e., \(f(X) + \epsilon\)) is dependent on both \(\epsilon\) and \(X\). Third, the realization of \(\hat{f_{\tau}}()\) is dependent on the training set \(\tau\) that is used to obtain the estimate, and therefore the value of \(\hat{f_{\tau}}(X)\) is dependent on both \(X\) and \(\tau\). Additionally, it should also be noted that during the derivation process, \(\epsilon\) is assumed to have a mean of zero.

Keeping the dependencies in mind, the expression can be rewritten as3:

3 To simplify the notations while still keeping things clear, \(f(X)\) and \(\hat{f_{\tau}}(X)\) are respectively abbreviated as \(f\) and \(\hat{f}\); the dependencies are presented as subscripts of \(E[...]\).

\[E_{\tau}[E_{X, \epsilon}[(Y - \hat{f}) ^ 2]] = E_{\tau}[E_{X, \epsilon}[(f + \epsilon - \hat{f}) ^ 2]]\]

Since \(X\) and \(\epsilon\) are independent, \(E_{X, \epsilon}[...]\) can be expanded to \(E_{X}[E_{\epsilon}[...]]\). Also, we can use the Fubini’s theorem to rearrange the order in which \(E_{\tau}[...]\), \(E_{X}[...]\), \(E_{\epsilon}[...]\) is evaluated. Then, the above expression becomes:

\[E_{\tau}[E_{X, \epsilon}[(Y - \hat{f}) ^ 2]] = E_{X}[E_{\tau}[E_{\epsilon}[(f + \epsilon - \hat{f}) ^ 2]] \]

Now, let’s solve the inner-most expectation, i.e., \(E_{\epsilon}[...]\) (recall that \(\epsilon\) has a mean of zero):

\[ \begin{align} E_{\epsilon}[(Y - \hat{f}) ^ 2] &= E_{\epsilon}[(f + \epsilon - \hat{f}) ^ 2]\\ &= E_{\epsilon}[(f - \hat{f}) ^ 2 + \epsilon ^ 2 + 2\epsilon(f - \hat{f})]\\ &= (f - \hat{f}) ^ 2 + E_{\epsilon}[\epsilon ^ 2] + 2(f - \hat{f})E_{\epsilon}[\epsilon]\\ &= (f - \hat{f}) ^ 2 + (E_{\epsilon}[\epsilon ^ 2] - E_{\epsilon}[\epsilon] ^ 2) + E_{\epsilon}[\epsilon] ^ 2 + 2(f - \hat{f})E_{\epsilon}[\epsilon]\\ &= (f - \hat{f}) ^ 2 + Var_{\epsilon}[\epsilon] + 0 + 0\\ &= (f - \hat{f}) ^ 2 + Var_{\epsilon}[\epsilon] \end{align} \]

Then the second inner-most expectation, i.e., \(E_{\tau}[...]\):

\[ \begin{align} E_{\tau}[E_{\epsilon}[(Y - \hat{f}) ^ 2]] &= E_{\tau}[(f - \hat{f}) ^ 2 + Var_{\epsilon}[\epsilon]]\\ &= E_{\tau}[(f - \hat{f}) ^ 2] + Var_{\epsilon}[\epsilon]\\ &= E_{\tau}[f ^ 2 + \hat{f} ^ 2 - 2f\hat{f}] + Var_{\epsilon}[\epsilon]\\ &= f ^ 2 + E_{\tau}[\hat{f} ^ 2] - 2fE_{\tau}[\hat{f}] + Var_{\epsilon}[\epsilon]\\ &= f ^ 2 + E_{\tau}[\hat{f} ^ 2] - 2fE_{\tau}[\hat{f}] + Var_{\epsilon}[\epsilon] + E_{\tau}[\hat{f}] ^ 2 - E_{\tau}[\hat{f}] ^ 2\\ &= (f ^ 2 + E_{\tau}[\hat{f}] ^ 2 - 2fE_{\tau}[\hat{f}]) + (E_{\tau}[\hat{f} ^ 2] - E_{\tau}[\hat{f}] ^ 2) + Var_{\epsilon}[\epsilon]\\ &= (f - E_{\tau}[\hat{f}]) ^ 2 + Var_{\tau}[\hat{f}] + Var_{\epsilon}[\epsilon] \end{align} \]

And finally, the outer-most expectation, i.e., \(E_{X}[...]\):

\[ \begin{align} E_{X}[E_{\tau}[E_{\epsilon}[(Y - \hat{f}) ^ 2]]] &= E_{X}[(f - E_{\tau}[\hat{f}]) ^ 2 + Var_{\tau}[\hat{f}] + Var_{\epsilon}[\epsilon]]\\ &= E_{X}[(f - E_{\tau}[\hat{f}]) ^ 2] + E_{X}[Var_{\tau}[\hat{f}]] + Var_{\epsilon}[\epsilon] \end{align} \]

The above equation shows that the expected test error can be decomposed into:

  • \(Var_{\epsilon}[\epsilon]\), i.e., irreducible error. This error is the amount by which the observed outcome differs from the true outcome. It could be caused by many limitations of data (e.g., random noise, measurement error, unmeasured predictors, unmeasurable variation), and cannot be avoided unless \(\epsilon\) has a variance of zero.
  • Reducible error. This error is caused by the algorithm we choose to model the relationship, and it can be minimized using appropriate modeling techniques. The reducible error can be further decomposed into:
    • \(E_{X}[(f - E_{\tau}[\hat{f}]) ^ 2]\), i.e., bias. This error represents the amount by which the mean of \(\hat{f_{\tau}}(X)\) differs from \(f(X)\). It is caused by erroneous assumptions in the learning algorithm, e.g., approximating an extremely complicated real-life problem by a much simpler algorithm.
    • \(E_{X}[Var_{\tau}[\hat{f}]]\), i.e., variance. This error represents the amount by which \(\hat{f_{\tau}}(X)\) differs from its own mean. It is caused by the sensitivity of the learning algorithm to small fluctuations in the training set.

The decomposition tells us that, in order to minimize the expected test error, we need to select a learning algorithm with both low variance and low bias. However, typically, the more complicated the algorithm, the lower the bias but the higher the variance, and hence the tradeoff.

A simulated experiment

The objective of this experiment was to use an example to verify the equation:

\[E_{\tau}[E_{X, \epsilon}[(Y - \hat{f}) ^ 2]] = E_{X}[(f - E_{\tau}[\hat{f}]) ^ 2] + E_{X}[Var_{\tau}[\hat{f}]] + Var_{\epsilon}[\epsilon]\]

Unlike real-life scenarios, where \(f()\) and the distribution of the irreducible error \(\epsilon\) are unknown, this experiment provided explicit definitions to perform the simulation:

  1. \(f() = sin()\).
  2. \(\epsilon\) had a mean of zero and a standard deviation of 0.1 (hence a variance of 0.01).
  3. Additionally, \(X \in [-\frac{\pi}{2}, \frac{\pi}{2}]\).

A simple linear regression algorithm was used to estimate \(f()\), and the results were then used to verify the equation.

Step 1:

Some setups.

library(tidyverse)
set.seed(1) # for reproducible results

Step 2:

Generated 101 data sets, each containing 100 observations. The first data set serves as a test set, while the remaining 100 data sets serve as training sets.

data_sets <- lapply(1:101, function(...) {
  tibble(
    X = runif(100, min = -pi / 2, max = pi / 2), # the predictor
    fX = sin(X), # the true outcome, i.e., f(X)
    e = rnorm(100, mean = 0, sd = 0.1), # irreducible error
    Y = fX + e # the observed outcome
  )
}) %>% {
  list(test_set = .[[1]],
       training_sets = .[-1])
}

Step 3:

Train the linear model on each of the 100 training sets, and get 100 realizations of \(\hat{f_{\tau}}()\), which is stored in the variable f_estimates in the code below.

f_estimates <- data_sets$training_sets %>%
  lapply(function(set_i) {
    lm(Y ~ X,
       data = set_i)
  })

Step 4:

Use each of the 100 \(\hat{f_{\tau}}()\) to predict the outcome of the 100 observations in the test set. This process yields 10000 predictions (i.e., \(\hat{f_{\tau}}(X)\), which is stored in a column named fX_estimates in the data frame results. Note that each prediction can be uniquely identified by the combination of an observation_id and a training_set_id4, which are also assigned to the data frame.

4 observation_id identifies the observation in the test set to be predicted; training_set_id identifies the training set to train \(\hat{f_{\tau}}()\), which is then used to make the prediction.

results <- lapply(1:100, function(i) {
  data_sets$test_set %>%
    mutate(fX_estimates = predict(f_estimates[[i]], newdata = data_sets$test_set),
           observation_id = 1:100,
           training_set_id = i)
}) %>%
  bind_rows()

Step 5:

Compute the value for each of the terms in the equation:

  1. expected_error = \(E_{\tau}[E_{X, \epsilon}[(Y - \hat{f}) ^ 2]]\)
  2. bias = \(E_{X}[(f - E_{\tau}[\hat{f}]) ^ 2]\)
  3. variance = \(E_{X}[Var_{\tau}[\hat{f}]]\)
  4. irreducible_error = \(Var_{\epsilon}[\epsilon]\)

Note that observation_id determines the value of \(X\) and \(\epsilon\), therefore \(E_{X}[...]\), \(E_{\epsilon}[...]\) and \(E_{X, \epsilon}[...]\) can all be considered as \(E_{observation\_id}[...]\). Similarly, \(E_{\tau}[...]\) can be considered as \(E_{training\_set\_id}[...]\). These notes help to understand how the calculations are performed in the code below. However, you may need to run the code yourself and examine the results to really relate the code to the math expressions5.

5 The R code is rather simplistic (e.g., training_set_id is not used for the calculation), this is because many operations are vectorized, which is a typical feature of R.

expected_error <- results %>%
  with({
    (Y - fX_estimates) ^ 2
  }) %>%
  mean()

bias <- results %>%
  split(.$observation_id) %>%
  sapply(function(ith_observation) {
    (ith_observation$fX[1] - mean(ith_observation$fX_estimates)) ^ 2 # `ith_observation$fX` is a vector with 100 identical values, but only one is needed for the calculation
  }) %>%
  mean()

variance <- results %>%
  split(.$observation_id) %>%
  sapply(function(ith_observation) {
    var(ith_observation$fX_estimates)
  }) %>%
  mean()

irreducible_error <- var(data_sets$test_set$e)

Finally, show the results:

expected_error
#> [1] 0.0150791
bias + variance + irreducible_error
#> [1] 0.01549488

The expected error and the sum of its components are very close. However, you may wonder why there is a slight inconsistency. This is because in the derivation process, it is assumed that the irreducible error \(\epsilon\) has a mean of zero, and therefore terms that include \(E_{\epsilon}[\epsilon]\) are dropped; but this assumption is not strictly satisfied in the experiment:

mean(data_sets$test_set$e)
#> [1] -0.001757949

Conclusions

The experiment simulated some data with a sinusoidal function, and used a simple linear regression algorithm to estimate the true function and to predict the outcome in the test set. The results showed that the expected test error was almost identical to the sum of the bias, the variance and the irreducible error. The slight inconsistency could be explained by the fact that one of the assumptions underlying the derivation process, i.e., \(E_{\epsilon}[\epsilon] = 0\), was not strictly satisfied in the experiment.