Understanding the Bias-Variance Decomposition with A Simulated Experiment
Introduction
The bias-variance tradeoff is an essential factor to consider when choosing a model to minimize test error. To truly understand the underlying concepts, it is helpful to learn how exactly the test error can be decomposed into bias and variance. However, for simplicity, most textbooks do not offer a precise derivation of the process, which may lead to confusion1.
1 A common source of confusion is that textbooks like to simplify math expressions. For example, in section 7.3 of The Elements of Statistical Learning, the text seems to suggest that only one input point (
To address the problem, this post provides two ways to demonstrate the bias-variance decomposition. In the first part, a precise mathematical derivation of the decomposition process is provided, along with some illustrative text. In the second part, a simulated experiment in R is presented to demonstrate the theory in a more realistic background.
Some theoretical background
Consider a regression problem where the observed outcome is denoted by
where
In real-life scenarios,
2 Some textbooks would drop the error term
Here are some things to keep in mind before performing the bias-variance decomposition. First, the values of
Keeping the dependencies in mind, the expression can be rewritten as3:
3 To simplify the notations while still keeping things clear,
Since
Now, let’s solve the inner-most expectation, i.e.,
Then the second inner-most expectation, i.e.,
And finally, the outer-most expectation, i.e.,
The above equation shows that the expected test error can be decomposed into:
, i.e., irreducible error. This error is the amount by which the observed outcome differs from the true outcome. It could be caused by many limitations of data (e.g., random noise, measurement error, unmeasured predictors, unmeasurable variation), and cannot be avoided unless has a variance of zero.- Reducible error. This error is caused by the algorithm we choose to model the relationship, and it can be minimized using appropriate modeling techniques. The reducible error can be further decomposed into:
, i.e., bias. This error represents the amount by which the mean of differs from . It is caused by erroneous assumptions in the learning algorithm, e.g., approximating an extremely complicated real-life problem by a much simpler algorithm. , i.e., variance. This error represents the amount by which differs from its own mean. It is caused by the sensitivity of the learning algorithm to small fluctuations in the training set.
The decomposition tells us that, in order to minimize the expected test error, we need to select a learning algorithm with both low variance and low bias. However, typically, the more complicated the algorithm, the lower the bias but the higher the variance, and hence the tradeoff.
A simulated experiment
The objective of this experiment was to use an example to verify the equation:
Unlike real-life scenarios, where
. had a mean of zero and a standard deviation of 0.1 (hence a variance of 0.01).- Additionally,
.
A simple linear regression algorithm was used to estimate
Step 1:
Some setups.
library(tidyverse)
set.seed(1) # for reproducible results
Step 2:
Generated 101 data sets, each containing 100 observations. The first data set serves as a test set, while the remaining 100 data sets serve as training sets.
<- lapply(1:101, function(...) {
data_sets tibble(
X = runif(100, min = -pi / 2, max = pi / 2), # the predictor
fX = sin(X), # the true outcome, i.e., f(X)
e = rnorm(100, mean = 0, sd = 0.1), # irreducible error
Y = fX + e # the observed outcome
)%>% {
}) list(test_set = .[[1]],
training_sets = .[-1])
}
Step 3:
Train the linear model on each of the 100 training sets, and get 100 realizations of f_estimates
in the code below.
<- data_sets$training_sets %>%
f_estimates lapply(function(set_i) {
lm(Y ~ X,
data = set_i)
})
Step 4:
Use each of the 100 fX_estimates
in the data frame results
. Note that each prediction can be uniquely identified by the combination of an observation_id
and a training_set_id
4, which are also assigned to the data frame.
4 observation_id
identifies the observation in the test set to be predicted; training_set_id
identifies the training set to train
<- lapply(1:100, function(i) {
results $test_set %>%
data_setsmutate(fX_estimates = predict(f_estimates[[i]], newdata = data_sets$test_set),
observation_id = 1:100,
training_set_id = i)
%>%
}) bind_rows()
Step 5:
Compute the value for each of the terms in the equation:
expected_error
=bias
=variance
=irreducible_error
=
Note that observation_id
determines the value of
5 The R code is rather simplistic (e.g., training_set_id
is not used for the calculation), this is because many operations are vectorized, which is a typical feature of R.
<- results %>%
expected_error with({
- fX_estimates) ^ 2
(Y %>%
}) mean()
<- results %>%
bias split(.$observation_id) %>%
sapply(function(ith_observation) {
$fX[1] - mean(ith_observation$fX_estimates)) ^ 2 # `ith_observation$fX` is a vector with 100 identical values, but only one is needed for the calculation
(ith_observation%>%
}) mean()
<- results %>%
variance split(.$observation_id) %>%
sapply(function(ith_observation) {
var(ith_observation$fX_estimates)
%>%
}) mean()
<- var(data_sets$test_set$e) irreducible_error
Finally, show the results:
expected_error
#> [1] 0.0150791
+ variance + irreducible_error bias
#> [1] 0.01549488
The expected error and the sum of its components are very close. However, you may wonder why there is a slight inconsistency. This is because in the derivation process, it is assumed that the irreducible error
mean(data_sets$test_set$e)
#> [1] -0.001757949
Conclusions
The experiment simulated some data with a sinusoidal function, and used a simple linear regression algorithm to estimate the true function and to predict the outcome in the test set. The results showed that the expected test error was almost identical to the sum of the bias, the variance and the irreducible error. The slight inconsistency could be explained by the fact that one of the assumptions underlying the derivation process, i.e.,