How to Simulate Data Based on The Assumptions of Logistic Regression
Introduction
The main purpose of this post was to show how to simulate data under the assumptions of logistic models. The technique introduced should also be useful for generating simulated data to study other machine learning algorithms.
But before the simulation, I will first introduce some key concepts of logistic regression.
About the logistic function
The logistic regression uses the logistic function to denote the probability of the occurrence of an outcome
It is apparent that the probability has a positive association with
The input
Abouth the logit function
The logistic function provides a good representation of the relationship between the probability of an outcome and its predictors, but it does not provide a straightforward interpretation of the coefficients (
The logit1 (log odds) function (denoted by
1 The term “logit” is an abbreviation for “logistic unit”
Note that
The above formula provides a straightforward interpretation for the coefficients in
2 Risk ratio, abbreviated as RR, is the ratio of the probability of an event occurring in a group of observations to the probability of the event occurring in another group of observations.
About maximum likelihood estimation
Logistic regression uses maximum likelihood estimation to estimate the coefficients in
An experiment to simulate data for logistic regression
In this example, I simulate a data set with known distribution and fit a logistic regression model to see how close the result is to the truth.
Consider the probability of an event given input
A data set that reflects the above definition can be simulated with the following code:
library(tidyverse)
set.seed(1)
<- 300
sample_size <- tibble(
dat x1 = runif(sample_size, min = -1, max = 1),
x2 = rbinom(sample_size, 1, 0.5),
e = rnorm(sample_size, sd = 0.1),
p = 1 / (1 + exp(-(-2 + 3 * x1^2 + x2 + e))),
log_odds = log(p / (1 - p)),
y = rbinom(sample_size, 1, p)
)
Figure 1 shows the simulated data set with the probability of the event in the y-axis.
Now, let’s fit a logistic model on the data using the glm()
function. Note how a quadratic term of x1
is represented in the formula to reflect the underlying definition.
<- glm(y ~ I(x1^2) + x2, data = dat, family = binomial)
fit
fit
#>
#> Call: glm(formula = y ~ I(x1^2) + x2, family = binomial, data = dat)
#>
#> Coefficients:
#> (Intercept) I(x1^2) x2
#> -2.070 2.747 1.088
#>
#> Degrees of Freedom: 299 Total (i.e. Null); 297 Residual
#> Null Deviance: 388.5
#> Residual Deviance: 342.2 AIC: 348.2
The result shows that
The predict()
function can be used to get the predicted probability, which can then be used to get the prediction.
<- dat %>%
dat mutate(p_predicted = predict(fit, type = "response"),
y_predicted = if_else(p_predicted > 0.5, 1L, 0L))
Figure 2 shows the predicted probability of the outcome for each observation in the data set. The pattern shown in Figure 2 is comparable to that shown in Figure 1 .
The following code shows the training error of the model:
mean(dat$y != dat$y_predicted)
#> [1] 0.2733333
An unexpected discovery
Logistic regression is a parametric method, so at first I expected the experiment results would have low variance3 (see my other post on the topic) even with a small sample size. But it didn’t turn out that way. To put the issue in a comparable context, assume
3 You can run the code with different random seeds to see how the estimates of the coefficients change every time.
To be fair, the variance was not caused by the logistic regression algorithm itself; rather, it originated from the nature of classification problems. Given the right predictors and assumptions, the uncertainty of a continuous outcome is only affected by random noise. However, the same cannot be said for a categorical outcome. Even without random noise, the outcome is still uncertain unless the probability of the outcome is 1 or 0, which is rarely the case. Therefore, it is very likely that a small data set does not reflect the underlying assumption due to variance caused by the uncertainty. In addition, even if a good model that is consistent with the truth is built (like the one shown in the experiment), misclassification error rate may still be high due to the aforementioned uncertainty. With this in mind, sometimes it may be more preferable to use other metrics (predicted probability, sensitivity, specificity, etc.) instead of misclassification error to evaluate the performance of classification models.