一些自己编写的R语言画图函数

简介

这里记录一些自己编写的R语言画图函数,以供日常使用参考。

代码部分

运行本文所提供的函数时,请先确保已经载入tidyverse包;某些函数可能要用到其他包,具体要求请看代码。

1号函数

plot_cat_bin <- function(data, predictor, outcome, reorder = F) {
  data <- data %>%
    transmute(predictor = !!enquo(predictor),
              outcome = !!enquo(outcome))
  
  stopifnot(is.factor(data$predictor) | is.logical(data$predictor))
  stopifnot(all(data$outcome %in% c(0, 1)) | is.logical(data$outcome))
  
  data <- drop_na(data, outcome)
  prob_mean <- mean(pull(data, outcome))
  
  data <- data %>%
    mutate(predictor = forcats::fct_explicit_na(factor(predictor), "<NA>")) %>%
    group_by(predictor) %>%
    group_modify(function(.x, .y) {
      results <- prop.test(sum(pull(.x, outcome)), nrow(.x))
      tibble(prob = unname(results$estimate),
             lower = results$conf.int[1],
             upper = results$conf.int[2])
    }, keep = T) %>%
    mutate(prob_mean = !!prob_mean) %>%
    ungroup()
  
  if (reorder) {
    data <- data %>%
      mutate(predictor = forcats::fct_reorder(predictor, prob))
  }
  
  data %>%
    ggplot(aes(predictor, prob)) +
    geom_point() +
    geom_errorbar(aes(ymin = lower, ymax = upper)) +
    geom_hline(yintercept = prob_mean, linetype = "dashed") +
    labs(x = as_label(enexpr(predictor)))
}

一些笔记:

计算率的置信区间时,最常见的计算算式是\(\hat{p} \pm z \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\);由该算式算出来的区间叫Wald区间。当\(n\)很小,或\(\hat{p}\)接近0或1的时候,Wald区间值非常不稳定,而且有可能不处于\([0, 1]\)区间内。

因为上述原因,R语言的prop.test()在计算率的置信区间时,使用的是适用面更广的Wilson score区间,并且在必要时还会增加连续性校正;该方法的计算算式可以参阅Newcombe RG (1998). “Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods.” 一文。prop.test()所使用的方法对应该文中的方法3(不含连续性校正)和方法4(含连续性校正)。

因此,举个例子,prop.test(12, 123, conf.level = 0.99, correct = FALSE)计算出来的置信区间\([0.0479415, 0.1883752]\)也可以通过以下方式计算出来:

p <- 12/123
n <- 123
z <- qnorm(0.995)
(2*n*p + z^2 + c(-1, 1)*z*sqrt(z^2 + 4*n*p*(1 - p))) /(2*(n + z^2))
## [1] 0.0479415 0.1883752

2号函数

plot_num_bin <- function(data, predictor, outcome, augment = F) {
  data <- data %>%
    transmute(predictor = !!enquo(predictor),
              outcome = !!enquo(outcome))
  
  stopifnot(is.numeric(data$predictor))
  stopifnot(all(data$outcome %in% c(0, 1)) | is.logical(data$outcome))
  
  data <- drop_na(data, outcome) %>%
    mutate(outcome = as.double(outcome))
  prob_mean <- mean(pull(data, outcome))
  
  gam_fit <- mgcv::gam(outcome ~ s(predictor), data = data, family = binomial)
  
  if(augment) {
    data <- tibble(predictor = runif(1000, min(data$predictor), max(data$predictor)))
  } else {
    data <- distinct(data, predictor)
  }
  
  data <- data %>%
    mutate(link = predict(gam_fit, data, type = "link"),
           se = predict(gam_fit, data, type = "link", se.fit = T)$se.fit,
           upper = link + qnorm(0.975) * se,
           lower = link - qnorm(0.975) * se,
           lower = binomial()$linkinv(lower),
           upper = binomial()$linkinv(upper),
           prob = binomial()$linkinv(link),
           prob_mean = !!prob_mean)
  
  data %>%
    ggplot(aes(x = predictor)) +
    geom_line(aes(y = prob)) +
    geom_ribbon(aes(ymin = lower, ymax = upper), fill = "darkgrey", alpha = 0.5) +
    geom_hline(yintercept = prob_mean, linetype = "dashed") +
    labs(x = as_label(enexpr(predictor)))
}

笔记:待补充。

相关

下一页
上一页
comments powered by Disqus