SQL递归查询的R语言实现

计算机技术

作者

zenggyu

发布于

2021-03-20

摘要

介绍如何用R语言实现SQL的递归查询。

引言

在之前写的另一篇文章中，我介绍了SQL递归查询的应用场景、一般形式和执行过程，并给出了一些具体的代码示例。但是，在某些情况下（例如，无法访问能够提供SQL递归查询功能的数据库时），你可能会希望通过其他工具完成相关任务。本文将介绍如何在SQL递归查询相关思路指导下，通过R语言实现相同的功能，并将提供代码示例。由于本文所介绍的实现方法与SQL递归查询紧密相关，因此建议读者先仔细阅读前文，以便更好地理解本文。

递归查询的一般形式

参照SQL递归查询的一般形式和执行过程（见前文），我们可以用R语言写出以下对应形式：

library(dplyr) # 需要用到dplyr包

result <- NULL

cte_name <- {non_recursive_term}

while (TRUE) {
  if (nrow(cte_name) == 0) {
    break
  } else {
    result <- bind_rows(result, cte_name) # result <- distinct(bind_rows(result, cte_name))
    cte_name <- {recursive_term}
  }
}

{invocation_statement}

在上述形式中，result是用于储存最终结果的变量，在一开始需要将其初始化为NULL；cte_name这个变量（在下文的示例中，我们将以t1替代cte_name作为该变量的名称；此处使用该名称只是为了与前文相关内容保持一致）用于临时储存递归查询过程中每个子查询所返回的数据框¹；non_recursive_term和recursive_term分别被称为非递归项和递归项，是用于获取前述数据框的子查询表达式，且recursive_term因含有对cte_name变量的引用而带有递归性质；在递归的迭代过程中，cte_name中的数据会被不断追加到result，而根据是否使用distinct()命令去除数据中的重复记录，追加的方式可以分为两种，分别对应于SQL递归查询中的UNION和UNION ALL方式；最后，invocation_statement是可选的查询表达式，用来对result做一些额外的查询操作。

¹ R语言的数据框与关系数据库里的表相对应，两者是相似的数据结构。

递归项recursive_term往往包含对数据框/表的连接操作；以下是一些需要特别注意的相关问题，以及R语言（本文特指dplyr包所提供的相关函数）和SQL在处理方式上的差异：

在需要进行连接的两个数据框/表之中，如果存在同名的列，会使程序产生混淆。在SQL中，我们可以通过指定表别名（alias）来避免该问题；而在R语言中，我们则需要在连接前对列进行重命名（在下文提供的示例中，我们将使用rename_with()函数来进行重命名）。
在进行连接操作时，SQL和R语言对缺失值（NULL/NA）的默认匹配规则不同。SQL不会对缺失值进行匹配；但R语言在默认情况下会对缺失值进行匹配，如果要采用与前者一致的规则，那么可以在相关函数中添加na_matches = "never"参数。
SQL和R语言对连接键的处理方式不一样。SQL会同时保留来自两个表的连接键供用户选用；但R语言在默认情况下只会保留其中一个，如果采用与前者一致的处理方式，那么可以在相关函数中添加keep = T参数。

递归查询写法示例

本文将采用与前文相同的数据集和查询（有关描述请见该文，此处不再赘述）进行演示，以便进行对比。

# 加载dplyr包
library(dplyr)

# 构建数据集
branch <- tribble(
  ~ branch_id, ~branch_name, ~parent_branch,
  1, '华东', NA,
  2, '华南', NA,
  11, '上海市', 1,
  111, '浦东新区', 11,
  12, '江苏省', 1,
  121, '南京市', 12,
  1211, '鼓楼区', 121,
  1212, '秦淮区', 121,
  13, '台湾省', 1,
  21, '广东省', 2,
  211, '广州市', 21,
  2111, '海珠区', 211,
  2112, '黄埔区', 211
)

找出华东分支（branch_id = 1）及其所有下级分支

result <- NULL

t1 <- branch %>%
  filter(branch_id == 1) %>%
  select(branch_id, branch_name)

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~ paste0("t1.", .)) %>%
      inner_join(branch %>%
                   rename_with(~ paste0("branch.", .)),
                 by = c("t1.branch_id" = "branch.parent_branch"),
                 keep = T,
                 na_matches = "never") %>%
      select(branch_id = branch.branch_id, branch_name = branch.branch_name)
  }
}

print(result)

上述语句的输出结果为：

+---------+-----------+
|branch_id|branch_name|
+---------+-----------+
|1        |华东       |
|11       |上海市     |
|12       |江苏省     |
|13       |台湾省     |
|111      |浦东新区   |
|121      |南京市     |
|1211     |鼓楼区     |
|1212     |秦淮区     |
+---------+-----------+

找出华东分支（branch_id = 1）的所有下二级分支

result <- NULL

t1 <- branch %>%
  filter(branch_id == 1) %>%
  mutate(rel_level = 0) %>%
  select(branch_id, branch_name, rel_level)

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~paste0("t1.", .)) %>%
      inner_join(branch %>%
                   rename_with(~paste0("branch.", .)),
                 by = c("t1.branch_id" = "branch.parent_branch"),
                 keep = T,
                 na_matches = "never") %>%
      mutate(t1.rel_level = t1.rel_level + 1) %>%
      select(
        branch_id = branch.branch_id,
        branch_name = branch.branch_name,
        rel_level = t1.rel_level
      )
  }
}

result %>%
  filter(rel_level == 2) %>%
  select(branch_id, branch_name) %>%
  print()

上述语句的输出结果为：

+---------+-----------+
|branch_id|branch_name|
+---------+-----------+
|111      |浦东新区   |
|121      |南京市     |
+---------+-----------+

找出华东分支（branch_id = 1）的所有最下级分支

result <- NULL

t1 <- branch %>%
  filter(branch_id == 1) %>%
  select(branch_id, branch_name, parent_branch)

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~ paste0("t1.", .)) %>%
      inner_join(branch %>%
                   rename_with(~ paste0("branch.", .)),
                 by = c("t1.branch_id" = "branch.parent_branch"),
                 keep = T,
                 na_matches = "never") %>%
      select(
        branch_id = branch.branch_id,
        branch_name = branch.branch_name,
        parent_branch = branch.parent_branch
      )
  }
}

result %>%
  anti_join(result, by = c("branch_id" = "parent_branch")) %>%
  print()

上述语句的输出结果为：

+---------+-----------+
|branch_id|branch_name|
+---------+-----------+
|13       |台湾省     |
|111      |浦东新区   |
|1211     |鼓楼区     |
|1212     |秦淮区     |
+---------+-----------+

找出黄埔区分支（branch_id = 2112）的所有上级分支

result <- NULL

t1 <- branch %>%
  filter(branch_id == 2112) %>%
  select(branch_id, branch_name, parent_branch)

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~ paste0("t1.", .)) %>%
      inner_join(
        branch %>%
          rename_with(~ paste0("branch.", .)),
        by = c("t1.parent_branch" = "branch.branch_id"),
        keep = T,
        na_matches = "never"
      ) %>%
      select(
        branch_id = branch.branch_id,
        branch_name = branch.branch_name,
        parent_branch = branch.parent_branch
      )
  }
}

result %>%
  filter(branch_id != 2112) %>%
  select(branch_id, branch_name) %>%
  print()

上述语句的输出结果为：

+---------+-----------+
|branch_id|branch_name|
+---------+-----------+
|211      |广州市     |
|21       |广东省     |
|2        |华南       |
+---------+-----------+

找出黄埔区分支（branch_id = 2112）的最上级分支

result <- NULL

t1 <- branch %>%
  filter(branch_id == 2112) %>%
  select(branch_id, branch_name, parent_branch)

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~ paste0("t1.", .)) %>%
      inner_join(
        branch %>%
          rename_with(~ paste0("branch.", .)),
        by = c("t1.parent_branch" = "branch.branch_id"),
        keep = T,
        na_matches = "never"
      ) %>%
      select(
        branch_id = branch.branch_id,
        branch_name = branch.branch_name,
        parent_branch = branch.parent_branch
      )
  }
}

result %>%
  filter(is.na(parent_branch)) %>%
  select(branch_id, branch_name) %>%
  print()

上述语句的输出结果为：

+---------+-----------+
|branch_id|branch_name|
+---------+-----------+
|2        |华南       |
+---------+-----------+

为每个分支列出其自身及所有下级分支

result <- NULL

t1 <- branch %>%
  select(
    anchor_id = branch_id,
    anchor_name = branch_name,
    branch_id = branch_id,
    branch_name = branch_name
  )

while (TRUE) {
  if (nrow(t1) == 0) {
    break
  } else {
    result <- bind_rows(result, t1)
    t1 <- t1 %>%
      rename_with(~ paste0("t1.", .)) %>%
      inner_join(
        branch %>%
          rename_with(~ paste0("branch.", .)),
        by = c("t1.branch_id" = "branch.parent_branch"),
        keep = T,
        na_matches = "never"
      ) %>%
      select(
        anchor_id = t1.anchor_id,
        anchor_name = t1.anchor_name,
        branch_id = branch.branch_id,
        branch_name = branch.branch_name
      )
  }
}

result %>%
  arrange(anchor_id, branch_id) %>%
  print(n = Inf)

上述语句的输出结果为：

+---------+-----------+---------+-----------+
|anchor_id|anchor_name|branch_id|branch_name|
+---------+-----------+---------+-----------+
|1        |华东       |1        |华东       |
|1        |华东       |11       |上海市     |
|1        |华东       |12       |江苏省     |
|1        |华东       |13       |台湾省     |
|1        |华东       |111      |浦东新区   |
|1        |华东       |121      |南京市     |
|1        |华东       |1211     |鼓楼区     |
|1        |华东       |1212     |秦淮区     |
|2        |华南       |2        |华南       |
|2        |华南       |21       |广东省     |
|2        |华南       |211      |广州市     |
|2        |华南       |2111     |海珠区     |
|2        |华南       |2112     |黄埔区     |
|11       |上海市     |11       |上海市     |
|11       |上海市     |111      |浦东新区   |
|12       |江苏省     |12       |江苏省     |
|12       |江苏省     |121      |南京市     |
|12       |江苏省     |1211     |鼓楼区     |
|12       |江苏省     |1212     |秦淮区     |
|13       |台湾省     |13       |台湾省     |
|21       |广东省     |21       |广东省     |
|21       |广东省     |211      |广州市     |
|21       |广东省     |2111     |海珠区     |
|21       |广东省     |2112     |黄埔区     |
|111      |浦东新区   |111      |浦东新区   |
|121      |南京市     |121      |南京市     |
|121      |南京市     |1211     |鼓楼区     |
|121      |南京市     |1212     |秦淮区     |
|211      |广州市     |211      |广州市     |
|211      |广州市     |2111     |海珠区     |
|211      |广州市     |2112     |黄埔区     |
|1211     |鼓楼区     |1211     |鼓楼区     |
|1212     |秦淮区     |1212     |秦淮区     |
|2111     |海珠区     |2111     |海珠区     |
|2112     |黄埔区     |2112     |黄埔区     |
+---------+-----------+---------+-----------+