dplyr 高级 — 数据科学中的 R 语言

scoped 函数

在相关章节介绍了dplyr的一些函数（mutate(), select()等等），事实上，这些函数加上后缀 _all, _at, _if，形成三组变体函数，可以方便对特定的子集进行操作。比如

对数据框所有列操作，可以用_all
对数据框指定的几列操作，可以用_at
对数据框符合条件的几列进行操作，可以用_if

Operate	_all	_at	_if
`select()`	`select_all()`	`select_at()`	`select_if()`
`mutate()`	`mutate_all()`	`mutate_at()`	`mutate_if()`
`rename()`	`rename_all()`	`rename_at()`	`rename_if()`
`arrange()`	`arrange_all()`	`arrange_at()`	`arrange_if()`
`filter()`	`filter_all()`	`filter_at()`	`filter_if()`
`distinct()`	`distinct_all()`	`distinct_at()`	`distinct_if()`
`group_by()`	`group_by_all()`	`group_by_at()`	`group_by_if()`
`summarise()`	`summarise_all()`	`summarise_at()`	`summarise_if()`
`map()`	`map_all()`	`map_at()`	`map_if()`
`modify()`	`modify_all()`	`modify_at()`	`modify_if()`

下面选取其中几个函数加以说明

mutate_if

R

iris <- iris %>% as_tibble() 

df_iris <- iris %>% head(5)

R

df_iris %>% mutate_if(is.double, as.integer)

可以一次性增加多列

R

df_iris %>% mutate_if(is.numeric, list(scale, log))

也可以把函数放在list()中，用 Purrr-style lambda 形式写出

R

df_iris %>% mutate_if(is.numeric, list(~ scale(.), ~ log(.)))

select_if()

R

df <- tibble::tibble(
  x = letters[1:3],
  y = c(1:3),
  z = c(0, 0, 0)
)
df

R

df %>% select_if(is.numeric)

R

df %>% select_if(~ n_distinct(.) &gt; 2)

select_if 多个条件的情况

R

df %>% select_if(
  list(~ (is.numeric(.) | is.character(.)))
)

R

df %>% select_if(
  ~ (is.numeric(.) | is.character(.))
)

R

to_keep <- function(x) is.numeric(x) | is.character(x)
df %>% select_if(to_keep)

R

df %>% select_if(
  list(~ (is.numeric(.) &amp;&amp; sum(.) &gt; 2))
)

R

df %>% select_if(
  list(~ (is.numeric(.) &amp;&amp; mean(.) &gt; 1))
)

我们也可以写成函数的形式

R

to_want <- function(x) is.numeric(x) &amp;&amp; sum(x) &gt; 3

df %>% select_if(to_want)

summarise_if

R

msleep <- ggplot2::msleep
msleep %>%
  dplyr::group_by(vore) %>%
  dplyr::summarise_all(~ mean(., na.rm = TRUE))

R

msleep <- ggplot2::msleep
msleep %>%
  dplyr::group_by(vore) %>%
  # summarise_if(is.numeric, ~mean(., na.rm = TRUE))
  dplyr::summarise_if(is.numeric, mean, na.rm = TRUE)

filter_if()

事实上，filter已经很强大了，有了scoped函数，就如虎添翼了

R

msleep <- ggplot2::msleep
msleep %>%
  dplyr::select(name, sleep_total) %>%
  dplyr::filter(sleep_total &gt; 18)

R

msleep %>%
  dplyr::select(name, sleep_total) %>%
  dplyr::filter(between(sleep_total, 16, 18))

R

msleep %>%
  dplyr::select(name, sleep_total) %>%
  # filter(near(sleep_total, 17,  tol=sd(sleep_total)))
  dplyr::filter(near(sleep_total, mean(sleep_total), tol = 0.5 * sd(sleep_total)))

mtcars是 R内置数据集，记录了32种不同品牌的轿车的的11个属性

R

mtcars <- mtcars %>% as_tibble()
mtcars

filter_if()配合all_vars(), any_vars()函数，可以完成很酷的工作. 比如，要求一行中所有变量的值都大于150

R

mtcars %>% filter_all(all_vars(. &gt; 150))

比如，要求一行中至少有一个变量的值都大于150

R

# Or the union:
mtcars %>% filter_all(any_vars(. &gt; 150))

R

# You can vary the selection of columns on which to apply the predicate.
# filter_at() takes a vars() specification:
mtcars %>% filter_at(vars(starts_with("d")), any_vars((. %% 2) == 0))

filter_if(.tbl, .predicate, .vars_predicate) 相对复杂点，我这里多说几句。

filter_if() 有三个参数：

.tbl, 数据框
.predicate, 应用在列上的函数，一般作为列的选择条件
.vars_predicate, 应用在一行上的函数，通过 all_vars(), any_vars()返回值决定是否选取该行。

R

# And filter_if() selects variables with a predicate function:
# filter_if(.tbl, .predicate, .vars_predicate)
# mtcars %>% map_df(~ all(floor(.) == .) )
# mtcars %>% select_if( ~ all(floor(.) == .) )

mtcars %>% filter_if(~ all(floor(.) == .), all_vars(. != 0))

所以这里是，先通过.predicate = ~ all(floor(.) == .) 选取变量值为整数的列，然后再看选取的这些列的行方向，如果每一行的值.vars_predicate = all_vars(. != 0) ，都不为0，就保留下来，否则过滤掉。

简单点说，这段代码的意思，数值全部为整数的列，不能同时为0

group_by

group_by() 用的很多，所以要多讲讲

R

mtcars %>% dplyr::group_by(cyl)

R

mtcars %>% group_by_at(vars(cyl))

R

# Group a data frame by all variables:
mtcars %>% group_by_all()

R

# Group by variables selected with a predicate:
iris %>% group_by_if(is.factor)

group_split(), group_map(), group_modify()

R

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_split()

简单点写，就是

R

iris %>%
  dplyr::group_split(Species)

如果使用group_split(), 注意分组后，返回的是列表

R

iris %>%
  dplyr::group_split(Species)

既然是列表，当然想到用前面讲到的purrr::map()家族

R

iris %>%
  dplyr::group_split(Species) %>%
  purrr::map(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

R

iris %>%
  dplyr::group_split(Species) %>%
  purrr::map_df(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

上面这个代码，数据框分割成list, 处理完后再合并成数据框，难道不觉得折腾么？为什么直接点？ tidyverse不会让我们失望的，先看看group_map()

R

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_map(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

group_map()要求 The result of .f should be a data frame(.f 必须返回数据框)
group_map() return a list of tibble(返回元素均为df的一个列表list(df1,df2,df3))

数据框进来，然后分组，依次处理成一个个数据框，最后以列表形式（a list of tibble）输出。

事实上，group_map()是返回list形式，也就是说，可以是返回任何形式，（a list of tibble）是其中特殊形式。可以看看下面这个

R

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_map(
    ~ lm(Petal.Length ~ Sepal.Length, data = .x)
  )

group_modify() 才是真正意义上的"数据框进、数据框出"。

R

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_modify(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

为了大家方便查阅和记忆，我总结下表

函数	说明	常用组合	返回值	要求
map()	列表进、列表出	df %>% <br>group_split() %>% <br>map()	list
map_df()	列表进、数据框出	df %>% <br>group_split() %>% <br>map_df()	df
group_map()	数据框进、列表出	df %>% <br>group_by() %>% <br>group_map()	返回list(list1, list2, ...) <br> 特例list(df1, df2, ...)
group_modify()	数据框进、数据框出	df %>% <br>group_by() %>% <br>group_modify()	返回grouped tibble	.f返回df
walk	列表进	df %>% <br>group_split() %>%<br>walk()	side effects
group_walk()	数据框进	df %>% <br>group_by() %>% <br>group_walk()	side effects

我常用的批量出图的语句

R

nobel_winners %>%
  dplyr::group_split(category) %>%
  purrr::map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.x$category)
  )

R

nobel_winners %>%
  dplyr::group_by(category) %>%
  dplyr::group_map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.y)
  )

R

nobel_winners %>%
  dplyr::group_by(category) %>%
  dplyr::group_walk(
    ~ ggsave(
      paste0(.y, ".png"),
      ggplot(data = .x, aes(x = prize_age)) +
        geom_density() +
        ggtitle(.y),
      device = "png",
      path = temp
    )
  ) %>%
  invisible()

其他group函数

group_nest(), group_data(), group_keys(), group_rows()

列名清理

数据框的列名，不要用有空格和中文。如果拿到的原始数据中列比较多，手动修改麻烦，可以使用janitor::clean_names()函数

R

library(readxl)
library(janitor) # install.packages("janitor")

roster_raw <- read_excel(here::here("demo_data", "dirty_data.xlsx"))

glimpse(roster_raw)

R

roster <- roster_raw %>%
  janitor::clean_names()

glimpse(roster)

缺失值检查与处理

purrr & dplyr 技巧

R

library(purrr)
airquality <- as_tibble(airquality)

airquality %>% purrr::map(~ sum(is.na(.)))

R

airquality %>%
  purrr::map_df(~ sum(is.na(.)))

R

airquality %>%
  dplyr::summarise_at(2:3, ~ sum(is.na(.)))

缺失值替换

R

airquality %>%
  mutate_all(funs(replace(., is.na(.), 0)))

R

airquality %>%
  mutate_all(replace_na, replace = 0)

R

airquality %>%
  mutate_if(is.numeric, replace_na, replace = 0)

R

airquality %>%
  mutate_all(as.numeric) %>%
  mutate_all(~ coalesce(., 0))

R

tibble(
  y = c(1, 2, NA, NA, 5),
  z = c(NA, NA, 3, 4, 5)
) %>%
  mutate_all(~ coalesce(., 0))

标准化

数据变量，在标准化之前是有单位的，如mm,kg等，标准之后就没有量纲了，而是偏离均值的程度，一般用多少方差，几个方差来度量。标准化的好处在于，不同量纲的变量可以比较分析。

R

df_mtcars

R

df_mtcars %>% select_if(funs(is.numeric))

R

# way 1
df_mtcars %>%
  mutate_at(vars(mpg, disp), ~ scale(., center = T, scale = T))

R

# way 2
df_mtcars %>%
  mutate_at(vars(mpg, disp), funs((. - mean(.)) / sd(.)))

R

# way 3
func <- function(x) (x - min(x)) / (max(x) - min(x))
df_mtcars %>%
  mutate_at(vars(mpg, disp), ~ func(.))

如果所有的列，都是数值型

R

func <- function(x) (x - min(x)) / (max(x) - min(x))

df_mtcars %>% mutate_all(~ func(.))

但这里数据中还有其他类型（fct, chr），所以这里 mutate_all() 会报错。
这种情形，用mutate_if()

R

func <- function(x) (x - min(x)) / (max(x) - min(x))

df_mtcars %>% mutate_if(is.numeric, ~ func(.))

R

funs <- list(
  centered = mean, # Function object
  scaled = ~ . - mean(.) / sd(.) # Purrr-style lambda
)

iris %>%
  mutate_if(is.numeric, funs)

across函数

数据框中向量de方向，事实上可以看做有两个方向，横着看是row-vector，竖着看是col-vector。

colwise: group_by() %>% summarise/mutate + across()
rowwise: rowwise()/nest_by() %>% summarise/mutate + c_across()

比如

R

iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarise(
    across(starts_with("Sepal"), mean),
    Area = mean(Petal.Length * Petal.Width),
    across(starts_with("Petal"), min)
  )

across函数替代scope函数

强大的across()函数，替代以上scope函数(_if, _at, 和 _all函数), 同时slice_max(), slice_min(), slice_n() 将替代 top_n()函数。请参考阅读相关章节。

R

df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))

df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))

df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

更方便的colwise操作

R

# multiple
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
mult <- list(x = 1, y = 10, z = 100)

df %>% mutate(across(all_of(names(mult)), ~ .x * mult[[cur_column()]]))



# weights
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
df
weights <- list(x = 0.2, y = 0.3, z = 0.5)

df %>% dplyr::mutate(
  across(all_of(names(weights)),
    list(wt = ~ .x * weights[[cur_column()]]),
    .names = "{col}.{fn}"
  )
)



# cutoffs
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
df

cutoffs <- list(x = 2, y = 3, z = 7)

df %>% dplyr::mutate(
  across(all_of(names(cutoffs)), ~ if_else(.x &gt; cutoffs[[cur_column()]], 1, 0))
)

参考资料

https://dplyr.tidyverse.org/dev/articles/rowwise.html
https://dplyr.tidyverse.org/dev/articles/colwise.html

R

# remove the objects
# rm(list=ls())
rm(df, df_iris, df_mtcars, func, funs, msleep, roster, roster_raw, to_keep, to_want)

R

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)