37

dplyr 高级

窗口函数与高级操作

Tidyverse 篇

让我们继续聊聊,相见恨晚的tidyverse

R
library(tidyverse)

scoped 函数

在相关章节介绍了dplyr的一些函数(mutate(), select()等等),事实上,这些函数加上后缀 _all, _at, _if,形成三组变体函数,可以方便对特定的子集进行操作。比如

  • 对数据框所有列操作,可以用_all
  • 对数据框指定的几列操作,可以用_at
  • 对数据框符合条件的几列进行操作,可以用_if
Operate_all_at_if
select()select_all()select_at()select_if()
mutate()mutate_all()mutate_at()mutate_if()
rename()rename_all()rename_at()rename_if()
arrange()arrange_all()arrange_at()arrange_if()
filter()filter_all()filter_at()filter_if()
distinct()distinct_all()distinct_at()distinct_if()
group_by()group_by_all()group_by_at()group_by_if()
summarise()summarise_all()summarise_at()summarise_if()
map()map_all()map_at()map_if()
modify()modify_all()modify_at()modify_if()

下面选取其中几个函数加以说明

mutate_if

R
iris <- iris %>% as_tibble() 

df_iris <- iris %>% head(5)
R
df_iris %>% mutate_if(is.double, as.integer)

可以一次性增加多列

R
df_iris %>% mutate_if(is.numeric, list(scale, log))

也可以把函数放在list()中,用 Purrr-style lambda 形式写出

R
df_iris %>% mutate_if(is.numeric, list(~ scale(.), ~ log(.)))

select_if()

R
df <- tibble::tibble(
  x = letters[1:3],
  y = c(1:3),
  z = c(0, 0, 0)
)
df
R
df %>% select_if(is.numeric)
R
df %>% select_if(~ n_distinct(.) &gt; 2)

select_if 多个条件的情况

R
df %>% select_if(
  list(~ (is.numeric(.) | is.character(.)))
)
R
df %>% select_if(
  ~ (is.numeric(.) | is.character(.))
)
R
to_keep <- function(x) is.numeric(x) | is.character(x)
df %>% select_if(to_keep)
R
df %>% select_if(
  list(~ (is.numeric(.) &amp;&amp; sum(.) &gt; 2))
)
R
df %>% select_if(
  list(~ (is.numeric(.) &amp;&amp; mean(.) &gt; 1))
)

我们也可以写成函数的形式

R
to_want <- function(x) is.numeric(x) &amp;&amp; sum(x) &gt; 3

df %>% select_if(to_want)

summarise_if

R
msleep <- ggplot2::msleep
msleep %>%
  dplyr::group_by(vore) %>%
  dplyr::summarise_all(~ mean(., na.rm = TRUE))
R
msleep <- ggplot2::msleep
msleep %>%
  dplyr::group_by(vore) %>%
  # summarise_if(is.numeric, ~mean(., na.rm = TRUE))
  dplyr::summarise_if(is.numeric, mean, na.rm = TRUE)

filter_if()

事实上,filter已经很强大了,有了scoped函数,就如虎添翼了

R
msleep <- ggplot2::msleep
msleep %>%
  dplyr::select(name, sleep_total) %>%
  dplyr::filter(sleep_total &gt; 18)
R
msleep %>%
  dplyr::select(name, sleep_total) %>%
  dplyr::filter(between(sleep_total, 16, 18))
R
msleep %>%
  dplyr::select(name, sleep_total) %>%
  # filter(near(sleep_total, 17,  tol=sd(sleep_total)))
  dplyr::filter(near(sleep_total, mean(sleep_total), tol = 0.5 * sd(sleep_total)))

mtcars是 R内置数据集,记录了32种不同品牌的轿车的的11个属性

R
mtcars <- mtcars %>% as_tibble()
mtcars

filter_if()配合all_vars(), any_vars()函数,可以完成很酷的工作. 比如,要求一行中所有变量的值都大于150

R
mtcars %>% filter_all(all_vars(. &gt; 150))

比如,要求一行中至少有一个变量的值都大于150

R
# Or the union:
mtcars %>% filter_all(any_vars(. &gt; 150))
R
# You can vary the selection of columns on which to apply the predicate.
# filter_at() takes a vars() specification:
mtcars %>% filter_at(vars(starts_with("d")), any_vars((. %% 2) == 0))

filter_if(.tbl, .predicate, .vars_predicate) 相对复杂点,我这里多说几句。

filter_if() 有三个参数:

  • .tbl, 数据框
  • .predicate, 应用在列上的函数,一般作为列的选择条件
  • .vars_predicate, 应用在一行上的函数,通过 all_vars(), any_vars()返回值决定是否选取该行。
R
# And filter_if() selects variables with a predicate function:
# filter_if(.tbl, .predicate, .vars_predicate)
# mtcars %>% map_df(~ all(floor(.) == .) )
# mtcars %>% select_if( ~ all(floor(.) == .) )

mtcars %>% filter_if(~ all(floor(.) == .), all_vars(. != 0))

所以这里是,先通过.predicate = ~ all(floor(.) == .) 选取变量值为整数的列,然后再看选取的这些列的行方向,如果每一行的值.vars_predicate = all_vars(. != 0) ,都不为0,就保留下来,否则过滤掉。

简单点说,这段代码的意思,数值全部为整数的列,不能同时为0

group_by

group_by() 用的很多,所以要多讲讲

R
mtcars %>% dplyr::group_by(cyl)
R
mtcars %>% group_by_at(vars(cyl))
R
# Group a data frame by all variables:
mtcars %>% group_by_all()
R
# Group by variables selected with a predicate:
iris %>% group_by_if(is.factor)

group_split(), group_map(), group_modify()

R
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_split()

简单点写,就是

R
iris %>%
  dplyr::group_split(Species)

如果使用group_split(), 注意分组后,返回的是列表

R
iris %>%
  dplyr::group_split(Species)

既然是列表,当然想到用前面讲到的purrr::map()家族

R
iris %>%
  dplyr::group_split(Species) %>%
  purrr::map(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))
R
iris %>%
  dplyr::group_split(Species) %>%
  purrr::map_df(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

上面这个代码,数据框分割成list, 处理完后再合并成数据框,难道不觉得折腾么? 为什么直接点? tidyverse不会让我们失望的,先看看group_map()

R
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_map(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))
  • group_map()要求 The result of .f should be a data frame(.f 必须返回数据框)
  • group_map() return a list of tibble(返回元素均为df的一个列表list(df1,df2,df3))

数据框进来,然后分组,依次处理成一个个数据框,最后以列表形式(a list of tibble)输出。

事实上,group_map()是返回list形式,也就是说,可以是返回任何形式,(a list of tibble)是其中特殊形式。 可以看看下面这个

R
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_map(
    ~ lm(Petal.Length ~ Sepal.Length, data = .x)
  )

group_modify() 才是真正意义上的"数据框进、数据框出"。

R
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::group_modify(~ broom::tidy(lm(Petal.Length ~ Sepal.Length, data = .x)))

为了大家方便查阅和记忆,我总结下表

函数说明常用组合返回值要求
map()列表进、列表出df %>% <br>group_split() %>% <br>map()list
map_df()列表进、数据框出df %>% <br>group_split() %>% <br>map_df()df
group_map()数据框进、列表出df %>% <br>group_by() %>% <br>group_map()返回list(list1, list2, ...) <br> 特例list(df1, df2, ...)
group_modify()数据框进、数据框出df %>% <br>group_by() %>% <br>group_modify()返回grouped tibble.f返回df
walk列表进df %>% <br>group_split() %>%<br>walk()side effects
group_walk()数据框进df %>% <br>group_by() %>% <br>group_walk()side effects

我常用的批量出图的语句

R
nobel_winners %>%
  dplyr::group_split(category) %>%
  purrr::map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.x$category)
  )
R
nobel_winners %>%
  dplyr::group_by(category) %>%
  dplyr::group_map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.y)
  )
R
nobel_winners %>%
  dplyr::group_by(category) %>%
  dplyr::group_walk(
    ~ ggsave(
      paste0(.y, ".png"),
      ggplot(data = .x, aes(x = prize_age)) +
        geom_density() +
        ggtitle(.y),
      device = "png",
      path = temp
    )
  ) %>%
  invisible()

其他group函数

group_nest(), group_data(), group_keys(), group_rows()

列名清理

数据框的列名,不要用有空格和中文。 如果拿到的原始数据中列比较多,手动修改麻烦,可以使用janitor::clean_names()函数

R
library(readxl)
library(janitor) # install.packages("janitor")

roster_raw <- read_excel(here::here("demo_data", "dirty_data.xlsx"))

glimpse(roster_raw)
R
roster <- roster_raw %>%
  janitor::clean_names()

glimpse(roster)

缺失值检查与处理

purrr & dplyr 技巧

R
library(purrr)
airquality <- as_tibble(airquality)

airquality %>% purrr::map(~ sum(is.na(.)))
R
airquality %>%
  purrr::map_df(~ sum(is.na(.)))
R
airquality %>%
  dplyr::summarise_at(2:3, ~ sum(is.na(.)))

缺失值替换

R
airquality %>%
  mutate_all(funs(replace(., is.na(.), 0)))
R
airquality %>%
  mutate_all(replace_na, replace = 0)
R
airquality %>%
  mutate_if(is.numeric, replace_na, replace = 0)
R
airquality %>%
  mutate_all(as.numeric) %>%
  mutate_all(~ coalesce(., 0))
R
tibble(
  y = c(1, 2, NA, NA, 5),
  z = c(NA, NA, 3, 4, 5)
) %>%
  mutate_all(~ coalesce(., 0))

标准化

数据变量,在标准化之前是有单位的,如mm,kg等,标准之后就没有量纲了,而是偏离均值的程度,一般用多少方差,几个方差来度量。 标准化的好处在于,不同量纲的变量可以比较分析。

R
df_mtcars
R
df_mtcars %>% select_if(funs(is.numeric))
R
# way 1
df_mtcars %>%
  mutate_at(vars(mpg, disp), ~ scale(., center = T, scale = T))
R
# way 2
df_mtcars %>%
  mutate_at(vars(mpg, disp), funs((. - mean(.)) / sd(.)))
R
# way 3
func <- function(x) (x - min(x)) / (max(x) - min(x))
df_mtcars %>%
  mutate_at(vars(mpg, disp), ~ func(.))

如果所有的列,都是数值型

R
func <- function(x) (x - min(x)) / (max(x) - min(x))

df_mtcars %>% mutate_all(~ func(.))
  • 但这里数据中还有其他类型(fct, chr),所以这里 mutate_all() 会报错。
  • 这种情形,用mutate_if()
R
func <- function(x) (x - min(x)) / (max(x) - min(x))

df_mtcars %>% mutate_if(is.numeric, ~ func(.))
R
funs <- list(
  centered = mean, # Function object
  scaled = ~ . - mean(.) / sd(.) # Purrr-style lambda
)

iris %>%
  mutate_if(is.numeric, funs)

across函数

数据框中向量de方向,事实上可以看做有两个方向,横着看是row-vector,竖着看是col-vector。

  • colwise: group_by() %>% summarise/mutate + across()
  • rowwise: rowwise()/nest_by() %>% summarise/mutate + c_across()

比如

R
iris %>%
  dplyr::group_by(Species) %>%
  dplyr::summarise(
    across(starts_with("Sepal"), mean),
    Area = mean(Petal.Length * Petal.Width),
    across(starts_with("Petal"), min)
  )

across函数替代scope函数

强大的across()函数,替代以上scope函数(_if, _at, 和 _all函数), 同时slice_max(), slice_min(), slice_n() 将替代 top_n()函数。请参考阅读相关章节。

R
df %>% mutate_if(is.numeric, mean, na.rm = TRUE)
# ->
df %>% mutate(across(is.numeric, mean, na.rm = TRUE))

df %>% mutate_at(vars(x, starts_with("y")), mean, na.rm = TRUE)
# ->
df %>% mutate(across(c(x, starts_with("y")), mean, na.rm = TRUE))

df %>% mutate_all(mean, na.rm = TRUE)
# ->
df %>% mutate(across(everything(), mean, na.rm = TRUE))

更方便的colwise操作

R
# multiple
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
mult <- list(x = 1, y = 10, z = 100)

df %>% mutate(across(all_of(names(mult)), ~ .x * mult[[cur_column()]]))



# weights
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
df
weights <- list(x = 0.2, y = 0.3, z = 0.5)

df %>% dplyr::mutate(
  across(all_of(names(weights)),
    list(wt = ~ .x * weights[[cur_column()]]),
    .names = "{col}.{fn}"
  )
)



# cutoffs
df <- tibble(x = 1:3, y = 3:5, z = 5:7)
df

cutoffs <- list(x = 2, y = 3, z = 7)

df %>% dplyr::mutate(
  across(all_of(names(cutoffs)), ~ if_else(.x &gt; cutoffs[[cur_column()]], 1, 0))
)

参考资料

  • https://dplyr.tidyverse.org/dev/articles/rowwise.html
  • https://dplyr.tidyverse.org/dev/articles/colwise.html
R
# remove the objects
# rm(list=ls())
rm(df, df_iris, df_mtcars, func, funs, msleep, roster, roster_raw, to_keep, to_want)
R
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)