40

across 之美(一)

across 基础用法

Tidyverse 篇

dplyr 1.0版本增加了across()函数,这个函数集中体现了dplyr宏包的强大和简约,今天我用企鹅数据,来领略它的美。

R
library(tidyverse)
library(palmerpenguins)
penguins

看到数据框里有很多缺失值,需要统计每一列缺失值的数量,按照常规的写法

R
penguins %>%
  summarise(
    na_in_species = sum(is.na(species)),
    na_in_island  = sum(is.na(island)),
    na_in_length  = sum(is.na(bill_length_mm)),
    na_in_depth   = sum(is.na(bill_depth_mm)),
    na_in_flipper = sum(is.na(flipper_length_mm)),
    na_in_body    = sum(is.na(body_mass_g)),
    na_in_sex     = sum(is.na(sex)),
    na_in_year    = sum(is.na(year))
  )

幸亏数据框的列数不够多,只有8列,如果数据框有几百列,那就成体力活了,同时代码复制粘贴也容易出错。想偷懒,我们自然想到用summarise_all()

R
penguins %>%
  summarise_all(
    ~ sum(is.na(.))
  )

挺好。接着探索,我们想先按企鹅类型分组,然后统计出各体征数据的均值,这个好说,直接写代码

R
penguins %>%
  group_by(species) %>%
  summarise(
    mean_length   = mean(bill_length_mm, na.rm = TRUE),
    mean_depth    = mean(bill_depth_mm, na.rm = TRUE),
    mean_flipper  = mean(flipper_length_mm, na.rm = TRUE),
    mean_body     = mean(body_mass_g, na.rm = TRUE)
  )

或者用summarise_if()偷懒

R
d1 <- penguins %>%
  group_by(species) %>%
  summarise_if(is.numeric, mean, na.rm = TRUE)
d1

方法不错,从语义上还算很好理解。但多了一列year, 我想在summarise_if()中用 is.numeric & !year去掉year,却没成功。人类的欲望是无穷的,我们还需要统计每组下企鹅的个数,然后合并到一起。因此,我们再接再厉

R
d2 <- penguins %>%
  group_by(species) %>%
  summarise(
    n = n()
  )
d2

最后合并

R
d1 %>% left_join(d2, by = "species")

结果应该没问题,然鹅,总让人感觉怪怪的,过程有点折腾,希望不这么麻烦。

across()横空出世

across()的出现,让这一切变得简单和清晰,上面三步完成的动作,一步搞定

图片
R
penguins %>%
  group_by(species) %>%
  summarise(
    across(where(is.numeric) &amp; !year, mean, na.rm = TRUE),
    n = n()
  )

是不是很强大。大爱Hadley Wickham !!!

across()函数形式

across()函数,它有三个主要的参数:

R
across(.cols = , .fns = , .names = )
  • 第一个参数.cols = ,选取我们要需要的若干列,选取多列的语法与select()的语法一致,选择方法非常丰富和人性化
  • 基本语法
  • :,变量在位置上是连续的,可以使用类似 1:3 或者 species:island
  • !,变量名前加!,意思是求这个变量的补集,等价于去掉这个变量,比如!species
  • &|,两组变量集的交集和并集,比如 is.numeric & !year, 就是选取数值类型变量,但不包括year; 再比如 is.numeric | is.factor就是选取数值型变量和因子型变量
  • c(),选取变量的组合,比如c(a, b, x)
  • 通过人性化的语句
  • everything(): 选取所有的变量
  • last_col(): 选取最后一列,也就说倒数第一列,也可以last_col(offset = 1L) 就是倒数第二列
  • 通过变量名的特征
  • starts_with(): 指定一组变量名的前缀,也就把选取具有这一前缀的变量,starts_with("bill_")
  • ends_with(): 指定一组变量名的后缀,也就选取具有这一后缀的变量,ends_with("_mm")
  • contains(): 指定变量名含有特定的字符串,也就是选取含有指定字符串的变量,ends_with("length")
  • matches(): 同上,字符串可以是正则表达式
  • 通过字符串向量
  • all_of(): 选取字符串向量对应的变量名,比如all_of(c("species", "sex", "year")),当然前提是,数据框中要有这些变量,否则会报错。
  • any_of(): 同all_of(),只不过数据框中没有字符串向量对应的变量,也不会报错,比如数据框中没有people这一列,代码any_of(c("species", "sex", "year", "people"))也正常运行,挺人性化的
  • 通过函数
  • 常见的有数据类型函数 where(is.numeric), where(is.factor), where(is.character), where(is.date)
  • 第二个参数.fns =,我们要执行的函数(或者多个函数),函数的语法有三种形式可选:
  • A function, e.g. mean.
  • A purrr-style lambda, e.g. ~ mean(.x, na.rm = TRUE)
  • A list of functions/lambdas, e.g. list(mean = mean, n_miss = ~ sum(is.na(.x))
  • 第三个参数.names =, 如果.fns是单个函数就默认保留原来数据列的名称,即"{.col}" ;如果.fns是多个函数,就在数据列的列名后面跟上函数名,比如"{.col}_{.fn}";当然,我们也可以简单调整列名和函数之间的顺序或者增加一个标识的字符串,比如弄成"{.fn}_{.col}""{.col}_{.fn}_aa"

across()应用举例

下面通过一些小案例,继续呈现across()函数的功能

求每一列的缺失值数量

就是本章开始的需求

R
penguins %>%
  summarise(
    na_in_species = sum(is.na(species)),
    na_in_island  = sum(is.na(island)),
    na_in_length  = sum(is.na(bill_length_mm)),
    na_in_depth   = sum(is.na(bill_depth_mm)),
    na_in_flipper = sum(is.na(flipper_length_mm)),
    na_in_body    = sum(is.na(body_mass_g)),
    na_in_sex     = sum(is.na(sex)),
    na_in_year    = sum(is.na(year))
  )
R
# using across()
penguins %>%
  summarise(
    across(everything(), function(x) sum(is.na(x)))
  )


# or
penguins %>%
  summarise(
    across(everything(), ~ sum(is.na(.)))
  ) %>% 
  pivot_longer( cols = everything() )

每个类型变量下有多少组?

R
penguins %>%
  summarise(
    distinct_species = n_distinct(species),
    distinct_island  = n_distinct(island),
    distinct_sex     = n_distinct(sex)
  )

# using across()
penguins %>%
  summarise(
    across(c(species, island, sex), n_distinct)
  )

多列多个统计函数

R
penguins %>%
  group_by(species) %>%
  summarise(
    length_mean  = mean(bill_length_mm, na.rm = TRUE),
    length_sd    = sd(bill_length_mm, na.rm = TRUE),
    depth_mean   = mean(bill_depth_mm, na.rm = TRUE),
    depth_sd     = sd(bill_depth_mm, na.rm = TRUE),
    flipper_mean = mean(flipper_length_mm, na.rm = TRUE),
    flipper_sd   = sd(flipper_length_mm, na.rm = TRUE),
    n            = n()
  )


# using across()
penguins %>%
  group_by(species) %>%
  summarise(
    across(ends_with("_mm"), list(mean = mean, sd = sd), na.rm = TRUE),
    n = n()
  )

不同分组下数据变量的多个分位数

事实上,这里是across()summarise()的强大结合起来

R
penguins %>%
  group_by(species, island) %>%
  summarise(
    prob    = c(.25, .75),
    length  = quantile(bill_length_mm, prob, na.rm = TRUE),
    depth   = quantile(bill_depth_mm, prob, na.rm = TRUE),
    flipper = quantile(flipper_length_mm, prob, na.rm = TRUE)
  )


# using across()
penguins %>%
  group_by(species, island) %>%
  summarise(
    prob = c(.25, .75),
    across(
      c(bill_length_mm, bill_depth_mm, flipper_length_mm),
      ~ quantile(., prob, na.rm = TRUE)
    )
  )


# or
penguins %>%
  group_by(species, island) %>%
  summarise(
    prob = c(.25, .75),
    across(where(is.numeric) &amp; !year, ~ quantile(., prob, na.rm = TRUE))
  )

不同分组下更复杂的统计

R
# using across()
penguins %>%
  group_by(species) %>%
  summarise(
    n = n(),
    across(starts_with("bill_"), mean, na.rm = TRUE),
    Area = mean(bill_length_mm * bill_depth_mm, na.rm = TRUE),
    across(ends_with("_g"), mean, na.rm = TRUE),
  )

数据标准化处理

R
std <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

# using across()
penguins %>%
  summarise(
    across(where(is.numeric), std),
    across(where(is.character), as.factor)
  )


# using across() and purrr style
penguins %>%
  drop_na() %>% 
  summarise(
    across(starts_with("bill_"), ~ (.x - mean(.x)) / sd(.x))
  )

数据对数化处理

R
# using across()
penguins %>%
  drop_na() %>%
  mutate(
    across(where(is.numeric), log),
    across(where(is.character), as.factor)
  )

# using across()
penguins %>%
  drop_na() %>%
  mutate(
    across(where(is.numeric), .fns = list(log = log), .names = "{.fn}_{.col}"),
    across(where(is.character), as.factor)
  )

案例:小于0的值,替换成NA

R
test <- tibble(
  Staff.Confirmed = c(0, 1, -999), 
  Residents.Confirmed = c(12, -192, 0)
)
R
test %>% 
  mutate(
    across(contains("Confirmed"), ~if_else(.x &lt; 0, NA_real_, .x), .names = "res_{.col}")
  )

或者

R
na_if_negative <- function(x) {
  x[x &lt; 0] <- NA
  x
}

test %>% 
  mutate(
    across(contains("Confirmed"), na_if_negative, .names = "res_{.col}")
  )

在分组建模中与cur_data()配合使用

R
penguins %>%
  group_by(species) %>%
  summarise(
    broom::tidy(lm(bill_length_mm ~ bill_depth_mm, data = cur_data()))
  )


penguins %>%
  group_by(species) %>%
  summarise(
    broom::tidy(lm(bill_length_mm ~ ., data = cur_data() %>% select(is.numeric)))
  )



penguins %>%
  group_by(species) %>%
  summarise(
    broom::tidy(lm(bill_length_mm ~ .,
                data = cur_data() %>% transmute(across(is.numeric))
    ))
  )


penguins %>%
  group_by(species) %>%
  summarise(
    broom::tidy(lm(bill_length_mm ~ ., data = across(is.numeric)))
  )

cur_column()配合使用

每一列乘以各自的系数

R
df   <- tibble(x = 1:3, y = 3:5, z = 5:7)
mult <- list(x = 1, y = 10, z = 100)

df %>% 
  mutate(across(all_of(names(mult)), ~ .x * mult[[cur_column()]]))

每一列乘以各自的权重

R
df      <- tibble(x = 1:3, y = 3:5, z = 5:7)
weights <- list(x = 0.2, y = 0.3, z = 0.5)

df %>%
  mutate(
    across(all_of(names(weights)),
           list(wt = ~ .x * weights[[cur_column()]]),
          .names = "{col}.{fn}"
    )
  )

每一列有各自的阈值,如果在阈值之上为1,否则为 0

R
df      <- tibble(x = 1:3, y = 3:5, z = 5:7)
cutoffs <- list(x = 2, y = 3, z = 7)

df %>% mutate(
  across(all_of(names(cutoffs)), ~ if_else(.x &gt; cutoffs[[cur_column()]], 1, 0))
)
  • 来一个案例
R
# 要求 x1_intercept + x1_value * x1_slope  --> x1_yhat
# 要求 x2_intercept + x2_value * x2_slope  --> x2_yhat

library(stringr)

df <- tibble(
  x1_intercept = c(0.1850, 0.1518), x2_intercept = c(0.2109, 0.3370),
  x1_value = c(0.0098, 0.0062), x2_value = c(0.0095, 0.0060),
  x1_slope = c(0.1234, 0.1241), x2_slope = c(0.1002, 0.3012),
)
df

df %>%
  mutate(
    across(
      .cols = ends_with("_intercept"),
      .fns = ~ . + get(str_replace(cur_column(), "intercept", "value")) *
        get(str_replace(cur_column(), "intercept", "slope")),
      .names = "{.col}_yhat"
    )
  ) %>%
  rename_with( ~ str_remove(., "_intercept"), ends_with("_yhat"))
  • 再来一个案例
R
df <- tibble(
  var_A_baseline = c(1, 2, 3, 4, 5),
  var_B_baseline = c(4, 1, 2, 3, 5),
  var_A_followup = c(3, 5, 4, 1, 2),
  var_B_followup = c(2, 5, 1, 3, 4)
)

# 需求 var_*_followup -  var_*_baseline

df %>%
  mutate(
    across(
      ends_with("_followup"),
      ~ . - get(sub("_followup", "_baseline", cur_column()))
    )
  )

.names参数也可用函数

R
penguins %>% 
  summarise(
    across(starts_with("bill"), 
           .fns = list(mean = ~ mean(.x, na.rm = TRUE)),
           .names = "{.col}_{.fn}"  
           )
  )


penguins %>% 
  summarise(
    across(starts_with("bill"), 
           .fns = list(mean = ~ mean(.x, na.rm = TRUE)),
           .names = "{stringr::str_remove(.col, '_mm')}_{.fn}"  
    )
  )

c_across()配合也挺默契

在一行中的占比

R
df <- tibble(x = 1:3, y = 3:5, z = 5:7)

df %>%
  rowwise() %>%
  mutate(total = sum(c_across(x:z))) %>%
  ungroup() %>%
  mutate(across(x:z, ~ . / total))

更神奇的方法,请看相关章节。

案例:替换一行中最大的值

看一行中哪个最大,最大的变为1,其余的变为0

R
df
R
replace_rowwise_max <- function(vec) {
  if (!is.vector(vec)) {
    stop("input of replace_col_max must be vector.")
  }

  if_else(vec == max(vec), 1L, 0L)
}


df %>%
  rowwise() %>%
  mutate(
    new = list(replace_rowwise_max(c_across(everything())))
  ) %>%
  unnest_wider(new, names_sep = "_")
R
df %>%
  purrr::pmap_dfr(
    ~`[<-`( c(...), seq_along(c(...)), if_else( c(...) == max(c(...)), 1, 0 )) 
  )

最风骚的是

R
df %>%
  rowwise() %>%
  mutate(
    across(x:z, ~ if_else(.x == max(c_across(x:z)), 1, 0))
  )

across()总结

我们看到了,across()函数在summarise()/mutate()/transmute()/condense()中使用,它能实现以下几个功能:

  • 数据框中的多列执行相同操作
  • 不同性质的操作,有时可以一起写出,不用再left_join()
图片
R
# remove the objects
# ls() %>% stringr::str_flatten(collapse = ", ")

rm(cutoffs, d1, d2, df, mult, std, weights, replace_col_max)
R
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)