across 之美(四)
across 高级模式
Tidyverse 篇本章讲讲mutate()中的across()与c_acorss()、map_df()、map2_dfc()系列的纠缠。
内容涉及迭代、泛函、返回数据框、数据框并入等概念。
R
library(tidyverse)
penguins <- palmerpenguins::penguins %>% drop_na()
mutate() 数据框并入
在以往的学习中,我们了解到mutate()的功能是新增一列,
R
d <- tibble(
x = 3:5
)
d %>%
mutate(x1 = x*2)
有时候我在想同时新增多列呢?
R
d %>%
mutate(
x1 = x*2,
x2 = x^2
)
从形式上看,相当于在原来d的基础上并入了一个新的数据框。或许mutate()和我们认识的不一样, 于是我试试
R
t <- tibble(
y = 4:6
)
d %>%
mutate(t)
奥利给。我们再看看,让数据框是以函数的形式返回
R
my_fun <- function(x) {
tibble(
x1 = x * 2,
x2 = x^2)
}
d %>%
mutate(
my_fun(x)
)
是不是很惊喜。 我们再看看across()函数,在mutate() 中 across() 返回的就是数据框,正好并入原来d,道理是一样的。
R
d %>%
mutate(
across(x, list(f1 = ~ .x * 2, f2 = ~ .x^2))
)
事实上,across()的返回数据框的特性,结合mutate()的并入数据框功能,让数据处理如鱼得水、如虎添翼。
从一个问题开始
计算每天水分和食物的所占比例,比如第一天water和food都是10.0,那么各自比例都是50%.
R
d <- tibble::tribble(
~water, ~food,
10.0, 10.0,
12.1, 10.3,
13.5, 19.1,
17.4, 16.0,
25.8, 15.6,
27.4, 19.8
)
d
传统的方法
传统的方法是,把数据框旋转成长表格,计算所占比例后,再旋转回来
R
d %>%
rownames_to_column() %>%
pivot_longer(
cols = !rowname
) %>%
group_by(rowname) %>%
mutate(
percent = 100 * value / sum(value)
) %>%
ungroup() %>%
pivot_wider(
names_from = name,
values_from = c(value, percent),
names_glue = "{name}_{.value}"
)
across()的方法
传统的方法,用到基本的dplyr函数,思路很清晰,但有点周折。下面,我列出几个比较新颖的方法,当然这些方法都来源于强大across()函数
方法1
R
d %>%
mutate(100 * across(.names = "%{.col}") / rowSums(across())) %>%
ungroup()
方法2
R
rowPercent <- function(df) {
df / rowSums(df) * 100
}
d %>%
mutate(rowPercent(across(.names = "%{.col}")))
方法3
R
d %>%
rowwise() %>%
mutate(
across(everything(), ~ .x / sum(c_across()) )
)
df %>%
rowwise() %>%
mutate(
across(everything(), .names = "prop_{.col}", ~ .x / sum(c_across()) )
)
df %>%
rowwise() %>%
mutate(
across(.names = "prop_{.col}", .fns = ~ .x / sum(c_across()) )
)
R
d %>%
rowwise() %>%
mutate(100 * across(.names = "%{.col}") / sum(c_across())) %>%
ungroup()
方法4
R
scale <- function(x) {
100 * x / sum(x, na.rm = TRUE)
}
d %>%
rowwise() %>%
mutate(
scale(across(.names = "%{.col}"))
)
方法5
R
d %>%
rowwise() %>%
mutate(100 * proportions(across(.names = "%{.col}")))
方法6
更好的呈现比例
R
d %>%
rowwise() %>%
mutate(
across(c(water, food), ~scales::label_percent(scale = 100)(.x /sum(c_across())), .names = "{.col}_%")
)
上面的方法虽然很多,但基本思路是一样的。
纠缠不清的迭代
我们先弄清楚迭代方向:
rowwise()一行一行的处理across()一列一列的处理rowwise() + across()这种组合,双重迭代,(一行一行+一列一列)就变成了一个一个的处理across() + purrr::map_dbl()这种组合分两种情形:purrr::map_dbl()作为across( .fns = )中的函数,即across(.cols = , .fns = map_dbl() )。across()一列一列的迭代,每一列又传入purrr::map_dbl()再次迭代,因此这里是双重迭代across()作为purrr::map_df(.x = )的数据,即purrr::map_df(.x = across(), .f = )。因为在mutate()中across()返回数据框,因此可以把across()整体视为数据框,然后这个数据框传入purrr::map_df(.x = )进行迭代,因此这种情形可以认为只有purrr::map_*()一次迭代。
R
# rowwise() + across()
# rowwise() 设定行方向后,接着across() 就行方向上的元素一个一个的执行.fns
# 循环模式:第一层,一行一行的,第二层在每一行里,一个元素到一个元素
penguins %>%
group_by(species, year) %>%
summarise(flipper_length_mm = list(flipper_length_mm)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = flipper_length_mm
) %>%
rowwise() %>%
mutate(
across(where(is.list), .fns = length)
)
R
# across(.cols = , .fns = purrr::map_dbl() )
# 用across()就是一列一列的处理,
# 此时的一列是vector or list,又可以进入purrr::map_dbl()再次迭代,对这一列的每个元素,执行.f
# 然后across()到下一列
# 循环模式:第一层,一列一列,第二层在每一列里,一个元素到一个元素
penguins %>%
group_by(species, year) %>%
summarise(flipper_length_mm = list(flipper_length_mm)) %>%
ungroup() %>%
pivot_wider(
names_from = year,
values_from = flipper_length_mm
) %>%
mutate(
across(where(is.list), ~ purrr::map_dbl(.x, length))
)
R
# `purrr::map_df(.x = across(), .f = )`
# mutate()中的`across()`整体被视为**数据框**,传入purrr::map_df(.x = across(), .f = ),然后迭代,返回数据框最后并入最初的df
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate(
map_dfc(
.x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
.f = ~.x^2
)
)
写成分步的形式,可能更好理解
R
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate({
data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
out <- map_dfc(data, .f = ~.x^2)
out
})
再回头看across()的常规用法,是否对它有了新的认识?
R
penguins %>%
select(species, starts_with("bill_")) %>%
head(5) %>%
mutate(
across(ends_with("_mm"), .fns = ~.x^2, .names = '{sub("_mm", "", .col)}')
)
案例1
觉得不过瘾,我们看下面复杂点的例子
R
tt <- penguins %>%
group_by(species, year) %>%
summarise(
across(c(bill_length_mm, bill_depth_mm), list)
) %>%
ungroup()
tt
R
tt %>%
mutate(
map_dfc(
.x = across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}'),
.f = ~ map_dbl(.x, length)
)
)
分步写法
R
tt %>%
mutate({
data <- across(ends_with("_mm"), .names = '{sub("_mm", "", .col)}')
out <- map_dfc(data, .f = ~ map_dbl(.x, length))
out
})
回到常规写法
R
tt %>%
mutate(
across(ends_with("_mm"), .fns = ~ map_dbl(.x, length), .names = '{sub("_mm", "", .col)}')
)
案例2
更变态的案例
R
tt %>%
mutate(
purrr::map2_dfr(
.x = across(bill_length_mm, .names = "cor"),
.y = across(bill_depth_mm),
.f = ~ map2_dbl(.x, .y, cor)
)
)
R
tt %>%
mutate(
purrr::map2_dfr(
.x = across(bill_length_mm, .names = "cor"),
.y = across(bill_depth_mm),
.f = ~ map2_dbl(.x, .y, cor)
)
)
分步写法
R
tt %>%
mutate({
data1 <- across(bill_length_mm, .names = "cor")
data2 <- across(bill_depth_mm)
out <- purrr::map2_dfc(data1, data2, .f = ~ map2_dbl(.x, .y, cor))
out
})
常规方法
R
tt %>%
rowwise() %>%
mutate(
cor = cor(bill_length_mm, bill_depth_mm)
)
我们这样折腾只是为了展示各种迭代.
习题
习题1
对于数据
R
df <- tibble(
id = 1:10,
sex = c("m", "m", "m", "f", "f", "f", "m", "f", "f", "m"),
lds1.x = c(NA, 1, 0, 1, NA, 0, 0, NA, 0, 1),
lds1.y = c(1, NA, 1, 1, 0, NA, 0, 3, NA, 1),
lds2.x = c(2, 1, NA, 0, 0, NA, 1, NA, NA, 1),
lds2.y = c(0, 2, 2, NA, NA, 0, 0, 3, 0, NA)
)
df
希望两两coalesce,比如,
R
df %>%
mutate(
lds1 = coalesce(lds1.x, lds1.y),
lds2 = coalesce(lds2.x, lds2.y)
)
但要求是用across()写。
解题思路:
- 在
mutate()中,把across()整体当作数据框用,比如
R
df %>%
mutate(
across(ends_with(".x"))
)
df %>%
mutate(
across(ends_with(".x"), .names = '{sub(".x","",.col)}')
)
- 在
mutate()中across()视为数据框,传递给map_dfc()后,map_dfc()将其转换成一个新的数据框,这个新的数据框最后并入df
R
df %>%
mutate(
map_dfc(
.x = across(ends_with(".x"), .names = '{sub(".x","", .col)}'),
.f = ~is.na(.x)
)
)
- 两个
across()对应两个数据框,传递给map2_dfc()函数
R
df %>%
mutate(
map2_dfr(
.x = across(ends_with(".x"), .names = '{sub(".x","",.col)}'),
.y = across(ends_with(".y")),
.f = coalesce # Vectors coalesce
)
)
- 分步写,更清晰和优雅。迭代过程:数据框df1的第一列和数据框data2的第一列coalesce,然后数据框df1的第二列和数据框df2的第二列coalesce.
R
df %>%
mutate({
df1 <- across(ends_with(".x"), .names = '{sub(".x","",.col)}')
df2 <- across(ends_with(".y"))
out <- purrr::map2_dfc(df1, df2, ~ coalesce(.x, .y))
out
})
习题2
题目:如果符合某个条件,就让指定的列反号。比如,如果x小于4,x和y两列就反号。
事实上,完成这个任务的方法很多,我们只是演示across()的某些特征。
R
d <- tibble( x = 1:4, y = 1:4)
d
R
# using data frame returns
d %>%
mutate({
test <- x < 4
x[test] <- -x[test]
y[test] <- -y[test]
data.frame(x = x, y = y)
})
R
# using across()
d %>%
mutate({
test <- x < 4
across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
})
R
# further abstract
negate_if <- function(condition, cols) {
across({{ cols }}, ~ {
.x[condition] <- -.x[condition]
.x
})
}
d %>%
mutate(negate_if(x < 4, c(x, y)))
R
# remove the objects
# ls() %>% stringr::str_flatten(collapse = ", ")
rm(d, scale, rowPercent, penguins, negate_if, df, tt)
R
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)