数据规整进阶
嵌套与高级变换
Tidyverse 篇接着上一章,罗列一些tidyr的函数
R
library(tidyverse)
fill() 缺失值填充
利用所在列的上下值进行缺失值填充
R
sales <- tibble::tribble(
~quarter, ~year, ~sales,
"Q1", 2000, 66013,
"Q2", NA, 69182,
"Q3", NA, 53175,
"Q4", NA, 21001,
"Q1", 2001, 46036,
"Q2", NA, 58842,
"Q3", NA, 44568,
"Q4", NA, 50197,
"Q1", 2002, 39113,
"Q2", NA, 41668,
"Q3", NA, 30144,
"Q4", NA, 52897
)
sales
R
sales %>% fill(year)
也可以控制填充的方向
R
sales %>% fill(year, .direction = "up")
expand() 与 complete()
指定数据框的若干列,根据其向量元素值,产生所有可能的交叉组合
R
df <- tibble::tribble(
~x, ~y, ~z,
1L, 1L, 4L,
1L, 2L, 5L,
2L, 1L, NA,
3L, 2L, 6L
)
df %>% expand(x, y)
nesting()用于限定只产生数据框已出现的组合。
R
df %>% expand(nesting(x, y))
R
df %>% expand(nesting(x, y), z)
complete() 补全,可以看做是 expand(nesting()) + fill()
R
df %>% complete(x, y)
R
df %>% complete(x, y, fill = list(z = 0))
数据在complete补全的时候,会面临有两种缺失值:
- 补位的时候造成的空缺
- 数据原先就存在缺失值
R
df %>% complete(x, y)
- 补位的时候造成的空缺,可通过
fill = list(z = 0)控制填充
R
df %>% complete(x, y, fill = list(z = 0))
- 数据原先就存在缺失值,最好通过
explicit = FALSE显式地控制是否填充
R
df %>% complete(x, y, fill = list(z = 0), explicit = FALSE)
expand_grid() 与 crossing()
产生一个新的数据框,每行对应着向量元素的所有交叉组合
R
expand_grid(x = 1:3, y = 1:2)
R
crossing(x = 1:3, y = 1:2)
向量换成数据框也可以,其结果就是数据框行与元素的交叉组合
R
expand_grid(df = data.frame(x = 1:2, y = c(2, 1)), z = 1:3)
R
crossing(df = data.frame(x = 1:2, y = c(2, 1)), z = 1:3)
crossing()可以看作是expand_grid() + distinct(), 即crossing()在完成交叉组合之后会自动去重,比如
R
expand_grid(x = c(1, 1), y = c(1:2)) # 不考虑去重
R
crossing(x = c(1, 1), y = c(1:2)) # 考虑去重
separate() 与 unite()
R
tb <- tibble::tribble(
~day, ~price,
1, "30-45",
2, "40-95",
3, "89-65",
4, "45-63",
5, "52-42"
)
R
tb1 <- tb %>%
separate(price, into = c("low", "high"), sep = "-")
tb1
R
tb1 %>%
unite(col = "price", c(low, high), sep = ":", remove = FALSE)
有时候分隔符搞不定的,可以用正则表达式,将捕获的每组弄成一列
R
dfc <- tibble(x = c("1-12week", "1-10wk", "5-12w", "01-05weeks"))
dfc
R
dfc %>% tidyr::extract(
x,
c("start", "end", "letter"), "(\\d+)-(\\d+)([a-z]+)",
remove = FALSE
)
删除缺失值所在行drop_na()与replace_na()
R
df <- tibble::tribble(
~name, ~type, ~score, ~extra,
"Alice", "english", 80, 10,
"Alice", "math", NA, 5,
"Bob", "english", NA, 9,
"Bob", "math", 69, NA,
"Carol", "english", 80, 10,
"Carol", "math", 90, 5
)
df
如果score列中有缺失值NA,就删除所在的row
R
df %>%
filter(!is.na(score))
或者用across()
R
df %>%
filter(
across(score, ~ !is.na(.x))
)
所有列,如果有缺失值NA,就删除所在的row
R
df %>%
filter(
across(everything(), ~ !is.na(.x))
)
现在有更简便的方法
R
df %>%
drop_na()
也可指定某一列
R
df %>%
drop_na(score)
没来参加考试,视为0分,可以用replace_na()
R
df %>% mutate(score = replace_na(score, 0))
或者使用coalesce()
R
df %>% mutate(score = coalesce(score, 0))
R
df %>%
mutate(
across(c(score, extra), ~ coalesce(.x, 0))
)
没来参加考试,用平均分代替
R
df %>%
mutate(
score = replace_na(score, mean(score, na.rm = TRUE))
)
当然也可以用if_else()来做
R
df %>%
mutate(
score = if_else(is.na(score), mean(score, na.rm = TRUE), score)
)
R
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)