dplyr 应用

实战案例与技巧

Tidyverse 篇

本章主要关注dplyr的一些应用。

导入数据

今天讲一个关于企鹅的数据故事。

library(tidyverse)
library(palmerpenguins)

penguins <- penguins %>% drop_na()

变量含义

variable	class	description
species	character	企鹅种类 (Adelie, Gentoo, Chinstrap)
island	character	所在岛屿 (Biscoe, Dream, Torgersen)
bill_length_mm	double	嘴峰长度 (单位毫米)
bill_depth_mm	double	嘴峰深度 (单位毫米)
flipper_length_mm	integer	鰭肢长度 (单位毫米)
body_mass_g	integer	体重 (单位克)
sex	character	性别
year	integer	记录年份

简单回顾

选择"bill_"开始的列

penguins %>% select(bill_length_mm, bill_depth_mm)

penguins %>% select(starts_with("bill_"))

选择"_mm"结尾的列

penguins %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm)

penguins %>% select(ends_with("_mm"))

选择含有"length"的列

penguins %>% select(bill_length_mm, flipper_length_mm)

penguins %>% select(contains("length"))

选择数值型的列

penguins %>% select(where(is.numeric))

选择字符串类型的列

penguins %>% select(where(is.character))

选择字符串类型以外的列

penguins %>% select(!where(is.character))

可以用多种组合来选择

penguins %>% select(species, starts_with("bill_"))

返回向量还是数据框

对应数据框my_tibble，注意返回向量还是数据框的区别

返回向量

my_tibble[["x"]] 

my_tibble$x 

my_tibble %>% 
  pull(x)

返回数据框

my_tibble["x"]

my_tibble %>% 
  select(x)

选择全部为0的列

tb <- tibble(
  x = c(1, 2, 3, 4, 5),
  y = 0,
  z = c(-1, -2, 0, 2, 1),
  w = c(0, 1, 2, 3, 4)
)
tb



myfun <- function(x) all(x == 0)
tb %>%
  select(where(myfun))



# or
tb %>%
  select(where(~all(.x == 0)))  # 找出选择全部为0的列

tb %>%
  select(where(~sum(.x) == 0))  # 找出这一列元素之和为0的列

tb %>%
  select(where(~any(.x == 0)))  # 找出这一列元素含有0的列

课堂练习：剔除全部为NA的列或者全部为NA的行

df <- tibble(
  x = c(NA, NA, NA),
  y = c(2, 3, NA),
  z = c(NA, 5, NA)
)

# columns
df %>%
  select(where(~ !all(is.na(.x))))

# rows
df %>%
  filter(
    if_any(everything(), ~ !is.na(.x))
  )

寻找男企鹅

函数 filter() 中的逻辑运算符

Operator	Meaning
`==`	Equal to
`>`	Greater than
`<`	Less than
`>=`	Greater than or equal to
`<=`	Less than or equal to
`!=`	Not equal to
`%in%`	in
`is.na`	is a missing value (NA)
`!is.na`	is not a missing value
`&`	and
`	`	or

penguins %>% filter(sex == "male")

penguins %>% filter(species %in% c("Adelie", "Gentoo"))

penguins %>%
  filter(species == "Adelie" &amp;  bill_length_mm &gt; 40)

penguins %>%
  filter(species == "Adelie", bill_length_mm &gt; 40)

课堂练习，说出以下代码的含义

penguins %>%
  filter(species == "Adelie", bill_length_mm == max(bill_length_mm) )

更多应用

希望介绍一个技术，对应一个应用场景

弱水三千，只取一瓢

penguins %>% 
  head()

penguins %>% 
  tail()

penguins %>% 
  slice(1)

penguins %>% 
  group_by(species) %>% 
  slice(1)

嘴峰长度最大那一行

三种方法

penguins %>%
  filter(bill_length_mm == max(bill_length_mm) )

penguins %>% 
  arrange(desc(bill_length_mm)) %>% 
  slice(1)

penguins %>% 
  slice_max(bill_length_mm)

separate

tb <- tibble::tribble(
  ~day, ~price,
  1,   "30-45",
  2,   "40-95",
  3,   "89-65",
  4,   "45-63",
  5,   "52-42"
)

tb1 <- tb %>% 
  separate(price, into = c("low", "high"), sep = "-")
tb1

unite

tb1 %>% 
  unite(col = "price", c(low, high), sep = ":", remove = FALSE)

distinct

distinct()处理的对象是data.frame；功能是筛选不重复的row；返回data.frame

df <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, 1,
  1, 1, 2,
  1, 1, 1,
  2, 1, 2,
  2, 2, 3,
  3, 3, 1
)
df

df %>%
  distinct()

df %>%
  distinct(x)


df %>%
  distinct(x, y)

df %>%
  distinct(x, y, .keep_all = TRUE) # 只保留最先出现的row

df %>%
  distinct(
    across(c(x, y)),
    .keep_all = TRUE
  )

df %>%
  group_by(x) %>%
  distinct(y, .keep_all = TRUE)

n_distinct()处理的对象是vector；功能是统计不同的元素有多少个；返回一个数值

c(1, 1, 1, 2, 2, 1, 3, 3) %>% n_distinct()

df$z %>% n_distinct()

df %>%
  group_by(x) %>%
  summarise(
    n = n_distinct(z)
  )

有关NA的计算

NA很讨厌，凡是它参与的四则运算，结果都是NA，

sum(c(1, 2, NA, 4))

所以需要事先把它删除，增加参数说明 na.rm = TRUE

sum(c(1, 2, NA, 4), na.rm = TRUE)

mean(c(1, 2, NA, 4), na.rm = TRUE)

寻找企鹅中的胖子

penguins %>% 
  mutate(
    body = if_else(body_mass_g &gt; 4200, "you are fat", "you are fine")
)

随堂练习：用考试成绩的均值代替缺失值

df <- tibble::tribble(
    ~name,     ~type, ~score,
  "Alice", "english",    80,
  "Alice",    "math",    NA,
    "Bob", "english",    70,
    "Bob",    "math",    69,
  "Carol", "english",    NA,
  "Carol",    "math",    90
  )
df

df %>% 
  group_by(type) %>% 
  mutate(mean_score = mean(score, na.rm = TRUE)) %>% 
  mutate(newscore = if_else(is.na(score), mean_score, score))

给企鹅身材分类

penguins %>% 
  mutate(
   body = case_when(
     body_mass_g &lt; 3500                       ~ "best",
     body_mass_g &gt;= 3500 &amp; body_mass_g &lt; 4500 ~ "good",
     body_mass_g &gt;= 4500 &amp; body_mass_g &lt; 5500 ~ "general",
     TRUE                                     ~ "other"
  )
)

随堂练习：按嘴峰长度分成A, B, C, D 4个等级

penguins %>% 
  mutate(
    degree = case_when(
      bill_length_mm &lt; 35                        ~ "A",
      bill_length_mm &gt;= 35 &amp; bill_length_mm &lt; 45 ~ "B",
      bill_length_mm &gt;= 45 &amp; bill_length_mm &lt; 55 ~ "C",
      TRUE                                       ~ "D"
  )
)

每种类型企鹅有多少只？

知识点：n()函数，统计当前分组数据框的行数

penguins %>%
  summarise(
    n = n()
  )

penguins %>%
  group_by(species) %>%
  summarise(
    n = n()
  )

统计某个变量中各组出现的次数，可以使用count()函数

penguins %>% count(species)

不同性别的企鹅各有多少

penguins %>% count(sex, sort = TRUE)

可以统计不同组合出现的次数

penguins %>% count(island, species)

可以在count()里构建新变量，并利用这个新变量完成统计。比如，统计嘴巴长度大于40的企鹅个数

常规做法

penguins %>%
  filter(bill_length_mm &gt; 40) %>% 
  summarise(
    n = n()
  )

count()做法

penguins %>% count(longer_bill = bill_length_mm &gt; 40)

解析思路：bill_length_mm > 40 比较算符返回逻辑型向量，向量里面只有TRUR和FALSE两种值，因此上面的代码相当于统计TRUE有多少个，FALSE有多少个？

强制转换

矢量中的元素必须是相同的类型，但如果不一样呢，会发生什么？这个时候R会强制转换成相同的类型。这就涉及数据类型的转换层级

character > numeric > logical
double > integer

比如这里会强制转换成字符串类型

c("foo", 1, TRUE)

这里会强制转换成数值型

c(1, TRUE, FALSE)

c(TRUE, TRUE, FALSE) %>% sum()

随堂练习：补全下面代码，求嘴峰长度大于40mm的占比？

penguins %>% 
  mutate(is_bigger40 = bill_length_mm &gt; 40)

across()之美

我们想知道，嘴巴长度和厚度的均值

penguins %>%
  summarize(
    length = mean(bill_length_mm)
  )

接着添加下个变量

penguins %>%
  summarize(
    length = mean(bill_length_mm),
    depth  = mean(bill_length_mm)
  )

长度和厚度惊人的相等。我是不是发现新大陆了？

across()函数

更安全、更简练的写法，王老师的最爱

penguins %>%
  summarize(
     across(c(bill_depth_mm, bill_length_mm), mean)
 )

翅膀的长度加进去看看

penguins %>%
  summarize(
     across(c(bill_depth_mm, bill_length_mm, flipper_length_mm), mean)
 )

还可以更简练喔

penguins %>%
  summarize(
    across(ends_with("_mm"), mean)
 )

::: {.rmdnote}

across()函数用法

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

用在 mutate() 和summarise() 函数里面
across() 对多列执行相同的函数操作，返回数据框

:::

数据中心化

penguins %>%
  mutate(
    bill_length_mm = bill_length_mm - mean(bill_length_mm),
    bill_depth_mm  = bill_depth_mm  - mean(bill_depth_mm)
    )

更清晰的办法

centralized <- function(x) {
  x - mean(x)
}


penguins %>%
  mutate(
    across(c(bill_length_mm, bill_depth_mm), centralized)
  )

数据标准化

std <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}


penguins %>%
  mutate(
    across(c(bill_length_mm, bill_depth_mm), std)
  )

或者使用更简洁的方法

# using across() and purrr style
penguins %>%
  summarise(
    across(starts_with("bill_"), ~ (.x - mean(.x)) / sd(.x))
  )

多列多个统计函数

penguins %>%
  group_by(species) %>%
  summarise(
    across(ends_with("_mm"), list(mean = mean, sd = sd), na.rm = TRUE)
  )

随堂练习：以sex分组，对"bill_"开头的列，求出每列的最大值和最小值

penguins %>%
  group_by(sex) %>%
  summarise(
    across(starts_with("bill_"), list(max = max, min = min), na.rm = TRUE)
  )

在相关章节到相关章节会继续讲王老师的最爱across()函数。