13

dplyr 应用

实战案例与技巧

Tidyverse 篇

本章主要关注dplyr的一些应用。

导入数据

今天讲一个关于企鹅的数据故事。

R
library(tidyverse)
library(palmerpenguins)

penguins <- penguins %>% drop_na()

变量含义

variableclassdescription
speciescharacter企鹅种类 (Adelie, Gentoo, Chinstrap)
islandcharacter所在岛屿 (Biscoe, Dream, Torgersen)
bill_length_mmdouble嘴峰长度 (单位毫米)
bill_depth_mmdouble嘴峰深度 (单位毫米)
flipper_length_mminteger鰭肢长度 (单位毫米)
body_mass_ginteger体重 (单位克)
sexcharacter性别
yearinteger记录年份
图片

简单回顾

选择"bill_"开始的列

R
penguins %>% select(bill_length_mm, bill_depth_mm)
R
penguins %>% select(starts_with("bill_"))

选择"_mm"结尾的列

R
penguins %>% select(bill_length_mm, bill_depth_mm, flipper_length_mm)
R
penguins %>% select(ends_with("_mm"))

选择含有"length"的列

R
penguins %>% select(bill_length_mm, flipper_length_mm)
R
penguins %>% select(contains("length"))

选择数值型的列

R
penguins %>% select(where(is.numeric))

选择字符串类型的列

R
penguins %>% select(where(is.character))

选择字符串类型以外的列

R
penguins %>% select(!where(is.character))

可以用多种组合来选择

R
penguins %>% select(species, starts_with("bill_"))

返回向量还是数据框

对应数据框my_tibble, 注意返回向量还是数据框的区别

  • 返回向量
R
my_tibble[["x"]] 

my_tibble$x 

my_tibble %>% 
  pull(x)
  • 返回数据框
R
my_tibble["x"]

my_tibble %>% 
  select(x)

选择全部为0的列

R
tb <- tibble(
  x = c(1, 2, 3, 4, 5),
  y = 0,
  z = c(-1, -2, 0, 2, 1),
  w = c(0, 1, 2, 3, 4)
)
tb



myfun <- function(x) all(x == 0)
tb %>%
  select(where(myfun))



# or
tb %>%
  select(where(~all(.x == 0)))  # 找出选择全部为0的列

tb %>%
  select(where(~sum(.x) == 0))  # 找出这一列元素之和为0的列

tb %>%
  select(where(~any(.x == 0)))  # 找出这一列元素含有0的列

课堂练习:剔除全部为NA的列或者全部为NA的行

R
df <- tibble(
  x = c(NA, NA, NA),
  y = c(2, 3, NA),
  z = c(NA, 5, NA)
)

# columns
df %>%
  select(where(~ !all(is.na(.x))))

# rows
df %>%
  filter(
    if_any(everything(), ~ !is.na(.x))
  )

寻找男企鹅

函数 filter() 中的逻辑运算符

OperatorMeaning
==Equal to
>Greater than
<Less than
>=Greater than or equal to
<=Less than or equal to
!=Not equal to
%in%in
is.nais a missing value (NA)
!is.nais not a missing value
&and
``or
R
penguins %>% filter(sex == "male")
R
penguins %>% filter(species %in% c("Adelie", "Gentoo"))
R
penguins %>%
  filter(species == "Adelie" &amp;  bill_length_mm &gt; 40)

penguins %>%
  filter(species == "Adelie", bill_length_mm &gt; 40)

课堂练习,说出以下代码的含义

R
penguins %>%
  filter(species == "Adelie", bill_length_mm == max(bill_length_mm) )

更多应用

希望介绍一个技术,对应一个应用场景

弱水三千,只取一瓢

R
penguins %>% 
  head()

penguins %>% 
  tail()
R
penguins %>% 
  slice(1)
R
penguins %>% 
  group_by(species) %>% 
  slice(1)

嘴峰长度最大那一行

三种方法

R
penguins %>%
  filter(bill_length_mm == max(bill_length_mm) )
R
penguins %>% 
  arrange(desc(bill_length_mm)) %>% 
  slice(1)
R
penguins %>% 
  slice_max(bill_length_mm)

separate

R
tb <- tibble::tribble(
  ~day, ~price,
  1,   "30-45",
  2,   "40-95",
  3,   "89-65",
  4,   "45-63",
  5,   "52-42"
)
R
tb1 <- tb %>% 
  separate(price, into = c("low", "high"), sep = "-")
tb1

unite

R
tb1 %>% 
  unite(col = "price", c(low, high), sep = ":", remove = FALSE)

distinct

distinct()处理的对象是data.frame;功能是筛选不重复的row;返回data.frame

R
df <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, 1,
  1, 1, 2,
  1, 1, 1,
  2, 1, 2,
  2, 2, 3,
  3, 3, 1
)
df
R
df %>%
  distinct()
R
df %>%
  distinct(x)


df %>%
  distinct(x, y)
R
df %>%
  distinct(x, y, .keep_all = TRUE) # 只保留最先出现的row
R
df %>%
  distinct(
    across(c(x, y)),
    .keep_all = TRUE
  )
R
df %>%
  group_by(x) %>%
  distinct(y, .keep_all = TRUE)

n_distinct()处理的对象是vector;功能是统计不同的元素有多少个;返回一个数值

R
c(1, 1, 1, 2, 2, 1, 3, 3) %>% n_distinct()
R
df$z %>% n_distinct()
R
df %>%
  group_by(x) %>%
  summarise(
    n = n_distinct(z)
  )

有关NA的计算

NA很讨厌,凡是它参与的四则运算,结果都是NA

R
sum(c(1, 2, NA, 4))

所以需要事先把它删除,增加参数说明 na.rm = TRUE

R
sum(c(1, 2, NA, 4), na.rm = TRUE)
R
mean(c(1, 2, NA, 4), na.rm = TRUE)

寻找企鹅中的胖子

R
penguins %>% 
  mutate(
    body = if_else(body_mass_g &gt; 4200, "you are fat", "you are fine")
)

随堂练习:用考试成绩的均值代替缺失值

R
df <- tibble::tribble(
    ~name,     ~type, ~score,
  "Alice", "english",    80,
  "Alice",    "math",    NA,
    "Bob", "english",    70,
    "Bob",    "math",    69,
  "Carol", "english",    NA,
  "Carol",    "math",    90
  )
df
R
df %>% 
  group_by(type) %>% 
  mutate(mean_score = mean(score, na.rm = TRUE)) %>% 
  mutate(newscore = if_else(is.na(score), mean_score, score))

给企鹅身材分类

R
penguins %>% 
  mutate(
   body = case_when(
     body_mass_g &lt; 3500                       ~ "best",
     body_mass_g &gt;= 3500 &amp; body_mass_g &lt; 4500 ~ "good",
     body_mass_g &gt;= 4500 &amp; body_mass_g &lt; 5500 ~ "general",
     TRUE                                     ~ "other"
  )
)

随堂练习:按嘴峰长度分成A, B, C, D 4个等级

R
penguins %>% 
  mutate(
    degree = case_when(
      bill_length_mm &lt; 35                        ~ "A",
      bill_length_mm &gt;= 35 &amp; bill_length_mm &lt; 45 ~ "B",
      bill_length_mm &gt;= 45 &amp; bill_length_mm &lt; 55 ~ "C",
      TRUE                                       ~ "D"
  )
)

每种类型企鹅有多少只?

知识点:n()函数,统计当前分组数据框的行数

R
penguins %>%
  summarise(
    n = n()
  )
R
penguins %>%
  group_by(species) %>%
  summarise(
    n = n()
  )

统计某个变量中各组出现的次数,可以使用count()函数

R
penguins %>% count(species)

不同性别的企鹅各有多少

R
penguins %>% count(sex, sort = TRUE)

可以统计不同组合出现的次数

R
penguins %>% count(island, species)

可以在count()里构建新变量,并利用这个新变量完成统计。 比如,统计嘴巴长度大于40的企鹅个数

  • 常规做法
R
penguins %>%
  filter(bill_length_mm &gt; 40) %>% 
  summarise(
    n = n()
  )
  • count()做法
R
penguins %>% count(longer_bill = bill_length_mm &gt; 40)

解析思路:bill_length_mm > 40 比较算符 返回逻辑型向量,向量里面只有TRUR和FALSE两种值,因此上面的代码相当于统计TRUE有多少个,FALSE有多少个?

强制转换

矢量中的元素必须是相同的类型,但如果不一样呢,会发生什么? 这个时候R会强制转换成相同的类型。这就涉及数据类型的转换层级

  • character > numeric > logical
  • double > integer

比如这里会强制转换成字符串类型

R
c("foo", 1, TRUE)

这里会强制转换成数值型

R
c(1, TRUE, FALSE)
R
c(TRUE, TRUE, FALSE) %>% sum()

随堂练习:补全下面代码,求嘴峰长度大于40mm的占比?

R
penguins %>% 
  mutate(is_bigger40 = bill_length_mm &gt; 40)

across()之美

我们想知道,嘴巴长度和厚度的均值

R
penguins %>%
  summarize(
    length = mean(bill_length_mm)
  )

接着添加下个变量

R
penguins %>%
  summarize(
    length = mean(bill_length_mm),
    depth  = mean(bill_length_mm)
  )

长度和厚度惊人的相等。我是不是发现新大陆了?

across()函数

更安全、更简练的写法,王老师的最爱

R
penguins %>%
  summarize(
     across(c(bill_depth_mm, bill_length_mm), mean)
 )

翅膀的长度加进去看看

R
penguins %>%
  summarize(
     across(c(bill_depth_mm, bill_length_mm, flipper_length_mm), mean)
 )

还可以更简练喔

R
penguins %>%
  summarize(
    across(ends_with("_mm"), mean)
 )

::: {.rmdnote}

across()函数用法

R
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
  • 用在 mutate()summarise() 函数里面
  • across()多列执行相同的函数操作,返回数据框

:::

数据中心化

R
penguins %>%
  mutate(
    bill_length_mm = bill_length_mm - mean(bill_length_mm),
    bill_depth_mm  = bill_depth_mm  - mean(bill_depth_mm)
    )

更清晰的办法

R
centralized <- function(x) {
  x - mean(x)
}


penguins %>%
  mutate(
    across(c(bill_length_mm, bill_depth_mm), centralized)
  )

数据标准化

R
std <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}


penguins %>%
  mutate(
    across(c(bill_length_mm, bill_depth_mm), std)
  )

或者使用更简洁的方法

R
# using across() and purrr style
penguins %>%
  summarise(
    across(starts_with("bill_"), ~ (.x - mean(.x)) / sd(.x))
  )

多列多个统计函数

R
penguins %>%
  group_by(species) %>%
  summarise(
    across(ends_with("_mm"), list(mean = mean, sd = sd), na.rm = TRUE)
  )

随堂练习:以sex分组,对"bill_"开头的列,求出每列的最大值和最小值

R
penguins %>%
  group_by(sex) %>%
  summarise(
    across(starts_with("bill_"), list(max = max, min = min), na.rm = TRUE)
  )

在相关章节到相关章节会继续讲王老师的最爱across()函数。