模型整理 — 数据科学中的 R 语言

案例

还是用相关章节的gapminder案例

R

library(tidyverse)
library(gapminder)
gapminder

可视化探索

画个简单的图

R

gapminder %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2)

我们想用不同的模型拟合log(gdpPercap)与lifeExp的关联

R

library(colorspace)

model_colors <- colorspace::qualitative_hcl(4, palette = "dark 2")
# model_colors <- c("darkorange", "purple", "cyan4")

ggplot(
  data = gapminder,
  mapping = aes(x = log(gdpPercap), y = lifeExp)
) +
  geom_point(alpha = 0.2) +
  geom_smooth(
    method = "lm",
    aes(color = "OLS", fill = "OLS") # one
  ) +
  geom_smooth(
    method = "lm", formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline") # two
  ) +
  geom_smooth(
    method = "loess",
    aes(color = "LOESS", fill = "LOESS") # three
  ) +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors) +
  theme(legend.position = "top")

简单模型

还是回到我们今天的主题。我们建立一个简单的线性模型

R

out <- lm(
  formula = lifeExp ~ gdpPercap + pop + continent,
  data = gapminder
)
out

R

str(out)

R

summary(out)

模型的输出结果是一个复杂的list，图给出了out的结构 !图片

我们发现out对象包含了很多元素，比如系数、残差、模型残差自由度等等，用读取列表的方法可以直接读取

R

out$coefficients
out$residuals
out$fitted.values

事实上，前面使用的suammary()函数只是选取和打印了out对象的一小部分信息，同时这些信息的结构不适合用dplyr操作和ggplot2画图。

broom

为规整模型结果，这里我们推荐用David Robinson 开发的broom宏包。

R

library(broom)

broom 宏包将常用的100多种模型的输出结果规整成数据框 tibble()的格式，在模型比较和可视化中就可以方便使用dplyr函数了。 broom 提供了三个主要的函数:

tidy() 提取模型输出结果的主要信息，比如 coefficients 和 t-statistics
glance() 把模型视为一个整体，提取如 F-statistic，model deviance 或者 r-squared等信息
augment() 模型输出的信息添加到建模用的数据集中，比如fitted values 和 residuals

tidy

R

tidy(out)

R

out %>%
  tidy() %>%
  ggplot(mapping = aes(
    x = term,
    y = estimate
  )) +
  geom_point() +
  coord_flip()

可以很方便的获取系数的置信区间

R

out %>%
  tidy(conf.int = TRUE)

R

out %>%
  tidy(conf.int = TRUE) %>%
  filter(!term %in% c("(Intercept)")) %>%
  ggplot(aes(
    x = reorder(term, estimate),
    y = estimate, ymin = conf.low, ymax = conf.high
  )) +
  geom_pointrange() +
  coord_flip() +
  labs(x = "", y = "OLS Estimate")

augment

augment()会返回一个数据框，这个数据框是在原始数据框的基础上，增加了模型的拟合值（.fitted）, 拟合值的标准误（.se.fit）, 残差（.resid）等列。

R

augment(out)

R

out %>%
  augment() %>%
  ggplot(mapping = aes(x = lifeExp, y = .fitted)) +
  geom_point()

glance

glance() 函数也会返回数据框，但这个数据框只有一行，内容实际上是summary()输出结果的最底下一行。

R

glance(out)

应用

broom的三个主要函数在分组统计建模时，格外方便。

R

penguins <-
  palmerpenguins::penguins %>%
  drop_na()

R

penguins %>%
  group_nest(species) %>%
  mutate(model = purrr::map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .))) %>%
  mutate(glance = purrr::map(model, ~ broom::glance(.))) %>%
  tidyr::unnest(glance)

R

fit_ols <- function(df) {
  lm(body_mass_g ~ bill_depth_mm + bill_length_mm, data = df)
}


out_tidy <- penguins %>%
  group_nest(species) %>%
  mutate(model = purrr::map(data, fit_ols)) %>%
  mutate(tidy = purrr::map(model, ~ broom::tidy(.))) %>%
  tidyr::unnest(tidy) %>%
  dplyr::filter(!term %in% "(Intercept)")

out_tidy

R

out_tidy %>%
  ggplot(aes(
    x = species, y = estimate,
    ymin = estimate - 2 * std.error,
    ymax = estimate + 2 * std.error,
    color = term
  )) +
  geom_pointrange(position = position_dodge(width = 0.25)) +
  theme(legend.position = "top") +
  labs(x = NULL, y = "Estimate", color = "coef")

练习

假定数据是

R

df <- tibble(
  x = runif(30, 2, 10),
  y = -2*x + rnorm(30, 0, 5)
  )
df

用broom::augment()和ggplot2做出类似的残差图

R

# remove the objects
# rm(list=ls())
rm(out, out_tidy, penguins, model_colors, fit_ols, df)

R

pacman::p_unload(pacman::p_loaded(), character.only = TRUE)