奥运会分析
奥运数据可视化
应用篇这是Nature期刊上的一篇文章Nature. 2004 September 30; 431(7008),
虽然觉得这个结论不太严谨,但我却无力反驳。
于是在文章补充材料里,我找到了文章使用的数据,现在的任务是,重复这张图和文章的分析过程。
导入数据
R
library(tidyverse)
library(readxl)
R
d <- read_excel("./demo_data/olympics.xlsx")
d
可视化
我们先画图看看
R
d %>%
ggplot() +
geom_point(aes(x = Olympic_year, y = Men_score), color = "blue") +
geom_point(aes(x = Olympic_year, y = Women_score), color = "red")
这样写也是可以的,只不过最好先tidy数据
R
d1 <- d %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
d1
然后在画图
R
d1 %>%
ggplot(aes(x = Olympic_year, y = winning_time, color = sex)) +
geom_point() +
# geom_smooth(method = "lm") +
scale_color_manual(
values = c("Men_score" = "blue", "Women_score" = "red")
) +
scale_x_continuous(
breaks = seq(1900, 2004, by = 4),
labels = seq(1900, 2004, by = 4)
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
回归分析
建立年份与成绩的线性关系 $$ \text{score}_i = \alpha + \beta \times \text{year}_i + \epsilon_i; \qquad \epsilon_i\in \text{Normal}(\mu, \sigma) $$
我们需要求出其中系数$\alpha$和$\beta$,写R语言代码如下 (lm(y ~ 1 + x,data = d), 要求得 $\alpha$和$\beta$,就是对应 1 和 x 前的系数)
R
fit_1 <- lm(Men_score ~ 1 + Olympic_year, data = d)
summary(fit_1)
R
fit_2 <- lm(Women_score ~ 1 + Olympic_year, data = d)
summary(fit_2)
预测
使用predict()完成预测
R
df <- data.frame(Olympic_year = 2020)
predict(fit_1, newdata = df)
为了图片中的一致,我们使用1900年到2252年(seq(1900, 2252, by = 4))建立预测项,并整理到数据框里
R
grid <- tibble(
Olympic_year = as.numeric(seq(1900, 2252, by = 4))
)
grid
R
tb <- grid %>%
mutate(
Predict_Men = predict(fit_1, newdata = grid),
Predict_Women = predict(fit_2, newdata = grid)
)
tb
有时候我喜欢用modelr::add_predictions()函数实现相同的功能
R
library(modelr)
grid %>%
add_predictions(fit_1, var = "Predict_Men") %>%
add_predictions(fit_2, var = "Predict_Women")
再次可视化
R
tb1 <- tb %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
tb1
R
tb1 %>%
ggplot(aes(
x = Olympic_year,
y = winning_time,
color = sex
)) +
geom_line(size = 2) +
geom_point(data = d1) +
scale_color_manual(
values = c(
"Men_score" = "blue",
"Women_score" = "red",
"Predict_Men" = "#588B8B",
"Predict_Women" = "#C8553D"
),
labels = c(
"Men_score" = "Men score",
"Women_score" = "Women score",
"Predict_Men" = "Men Predict score",
"Predict_Women" = "Women Predict score"
)
) +
scale_x_continuous(
breaks = seq(1900, 2252, by = 16),
labels = as.character(seq(1900, 2252, by = 16))
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
早知道nature文章这么简单,10年前我也可以写啊!
list_column
这里是另外的一种方法
R
library(modelr)
R
d1 <- d %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
fit_model <- function(df) lm(winning_time ~ Olympic_year, data = df)
d2 <- d1 %>%
group_nest(sex) %>%
mutate(
mod = map(data, fit_model)
)
d2
# d2 %>% mutate(p = list(grid, grid))
# d3 <- d2 %>% mutate(p = list(grid, grid))
# d3
# d3 %>%
# mutate(
# predictions = map2(p, mod, add_predictions),
# )
# or
tb4 <- d2 %>%
mutate(
predictions = map(mod, ~ add_predictions(grid, .))
) %>%
select(sex, predictions) %>%
unnest(predictions)
tb4 %>%
ggplot(aes(
x = Olympic_year,
y = pred,
group = sex,
color = sex
)) +
geom_point() +
geom_line(size = 2) +
geom_point(
data = d1,
aes(
x = Olympic_year,
y = winning_time,
group = sex,
color = sex
)
) +
scale_x_continuous(
breaks = seq(1900, 2252, by = 16),
labels = as.character(seq(1900, 2252, by = 16))
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
课后作业
- 探索数据,建立身高体重的线性模型
R
# remove the objects
# rm(list=ls())
rm(d, d1, d2, df, fit_1, fit_2, fit_model, grid, tb, tb1, tb4)
R
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)