forked from bioconnector/workshops
-
Notifications
You must be signed in to change notification settings - Fork 0
/
r-dplyr-homework.Rmd
167 lines (119 loc) · 4.73 KB
/
r-dplyr-homework.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "Advanced Data Manipulation Homework"
---
(_Refer back to the [Advanced Data Manipulation lesson](r-dplyr-yeast.html))._
```{r inithw, echo=FALSE}
knitr::opts_chunk$set(echo=FALSE, message = FALSE, warning = FALSE)
```
### Key Concepts
>
- **dplyr** verbs
- the pipe `%>%`
- the `tbl_df`
- variable creation
- multiple conditions
- properties of grouped data
- aggregation
- summary functions
- window functions
### Getting Started
We're going to work with a different dataset for the homework here. It's a [cleaned-up excerpt](https://github.com/jennybc/gapminder) from the [Gapminder data](http://www.gapminder.org/data/). Download the [**gapminder.csv** data by clicking here](data/gapminder.csv) or using the link above. Download it, and save it in a `data/` subfolder of the project directory where you can access it easily from R.
Load the **dplyr** and **readr** packages, and read the gapminder data into R using the `read_csv()` function (n.b. `read_csv()` is _not_ the same as `read.csv()`). Assign the data to an object called `gm`.
In your submitted homework assignment, I would prefer you use the `read_csv()` function to read the data directly from the web (see below). This way I can run your R code without worrying about whether I have the `data/` directory or not.
```{r loaddata, echo=TRUE, eval=FALSE}
library(dplyr)
library(readr)
# Preferably: read data from web
gm <- read_csv("http://bioconnector.org/workshops/data/gapminder.csv")
# Alternatively read from file:
# gm <- read_csv("data/gapminder.csv")
# Display the data
gm
```
```{r loaddatatrue, eval=TRUE, include=FALSE}
library(dplyr)
library(readr)
gm <- read_csv("data/gapminder.csv")
gm
```
### Problem set
Use **dplyr** functions to address the following questions:
1) How many unique countries are represented per continent?
```{r problem1}
# gm %>%
# distinct(country, .keep_all=TRUE) %>%
# group_by(continent) %>%
# summarise(n = n())
gm %>%
group_by(continent) %>%
summarize(n=n_distinct(country))
```
2) Which European nation had the lowest GDP per capita in 1997?
```{r problem2}
gm %>%
filter(continent == "Europe" & year == 1997) %>%
arrange(gdpPercap) %>%
head(1)
```
3) According to the data available, what was the average life expectancy across each continent in the 1980s?
```{r problem3}
gm %>%
filter(year == 1982 | year == 1987) %>%
group_by(continent) %>%
summarize(mean.lifeExp = mean(lifeExp))
```
4) What 5 countries have the highest total GDP over all years combined?
```{r problem4}
gm %>%
mutate(gdp = gdpPercap*pop) %>%
group_by(country) %>%
summarise(Total.GDP = sum(gdp)) %>%
arrange(desc(Total.GDP)) %>%
head(5)
```
5) What countries and years had life expectancies of _at least_ 80 years? _N.b. only output the columns of interest: country, life expectancy and year (in that order)._
```{r problem5}
gm %>%
filter(lifeExp >= 80) %>%
select(country, lifeExp, year)
```
6) What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?
```{r problem6}
gm %>%
group_by(country) %>%
summarise(r = abs(cor(lifeExp, gdpPercap))) %>%
arrange(desc(r)) %>%
head(10)
```
7) Which combinations of continent (besides Asia) and year have the highest average population across all countries? _N.b. your output should include all results sorted by highest average population_. With what you already know, this one may stump you. See [this Q&A](http://stackoverflow.com/q/27207963/654296) for how to `ungroup` before `arrange`ing. This also [behaves differently in more recent versions of dplyr](https://github.com/hadley/dplyr/releases/tag/v0.5.0).
```{r problem7}
gm %>%
filter(continent != "Asia") %>%
group_by(continent, year) %>%
summarise(mean.pop = mean(pop)) %>%
ungroup() %>%
arrange(desc(mean.pop))
```
8) Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?
```{r problem8}
gm %>%
group_by(country) %>%
summarize(sd.pop = sd(pop)) %>%
arrange(sd.pop) %>%
head(3)
```
9) Subset **gm** to only include observations from 1992 and store the results as **gm1992**. What kind of object is this?
```{r problem9}
gm1992 <-
gm %>%
filter(year == 1992)
gm1992 %>%
class()
```
10) **_Bonus!_** Which observations indicate that the population of a country has *decreased* from the previous year **and** the life expectancy has *increased* from the previous year? See [the vignette on window functions](https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html).
```{r problem10}
gm %>%
arrange(country, year) %>%
group_by(country) %>%
filter(pop < lag(pop) & lifeExp > lag(lifeExp))
```