---
title: "Homework 3 - Key"
author: "Introduction to R for Public Health Researchers"
output: html_document
---
Use this dataset on infant mortality for the following questions: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv
```{r initiatePackages, message=FALSE}
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
```
Questions
1. Read the data using `read_csv()` and name it `mort`. Rename the first column to `country` using the `rename()` command in `dplyr`. Create an object `year` variable by extracting column names (using `colnames()`) and make it to an integer `as.integer()`), excluding the first column either with string manipulations or bracket subsetting or subsetting with `is.na`.
```{r, dataImport}
mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv")
mort = mort %>%
rename(country = X1)
```
Using Bracket notation:
```{r}
year = colnames(mort)
year = year[-1]
year = as.integer(year)
```
or using `is.na`:
```{r}
year = colnames(mort)
year = as.integer(year)
year = year[ !is.na(year)]
```
or using string manipulations
```{r}
year = colnames(mort)
# start with a digit
year = year[str_detect(year, "^\\d")]
year = as.integer(year)
```
2. Reshape the data so that there is a variable named `year` corresponding to `year` (key) and a column of the mortalities named `mortality` (value). Hint: use the `tidyr` package and its `gather()` function. Name the output `long` and make `year` a numeric variable. Hint: remember that -COLUMN_NAME removes that column, gather all the columns but country.
Let's create a new vector for the years:
```{r, question2}
# can use quotes
long = mort %>%
gather(key = "year", value = "mortality", -country)
# or without
long = mort %>%
gather(year, mortality, -country)
long = long %>%
mutate(year = as.numeric(year))
```
3. Read in this the tab-delim file: http://johnmuschelli.com/intro_to_r/data/country_pop.txt and call it `pop`, which contains population information on each country (hint: use `read_tsv()`). Rename the second column to `"Country"` and the column `"% of world population"`, to `percent`
```{r, question3}
pop = read_tsv("http://johnmuschelli.com/intro_to_r/data/country_pop.txt")
pop = pop %>%
rename(Country = `Country (or dependent territory)`,
percent = `% of world population`)
```
4. Determine the population of each country in `pop` using `arrange()`. Get the order of the countries based on this (first is the highest population), and extract that column and call it `pop_levels`. Make a variable in the `long` data set named `sorted` that is the `country` variable coded as a factor based on `pop_levels`.
```{r, question4}
pop = pop %>%
arrange(desc(Population))
# this is sorted !
pop_levels = pop$Country
long = long %>%
mutate(sorted = factor(country, levels = pop_levels))
```
As an aside, we should do some cleaning and checking before doing this, as we see not all the countries in the `long` data set exactly match those in the `pop` data set:
```{r}
# you would want to clean these up in practice
# setdiff shows the "set difference"
setdiff(long$country, pop$Country)
# some are now set to missing (as factors do)
sum(is.na(long$country))
sum(is.na(long$sorted))
```
5. Subset `long` based on years 1975-2010, including 1975 and 2010 and call this `long_sub` using `&` or the `between()` function. Subset the following countries using `dplyr::filter()` and the `%in%` operator on the sorted country factor (`sorted`) for `long_sub` with
`c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")` and reassign to `long_sub`. Lastly, remove missing rows for `mortality` using `filter()` and `is.na()`.
Subsetting long:
```{r, question5}
long_sub = long %>%
filter(year >= 1975 & year <= 2010)
range(long_sub$year)
```
There is a function `between` that helps us with this for shorthand
```{r, question5b}
long_sub = long %>%
filter(between(year, 1975, 2010))
range(long_sub$year)
```
```{r}
long_sub = long_sub %>%
filter(sorted %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")) %>%
filter(!is.na(mortality))
```
6. Plotting: create "spaghetti"/line plots for the countries in `long_sub`, using different colors for different countries, using `sorted`. The x-axis should be `year`, and the y-axis should be `mortality`. Make the plot using `qplot` and also `ggplot`.
```{r}
qplot(year, y = mortality, data = long_sub, color = sorted, geom = "line")
```
```{r}
long_sub %>%
ggplot(aes(x = year, y = mortality)) +
geom_line(aes(colour = sorted))
```
```{r}
g = long_sub %>%
ggplot(aes(x = year, y = mortality, colour = sorted))
g
```
7. Bonus, load the `plotly` package (`library(plotly)`) and assign the plot from 6) to `g` and run `ggplotly(g)`.
```{r}
library(plotly)
ggplotly(g)
```