--- title: '**KEY** Introduction to R: Homework 3' author: "Authored by Andrew Jaffe and John Muschelli" date: "11 June 2020" output: html_document: default --- ### **Instructions** 1. You must submit both the RMD and "knitted" HTML files as one compressed .zip to the Homework 3 Drop Box on CoursePlus.
2. All assignments are due by the end of the grading period for this term (26 June 2020). ### **Getting Started** In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.
The packages listed below are simply suggestions, but please edit this list as you see fit. ```{r initiatePackages, message=FALSE} ## you can add more, or change...these are suggestions library(tidyverse) library(readr) library(dplyr) library(ggplot2) library(tidyr) ``` ### **Problem Set** 1. Read the data using `read_csv()` and name it `mort`. Rename the first column to `country` using the `rename()` command in `dplyr`. Create an object `year` variable by extracting column names (using `colnames()`) and make it to an integer `as.integer()`), excluding the first column either with string manipulations or bracket subsetting or subsetting with `is.na()`. ```{r, dataImport} mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv") mort = mort %>% rename(country = X1) ``` Using Bracket notation: ```{r} year = colnames(mort) year = year[-1] year = as.integer(year) ``` or using `is.na`: ```{r} year = colnames(mort) year = as.integer(year) year = year[ !is.na(year)] ``` or using string manipulations ```{r} year = colnames(mort) # start with a digit year = year[str_detect(year, "^\\d")] year = as.integer(year) ``` 2. Reshape the data so that there is a variable named `year` corresponding to `year` (key) and a column of the mortalities named `mortality` (value), using the `tidyr` package and its `gather()` function. Name the output `long` and make `year` a numeric variable.
**Hint:** remember that -COLUMN_NAME removes that column, gather all the columns but country. ```{r, question2} # can use quotes long = mort %>% gather(key = "year", value = "mortality", -country) # or without long = mort %>% gather(year, mortality, -country) long = long %>% mutate(year = as.numeric(year)) ``` 3. Read in this the tab-delim file and call it `pop`: http://johnmuschelli.com/intro_to_r/data/country_pop.txt. The file contains population information on each country. Rename the second column to `"Country"` and the column `"% of world population"`, to `percent`.
**Hint:** use `read_tsv()` ```{r, question3} pop = read_tsv("http://johnmuschelli.com/intro_to_r/data/country_pop.txt") pop = pop %>% rename(Country = `Country (or dependent territory)`, percent = `% of world population`) ``` 4. Determine the population of each country in `pop` using `arrange()`. Get the order of the countries based on this (first is the highest population), and extract that column and call it `pop_levels`. Make a variable in the `long` data set named `sorted` that is the `country` variable coded as a factor based on `pop_levels`. ```{r, question4} pop = pop %>% arrange(desc(Population)) # this is sorted ! pop_levels = pop\$Country long = long %>% mutate(sorted = factor(country, levels = pop_levels)) ``` As an aside, we should do some cleaning and checking before doing this, as we see not all the countries in the `long` data set exactly match those in the `pop` data set: ```{r} # you would want to clean these up in practice # setdiff shows the "set difference" setdiff(long\$country, pop\$Country) # some are now set to missing (as factors do) sum(is.na(long\$country)) sum(is.na(long\$sorted)) ``` 5. Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using `%>%` as necessary.
**a.** Subset `long` based on years 1975-2010, including 1975 and 2010 and call this `long_sub` using `&` or the `between()` function.
**b.** Further subset `long_sub` for the following countries using `dplyr::filter()` and the `%in%` operator on the sorted country factor (`sorted`):`c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")`.
**c.** Lastly, remove missing rows for `mortality` using `filter()` and `is.na()`.
**Hint:** Be sure to assign your final object created from a through c as `long_sub` so you can use it in questions 6 and 7. Subsetting long: ```{r, question5} long_sub = long %>% filter(year >= 1975 & year <= 2010) range(long_sub\$year) ``` There is a function `between` that helps us with this for shorthand ```{r, question5b} long_sub = long %>% filter(between(year, 1975, 2010)) range(long_sub\$year) ``` ```{r} long_sub = long_sub %>% filter(sorted %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")) %>% filter(!is.na(mortality)) ``` 6. Plotting: create "spaghetti"/line plots for the countries in `long_sub`, using different colors for different countries, using `sorted`. The x-axis should be `year`, and the y-axis should be `mortality`. Make the plot using a.`qplot` and b. `ggplot`. ```{r} qplot(year, y = mortality, data = long_sub, color = sorted, geom = "line") ``` ```{r} long_sub %>% ggplot(aes(x = year, y = mortality)) + geom_line(aes(colour = sorted)) ``` ```{r} g = long_sub %>% ggplot(aes(x = year, y = mortality, colour = sorted)) g ``` 7. **Bonus:** load the `plotly` package (`library(plotly)`) and assign the plot from question 6 to `g` and run `ggplotly(g)`. ```{r} library(plotly) ggplotly(g) ```