Instructions

  1. You must submit both the RMD and “knitted” HTML files as one compressed .zip to the Homework 3 Drop Box on CoursePlus.
  2. All assignments are due by the end of the grading period for this term (26 June 2020).

Getting Started

In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.

The packages listed below are simply suggestions, but please edit this list as you see fit.

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

Problem Set

  1. Read the data using read_csv() and name it mort. Rename the first column to country using the rename() command in dplyr. Create an object year variable by extracting column names (using colnames()) and make it to an integer as.integer() ), excluding the first column either with string manipulations or bracket subsetting or subsetting with is.na().

  2. Reshape the data so that there is a variable named year corresponding to year (key) and a column of the mortalities named mortality (value), using the tidyr package and its gather() function. Name the output long and make year a numeric variable.
    Hint: remember that -COLUMN_NAME removes that column, gather all the columns but country.

  3. Read in this the tab-delim file and call it pop: http://johnmuschelli.com/intro_to_r/data/country_pop.txt. The file contains population information on each country. Rename the second column to "Country" and the column "% of world population", to percent.
    Hint: use read_tsv()

  4. Determine the population of each country in pop using arrange(). Get the order of the countries based on this (first is the highest population), and extract that column and call it pop_levels. Make a variable in the long data set named sorted that is the country variable coded as a factor based on pop_levels.

  5. Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using %>% as necessary.

    a. Subset long based on years 1975-2010, including 1975 and 2010 and call this long_sub using & or the between() function.
    b. Further subset long_sub for the following countries using dplyr::filter() and the %in% operator on the sorted country factor (sorted):c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti").
    c. Lastly, remove missing rows for mortality using filter() and is.na().

    Hint: Be sure to assign your final object created from a through c as long_sub so you can use it in questions 6 and 7.

  6. Plotting: create “spaghetti”/line plots for the countries in long_sub, using different colors for different countries, using sorted. The x-axis should be year, and the y-axis should be mortality. Make the plot using a.qplot and b. ggplot.

  7. Bonus: load the plotly package (library(plotly)) and assign the plot from question 6 to g and run ggplotly(g).