Instructions

  1. You must submit both the RMD and “knitted” HTML files as one compressed .zip to the Homework 3 Drop Box on CoursePlus.
  2. All assignments are due by the end of the grading period for this term (26 June 2020).

Getting Started

In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.

The packages listed below are simply suggestions, but please edit this list as you see fit.

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

Problem Set

  1. Read the data using read_csv() and name it mort. Rename the first column to country using the rename() command in dplyr. Create an object year variable by extracting column names (using colnames()) and make it to an integer as.integer()), excluding the first column either with string manipulations or bracket subsetting or subsetting with is.na().
mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   X1 = col_character()
## )
## See spec(...) for full column specifications.
mort = mort %>% 
  rename(country = X1)

Using Bracket notation:

year = colnames(mort)
year = year[-1]
year = as.integer(year)

or using is.na:

year = colnames(mort)
year = as.integer(year)
## Warning: NAs introduced by coercion
year = year[ !is.na(year)]

or using string manipulations

year = colnames(mort)
# start with a  digit
year = year[str_detect(year, "^\\d")]
year = as.integer(year)
  1. Reshape the data so that there is a variable named year corresponding to year (key) and a column of the mortalities named mortality (value), using the tidyr package and its gather() function. Name the output long and make year a numeric variable.
    Hint: remember that -COLUMN_NAME removes that column, gather all the columns but country.
# can use quotes
long = mort %>% 
  gather(key = "year", value = "mortality", -country)
# or without
long = mort %>% 
  gather(year, mortality, -country)

long = long %>% 
  mutate(year = as.numeric(year))
  1. Read in this the tab-delim file and call it pop: http://johnmuschelli.com/intro_to_r/data/country_pop.txt. The file contains population information on each country. Rename the second column to "Country" and the column "% of world population", to percent.
    Hint: use read_tsv()
pop = read_tsv("http://johnmuschelli.com/intro_to_r/data/country_pop.txt")
## Parsed with column specification:
## cols(
##   Rank = col_double(),
##   `Country (or dependent territory)` = col_character(),
##   Population = col_number(),
##   Date = col_character(),
##   `% of world population` = col_character(),
##   Source = col_character()
## )
pop = pop %>% 
  rename(Country = `Country (or dependent territory)`,
         percent = `% of world population`)
  1. Determine the population of each country in pop using arrange(). Get the order of the countries based on this (first is the highest population), and extract that column and call it pop_levels. Make a variable in the long data set named sorted that is the country variable coded as a factor based on pop_levels.
pop = pop %>% 
  arrange(desc(Population))
# this is sorted !
pop_levels = pop$Country
long = long %>% 
  mutate(sorted = factor(country, levels = pop_levels))

As an aside, we should do some cleaning and checking before doing this, as we see not all the countries in the long data set exactly match those in the pop data set:

# you would want to clean these up in practice
# setdiff shows the "set difference"
setdiff(long$country, pop$Country)
##  [1] "Aruba"                 "Central African Rep."  "Channel Islands"      
##  [4] "Congo, Dem. Rep."      "Congo, Rep."           "Cote d'Ivoire"        
##  [7] "Czech Rep."            "Dominican Rep."        "French Guiana"        
## [10] "French Polynesia"      "Guadeloupe"            "Guam"                 
## [13] "Hong Kong, China"      "Korea, Dem. Rep."      "Korea, Rep."          
## [16] "Macao, China"          "Macedonia, FYR"        "Martinique"           
## [19] "Mayotte"               "Micronesia, Fed. Sts." "Netherlands Antilles" 
## [22] "New Caledonia"         "Puerto Rico"           "Reunion"              
## [25] "Sao Tome and Principe" "Slovak Republic"       "West Bank and Gaza"   
## [28] "Western Sahara"        "Virgin Islands (U.S.)" "Yemen, Rep."
# some are now set to missing (as factors do)
sum(is.na(long$country))
## [1] 0
sum(is.na(long$sorted))
## [1] 7620
  1. Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using %>% as necessary.

    a. Subset long based on years 1975-2010, including 1975 and 2010 and call this long_sub using & or the between() function.
    b. Further subset long_sub for the following countries using dplyr::filter() and the %in% operator on the sorted country factor (sorted):c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti").
    c. Lastly, remove missing rows for mortality using filter() and is.na().

    Hint: Be sure to assign your final object created from a through c as long_sub so you can use it in questions 6 and 7.

Subsetting long:

long_sub = long %>% 
  filter(year >= 1975 & year <= 2010)
range(long_sub$year)
## [1] 1975 2010

There is a function between that helps us with this for shorthand

long_sub = long %>% 
  filter(between(year, 1975, 2010))
range(long_sub$year)
## [1] 1975 2010
long_sub = long_sub %>% 
  filter(sorted %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile",  "Western Sahara", "Azerbaijan", "Argentina", "Haiti")) %>% 
  filter(!is.na(mortality))
  1. Plotting: create “spaghetti”/line plots for the countries in long_sub, using different colors for different countries, using sorted. The x-axis should be year, and the y-axis should be mortality. Make the plot using a.qplot and b. ggplot.
qplot(year, y = mortality, data = long_sub, color = sorted, geom = "line")

long_sub %>% 
  ggplot(aes(x = year, y = mortality)) +
  geom_line(aes(colour = sorted))

g = long_sub %>% 
  ggplot(aes(x = year, y = mortality, colour = sorted)) 
g

  1. Bonus: load the plotly package (library(plotly)) and assign the plot from question 6 to g and run ggplotly(g).
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(g)