In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.
The packages listed below are simply suggestions, but please edit this list as you see fit.
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
Read the data using read_csv()
and name it mort
. Rename the first column to country
using the rename()
command in dplyr
. Create an object year
variable by extracting column names (using colnames()
) and make it to an integer as.integer()
), excluding the first column either with string manipulations or bracket subsetting or subsetting with is.na()
.
Reshape the data so that there is a variable named year
corresponding to year
(key) and a column of the mortalities named mortality
(value), using the tidyr
package and its gather()
function. Name the output long
and make year
a numeric variable.
Hint: remember that -COLUMN_NAME removes that column, gather all the columns but country.
Read in this the tab-delim file and call it pop
: http://johnmuschelli.com/intro_to_r/data/country_pop.txt. The file contains population information on each country. Rename the second column to "Country"
and the column "% of world population"
, to percent
.
Hint: use read_tsv()
Determine the population of each country in pop
using arrange()
. Get the order of the countries based on this (first is the highest population), and extract that column and call it pop_levels
. Make a variable in the long
data set named sorted
that is the country
variable coded as a factor based on pop_levels
.
Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using %>%
as necessary.
a. Subset long
based on years 1975-2010, including 1975 and 2010 and call this long_sub
using &
or the between()
function.
b. Further subset long_sub
for the following countries using dplyr::filter()
and the %in%
operator on the sorted country factor (sorted
):c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")
.
c. Lastly, remove missing rows for mortality
using filter()
and is.na()
.
Hint: Be sure to assign your final object created from a through c as long_sub
so you can use it in questions 6 and 7.
Plotting: create “spaghetti”/line plots for the countries in long_sub
, using different colors for different countries, using sorted
. The x-axis should be year
, and the y-axis should be mortality
. Make the plot using a.qplot
and b. ggplot
.
Bonus: load the plotly
package (library(plotly)
) and assign the plot from question 6 to g
and run ggplotly(g)
.