---
title: 'Introduction to R: Homework 3'
author: "Authored by Andrew Jaffe and John Muschelli"
date: "11 June 2020"
output:
  html_document: default
---

### **Instructions**
1. You must submit both the RMD and "knitted" HTML files as one compressed .zip to the Homework 3 Drop Box on CoursePlus.<br />
2. All assignments are due by the end of the grading period for this term (26 June 2020).

### **Getting Started**
In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.<br />

The packages listed below are simply suggestions, but please edit this list as you see fit.
```{r initiatePackages, message=FALSE}
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
```

### **Problem Set**

1.	Read the data using `read_csv()` and name it `mort`. Rename the first column to `country` using the `rename()` command in `dplyr`.  Create an object `year` variable by extracting column names (using `colnames()`) and make it to an integer `as.integer()` ), excluding the first column either with string manipulations or bracket subsetting or subsetting with `is.na()`. 

```{r, dataImport}

```

2.	Reshape the data so that there is a variable named `year` corresponding to `year` (key) and a column of the mortalities named `mortality` (value), using the `tidyr` package and its `gather()` function. Name the output `long` and make `year` a numeric variable.<br />**Hint:** remember that -COLUMN_NAME removes that column, gather all the columns but country.  

```{r, question2}

```


3.	Read in this the tab-delim file and call it `pop`: http://johnmuschelli.com/intro_to_r/data/country_pop.txt.
The file contains population information on each country. Rename the second column to `"Country"` and the column `"% of world population"`, to `percent`.<br />**Hint:** use `read_tsv()`

```{r, question3}


```

4.	Determine the population of each country in `pop` using `arrange()`. Get the order of the countries based on this (first is the highest population), and extract that column and call it `pop_levels`.  Make a variable in the `long` data set named `sorted` that is the `country` variable coded as a factor based on `pop_levels`.

```{r, question4}


```


5.	Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using `%>%` as necessary.<br />

    **a.** Subset `long` based on years 1975-2010, including 1975 and 2010 and call this `long_sub` using `&` or the `between()` function.<br />
    **b.** Further subset `long_sub` for the following countries using `dplyr::filter()` and the `%in%` operator on the sorted country factor (`sorted`):`c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile",  "Western Sahara", "Azerbaijan", "Argentina", "Haiti")`.<br />
    **c.** Lastly, remove missing rows for `mortality` using `filter()` and `is.na()`.<br />

    **Hint:** Be sure to assign your final object created from a through c as `long_sub` so you can use it in questions 6 and 7.

```{r, question5}


```


6.  Plotting: create "spaghetti"/line plots for the countries in `long_sub`, using different colors for different countries, using `sorted`. The x-axis should be `year`, and the y-axis should be `mortality`. Make the plot using a.`qplot` and b. `ggplot`.

```{r}


```


7.  **Bonus:** load the `plotly` package (`library(plotly)`) and assign the plot from question 6 to `g` and run `ggplotly(g)`.

```{r}


```