In this assignment, we will be working with the infant mortality data set, found here: http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv.
The packages listed below are simply suggestions, but please edit this list as you see fit.
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
read_csv()
and name it mort
. Rename the first column to country
using the rename()
command in dplyr
. Create an object year
variable by extracting column names (using colnames()
) and make it to an integer as.integer()
), excluding the first column either with string manipulations or bracket subsetting or subsetting with is.na()
.mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## See spec(...) for full column specifications.
mort = mort %>%
rename(country = X1)
Using Bracket notation:
year = colnames(mort)
year = year[-1]
year = as.integer(year)
or using is.na
:
year = colnames(mort)
year = as.integer(year)
## Warning: NAs introduced by coercion
year = year[ !is.na(year)]
or using string manipulations
year = colnames(mort)
# start with a digit
year = year[str_detect(year, "^\\d")]
year = as.integer(year)
year
corresponding to year
(key) and a column of the mortalities named mortality
(value), using the tidyr
package and its gather()
function. Name the output long
and make year
a numeric variable.# can use quotes
long = mort %>%
gather(key = "year", value = "mortality", -country)
# or without
long = mort %>%
gather(year, mortality, -country)
long = long %>%
mutate(year = as.numeric(year))
pop
: http://johnmuschelli.com/intro_to_r/data/country_pop.txt. The file contains population information on each country. Rename the second column to "Country"
and the column "% of world population"
, to percent
.read_tsv()
pop = read_tsv("http://johnmuschelli.com/intro_to_r/data/country_pop.txt")
## Parsed with column specification:
## cols(
## Rank = col_double(),
## `Country (or dependent territory)` = col_character(),
## Population = col_number(),
## Date = col_character(),
## `% of world population` = col_character(),
## Source = col_character()
## )
pop = pop %>%
rename(Country = `Country (or dependent territory)`,
percent = `% of world population`)
pop
using arrange()
. Get the order of the countries based on this (first is the highest population), and extract that column and call it pop_levels
. Make a variable in the long
data set named sorted
that is the country
variable coded as a factor based on pop_levels
.pop = pop %>%
arrange(desc(Population))
# this is sorted !
pop_levels = pop$Country
long = long %>%
mutate(sorted = factor(country, levels = pop_levels))
As an aside, we should do some cleaning and checking before doing this, as we see not all the countries in the long
data set exactly match those in the pop
data set:
# you would want to clean these up in practice
# setdiff shows the "set difference"
setdiff(long$country, pop$Country)
## [1] "Aruba" "Central African Rep." "Channel Islands"
## [4] "Congo, Dem. Rep." "Congo, Rep." "Cote d'Ivoire"
## [7] "Czech Rep." "Dominican Rep." "French Guiana"
## [10] "French Polynesia" "Guadeloupe" "Guam"
## [13] "Hong Kong, China" "Korea, Dem. Rep." "Korea, Rep."
## [16] "Macao, China" "Macedonia, FYR" "Martinique"
## [19] "Mayotte" "Micronesia, Fed. Sts." "Netherlands Antilles"
## [22] "New Caledonia" "Puerto Rico" "Reunion"
## [25] "Sao Tome and Principe" "Slovak Republic" "West Bank and Gaza"
## [28] "Western Sahara" "Virgin Islands (U.S.)" "Yemen, Rep."
# some are now set to missing (as factors do)
sum(is.na(long$country))
## [1] 0
sum(is.na(long$sorted))
## [1] 7620
Parts a, b, and c below are only broken up here for clarity, but all three components can be addressed in one chunk of code/as one function, using %>%
as necessary.
a. Subset long
based on years 1975-2010, including 1975 and 2010 and call this long_sub
using &
or the between()
function.
b. Further subset long_sub
for the following countries using dplyr::filter()
and the %in%
operator on the sorted country factor (sorted
):c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")
.
c. Lastly, remove missing rows for mortality
using filter()
and is.na()
.
Hint: Be sure to assign your final object created from a through c as long_sub
so you can use it in questions 6 and 7.
Subsetting long:
long_sub = long %>%
filter(year >= 1975 & year <= 2010)
range(long_sub$year)
## [1] 1975 2010
There is a function between
that helps us with this for shorthand
long_sub = long %>%
filter(between(year, 1975, 2010))
range(long_sub$year)
## [1] 1975 2010
long_sub = long_sub %>%
filter(sorted %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Chile", "Western Sahara", "Azerbaijan", "Argentina", "Haiti")) %>%
filter(!is.na(mortality))
long_sub
, using different colors for different countries, using sorted
. The x-axis should be year
, and the y-axis should be mortality
. Make the plot using a.qplot
and b. ggplot
.qplot(year, y = mortality, data = long_sub, color = sorted, geom = "line")
long_sub %>%
ggplot(aes(x = year, y = mortality)) +
geom_line(aes(colour = sorted))
g = long_sub %>%
ggplot(aes(x = year, y = mortality, colour = sorted))
g
plotly
package (library(plotly)
) and assign the plot from question 6 to g
and run ggplotly(g)
.library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(g)