---
title: "Statistics Lab Key"
output: html_document
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r, include = FALSE}
library(tidyverse)
library(broom)
library(jhur)
library(psych)
```
# Part 1
Read in the following infant mortality data from http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv
```{r}
mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv")
```
or use
```{r}
mort = read_mortality()
```
Then change the first column name to `country`"
```{r}
mort = mort %>%
rename(country = X1)
```
or
```{r}
colnames(mort)[1] = "country"
```
1. Compute the correlation between the `1980`, `1990`, `2000`, and `2010` mortality data.
No need to save this in an object. Just display the result to the screen.
```{r}
sub = mort %>%
select(`1980`, `1990`, `2000`, `2010`)
cor(sub)
cor(sub, use = "complete.obs")
psych::corr.test(sub, use = "complete.obs")
psych::corr.test(sub)
cor(mort[,c("1980", "1990", "2000", "2010")])
```
What's going on?! Seems we have `NA`s in the `2010` column
```{r}
summary(mort[,c("1980", "1990", "2000", "2010")])
```
Both are equivalent
```{r}
cor(mort[,c("1980", "1990", "2000", "2010")], use = "complete.obs")
cor(mort %>% select("1980", "1990", "2000", "2010"), use = "complete.obs")
```
2. a. Compute the correlation between the `Myanmar`, `China`, and `United States` mortality data. Store this correlation matrix in an object called `country_cor`
b. Extract the Myanmar-US correlation from the correlation matrix.
```{r}
mort_subs = mort %>%
filter(country %in% c( "China", "Myanmar","United States")) %>%
arrange(country)
long = mort_subs %>%
gather(key = "year", value = death, -country) %>%
filter(!is.na(death))
long = long %>%
spread(key = country, value = death)
sub = long %>%
select(Myanmar, China, `United States`)
cor(sub)
mort_subs = mort_subs %>%
select(-country)
mort_subs = t(mort_subs)
country_cor = cor(mort_subs,
use = "complete.obs")
```
Run the following to see that the order is China, Myanmar, US:
```{r}
mort %>% filter(country %in% c("Myanmar", "China", "United States"))
```
Extract the Myanmar-US correlation
```{r}
country_cor[1,3]
```
3. Is there a difference between mortality information from `1990` and `2000`?
Run a t-test and a Wilcoxon rank sum test to assess this.
Hint: to extract the column of information for `1990`, use `mort$"1990"`
```{r}
t.test(mort$"1990", mort$"2000")
t.test(mort$`1990`, mort$`2000`)
t.test(mort$`1990`, mort$`2000`, paired = TRUE)
wilcox.test(mort$"1990", mort$"2000")
wilcox.test(mort$"1990", mort$"2000", paired = TRUE)
```
# Part 2
Read in the Kaggle cars auction dataset from http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv with the argument `col_types = cols(VehBCost = col_double())`:
```{r}
cars = read_csv("http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv",
col_types = cols(VehBCost = col_double()))
```
4. Fit a linear regression model with vehicle cost (`VehBCost`) as the outcome and
vehicle age (`VehicleAge`) and whether it's an online sale (`IsOnlineSale`) as predictors as well as their interaction. Save the model fit in an object called `lmfit_cars` and display the summary table.
```{r}
lmfit_cars = lm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge*IsOnlineSale, data = cars)
lmfit_cars = lm(VehBCost ~ VehicleAge*IsOnlineSale, data = cars)
tidy(lmfit_cars)
```
5. Create a variable called `expensive` in the `cars` data that indicates if the
vehicle cost is over `$10,000`. Use a chi-squared test to assess if there is a
relationship between a car being expensive and it being labeled as a "bad buy" (`IsBadBuy`).
```{r}
cars = cars %>%
mutate(expensive = VehBCost > 10000)
tab = table(cars$expensive, cars$IsBadBuy)
chisq.test(tab)
fisher.test(tab)
prop.test(tab)
```
6. Fit a logistic regression model where the outcome is "bad buy" status and predictors are the `expensive` status and vehicle age (`VehicleAge`).
Save the model fit in an object called `logfit_cars` and display the summary table.
Use `summary` or `tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)`
or `tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)` for log odds ratios
```{r}
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial())
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = "binomial")
summary(logfit_cars)
tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)
tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)
```
7. Randomly sample `10,000` observations (rows) with replacement from the `cars` dataset and store this new dataset in an object called `cars_subsample`. Hint: `sample_n` function
```{r}
set.seed(1)
cars_subsample = cars %>%
sample_n(size = 10000)
```
or another way
```{r}
rows = sample(nrow(cars), size = 10000, replace = TRUE)
cars_subsample = cars[rows,]
```
8. Fit the same logistic regression model as in problem 6 above
with this sample and display the summary table using the `tidy` function. How do the results compare? Call this model `logfit_cars_sub`:
```{r}
logfit_cars_sub = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
summary(logfit_cars_sub)
tidy(logfit_cars_sub, conf.int = TRUE, exponentiate = TRUE)
```