---
title: "Statistics Lab Key"
output: html_document
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

```{r, include = FALSE}
library(tidyverse)
library(broom)
library(jhur)
library(psych)
```

# Part 1
Read in the following infant mortality data from http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv
```{r}
mort = read_csv("http://johnmuschelli.com/intro_to_r/data/indicatordeadkids35.csv")
```
or use 
```{r}
mort = read_mortality()
```

Then change the first column name to `country`"

```{r}
mort = mort %>% 
  rename(country = X1)
```

or 

```{r}
colnames(mort)[1] = "country"
```

1. Compute the correlation between the `1980`, `1990`, `2000`, and `2010` mortality data.
No need to save this in an object. Just display the result to the screen.
```{r}
sub = mort %>% 
  select(`1980`, `1990`, `2000`, `2010`)
cor(sub)
cor(sub, use = "complete.obs")
psych::corr.test(sub, use = "complete.obs")
psych::corr.test(sub)
cor(mort[,c("1980", "1990", "2000", "2010")])
```

What's going on?! Seems we have `NA`s in the `2010` column

```{r}
summary(mort[,c("1980", "1990", "2000", "2010")])
```
Both are equivalent
```{r}
cor(mort[,c("1980", "1990", "2000", "2010")], use = "complete.obs")
cor(mort %>% select("1980", "1990", "2000", "2010"), use = "complete.obs")
```

2. a. Compute the correlation between the `Myanmar`, `China`, and `United States` mortality data. Store this correlation matrix in an object called `country_cor`

b. Extract the Myanmar-US correlation from the correlation matrix.
```{r}
mort_subs = mort %>% 
  filter(country %in% c( "China", "Myanmar","United States")) %>% 
  arrange(country)
long = mort_subs %>% 
  gather(key = "year", value = death, -country) %>% 
  filter(!is.na(death))
long = long %>% 
  spread(key = country, value = death)
sub = long %>% 
  select(Myanmar, China, `United States`)
cor(sub)
mort_subs = mort_subs %>% 
  select(-country)
mort_subs = t(mort_subs)
country_cor = cor(mort_subs, 
                  use = "complete.obs")
```

Run the following to see that the order is China, Myanmar, US:
```{r}
mort %>% filter(country %in% c("Myanmar", "China", "United States"))
```

Extract the Myanmar-US correlation

```{r}
country_cor[1,3]
```

3. Is there a difference between mortality information from `1990` and `2000`?
Run a t-test and a Wilcoxon rank sum test to assess this.
Hint: to extract the column of information for `1990`, use `mort$"1990"`

```{r}
t.test(mort$"1990", mort$"2000")
t.test(mort$`1990`, mort$`2000`)
t.test(mort$`1990`, mort$`2000`, paired = TRUE)
wilcox.test(mort$"1990", mort$"2000")
wilcox.test(mort$"1990", mort$"2000", paired = TRUE)
```

# Part 2

Read in the Kaggle cars auction dataset from http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv with the argument `col_types = cols(VehBCost = col_double())`:
```{r}
cars = read_csv("http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv",
col_types = cols(VehBCost = col_double()))
```

4. Fit a linear regression model with vehicle cost (`VehBCost`) as the outcome and 
vehicle age (`VehicleAge`) and whether it's an online sale (`IsOnlineSale`) as predictors as well as their interaction.  Save the model fit in an object called `lmfit_cars` and display the summary table.

```{r}
lmfit_cars = lm(VehBCost ~ VehicleAge + IsOnlineSale + VehicleAge*IsOnlineSale, data = cars)
lmfit_cars = lm(VehBCost ~ VehicleAge*IsOnlineSale, data = cars)
tidy(lmfit_cars)
```

5. Create a variable called `expensive` in the `cars` data that indicates if the 
vehicle cost is over `$10,000`. Use a chi-squared test to assess if there is a
relationship between a car being expensive and it being labeled as a "bad buy" (`IsBadBuy`).

```{r}
cars = cars %>%
mutate(expensive = VehBCost > 10000)
tab = table(cars$expensive, cars$IsBadBuy)
chisq.test(tab)
fisher.test(tab)
prop.test(tab)
```

6. Fit a logistic regression model where the outcome is "bad buy" status and predictors are the `expensive` status and vehicle age (`VehicleAge`).
Save the model fit in an object called `logfit_cars` and display the summary table.
Use `summary` or `tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)`
or `tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)` for log odds ratios 

```{r}
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial())
logfit_cars = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = "binomial")
summary(logfit_cars)
tidy(logfit_cars, conf.int = TRUE, exponentiate = TRUE)
tidy(logfit_cars, conf.int = TRUE, exponentiate = FALSE)
```

7. Randomly sample `10,000` observations (rows) with replacement from the `cars` dataset and store this new dataset in an object called `cars_subsample`.  Hint: `sample_n` function

```{r}
set.seed(1)
cars_subsample = cars %>% 
  sample_n(size = 10000)
```

or another way

```{r}
rows = sample(nrow(cars), size = 10000, replace = TRUE)
cars_subsample = cars[rows,]
```

8. Fit the same logistic regression model as in problem 6 above 
with this sample and display the summary table using the `tidy` function. How do the results compare?  Call this model `logfit_cars_sub`:

```{r}
logfit_cars_sub = glm(IsBadBuy ~ expensive + VehicleAge, data = cars, family = binomial)
summary(logfit_cars_sub)
tidy(logfit_cars_sub, conf.int = TRUE, exponentiate = TRUE)
```