---
title: "Statistics"
author: "Introduction to R for Public Health Researchers"
output:
  ioslides_presentation:
    css: ../styles.css
    widescreen: yes
---

```{r knit-setup, include=FALSE}
library(knitr)
opts_chunk$set(echo = TRUE, 
               message = FALSE, 
               warning = FALSE,
               fig.height = 4,
               fig.width = 7, 
               comment = "")
library(dplyr)
```

## Statistics

Now we are going to cover how to perform a variety of basic statistical tests in R. 

* Correlation
* T-tests/Rank-sum tests
* Linear Regression
* Logistic Regression
* Proportion tests
* Chi-squared
* Fisher's Exact Test

Note: We will be glossing over the statistical theory and "formulas" for these tests. There are plenty of resources online for learning more about these tests, as well as dedicated Biostatistics series at the School of Public Health

## Correlation 

`cor()` performs correlation in R

```
cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))
```

Like other functions, if there are NAs, you get NA as the result.  But if you specify use only the complete observations, then it will give you correlation on the non-missing data. 
```{r cor1, comment="", message = FALSE}
library(readr)
circ = read_csv("http://johnmuschelli.com/intro_to_r/data/Charm_City_Circulator_Ridership.csv")
cor(circ$orangeAverage, circ$purpleAverage, use="complete.obs")
```


## Correlation with `corrr` 

The `corrr` package allows you to do correlations easily:

```{r cor2, comment="", message = TRUE}
library(corrr); library(dplyr)
circ %>% 
  select(orangeAverage, purpleAverage) %>% 
  correlate()
```

## Correlation  

`correlate` is usually better with more than 2 columns:

```{r cor3, comment=""}
avgs = circ %>% select(ends_with("Average"))
cobj = avgs %>% correlate(use = "complete.obs", diagonal = 1)
cobj %>% fashion(decimals = 3)
```

## Correlation  

```{r corplot, comment=""}
cobj %>% rplot()
```


<!-- ## Correlation  -->

<!-- You can also get the correlation between matrix columns -->

<!-- Or between columns of two matrices/dfs, column by column. -->

<!-- ```{r cor3, comment=""} -->
<!-- op = avgs %>% select(orangeAverage, purpleAverage) -->
<!-- gb = avgs %>% select(greenAverage, bannerAverage) -->
<!-- signif(cor(op, gb, use = "complete.obs"), 3) -->
<!-- ``` -->

## Correlation  

You can also use `cor.test()` to test for whether correlation is significant (ie non-zero). Note that linear regression may be better, especially if you want to regress out other confounders.  

```{r cor4, comment=""}
ct = cor.test(circ$orangeAverage, circ$purpleAverage, 
              use = "complete.obs")
ct
```

## Correlation

For many of these testing result objects, you can extract specific slots/results as numbers, as the `ct` object is just a list.

```{r cor5, comment=""}
# str(ct)
names(ct)
ct$statistic
ct$p.value
```

## Broom package

The `broom` package has a `tidy` function that puts most objects into `data.frame`s so that they are easily manipulated:

```{r broom, comment=""}
library(broom)
tidy_ct = tidy(ct)
tidy_ct
```


<!-- ## Correlation {.smaller} -->

<!-- Note that you can add the correlation to a plot, via the legend() function.  -->

<!-- ```{r cor4a, comment="",  fig.height=4,fig.width=4} -->
<!-- txt = paste0("r=", signif(ct$estimate,3)) -->
<!-- plot(circ$orangeAverage, circ$purpleAverage, -->
<!--      xlab="Orange Line", ylab="Purple Line", -->
<!--      main="Average Ridership",cex.axis=1.5, -->
<!--      cex.lab=1.5,cex.main=2) -->
<!-- legend("topleft", txt, bty="n",cex=1.5) -->
<!-- ``` -->

## Correlation 

Note that you can add the correlation to a plot, via the `annotate`

```{r cor_ggplot, comment="",  fig.height=4,fig.width=4}
library(ggplot2)
txt = paste0("r=", signif(ct$estimate,3))
q = qplot(data = circ, x = orangeAverage, y = purpleAverage)
q + annotate("text", x = 4000, y = 8000, label = txt, size = 5)
```


## T-tests

The T-test is performed using the `t.test()` function, which essentially tests for the difference in means of a variable between two groups.

In this syntax, x and y are the column of data for each group.

```{r tt1, comment=""}
tt = t.test(circ$orangeAverage, circ$purpleAverage)
tt
```

## T-tests

Using `t.test` treats the data as independent.  Realistically, this data should be treated as a paired t-test.  The `paired = TRUE` argument to do a paired test

```{r tt1_paired, comment=""}
t.test(circ$orangeAverage, circ$purpleAverage, paired = TRUE)
```

## T-tests

`t.test` saves a lot of information: the difference in means `estimate`, confidence interval for the difference `conf.int`, the p-value `p.value`, etc.

```{r tt1_1, comment=""}
names(tt)
```

## T-tests

```{r tt1_broom, comment=""}
tidy(tt)
```

## T-tests

You can also use the 'formula' notation.  In this syntax, it is `y ~ x`, where `x` is a factor with 2 levels or a binary variable and `y` is a vector of the same length.  

```{r long_tt} 
library(tidyr)
long = circ %>% 
  select(date, orangeAverage, purpleAverage) %>% 
  gather(key = line, value = avg, -date)
tt = t.test(avg ~ line, data = long)
tidy(tt)
```

## Wilcoxon Rank-Sum Tests

Nonparametric analog to t-test (testing medians):

```{r wt, comment=""}
tidy(wilcox.test(avg ~ line, data = long))
```


## Lab Part 1

[Website](http://johnmuschelli.com/intro_to_r/index.html)

## Analysis of Variance

The `aov` function exists for ANOVA, but we'd recommend `lm` (Linear models):

```{r aov, comment=""}
long3 = circ %>% 
  select(date, orangeAverage, purpleAverage, bannerAverage) %>% 
  gather(key = line, value = avg, -date)
anova_res = aov(avg ~ line, data = long3)
anova_res
```


## Linear Regression

Now we will briefly cover linear regression. I will use a little notation here so some of the commands are easier to put in the proper context.
$$
y_i = \alpha + \beta x_i + \varepsilon_i 
$$
where:

* $y_i$ is the outcome for person i
* $\alpha$ is the intercept
* $\beta$ is the slope
* $x_i$ is the predictor for person i
* $\varepsilon_i$ is the residual variation for person i

## Linear Regression

The `R` version of the regression model is:

```
y ~ x
```

where: 

* y is your outcome
* x is/are your predictor(s)

## Linear Regression

For a linear regression, when the predictor is binary this is the same as a t-test:

```{r regress1, comment=""}
fit = lm(avg ~ line, data = long)
fit
```

'(Intercept)' is $\alpha$

'linepurpleAverage' is $\beta$


## Linear Regression

```{r regress1000, comment=""}
fit = lm(avg ~ line, data = long3)
fit
```

## Linear Regression

The `summary` command gets all the additional information (p-values, t-statistics, r-square) that you usually want from a regression.

```{r regress2, comment=""}
sfit = summary(fit)
print(sfit)
```

## Linear Regression

We can `tidy` linear models as well and it gives us all of this in a tibble:

```{r tidy_lm, comment=""}
tidy(fit)
```

## Linear Regression

The `confint` argument allows for confidence intervals
```{r tidy_lm_onf, comment=""}
tidy(fit, conf.int = TRUE)
```

## Using Cars Data

```{r tt2, comment="", message = FALSE}
http_data_dir = "http://johnmuschelli.com/intro_to_r/data/"
cars = read_csv(
  paste0(http_data_dir, "kaggleCarAuction.csv"),   
  col_types = cols(VehBCost = col_double()))
head(cars)
```

## Linear Regression

We'll look at vehicle odometer value by vehicle age:

```{r}
fit = lm(VehOdo~VehicleAge, data = cars)
print(fit)
```

<!-- ## Linear Regression {.smaller} -->

<!-- We can visualize the vehicle age/odometer relationshp using scatter plots or box plots (with regression lines).  The function `abline` will plot the regresion line on the plot. -->


<!-- ```{r regress4, comment="", fig.height=4,fig.width=8} -->
<!-- par(mfrow=c(1,2)) -->
<!-- plot(VehOdo ~ jitter(VehicleAge,amount=0.2), data=cars, pch = 19, -->
<!--      col = scales::alpha("black",0.05), xlab = "Vehicle Age (Yrs)") -->
<!-- abline(fit, col = "red",lwd=2) -->
<!-- legend("topleft", paste("p =",summary(fit)$coef[2,4])) -->
<!-- boxplot(VehOdo ~ VehicleAge, data=cars, varwidth=TRUE) -->
<!-- abline(fit, col="red",lwd=2) -->
<!-- ``` -->

## Linear Regression

Note that you can have more than 1 predictor in regression models.The interpretation for each slope is change in the predictor corresponding to a one-unit change in the outcome, holding all other predictors constant.

```{r regress5, comment="", fig.height=4,fig.width=8}
fit2 = lm(VehOdo ~ IsBadBuy + VehicleAge, data = cars)
tidy(fit2)  
```


## Linear Regression: Interactions

The `*` does interactions:

```{r regress9, comment="", fig.height=4,fig.width=8}
fit3 = lm(VehOdo ~ IsBadBuy * VehicleAge, data = cars)
tidy(fit3)  
```


## Linear Regression: Interactions

You can take out main effects with minus

```{r regress10, comment="", fig.height=4,fig.width=8}
fit4 = lm(VehOdo ~ IsBadBuy * VehicleAge -IsBadBuy , data = cars)
tidy(fit4)  
```


## Linear Regression {.smaller}

Factors get special treatment in regression models - lowest level of the factor is the comparison group, and all other factors are relative to its values.

```{r regress6, comment="", fig.height=4,fig.width=8}
fit3 = lm(VehOdo ~ factor(TopThreeAmericanName), data = cars)
tidy(fit3)  
```

## Logistic Regression and GLMs {.smaller}

Generalized Linear Models (GLMs) allow for fitting regressions for non-continuous/normal outcomes.  The `glm` has similar syntax to the `lm` command. Logistic regression is one example. See `?family` for 

```{r regress7, comment="", fig.height=4,fig.width=8}
glmfit = glm(IsBadBuy ~ VehOdo + VehicleAge, data=cars, family = binomial())
tidy(glmfit)  
```

## Tidying GLMs

```{r tidy_glmfit, comment="", fig.height=4,fig.width=8}
tidy(glmfit, conf.int = TRUE)
```

## Tidying GLMs

```{r tidy_glm_exp, comment="", fig.height=4,fig.width=8}
tidy(glmfit, conf.int = TRUE, exponentiate = TRUE)
```

## Logistic Regression 

Note the coefficients are on the original scale, we must exponentiate them for odds ratios:

```{r regress8, comment="", fig.height=4,fig.width=8}
exp(coef(glmfit))
```

## Chi-squared tests

`chisq.test()` performs chi-squared contingency table tests and goodness-of-fit tests.

```
chisq.test(x, y = NULL, correct = TRUE,
           p = rep(1/length(x), length(x)), rescale.p = FALSE,
           simulate.p.value = FALSE, B = 2000)
```

```{r chisq1, comment=""}
tab = table(cars$IsBadBuy, cars$IsOnlineSale)
tab
```

## Chi-squared tests

You can also pass in a table object (such as `tab` here)
```{r chisq2, comment=""}
cq = chisq.test(tab)
cq
names(cq)
cq$p.value
```

## Chi-squared tests

Note that does the same test as `prop.test`, for a 2x2 table (`prop.test` not relevant for greater than 2x2).

```{r chisq3, comment=""}
chisq.test(tab)
prop.test(tab)
```

## Fisher's Exact test

`fisher.test()` performs contingency table test using the hypogeometric distribution (used for small sample sizes).

```
fisher.test(x, y = NULL, workspace = 200000, hybrid = FALSE,
            control = list(), or = 1, alternative = "two.sided",
            conf.int = TRUE, conf.level = 0.95,
            simulate.p.value = FALSE, B = 2000)
```

```{r fish.test, comment=""}
fisher.test(tab)
```

## Sampling

Also, if you want to only plot a subset of the data (for speed/time or overplotting)

```{r samp_plot, comment="", fig.height=4,fig.width=7}
samp.cars = sample_n(cars, size = 10000)
samp.cars = sample_frac(cars, size = 0.2)
ggplot(aes(x = VehBCost, y = VehOdo), 
       data = samp.cars) + geom_point() 
```

## Lab Part 2

[Website](http://johnmuschelli.com/intro_to_r/index.html)