---
title: "Data Summarization"
author: "Introduction to R for Public Health Researchers"
output:
ioslides_presentation:
css: ../styles.css
widescreen: yes
---
```{r, echo = FALSE, message=FALSE, error = FALSE}
library(knitr)
opts_chunk$set(comment = "", message = FALSE)
suppressWarnings({library(dplyr)})
library(readr)
library(tidyverse)
library(jhur)
```
## Data Summarization
* Basic statistical summarization
* `mean(x)`: takes the mean of x
* `sd(x)`: takes the standard deviation of x
* `median(x)`: takes the median of x
* `quantile(x)`: displays sample quantities of x. Default is min, IQR, max
* `range(x)`: displays the range. Same as `c(min(x), max(x))`
* `sum(x)`: sum of x
* **all have a **`na.rm` for missing data - discussed later
* Transformations
* `log` - log (base `e`) transformation
* `log2` - log base 2 transform
* `log10` - log base 10 transform
* `sqrt` - square root
## Some examples
We can use the `jhu_cars` to explore different ways of summarizing data. The `head` command displays the first `6` (default) rows of an object:
```{r}
library(jhur)
head(jhu_cars)
```
## Statistical summarization
Note - the `$` references/selects columns from a `data.frame`/`tibble`:
```{r}
mean(jhu_cars$hp)
quantile(jhu_cars$hp)
```
## Statistical summarization
```{r}
median(jhu_cars$wt)
quantile(jhu_cars$wt, probs = 0.6)
```
## Statistical summarization
`t.test` will be covered more in detail later, gives a mean and 95\% CI:
```{r}
t.test(jhu_cars$wt)
broom::tidy(t.test(jhu_cars$wt))
```
## Statistical summarization
Note that many of these functions have additional inputs regarding missing data, typically requiring the `na.rm` argument ("remove NAs").
```{r}
x = c(1,5,7,NA,4,2, 8,10,45,42)
mean(x)
mean(x, na.rm = TRUE)
quantile(x, na.rm = TRUE)
```
## Data Summarization on matrices/data frames
* Basic statistical summarization
* `rowMeans(x)`: takes the means of each row of x
* `colMeans(x)`: takes the means of each column of x
* `rowSums(x)`: takes the sum of each row of x
* `colSums(x)`: takes the sum of each column of x
* `summary(x)`: for data frames, displays the quantile information
* The `matrixStats` package has additional `row*` and `col*` functions
* Like `rowSds`, `colQuantiles`
## Lab Part 1
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## TB Incidence
Please download the TB incidence data:
http://johnmuschelli.com/intro_to_r/data/tb_incidence.xlsx
Here we will read in a `tibble` of values from TB incidence:
```{r}
library(readxl)
# tb <- read_excel("http://johnmuschelli.com/intro_to_r/data/tb_incidence.xlsx")
tb = jhur::read_tb()
colnames(tb)
```
## Indicator of TB
We can rename the first column to be the country measured using the `rename` function in `dplyr` (we have to use the \` things because there are spaces in the name):
```{r}
library(dplyr)
tb = tb %>% rename(country = `TB incidence, all forms (per 100 000 population per year)`)
```
`colnames` will show us the column names and show that country is renamed:
```{r}
colnames(tb)
```
## Summarize the data: `dplyr` `summarize` function
`dplyr::summarize` will allow you to summarize data. Format is `new = SUMMARY`. If you don't set a `new` name, it will be a messy output:
```{r}
tb %>%
summarize(mean_2006 = mean(`2006`, na.rm = TRUE),
media_2007 = median(`2007`, na.rm = TRUE),
median(`2004`, na.rm = TRUE))
```
## Column and Row means
`colMeans` and `rowMeans` must work on **all numeric data**. We will subset years before 2000 (starting with 1):
```{r colMeans}
avgs = select(tb, starts_with("1"))
colMeans(avgs, na.rm = TRUE)
```
```{r}
tb$before_2000_avg = rowMeans(avgs, na.rm = TRUE)
head(tb[, c("country", "before_2000_avg")])
```
## Summarize the data: `dplyr` `summarize` function
`dplyr::summarize` will allow you to summarize data. If you would like to summarize **all** columns, you can use `summarize_all` and pass in a function (with other arguments):
```{r, echo = TRUE, eval=FALSE}
summarize_all(DATASET, FUNCTION, OTHER_FUNCTION_ARGUMENTS) # how to use
```
```{r}
summarize_all(avgs, mean, na.rm = TRUE)
```
## Summary Function
Using `summary` can give you rough snapshots of each column, but you would likely use `mean`, `min`, `max`, and `quantile` when necessary (and number of NAs):
```{r summary1}
summary(tb)
```
## Youth Tobacco Survey
Here we will be using the Youth Tobacco Survey data:
http://johnmuschelli.com/intro_to_r/data/Youth_Tobacco_Survey_YTS_Data.csv .
```{r}
yts = jhur::read_yts()
head(yts)
```
## Length and unique
`unique(x)` will return the unique elements of `x`
```{r, message = FALSE}
head(unique(yts$LocationDesc), 10)
```
`length` will tell you the length of a vector. Combined with `unique`, tells you the number of unique elements:
```{r}
length(unique(yts$LocationDesc))
```
## Table
`table(x)` will return a frequency table of unique elements of `x`
```{r, message = FALSE}
head(table(yts$LocationDesc))
```
## `dplyr`: `count`
```{r, message = FALSE}
yts %>% count(LocationDesc)
```
## `dplyr`: `count`
```{r, message = FALSE}
yts %>% count(LocationDesc, Age)
```
## Subsetting to specific columns
Let's just take smoking status measures for all genders in middle school current smoking using `filter`, and the columns that represent the year, state and percentage using `select`:
```{r, message=FALSE}
library(dplyr)
sub_yts = filter(yts, MeasureDesc == "Smoking Status",
Gender == "Overall", Response == "Current",
Education == "Middle School")
sub_yts = select(sub_yts, YEAR, LocationDesc, Data_Value, Data_Value_Unit)
head(sub_yts, 4)
```
## Perform Operations By Groups: dplyr
`group_by` allows you group the data set by grouping variables:
```{r}
sub_yts = group_by(sub_yts, YEAR)
head(sub_yts)
```
- doesn't change the data in any way, but how **functions operate on it**
## Summarize the data
It's grouped!
```{r}
sub_yts %>% summarize(year_avg = mean(Data_Value, na.rm = TRUE))
```
## Using the `pipe`
Pipe `sub_yts` into `group_by`, then pipe that into `summarize`:
```{r}
yts_avgs = sub_yts %>%
group_by(YEAR) %>%
summarize(year_avg = mean(Data_Value, na.rm = TRUE),
year_median = median(Data_Value, na.rm = TRUE))
head(yts_avgs)
```
## Ungroup the data
You usually want to perform operations on groups and may want to redefine the groups. The `ungroup` function will allow you to clear the groups from the data:
```{r}
sub_yts = ungroup(sub_yts)
sub_yts
```
## `group_by` with `mutate` - just add data
We can also use `mutate` to calculate the mean value for each year and add it as a column:
```{r}
sub_yts %>%
group_by(YEAR) %>%
mutate(year_avg = mean(Data_Value, na.rm = TRUE)) %>%
arrange(LocationDesc, YEAR) # look at year 2000 value
```
## Counting
Standard statistics can be calculated. There are other functions, such as `n()` count the number of observations.
```{r}
sub_yts %>%
group_by(YEAR) %>%
summarize(n = n(),
mean = mean(Data_Value, na.rm = TRUE)) %>%
head
```
## Lab Part 2
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Data Summarization/Visualization: ggplot2
`ggplot2` is a package of plotting that is very popular and powerful (using the **g**rammar of **g**raphics). We will use `qplot` ("quick plot") for most of the basic examples:
```{r, eval = FALSE}
qplot
```
```{r, echo = FALSE}
args(qplot)
```
## Basic Plots
Plotting is an important component of exploratory data analysis. We will review some of the more useful and informative plots here. We will go over formatting and making plots look nicer in additional lectures.
## Scatterplot
```{r}
library(ggplot2)
qplot(x = disp, y = mpg, data = jhu_cars)
```
## Histograms
```{r}
qplot(x = before_2000_avg, data = tb, geom = "histogram")
```
## Plot with a line
```{r}
qplot(x = YEAR, y = year_avg, data = yts_avgs, geom = "line")
```
## Density
Over all years and states, this is the density of smoking status incidence:
```{r}
qplot(x = Data_Value, data = sub_yts, geom = "density")
```
## Boxplots
```{r}
qplot(x = LocationDesc, y = Data_Value, data = sub_yts, geom = "boxplot")
```
## Boxplots
```{r}
qplot(x = LocationDesc, y = Data_Value,
data = sub_yts, geom = "boxplot") + coord_flip()
```
```{r ggally_pairs, warning=FALSE, echo = FALSE}
library(GGally)
# ggpairs(avgs)
```
## Lab Part 3
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Base functions for plotting
* Basic summarization plots:
* `plot(x,y)`: scatterplot of x and y
* `boxplot(y~x)`: boxplot of y against levels of x
* `hist(x)`: histogram of x
* `density(x)`: kernel density plot of x
## Conclusion
- `group_by` is very powerful, especially with `summarise/summarize`
- Using `group_by` and `mutate` keeps all the rows and repeates a value, `summarize` reduces the number of rows
- The `matrixStats` package extends this to `colMedians`, `colMaxs`, etc.
## Website
[Website](http://johnmuschelli.com/intro_to_r/index.html)
# Base R Plots - not covered
## Basic Plots
Plotting is an important component of exploratory data analysis. We will review some of the more useful and informative plots here. We will go over formatting and making plots look nicer in additional lectures.
## Scatterplot
```{r scatter1}
plot(jhu_cars$mpg, jhu_cars$disp)
```
## Histograms
```{r hist1}
hist(tb$before_2000_avg)
```
## Plot with a line
`type = "l"` means a line
```{r hist_date}
plot(yts_avgs$YEAR, yts_avgs$year_avg, type = "l")
```
## Density
Over all years and states, this is the density of smoking status incidence:
```{r dens1,fig.width=5,fig.height=5}
plot(density(sub_yts$Data_Value))
```
## Boxplots
```{r box1}
boxplot(sub_yts$Data_Value ~ sub_yts$LocationDesc)
```
## Boxplots
```{r box2}
boxplot(Data_Value ~ LocationDesc, data = sub_yts)
```
## Data Summarization for data.frames
* Basic summarization plots
* `matplot(x,y)`: scatterplot of two matrices, x and y
* `pairs(x,y)`: plots pairwise scatter plots of matrices x and y, column by column
## Matrix plot
```{r matplot2}
pairs(avgs)
```
## Apply statements
You can apply more general functions to the rows or columns of a matrix or data frame, beyond the mean and sum.
```
apply(X, MARGIN, FUN, ...)
```
> X : an array, including a matrix.
>
> MARGIN : a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.
>
> FUN : the function to be applied: see 'Details'.
>
> ... : optional arguments to FUN.
## Apply statements
```{r apply1}
apply(avgs,2,mean, na.rm=TRUE) # column means
head(apply(avgs,1,mean, na.rm=TRUE)) # row means
apply(avgs,2,sd, na.rm=TRUE) # columns sds
apply(avgs,2,max, na.rm=TRUE) # column maxs
```
## Other Apply Statements
* `tapply()`: 'grouping' apply
* `lapply()`: 'list' apply [tomorrow]
* `sapply()`: 'simple' apply [tomorrow]
* Other less used ones...
See more details here: http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
- Commonly used, but we will discuss how to do all steps in `dplyr`