---
title: "Knitr"
author: "Introduction to R for Public Health Researchers"
output:
html_document:
toc: yes
pdf_document:
toc: yes
word_document:
toc: yes
---
The three "back ticks" (\`) must be followed by curly brackets "{", and then "r" to tell the computer that you are using R code. This line is then closed off by another curly bracket "}".
Anything before three more back ticks "```" are then considered R code (a script).
If any code in the document has just a backtick \` then nothing, then another backtick, then that word is just printed as if it were code, such as `hey`.
I'm reading in the bike lanes here.
```{r readin, message = FALSE}
# readin is just a "label" for this code chunk
## code chunk is just a "chunk" of code, where this code usually
## does just one thing, aka a module
### comments are still # here
### you can do all your reading in there
### let's say we loaded some packages
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
fname <- "http://johnmuschelli.com/intro_to_r/data/Bike_Lanes.csv"
bike = read_csv(fname)
```
You can write your introduction here.
## Introduction
Bike lanes are in Baltimore. People like them. Why are they so long?
## Exploratory Analysis
Let's look at some plots of bike length. Let's say we wanted to look at what affects bike length.
### Plots of bike length
Note we made the subsection by using three "hashes" (pound signs): ###.
We can turn off R code output by using `echo = FALSE` on the knitr code chunks.
```{r, echo = FALSE}
no.missyear <- bike %>% filter(dateInstalled != 0)
plot(no.missyear$dateInstalled, no.missyear$length)
boxplot(length ~ dateInstalled, data = no.missyear)
```
```{r, echo = TRUE}
no.missyear = no.missyear %>% mutate(dateInstalled = factor(dateInstalled))
library(ggplot2)
gbox = no.missyear %>% ggplot(aes(x = dateInstalled, y = length)) + geom_boxplot()
print(gbox)
```
We have a total of `r nrow(no.missyear)` rows.
What does it look like if we took the log (base 10) of the bike length:
```{r}
no.missyear <- no.missyear %>% mutate(log.length = log10(length))
### see here that if you specify the data argument, you don't need to do the $
boxplot(log.length ~ dateInstalled, data = no.missyear,
main = "Boxplots of Bike Lenght by Year",
xlab="Year",
ylab="Bike Length")
glogbox = no.missyear %>% ggplot(aes(x = dateInstalled, y = log.length)) + geom_boxplot() +
ggtitle("Boxplots of Bike Lenght by Year") +
xlab("Year") +
ylab("Bike Length")
print(glogbox)
```
I want my boxplots colored, so I set the `col` argument.
```{r}
boxplot(log.length ~ dateInstalled,
data=no.missyear,
main="Boxplots of Bike Lenght by Year",
xlab="Year",
ylab="Bike Length",
col="red")
glogbox + geom_boxplot(fill = "red")
```
As we can see, 2006 had a much higher bike length. What about for the type of bike path?
```{r}
### type is a character, but when R sees a "character" in a "formula", then it automatically converts it to factor
### a formula is something that has a y ~ x, which says I want to plot y against x
### or if it were a model you would do y ~ x, which meant regress against y
boxplot(log.length ~ type, data=no.missyear, main="Boxplots of Bike Lenght by Year", xlab="Year", ylab="Bike Length")
```
## Multiple Facets
We can do the plot with different panels for each type.
```{r}
glogbox + facet_wrap(~ type)
```
NOTE, this is different than if we colored on type:
```{r}
glogbox + aes(colour = type)
```
### Means by type
What if we want to extract means by each type?
Let's show a few ways:
```{r}
no.missyear %>% group_by(type) %>%
dplyr::summarise(mean = mean(log.length))
```
Let's show a what if we wanted to go over `type` and `dateInstalled`:
```{r}
no.missyear %>% group_by(type, dateInstalled) %>%
dplyr::summarise(mean = mean(log.length),
median = median(log.length),
Std.Dev = sd(log.length))
```
## Linear Models
OK let's do some linear model
```{r}
### type is a character, but when R sees a "character" in a "formula", then it automatically converts it to factor
### a formula is something that has a y ~ x, which says I want to plot y against x
### or if it were a model you would do y ~ x, which meant regress against y
mod.type = lm(log.length ~ type, data = no.missyear)
mod.yr = lm(log.length ~ factor(dateInstalled), data = no.missyear)
mod.yrtype = lm(log.length ~ type + factor(dateInstalled), data = no.missyear)
summary(mod.type)
```
That's rather UGLY, so let's use a package called `pander` and then make this model into an `pander` object and then print it out nicely.
## Grabbing coefficients
We can use the `coef` function on a summary, or do `smod$coef` to get the coefficients. But they are in a `matrix`:
```{r}
smod = summary(mod.type)
coef(smod)
class(coef(smod))
```
### Broom package
The `broom` package can "tidy" up the output to actually put the terms into a column of a data.frame that you can grab values from:
```{r}
library(broom)
smod2 = tidy(mod.type)
class(smod2)
better = smod2 %>% mutate(term = str_replace(term, "^type", ""))
better
better %>% filter(term == "SIDEPATH")
write.csv(better, file = "Best_Model_Coefficients.csv")
```
BUT I NEEEEEED an XLSX! The `xlsx` package can do it, but I still tend to use CSVs. May need to look at [this post](https://stackoverflow.com/questions/30738974/rjava-load-error-in-rstudio-r-after-upgrading-to-osx-yosemite) if getting "Reason: image not found" error.
```{r}
library(xlsx) # may need to run sudo R CMD javareconf
write.xlsx(better, file = "Best_Model_Coefficients.xlsx")
```
### Testing Nested Models
The `anova` command will test nested models and give you a table of results:
```{r}
my_lrtest = anova(mod.yrtype, mod.yr)
print(my_lrtest)
print(tidy(my_lrtest))
```
Similarly with year:
```{r}
my_lrtest = anova(mod.yrtype, mod.type)
print(tidy(my_lrtest))
```
ASIDE: the `aov` function fits what you think of when you think ANOVA.
## Pander
Pander can output tables (as well as other things such as models), so let's print this using the `pander` command from the `pander` package. So `pander` is really good when you are trying to print out a table (in html, otherwise make the table and use `write.csv` to get it in Excel and then format) really quickly and in a report.
```{r}
# devtools::install_github('Rapporter/pander') # need this version!
library(pander)
pander(mod.yr)
```
It is the same if we write out the summary, but more information is in the **footer**.
```{r}
pander(summary(mod.yr))
```
### Formatting
Let's format the rows and the column names a bit better:
#### Changing the terms
```{r}
ptable = tidy(mod.yr)
ptable$term = ptable$term %>%
str_replace(fixed("factor(dateInstalled)"), "") %>%
str_replace(fixed("(Intercept)"), "Intercept")
```
#### Column Names
Now we can reset the column names if we didn't like them before:
```{r}
colnames(ptable) = c("Variable", "Beta", "SE", "tstatistic", "p.value")
pander(ptable)
```
#### Confidence Intervals
Let's say we want the beta, the 95% CI. We can use `confint` on the model, `merge` it to `ptable` and then paste the columns together (after rounding) with a comma and bound them in parentheses.
```{r}
cint = confint(mod.yr)
print(cint)
print(class(cint))
```
Tidying it up
```{r}
cint = tidy(cint)
colnames(cint) = c("Variable", "lower", "upper")
cint$Variable = cint$Variable %>%
str_replace(fixed("factor(dateInstalled)"), "") %>%
str_replace(fixed("(Intercept)"), "Intercept")
ptable = left_join(ptable, cint, by = "Variable")
ptable = ptable %>% mutate(lower = round(lower, 2),
upper = round(lower, 2),
Beta = round(Beta, 2),
p.value = ifelse(p.value < 0.01, "< 0.01",
round(p.value,2)))
ptable = ptable %>% mutate(ci = paste0("(", lower, ", ", upper, ")"))
ptable = dplyr::select(ptable, Beta, ci, p.value)
pander(ptable)
```
## Multiple Models
OK, that's pretty good, but let's say we have all three models. You can't put doesn't work so well with *many* models together.
```{r}
# pander(mod.yr, mod.yrtype) does not work
# pander(list(mod.yr, mod.yrtype)) # will give 2 separate tables
```
If we use the `memisc` package, we can combine the models:
```{r, message = FALSE}
library(memisc)
mtab_all <- mtable("Model Year" = mod.yr,
"Model Type" = mod.type,
"Model Both" = mod.yrtype,
summary.stats = c("sigma","R-squared","F","p","N"))
print(mtab_all)
```
If you want to write it out (for Excel), it is tab delimited:
```{r}
write.mtable(mtab_all, file = "my_tab.txt")
```
```{r}
pander(mtab_all)
```
### Not covered - making `mtable` better:
```{r}
renamer = function(model) {
names(model$coefficients) = names(model$coefficients) %>%
str_replace(fixed("factor(dateInstalled)"), "") %>%
str_replace(fixed("(Intercept)"), "Intercept")
names(model$contrasts) = names(model$contrasts) %>%
str_replace(fixed("factor(dateInstalled)"), "") %>%
str_replace(fixed("(Intercept)"), "Intercept")
return(model)
}
mod.yr = renamer(mod.yr)
mod.yrtype = renamer(mod.yrtype)
mod.type = renamer(mod.type)
mtab_all_better <- mtable("Model Year" = mod.yr,
"Model Type" = mod.type,
"Model Both" = mod.yrtype,
summary.stats = c("sigma","R-squared","F","p","N"))
pander(mtab_all_better)
```
Another package called `stargazer` can put models together easily and print them out. So let's use stargazer. Again, you need to use `install.packages("stargazer")` if you don't have function.
```{r}
require(stargazer)
```
OK, so what's the difference here? First off, we said results are "markup", so that it will not try to reformat the output. Also, I didn't want those # for comments, so I just made comment an empty string "".
```{r, results='markup', comment=""}
stargazer(mod.yr, mod.type, mod.yrtype, type = "text")
```
If we use
```{r, results='asis', comment=""}
stargazer(mod.yr, mod.type, mod.yrtype, type="html")
```
## Data Extraction
Let's say I want to get data INTO my text. Like there are N number of bike lanes with a date installed that isn't zero. There are `r nrow(no.missyear)` bike lanes with a date installed after 2006. So you use one backtick ` and then you say "r" to tell that it's R code. And then you run R code that gets evaulated and then returns the value. Let's say you want to compute a bunch of things:
```{r computes}
### let's get number of bike lanes installed by year
n.lanes = no.missyear %>% group_by(dateInstalled) %>% dplyr::summarize(n())
class(n.lanes)
print(n.lanes)
n.lanes = as.data.frame(n.lanes)
print(n.lanes)
```
```{r}
colnames(n.lanes) <- c("date", "nlanes")
n2009 <- filter(n.lanes, date == 2009)
n2010 <- filter(n.lanes, date == 2010)
getwd()
```
Now I can just say there are `r n2009` lanes in 2009 and `r n2010` in 2010.
```{r}
fname <- "http://johnmuschelli.com/intro_to_r/data/Charm_City_Circulator_Ridership.csv"
## file.path takes a directory and makes a full name with a full file path
charm = read.csv(fname, as.is=TRUE)
library(chron)
days = levels(weekdays(1, abbreviate=FALSE))
charm$day <- factor(charm$day, levels=days)
charm$date <- as.Date(charm$date, format="%m/%d/%Y")
cn <- colnames(charm)
daily <- charm[, c("day", "date", "daily")]
```
```{r}
charm$daily <- NULL
require(reshape)
long.charm <- melt(charm, id.vars = c("day", "date"))
long.charm$type <- "Boardings"
long.charm$type[ grepl("Alightings", long.charm$variable)] <- "Alightings"
long.charm$type[ grepl("Average", long.charm$variable)] <- "Average"
long.charm$line <- "orange"
long.charm$line[ grepl("purple", long.charm$variable)] <- "purple"
long.charm$line[ grepl("green", long.charm$variable)] <- "green"
long.charm$line[ grepl("banner", long.charm$variable)] <- "banner"
long.charm$variable <- NULL
long.charm$line <-factor(long.charm$line, levels=c("orange", "purple",
"green", "banner"))
head(long.charm)
### NOW R has a column of day, the date, a "value", the type of value and the
### circulator line that corresponds to it
### value is now either the Alightings, Boardings, or Average from the charm dataset
```
Let's do some plotting now!
```{r plots}
require(ggplot2)
### let's make a "ggplot"
### the format is ggplot(dataframe, aes(x=COLNAME, y=COLNAME))
### where COLNAME are colnames of the dataframe
### you can also set color to a different factor
### other options in AES (fill, alpha level -which is the "transparency" of points)
g <- ggplot(long.charm, aes(x=date, y=value, color=line))
### let's change the colors to what we want- doing this manually, not letting it choose
### for me
g <- g + scale_color_manual(values=c("orange", "purple", "green", "blue"))
### plotting points
g + geom_point()
### Let's make Lines!
g + geom_line()
### let's make a new plot of poitns
gpoint <- g + geom_point()
### let's plot the value by the type of value - boardings/average, etc
gpoint + facet_wrap(~ type)
```
OK let's turn off some warnings - making `warning=FALSE` (in knitr) as an option.
```{r, warning=FALSE}
## let's compare vertically
gpoint + facet_wrap(~ type, ncol=1)
gfacet = g + facet_wrap(~ type, ncol=1)
```
We can also smooth the data to give us a overall idea of how the average changes over time. I don't want to do a standard error (`se`).
```{r, warning=FALSE}
## let's smooth this - get a rough estimate of what's going on
gfacet + geom_smooth(se=FALSE)
```
OK, I've seen enough code, let's turn that off, using `echo=FALSE`.
```{r, echo=FALSE, warning=FALSE, fig.width=10, fig.height=5}
#### COMBINE! - let's make the line width bigger (lwd)
### also making the "alpha level" (transparency) low for the point sos we can see the lines
g + geom_point(alpha=0.2) + geom_smooth(se=FALSE, lwd=1.5) + facet_wrap( ~ type)
```
There are still messages, but we can turn these off with `message = FALSE`
```{r, echo=FALSE, warning=FALSE, message = FALSE, fig.width=10, fig.height=5}
g + geom_point(alpha=0.2) + geom_smooth(se=FALSE, lwd=1.5) + facet_wrap( ~ type)
```