---
title: "Data Classes"
author: "Introduction to R for Public Health Researchers"
output:
ioslides_presentation:
css: ../styles.css
widescreen: yes
---
```{r, echo = FALSE, message=FALSE}
# library(dplyr)
suppressPackageStartupMessages(library(dplyr))
library(readr)
library(forcats)
```
## Data Types:
* One dimensional types ('vectors'):
* Character: strings or individual characters, quoted
* Numeric: any real number(s)
* Integer: any integer(s)/whole numbers
* Factor: categorical/qualitative variables
* Logical: variables composed of TRUE or FALSE
* Date/POSIXct: represents calendar dates and times
## Character and numeric
We have already covered `character` and `numeric` types.
```{r numChar}
class(c("Andrew", "Jaffe"))
class(c(1, 4, 7))
```
## Integer
`Integer` is a special subset of `numeric` that contains only whole numbers
A sequence of numbers is an example of the integer type
```{r seq}
x = seq(from = 1, to = 5) # seq() is a function
x
class(x)
```
## Integer
The colon `:` is a shortcut for making sequences of numbers
It makes consecutive integer sequence from `[num1]` to `[num2]` by 1
```{r seqShort}
1:5
```
## Logical
`logical` is a type that only has two possible elements: `TRUE` and `FALSE`
```{r logical1}
x = c(TRUE, FALSE, TRUE, TRUE, FALSE)
class(x)
is.numeric(c("Andrew", "Jaffe"))
is.character(c("Andrew", "Jaffe"))
```
## Logical
Note that `logical` elements are NOT in quotes.
```{r logical2}
z = c("TRUE", "FALSE", "TRUE", "FALSE")
class(z)
as.logical(z)
```
Bonus: `sum()` and `mean()` work on `logical` vectors - they return the total and proportion of `TRUE` elements, respectively.
```{r logical_z}
sum(as.logical(z))
```
## General Class Information
There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class (`is.CLASS_()`) and coercing between classes (`as.CLASS_()`).
```{r logical_coercion}
is.numeric(c("Andrew", "Jaffe"))
is.character(c("Andrew", "Jaffe"))
```
## General Class Information
There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class (`is.CLASS_()`) and coercing between classes (`as.CLASS_()`).
```{r logical_coercion2}
as.character(c(1, 4, 7))
as.numeric(c("Andrew", "Jaffe"))
```
## Factors
A `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables:
```{r factor1}
x = factor(c("boy", "girl", "girl", "boy", "girl"))
x
class(x)
```
Note that levels are, by default, in alphanumerical order.
## Factors
Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering)
Note that R reads in character strings as factors by default in functions like `read.csv()` (but not `read_csv`)
'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered.'
```
factor(x = character(), levels, labels = levels,
exclude = NA, ordered = is.ordered(x))
```
## Necessary for the lab: `%in%`
```{r}
x = c(0, 2, 2, 3, 4)
(x == 0 | x == 2)
```
Introduce the `%in%` operator:
```{r}
x %in% c(0, 2) # NEVER has NA and returns logical
```
reads "return `TRUE` if `x` is in 0 or 2". (Like `inlist` in Stata).
## Lab Part 1
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Factors
Suppose we have a vector of case-control status
```{r factor2}
cc = factor(c("case","case","case",
"control","control","control"))
cc
```
We can reset the levels using the `levels` function, but this is **bad** and can cause problems. You should do this using the `levels` argument in the `factor()`
```{r}
levels(cc) = c("control","case")
cc
```
## Factors
Note that the levels are alphabetically ordered by default. We can also specify the levels within the factor call
```{r factor_cc_again}
casecontrol = c("case","case","case","control",
"control","control")
factor(casecontrol, levels = c("control","case") )
factor(casecontrol, levels = c("control","case"),
ordered=TRUE)
```
## Factors
Another way to do this once you already have the factor made is with the `relevel()` function.
```{r factorCheck}
cc = factor(c("case","case","case",
"control","control","control"))
relevel(cc, "control")
```
## Factors
One of the core "tidyverse" packages is `forcats` which offers useful functionality for interacting with factors. For example, there is a function for releveling factors here:
```{r}
fct_relevel(cc, "control")
```
## Factors
There are other useful functions for dictating the levels of factors, like in the order they appears in the vector, by frequency, or into collapsed groups.
```{r}
levels(fct_inorder(chickwts$feed))
levels(fct_infreq(chickwts$feed))
levels(fct_lump(chickwts$feed, n=1))
```
## Factors
Factors can be converted to `numeric` or `character` very easily
```{r factor3}
x = factor(casecontrol,
levels = c("control","case") )
as.character(x)
as.numeric(x)
```
## Creating categorical variables
The `rep()` ["repeat"] function is useful for creating new variables
```{r rep1}
bg = rep(c("boy","girl"),each=50)
head(bg)
bg2 = rep(c("boy","girl"),times=50)
head(bg2)
length(bg) == length(bg2)
```
## Lab Part 2
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Dates
You can convert date-like strings in the `Date` class (http://www.statmethods.net/input/dates.html for more info) using
the `lubridate` package!
```{r, message = FALSE}
circ = jhur::read_circulator()
head(sort(circ$date))
library(lubridate) # great for dates!
circ = mutate(circ, newDate2 = mdy(date))
head(circ$newDate2)
range(circ$newDate2) # gives you the range of the data
```
## Works great - but need to specy the correct format still
See `?ymd` and `?ymd_hms`
```{r, message = FALSE}
x = c("2014-02-4 05:02:00", "2016/09/24 14:02:00")
ymd_hms(x)
```
```{r}
ymd_hm(x)
```
## POSIXct
The `POSIXct` class is like a more general date format (with hours, minutes, seconds).
```{r, message = FALSE}
x = c("2014-02-4 05:02:00", "2016/09/24 14:02:00")
dates = ymd_hms(x)
class(dates)
```
## Adding Periods of time
The `as.Period` command is helpful for adding time to a date:
```{r}
theTime = Sys.time()
theTime
class(theTime)
theTime + as.period(20, unit = "minutes") # the future
```
## Differences in Times
You can subtract times as well, the `difftime` function is helpful as you can set the units (note it does `time1 - time2`):
```{r}
the_future = ymd_hms("2020-12-31 11:59:59")
the_future - theTime
difftime(the_future, theTime, units = "weeks")
```
## Lab Part 3
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Website
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Data Classes:
* Two dimensional classes:
* `data.frame`: traditional 'Excel' spreadsheets
* Each column can have a different class, from above
* Matrix: two-dimensional data, composed of rows and columns. Unlike data frames, the entire matrix is composed of one R class, e.g. all numeric or all characters.
## Matrices
```{r matrix}
n = 1:9
n
mat = matrix(n, nrow = 3)
mat
```
## Data Selection
Matrices have two "slots" you can use to select data, which represent rows and columns, that are separated by a comma, so the syntax is `matrix[row,column]`. Note you cannot use `dplyr` functions on matrices.
```{r subset3}
mat[1, 1] # individual entry: row 1, column 1
mat[1, ] # first row
mat[, 1] # first columns
```
## Data Selection
Note that the class of the returned object is no longer a matrix
```{r subset4}
class(mat[1, ])
class(mat[, 1])
```
## Data Frames
To review, the `data.frame`/`tbl_df` are the other two dimensional variable classes.
Again, data frames are like matrices, but each column is a vector that can have its own class. So some columns might be `character` and others might be `numeric`, while others maybe a `factor`.
## Lists
* One other data type that is the most generic are `lists`.
* Can be created using list()
* Can hold vectors, strings, matrices, models, list of other list, lists upon lists!
* Can reference data using $ (if the elements are named), or using [], or [[]]
```{r makeList, comment="", prompt=TRUE}
mylist <- list(letters=c("A", "b", "c"),
numbers=1:3, matrix(1:25, ncol=5))
```
## List Structure
```{r Lists, comment="", prompt=TRUE}
head(mylist)
```
## List referencing
```{r Listsref1, comment="", prompt=TRUE}
mylist[1] # returns a list
mylist["letters"] # returns a list
```
## List referencing
```{r Listsrefvec, comment="", prompt=TRUE}
mylist[[1]] # returns the vector 'letters'
mylist$letters # returns vector
mylist[["letters"]] # returns the vector 'letters'
```
## List referencing
You can also select multiple lists with the single brackets.
```{r Listsref2, comment="", prompt=TRUE}
mylist[1:2] # returns a list
```
## List referencing
You can also select down several levels of a list at once
```{r Listsref3, comment="", prompt=TRUE}
mylist$letters[1]
mylist[[2]][1]
mylist[[3]][1:2,1:2]
```
## Quick Aside: "slicing" data: like _n and _N in Stata
In `dplyr`, there are `first`, `last` and `nth` operators.
If you first sort a data set using `arrange`, you can grab the first or last as so:
```{r, message=FALSE}
circ %>%
mutate(first_date = first(newDate2),
last_date = last(newDate2),
third_date = nth(newDate2, 3)) %>%
select(day, date, first_date, last_date, third_date) %>% head(3)
```
## Quick Aside: "slicing" data
Many times, you need to group first
```{r, message=FALSE}
circ %>%
group_by(day) %>%
mutate(first_date = first(newDate2),
last_date = last(newDate2),
third_date = nth(newDate2, 3)) %>%
select(day, date, first_date, last_date, third_date) %>% head(3)
```
## Differences in Times
```{r, message=FALSE}
circ = circ %>%
group_by(day) %>%
mutate(first_date = first(newDate2),
diff_from_first = difftime( # time1 - time2
time1 = newDate2, time2 = first_date))
head(circ$diff_from_first, 10)
units(circ$diff_from_first) = "days"
head(circ$diff_from_first, 10)
```
## Website
[Website](http://johnmuschelli.com/intro_to_r/index.html)