--- title: "Data Classes" author: "Introduction to R for Public Health Researchers" output: ioslides_presentation: css: ../styles.css widescreen: yes --- ```{r, echo = FALSE, message=FALSE} # library(dplyr) suppressPackageStartupMessages(library(dplyr)) library(readr) library(forcats) ``` ## Data Types: * One dimensional types ('vectors'): * Character: strings or individual characters, quoted * Numeric: any real number(s) * Integer: any integer(s)/whole numbers * Factor: categorical/qualitative variables * Logical: variables composed of TRUE or FALSE * Date/POSIXct: represents calendar dates and times ## Character and numeric We have already covered `character` and `numeric` types. ```{r numChar} class(c("Andrew", "Jaffe")) class(c(1, 4, 7)) ``` ## Integer `Integer` is a special subset of `numeric` that contains only whole numbers A sequence of numbers is an example of the integer type ```{r seq} x = seq(from = 1, to = 5) # seq() is a function x class(x) ``` ## Integer The colon `:` is a shortcut for making sequences of numbers It makes consecutive integer sequence from `[num1]` to `[num2]` by 1 ```{r seqShort} 1:5 ``` ## Logical `logical` is a type that only has two possible elements: `TRUE` and `FALSE` ```{r logical1} x = c(TRUE, FALSE, TRUE, TRUE, FALSE) class(x) is.numeric(c("Andrew", "Jaffe")) is.character(c("Andrew", "Jaffe")) ``` ## Logical Note that `logical` elements are NOT in quotes. ```{r logical2} z = c("TRUE", "FALSE", "TRUE", "FALSE") class(z) as.logical(z) ``` Bonus: `sum()` and `mean()` work on `logical` vectors - they return the total and proportion of `TRUE` elements, respectively. ```{r logical_z} sum(as.logical(z)) ``` ## General Class Information There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class (`is.CLASS_()`) and coercing between classes (`as.CLASS_()`). ```{r logical_coercion} is.numeric(c("Andrew", "Jaffe")) is.character(c("Andrew", "Jaffe")) ``` ## General Class Information There are two useful functions associated with practically all R classes, which relate to logically checking the underlying class (`is.CLASS_()`) and coercing between classes (`as.CLASS_()`). ```{r logical_coercion2} as.character(c(1, 4, 7)) as.numeric(c("Andrew", "Jaffe")) ``` ## Factors A `factor` is a special `character` vector where the elements have pre-defined groups or 'levels'. You can think of these as qualitative or categorical variables: ```{r factor1} x = factor(c("boy", "girl", "girl", "boy", "girl")) x class(x) ``` Note that levels are, by default, in alphanumerical order. ## Factors Factors are used to represent categorical data, and can also be used for ordinal data (ie categories have an intrinsic ordering) Note that R reads in character strings as factors by default in functions like `read.csv()` (but not `read_csv`) 'The function factor is used to encode a vector as a factor (the terms 'category' and 'enumerated type' are also used for factors). If argument ordered is TRUE, the factor levels are assumed to be ordered.' ``` factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x)) ``` ## Necessary for the lab: `%in%` ```{r} x = c(0, 2, 2, 3, 4) (x == 0 | x == 2) ``` Introduce the `%in%` operator: ```{r} x %in% c(0, 2) # NEVER has NA and returns logical ``` reads "return `TRUE` if `x` is in 0 or 2". (Like `inlist` in Stata). ## Lab Part 1 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Factors Suppose we have a vector of case-control status ```{r factor2} cc = factor(c("case","case","case", "control","control","control")) cc ``` We can reset the levels using the `levels` function, but this is **bad** and can cause problems. You should do this using the `levels` argument in the `factor()` ```{r} levels(cc) = c("control","case") cc ``` ## Factors Note that the levels are alphabetically ordered by default. We can also specify the levels within the factor call ```{r factor_cc_again} casecontrol = c("case","case","case","control", "control","control") factor(casecontrol, levels = c("control","case") ) factor(casecontrol, levels = c("control","case"), ordered=TRUE) ``` ## Factors Another way to do this once you already have the factor made is with the `relevel()` function. ```{r factorCheck} cc = factor(c("case","case","case", "control","control","control")) relevel(cc, "control") ``` ## Factors One of the core "tidyverse" packages is `forcats` which offers useful functionality for interacting with factors. For example, there is a function for releveling factors here: ```{r} fct_relevel(cc, "control") ``` ## Factors There are other useful functions for dictating the levels of factors, like in the order they appears in the vector, by frequency, or into collapsed groups. ```{r} levels(fct_inorder(chickwts$feed)) levels(fct_infreq(chickwts$feed)) levels(fct_lump(chickwts$feed, n=1)) ``` ## Factors Factors can be converted to `numeric` or `character` very easily ```{r factor3} x = factor(casecontrol, levels = c("control","case") ) as.character(x) as.numeric(x) ``` ## Creating categorical variables The `rep()` ["repeat"] function is useful for creating new variables ```{r rep1} bg = rep(c("boy","girl"),each=50) head(bg) bg2 = rep(c("boy","girl"),times=50) head(bg2) length(bg) == length(bg2) ``` ## Lab Part 2 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Dates You can convert date-like strings in the `Date` class (http://www.statmethods.net/input/dates.html for more info) using the `lubridate` package! ```{r, message = FALSE} circ = jhur::read_circulator() head(sort(circ$date)) library(lubridate) # great for dates! circ = mutate(circ, newDate2 = mdy(date)) head(circ$newDate2) range(circ$newDate2) # gives you the range of the data ``` ## Works great - but need to specy the correct format still See `?ymd` and `?ymd_hms` ```{r, message = FALSE} x = c("2014-02-4 05:02:00", "2016/09/24 14:02:00") ymd_hms(x) ``` ```{r} ymd_hm(x) ``` ## POSIXct The `POSIXct` class is like a more general date format (with hours, minutes, seconds). ```{r, message = FALSE} x = c("2014-02-4 05:02:00", "2016/09/24 14:02:00") dates = ymd_hms(x) class(dates) ``` ## Adding Periods of time The `as.Period` command is helpful for adding time to a date: ```{r} theTime = Sys.time() theTime class(theTime) theTime + as.period(20, unit = "minutes") # the future ``` ## Differences in Times You can subtract times as well, the `difftime` function is helpful as you can set the units (note it does `time1 - time2`): ```{r} the_future = ymd_hms("2020-12-31 11:59:59") the_future - theTime difftime(the_future, theTime, units = "weeks") ``` ## Lab Part 3 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Website [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Data Classes: * Two dimensional classes: * `data.frame`: traditional 'Excel' spreadsheets * Each column can have a different class, from above * Matrix: two-dimensional data, composed of rows and columns. Unlike data frames, the entire matrix is composed of one R class, e.g. all numeric or all characters. ## Matrices ```{r matrix} n = 1:9 n mat = matrix(n, nrow = 3) mat ``` ## Data Selection Matrices have two "slots" you can use to select data, which represent rows and columns, that are separated by a comma, so the syntax is `matrix[row,column]`. Note you cannot use `dplyr` functions on matrices. ```{r subset3} mat[1, 1] # individual entry: row 1, column 1 mat[1, ] # first row mat[, 1] # first columns ``` ## Data Selection Note that the class of the returned object is no longer a matrix ```{r subset4} class(mat[1, ]) class(mat[, 1]) ``` ## Data Frames To review, the `data.frame`/`tbl_df` are the other two dimensional variable classes. Again, data frames are like matrices, but each column is a vector that can have its own class. So some columns might be `character` and others might be `numeric`, while others maybe a `factor`. ## Lists * One other data type that is the most generic are `lists`. * Can be created using list() * Can hold vectors, strings, matrices, models, list of other list, lists upon lists! * Can reference data using $ (if the elements are named), or using [], or [[]] ```{r makeList, comment="", prompt=TRUE} mylist <- list(letters=c("A", "b", "c"), numbers=1:3, matrix(1:25, ncol=5)) ``` ## List Structure ```{r Lists, comment="", prompt=TRUE} head(mylist) ``` ## List referencing ```{r Listsref1, comment="", prompt=TRUE} mylist[1] # returns a list mylist["letters"] # returns a list ``` ## List referencing ```{r Listsrefvec, comment="", prompt=TRUE} mylist[[1]] # returns the vector 'letters' mylist$letters # returns vector mylist[["letters"]] # returns the vector 'letters' ``` ## List referencing You can also select multiple lists with the single brackets. ```{r Listsref2, comment="", prompt=TRUE} mylist[1:2] # returns a list ``` ## List referencing You can also select down several levels of a list at once ```{r Listsref3, comment="", prompt=TRUE} mylist$letters[1] mylist[[2]][1] mylist[[3]][1:2,1:2] ``` ## Quick Aside: "slicing" data: like _n and _N in Stata In `dplyr`, there are `first`, `last` and `nth` operators. If you first sort a data set using `arrange`, you can grab the first or last as so: ```{r, message=FALSE} circ %>% mutate(first_date = first(newDate2), last_date = last(newDate2), third_date = nth(newDate2, 3)) %>% select(day, date, first_date, last_date, third_date) %>% head(3) ``` ## Quick Aside: "slicing" data Many times, you need to group first ```{r, message=FALSE} circ %>% group_by(day) %>% mutate(first_date = first(newDate2), last_date = last(newDate2), third_date = nth(newDate2, 3)) %>% select(day, date, first_date, last_date, third_date) %>% head(3) ``` ## Differences in Times ```{r, message=FALSE} circ = circ %>% group_by(day) %>% mutate(first_date = first(newDate2), diff_from_first = difftime( # time1 - time2 time1 = newDate2, time2 = first_date)) head(circ$diff_from_first, 10) units(circ$diff_from_first) = "days" head(circ$diff_from_first, 10) ``` ## Website [Website](http://johnmuschelli.com/intro_to_r/index.html)