--- title: "Data Cleaning" author: "Introduction to R for Public Health Researchers" output: ioslides_presentation: css: ../styles.css widescreen: yes --- ```{r, echo = FALSE, message = FALSE} library(knitr) opts_chunk$set(comment = "") library(readr) suppressPackageStartupMessages(library(dplyr)) library(tidyverse) ``` # Before Cleaning - Subsetting with Brackets ## Select specific elements using an index Often you only want to look at subsets of a data set at any given time. Elements of an R object are selected using the brackets (`[` and `]`). For example, `x` is a vector of numbers and we can select the second element of `x` using the brackets and an index (2): ```{r} x = c(1, 4, 2, 8, 10) x[2] ``` `dplyr`: ```{r} nth(x, n = 2) ``` ## Select specific elements using an index We can select the fifth or second AND fifth elements below: ```{r} x = c(1, 2, 4, 8, 10) x[5] x[c(2,5)] ``` ```{r, error=TRUE} nth(x, n = c(2, 5)) # nth only returns one number ``` ## Subsetting by deletion of entries You can put a minus (`-`) before integers inside brackets to remove these indices from the data. ```{r negativeIndex} x[-2] # all but the second ``` Note that you have to be careful with this syntax when dropping more than 1 element: ```{r negativeIndex2} x[-c(1,2,3)] # drop first 3 # x[-1:3] # shorthand. R sees as -1 to 3 x[-(1:3)] # needs parentheses ``` ## Select specific elements using logical operators What about selecting rows based on the values of two variables? We use logical statements. Here we select only elements of `x` greater than 2: ```{r} x x > 2 x[ x > 2 ] ``` ## Select specific elements using logical operators You can have multiple logical conditions using the following: * `&` : AND * `|` : OR ```{r} x[ x > 2 & x < 5 ] x[ x > 5 | x == 2 ] ``` ## which function The `which` functions takes in logical vectors and returns the index for the elements where the logical value is `TRUE`. ```{r} which(x > 5 | x == 2) # returns index x[ which(x > 5 | x == 2) ] x[ x > 5 | x == 2 ] ``` ## Data Cleaning In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable. MOST IMPORTANT RULE - LOOK AT YOUR DATA! ## Useful checking functions - `is.na` - is `TRUE` if the data is `FALSE` otherwise - `!` - negation (NOT) - if `is.na(x)` is `TRUE`, then `!is.na(x)` is `FALSE` - `all` takes in a `logical` and will be `TRUE` if ALL are `TRUE` - `all(!is.na(x))` - are all values of `x` NOT NA - `any` will be `TRUE` if ANY are true - `any(is.na(x))` - do we have any `NA`'s in `x`? - `complete.cases` - returns `TRUE` if EVERY value of a row is NOT `NA` - very stringent condition - `FALSE` missing one value (even if not important) - `tidyr::drop_na` will drop rows with **any** missing # Dealing with Missing Data ## Missing data types One of the most important aspects of data cleaning is missing values. Types of "missing" data: * `NA` - general missing data * `NaN` - stands for "**N**ot **a** **N**umber", happens when you do 0/0. * `Inf` and `-Inf` - Infinity, happens when you take a positive number (or negative number) by 0. ## Finding Missing data Each missing data type has a function that returns `TRUE` if the data is missing: * `NA` - `is.na` * `NaN` - `is.nan` * `Inf` and `-Inf` - `is.infinite` * `is.finite` returns `FALSE` for all missing data and `TRUE` for non-missing ## Missing Data with Logicals One important aspect (esp with subsetting) is that logical operations return `NA` for `NA` values. Think about it, the data could be `> 2` or not we don't know, so `R` says there is no `TRUE` or `FALSE`, so that is missing: ```{r} x = c(0, NA, 2, 3, 4, -0.5, 0.2) x > 2 ``` ## Missing Data with Logicals What to do? What if we want if `x > 2` and `x` isn't `NA`? Don't do `x != NA`, do `x > 2` and `x` is NOT `NA`: ```{r} x != NA x > 2 & !is.na(x) ``` ## Missing Data with Logicals What about seeing if a value is equal to multiple values? You can do `(x == 1 | x == 2) & !is.na(x)`, but that is not efficient. ```{r} (x == 0 | x == 2) # has NA (x == 0 | x == 2) & !is.na(x) # No NA ``` what to do? ## Missing Data with Logicals: `%in%` Filter removes missing values, have to keep them if you want them: ```{r} df = tibble(x = x) df %>% filter(x > 2) df %>% filter(between(x, -1, 3) | is.na(x)) ``` ## `dplyr::filter` Be careful with missing data using subsetting: ```{r} x %in% c(0, 2, NA) # this x %in% c(0, 2) | is.na(x) # versus this ``` ## Missing Data with Operations Similarly with logicals, operations/arithmetic with `NA` will result in `NA`s: ```{r} x + 2 x * 2 ``` ## Lab Part 1 [Website](http://johnmuschelli.com/intro_to_r/index.html) # Tables and Tabulations ## Useful checking functions - `unique` - gives you the unique values of a variable - `table(x)` - will give a one-way table of `x` - `table(x, useNA = "ifany")` - will have row NA - `table(x, y)` - will give a cross-tab of `x` and `y` - `df %>% count(x, y)` - `df %>% group_by(x, y) %>% tally` ## Creating One-way Tables Here we will use `table` to make tabulations of the data. Look at `?table` to see options for missing data. ```{r table} unique(x) table(x) table(x, useNA = "ifany") # will not df %>% count(x) ``` ## Creating One-way Tables `useNA = "ifany"` will not have NA in table heading if no `NA`: ```{r onetab_ifany} table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3), useNA = "ifany") tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3)) %>% count(x) ``` ## Creating One-way Tables You can set `useNA = "always"` to have it always have a column for `NA` ```{r onetab} table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3), useNA = "always") ``` ## Tables with Factors If you use a `factor`, all levels will be given even if no exist! - (May be wanted or not): ```{r onetab_fact, error = TRUE} fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), levels = 1:4) tab = table(fac) tab tab[ tab > 0 ] tibble(x = fac) %>% count(x) ``` ## Creating Two-way Tables A two-way table. If you pass in 2 vectors, `table` creates a 2-dimensional table. ```{r} tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3), useNA = "always") tab ``` ## Creating Two-way Tables ```{r} tab_df = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3)) tab_df %>% count(x, y) ``` ## Finding Row or Column Totals `margin.table` finds the marginal sums of the table. `margin` is 1 for rows, 2 for columns in general in `R`. Here is the column sums of the table: ```{r margin} margin.table(tab, 2) ``` ## Proportion Tables `prop.table` finds the marginal proportions of the table. Think of it dividing the table by it's respective marginal totals. If `margin` not set, divides by overall total. ```{r table2} prop.table(tab) prop.table(tab,1) * 100 ``` ## Creating Two-way Tables ```{r} tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = n / sum(n)) ``` ## Creating Two-way Tables ```{r} library(scales) tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = percent(n / sum(n))) ``` ## Lab Part 2 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Download Salary FY2014 Data From https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53, from https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv Read the CSV into R `Sal`: ```{r readSal, echo = TRUE, eval = FALSE} Sal = jhur::read_salaries() # or Sal = read_csv("https://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv") Sal = rename(Sal, Name = name) ``` ```{r readSal_csv, echo= FALSE, eval = TRUE} Sal = read_csv("https://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv", col_types = cols( name = col_character(), JobTitle = col_character(), AgencyID = col_character(), Agency = col_character(), HireDate = col_character(), AnnualSalary = col_character(), GrossPay = col_character() )) Sal = rename(Sal, Name = name) ``` ## Checking for logical conditions * `any()` - checks if there are any `TRUE`s * `all()` - checks if ALL are true ```{r isna} head(Sal,2) any(is.na(Sal$Name)) # are there any NAs? ``` # Recoding Variables ## Example of Recoding For example, let's say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements. In `dplyr` you can use the `recode` function: ```{r, eval = FALSE} data = data %>% mutate(gender = recode(gender, M = "Male", m = "Male", M = "Male")) ``` or use `ifelse`: ```{r, eval = FALSE} data %>% mutate(gender = ifelse(gender %in% c("Male", "M", "m"), "Male", gender)) ``` ## Example of Cleaning: more complicated Sometimes though, it's not so simple. That's where functions that find patterns come in very useful. ```{r gender, echo=FALSE} set.seed(4) # random sample below - make sure same every time gender <- sample(c("Male", "mAle", "MaLe", "M", "MALE", "Ma", "FeMAle", "F", "Woman", "Man", "Fm", "FEMALE"), 1000, replace = TRUE) ``` ```{r gentab} table(gender) ``` ## Example of Cleaning: more complicated ```{r gender2, echo=FALSE} gender = gender %>% tolower %>% recode(m = "Male", f = "Female", ma = "Male", woman = "Female", man = "Male") ``` ```{r gentab2} table(gender) ``` # Strings functions ## Splitting/Find/Replace and Regular Expressions * R can do much more than find exact matches for a whole string * Like Perl and other languages, it can use regular expressions. * What are regular expressions? * Ways to search for specific strings * Can be very complicated or simple * Highly Useful - think "Find" on steroids ## A bit on Regular Expressions * http://www.regular-expressions.info/reference.html * They can use to match a large number of strings in one statement * `.` matches any single character * `*` means repeat as many (even if 0) more times the last character * `?` makes the last thing optional * `^` matches start of vector `^a` - starts with "a" * `$` matches end of vector `b$` - ends with "b" ## The `stringr` package The `stringr` package: * Makes string manipulation more intuitive * Has a standard format for most functions * the first argument is a string like first argument is a `data.frame` in `dplyr` * We will not cover `grep` or `gsub` - base R functions - are used on forums for answers * Almost all functions start with `str_*` ## Let's look at modifier for `stringr` `?modifiers` * `fixed` - match everything exactly * `regex` - default - uses **reg**ular **exp**ressions * `ignore_case` is an option to not have to use `tolower` ## Substring and String Splitting * `str_sub(x, start, end)` - substrings from position start to position end * `str_split(string, pattern)` - splits strings up - returns list! ```{r str_split_orig} library(stringr) x <- c("I really", "like writing", "R code programs") y <- str_split(x, " ") # returns a list y ``` ## Using a fixed expression One example case is when you want to split on a period "`.`". In regular expressions `.` means **ANY** character, so ```{r} str_split("I.like.strings", ".") str_split("I.like.strings", fixed(".")) ``` ## Let's extract from `y` ```{r stsplit2} y[[2]] sapply(y, dplyr::first) # on the fly sapply(y, nth, 2) # on the fly sapply(y, last) # on the fly ``` ## Separating columns based on a separator * From `tidyr`, you can split a data set into multiple columns: ```{r separate_df} df = tibble(x = c("I really", "like writing", "R code programs")) ``` ```{r} df %>% separate(x, into = c("first", "second", "third")) ``` ## Separating columns based on a separator * From `tidyr`, you can split a data set into multiple columns: ```{r separate_df2_for_easy} df = tibble(x = c("I really", "like writing", "R code programs")) ``` ```{r} df %>% separate(x, into = c("first", "second")) ``` ## Separating columns based on a separator * `extra = "merge"` will not drop data. Also, you can specify the separator ```{r separate_df2} df = tibble(x = c("I really", "like. _writing R. But not", "R code programs")) ``` ```{r} df %>% separate(x, into = c("first", "second", "third"), extra = "merge") ``` ## Separating columns based on a separator * `extra = "merge"` will not drop data. Also, you can specify the separator ```{r} df %>% separate(x, into = c("first", "second", "third"), extra = "merge", sep = " ") ``` ## 'Find' functions: `stringr` `str_detect`, `str_subset`, `str_replace`, and `str_replace_all` search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results. * `str_detect` - returns `TRUE` if `pattern` is found * `str_subset` - returns only the strings which pattern were detected * convenient wrapper around `x[str_detect(x, pattern)]` * `str_extract` - returns only strings which pattern were detected, but ONLY the pattern * `str_replace` - replaces `pattern` with `replacement` the first time * `str_replace_all` - replaces `pattern` with `replacement` as many times matched ## 'Find' functions: Finding Logicals These are the indices where the pattern match occurs: ```{r RawlMatch_log} head(str_detect(Sal$Name, "Rawlings")) ``` ## 'Find' functions: Finding Indices These are the indices where the pattern match occurs: ```{r RawlMatch} which(str_detect(Sal$Name, "Rawlings")) ``` ## Showing difference in `str_extract` `str_extract` extracts just the matched string ```{r ggrep2} ss = str_extract(Sal$Name, "Rawling") head(ss) ss[ !is.na(ss)] ``` ## 'Find' functions: finding values, `stringr` and `dplyr` ```{r ggrep} str_subset(Sal$Name, "Rawlings") Sal %>% filter(str_detect(Name, "Rawlings")) ``` ## Using Regular Expressions * Look for any name that starts with: * Payne at the beginning, * Leonard and then an S * Spence then capital C ```{r grepstar_stringr} head(str_subset( Sal$Name, "^Payne.*"), 3) ``` ```{r grepstar2_stringr} head(str_subset( Sal$Name, "Leonard.?S")) head(str_subset( Sal$Name, "Spence.*C.*")) ``` ## Showing differnce in `str_extract` and `str_extract_all` `str_extract_all` extracts all the matched strings - `\\d` searches for DIGITS/numbers ```{r } head(str_extract(Sal$AgencyID, "\\d")) head(str_extract_all(Sal$AgencyID, "\\d"), 2) ``` ## Showing differnce in `str_replace` and `str_replace_all` `str_replace_all` extracts all the matched strings ```{r } head(str_replace(Sal$Name, "a", "j")) head(str_replace_all(Sal$Name, "a", "j"), 2) ``` ## Replace Let's say we wanted to sort the data set by Annual Salary: ```{r classSal} class(Sal$AnnualSalary) ``` ```{r destringSal} head(Sal$AnnualSalary, 4) head(as.numeric(Sal$AnnualSalary), 4) ``` R didn't like the `$` so it thought turned them all to `NA`. ## Replacing and substituting Now we can replace the `$` with nothing (used `fixed("$")` because `$` means ending): ```{r orderSal} Sal = Sal %>% mutate( AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""), AnnualSalary = as.numeric(AnnualSalary) ) %>% arrange(desc(AnnualSalary)) ``` ## Pasting strings with `paste` and `paste0` Paste can be very useful for joining vectors together: ```{r Paste} paste("Visit", 1:5, sep = "_") paste("Visit", 1:5, sep = "_", collapse = " ") paste("To", "is going be the ", "we go to the store!", sep = "day ") # and paste0 can be even simpler see ?paste0 paste0("Visit",1:5) ``` ## Uniting columns based on a separator * From `tidyr`, you can unite: ```{r unite_df} df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5)) ``` ```{r} df %>% unite(col = "unique_id", id, visit, sep = "_") ``` ## Uniting columns based on a separator * From `tidyr`, you can unite: ```{r unite_df2} df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5)) ``` ```{r} df %>% unite(col = "unique_id", id, visit, sep = "_", remove = FALSE) ``` ## Paste Depicting How Collapse Works ```{r Paste2} paste(1:5) paste(1:5, collapse = " ") ``` ## Useful String Functions Useful String functions * `toupper()`, `tolower()` - uppercase or lowercase your data: * `str_trim()` (in the `stringr` package) or `trimws` in base - will trim whitespace * `nchar` - get the number of characters in a string ## Sorting characters - `sort` - reorders the data - characters work, but not correctly - `rank` - gives the rank of the data - ties are split - `order` - gives the indices, if subset, would give the data sorted - `x[order(x)]` is the same as sorting ```{r orderrank} sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data) order(c("1", "2", "10")) x = rnorm(10) x[1] = x[2] # create a tie rank(x) ``` ## Lab Part 3 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Website [Website](http://johnmuschelli.com/intro_to_r/index.html) # Comparison of `stringr` to base R - not covered # Splitting Strings ## Substringing Very similar: Base R * `substr(x, start, stop)` - substrings from position start to position stop * `strsplit(x, split)` - splits strings up - returns list! `stringr` * `str_sub(x, start, end)` - substrings from position start to position end * `str_split(string, pattern)` - splits strings up - returns list! ## Splitting String: base R In base R, `strsplit` splits a vector on a string into a `list` ```{r} x <- c("I really", "like writing", "R code programs") y <- strsplit(x, split = " ") # returns a list y ``` ## Showing differnce in `str_extract` and `str_extract_all` `str_extract_all` extracts all the matched strings - `\\d` searches for DIGITS/numbers ```{r} head(str_extract(Sal$AgencyID, "\\d")) head(str_extract_all(Sal$AgencyID, "\\d"), 2) ``` ## 'Find' functions: base R `grep`: `grep`, `grepl`, `regexpr` and `gregexpr` search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results. `grep(pattern, x, fixed=FALSE)`, where: * pattern = character string containing a regular expression to be matched in the given character vector. * x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. * If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find) ## 'Find' functions: stringr compared to base R Base R does not use these functions. Here is a "translator" of the `stringr` function to base R functions * `str_detect` - similar to `grepl` (return logical) * `grep(value = FALSE)` is similar to `which(str_detect())` * `str_subset` - similar to `grep(value = TRUE)` - return value of matched * `str_replace` - similar to `sub` - replace one time * `str_replace_all` - similar to `gsub` - replace many times ## Important Comparisons Base R: * Argument order is `(pattern, x)` * Uses option `(fixed = TRUE)` `stringr` * Argument order is `(string, pattern)` aka `(x, pattern)` * Uses function `fixed(pattern)` ## 'Find' functions: Finding Indices These are the indices where the pattern match occurs: ```{r} grep("Rawlings",Sal$Name) which(grepl("Rawlings", Sal$Name)) which(str_detect(Sal$Name, "Rawlings")) ``` ## 'Find' functions: Finding Logicals These are the indices where the pattern match occurs: ```{r} head(grepl("Rawlings",Sal$Name)) head(str_detect(Sal$Name, "Rawlings")) ``` ## 'Find' functions: finding values, base R ```{r} grep("Rawlings",Sal$Name,value=TRUE) Sal[grep("Rawlings",Sal$Name),] ``` ## Showing differnce in `str_extract` `str_extract` extracts just the matched string ```{r} ss = str_extract(Sal$Name, "Rawling") head(ss) ss[ !is.na(ss)] ``` ## Showing differnce in `str_extract` and `str_extract_all` `str_extract_all` extracts all the matched strings ```{r} head(str_extract(Sal$AgencyID, "\\d")) head(str_extract_all(Sal$AgencyID, "\\d"), 2) ``` ## Using Regular Expressions * Look for any name that starts with: * Payne at the beginning, * Leonard and then an S * Spence then capital C ```{r} head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3) ``` ```{r} head(grep("Leonard.?S", x = Sal$Name, value = TRUE)) head(grep("Spence.*C.*", x = Sal$Name, value = TRUE)) ``` ## Using Regular Expressions: `stringr` ```{r} head(str_subset( Sal$Name, "^Payne.*"), 3) ``` ```{r} head(str_subset( Sal$Name, "Leonard.?S")) head(str_subset( Sal$Name, "Spence.*C.*")) ``` ## Replace Let's say we wanted to sort the data set by Annual Salary: ```{r} class(Sal$AnnualSalary) ``` ```{r} sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data) order(c("1", "2", "10")) ``` ## Replace So we must change the annual pay into a numeric: ```{r} head(Sal$AnnualSalary, 4) head(as.numeric(Sal$AnnualSalary), 4) ``` R didn't like the `$` so it thought turned them all to `NA`. `sub()` and `gsub()` can do the replacing part in base R. ## Replacing and subbing Now we can replace the `$` with nothing (used `fixed=TRUE` because `$` means ending): ```{r} Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="", Sal$AnnualSalary, fixed=TRUE)) Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ] Sal[1:5, c("Name", "AnnualSalary", "JobTitle")] ``` ## Replacing and subbing: `stringr` We can do the same thing (with 2 piping operations!) in dplyr ```{r} dplyr_sal = Sal dplyr_sal = dplyr_sal %>% mutate( AnnualSalary = AnnualSalary %>% str_replace( fixed("$"), "") %>% as.numeric) %>% arrange(desc(AnnualSalary)) check_Sal = Sal rownames(check_Sal) = NULL all.equal(check_Sal, dplyr_sal) ``` ## Website [Website](../index.html)