---
title: "Data Cleaning"
author: "Introduction to R for Public Health Researchers"
output:
ioslides_presentation:
css: ../styles.css
widescreen: yes
---
```{r, echo = FALSE, message = FALSE}
library(knitr)
opts_chunk$set(comment = "")
library(readr)
suppressPackageStartupMessages(library(dplyr))
library(tidyverse)
```
# Before Cleaning - Subsetting with Brackets
## Select specific elements using an index
Often you only want to look at subsets of a data set at any given time. Elements of an R object are selected using the brackets (`[` and `]`).
For example, `x` is a vector of numbers and we can select the second element of `x` using the brackets and an index (2):
```{r}
x = c(1, 4, 2, 8, 10)
x[2]
```
`dplyr`:
```{r}
nth(x, n = 2)
```
## Select specific elements using an index
We can select the fifth or second AND fifth elements below:
```{r}
x = c(1, 2, 4, 8, 10)
x[5]
x[c(2,5)]
```
```{r, error=TRUE}
nth(x, n = c(2, 5)) # nth only returns one number
```
## Subsetting by deletion of entries
You can put a minus (`-`) before integers inside brackets to remove these indices from the data.
```{r negativeIndex}
x[-2] # all but the second
```
Note that you have to be careful with this syntax when dropping more than 1 element:
```{r negativeIndex2}
x[-c(1,2,3)] # drop first 3
# x[-1:3] # shorthand. R sees as -1 to 3
x[-(1:3)] # needs parentheses
```
## Select specific elements using logical operators
What about selecting rows based on the values of two variables? We use logical statements. Here we select only elements of `x` greater than 2:
```{r}
x
x > 2
x[ x > 2 ]
```
## Select specific elements using logical operators
You can have multiple logical conditions using the following:
* `&` : AND
* `|` : OR
```{r}
x[ x > 2 & x < 5 ]
x[ x > 5 | x == 2 ]
```
## which function
The `which` functions takes in logical vectors and returns the index for the elements where the logical value is `TRUE`.
```{r}
which(x > 5 | x == 2) # returns index
x[ which(x > 5 | x == 2) ]
x[ x > 5 | x == 2 ]
```
## Data Cleaning
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
## Useful checking functions
- `is.na` - is `TRUE` if the data is `FALSE` otherwise
- `!` - negation (NOT)
- if `is.na(x)` is `TRUE`, then `!is.na(x)` is `FALSE`
- `all` takes in a `logical` and will be `TRUE` if ALL are `TRUE`
- `all(!is.na(x))` - are all values of `x` NOT NA
- `any` will be `TRUE` if ANY are true
- `any(is.na(x))` - do we have any `NA`'s in `x`?
- `complete.cases` - returns `TRUE` if EVERY value of a row is NOT `NA`
- very stringent condition
- `FALSE` missing one value (even if not important)
- `tidyr::drop_na` will drop rows with **any** missing
# Dealing with Missing Data
## Missing data types
One of the most important aspects of data cleaning is missing values.
Types of "missing" data:
* `NA` - general missing data
* `NaN` - stands for "**N**ot **a** **N**umber", happens when you do 0/0.
* `Inf` and `-Inf` - Infinity, happens when you take a positive number (or negative number) by 0.
## Finding Missing data
Each missing data type has a function that returns `TRUE` if the data is missing:
* `NA` - `is.na`
* `NaN` - `is.nan`
* `Inf` and `-Inf` - `is.infinite`
* `is.finite` returns `FALSE` for all missing data and `TRUE` for non-missing
## Missing Data with Logicals
One important aspect (esp with subsetting) is that logical operations return `NA` for `NA` values. Think about it, the data could be `> 2` or not we don't know,
so `R` says there is no `TRUE` or `FALSE`, so that is missing:
```{r}
x = c(0, NA, 2, 3, 4, -0.5, 0.2)
x > 2
```
## Missing Data with Logicals
What to do? What if we want if `x > 2` and `x` isn't `NA`?
Don't do `x != NA`, do `x > 2` and `x` is NOT `NA`:
```{r}
x != NA
x > 2 & !is.na(x)
```
## Missing Data with Logicals
What about seeing if a value is equal to multiple values? You can do `(x == 1 | x == 2) & !is.na(x)`, but that is not efficient.
```{r}
(x == 0 | x == 2) # has NA
(x == 0 | x == 2) & !is.na(x) # No NA
```
what to do?
## Missing Data with Logicals: `%in%`
Filter removes missing values, have to keep them if you want them:
```{r}
df = tibble(x = x)
df %>% filter(x > 2)
df %>% filter(between(x, -1, 3) | is.na(x))
```
## `dplyr::filter`
Be careful with missing data using subsetting:
```{r}
x %in% c(0, 2, NA) # this
x %in% c(0, 2) | is.na(x) # versus this
```
## Missing Data with Operations
Similarly with logicals, operations/arithmetic with `NA` will result in `NA`s:
```{r}
x + 2
x * 2
```
## Lab Part 1
[Website](http://johnmuschelli.com/intro_to_r/index.html)
# Tables and Tabulations
## Useful checking functions
- `unique` - gives you the unique values of a variable
- `table(x)` - will give a one-way table of `x`
- `table(x, useNA = "ifany")` - will have row NA
- `table(x, y)` - will give a cross-tab of `x` and `y`
- `df %>% count(x, y)`
- `df %>% group_by(x, y) %>% tally`
## Creating One-way Tables
Here we will use `table` to make tabulations of the data. Look at `?table` to see options for missing data.
```{r table}
unique(x)
table(x)
table(x, useNA = "ifany") # will not
df %>% count(x)
```
## Creating One-way Tables
`useNA = "ifany"` will not have NA in table heading if no `NA`:
```{r onetab_ifany}
table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3),
useNA = "ifany")
tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3)) %>% count(x)
```
## Creating One-way Tables
You can set `useNA = "always"` to have it always have a column for `NA`
```{r onetab}
table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3),
useNA = "always")
```
## Tables with Factors
If you use a `factor`, all levels will be given even if no exist!
- (May be wanted or not):
```{r onetab_fact, error = TRUE}
fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
levels = 1:4)
tab = table(fac)
tab
tab[ tab > 0 ]
tibble(x = fac) %>% count(x)
```
## Creating Two-way Tables
A two-way table. If you pass in 2 vectors, `table` creates a 2-dimensional table.
```{r}
tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3),
useNA = "always")
tab
```
## Creating Two-way Tables
```{r}
tab_df = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3),
y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3))
tab_df %>% count(x, y)
```
## Finding Row or Column Totals
`margin.table` finds the marginal sums of the table. `margin` is 1 for rows, 2 for columns in general in `R`. Here is the column sums of the table:
```{r margin}
margin.table(tab, 2)
```
## Proportion Tables
`prop.table` finds the marginal proportions of the table. Think of it dividing the table by it's respective marginal totals. If `margin` not set, divides by overall total.
```{r table2}
prop.table(tab)
prop.table(tab,1) * 100
```
## Creating Two-way Tables
```{r}
tab_df %>%
count(x, y) %>%
group_by(x) %>% mutate(pct_x = n / sum(n))
```
## Creating Two-way Tables
```{r}
library(scales)
tab_df %>%
count(x, y) %>%
group_by(x) %>% mutate(pct_x = percent(n / sum(n)))
```
## Lab Part 2
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Download Salary FY2014 Data
From https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53, from https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv
Read the CSV into R `Sal`:
```{r readSal, echo = TRUE, eval = FALSE}
Sal = jhur::read_salaries() # or
Sal = read_csv("https://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv")
Sal = rename(Sal, Name = name)
```
```{r readSal_csv, echo= FALSE, eval = TRUE}
Sal = read_csv("https://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv", col_types = cols(
name = col_character(),
JobTitle = col_character(),
AgencyID = col_character(),
Agency = col_character(),
HireDate = col_character(),
AnnualSalary = col_character(),
GrossPay = col_character()
))
Sal = rename(Sal, Name = name)
```
## Checking for logical conditions
* `any()` - checks if there are any `TRUE`s
* `all()` - checks if ALL are true
```{r isna}
head(Sal,2)
any(is.na(Sal$Name)) # are there any NAs?
```
# Recoding Variables
## Example of Recoding
For example, let's say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
In `dplyr` you can use the `recode` function:
```{r, eval = FALSE}
data = data %>%
mutate(gender = recode(gender, M = "Male", m = "Male", M = "Male"))
```
or use `ifelse`:
```{r, eval = FALSE}
data %>%
mutate(gender = ifelse(gender %in% c("Male", "M", "m"),
"Male", gender))
```
## Example of Cleaning: more complicated
Sometimes though, it's not so simple. That's where functions that find patterns come in very useful.
```{r gender, echo=FALSE}
set.seed(4) # random sample below - make sure same every time
gender <- sample(c("Male", "mAle", "MaLe", "M", "MALE", "Ma", "FeMAle", "F", "Woman", "Man", "Fm", "FEMALE"), 1000, replace = TRUE)
```
```{r gentab}
table(gender)
```
## Example of Cleaning: more complicated
```{r gender2, echo=FALSE}
gender = gender %>%
tolower %>%
recode(m = "Male", f = "Female", ma = "Male",
woman = "Female", man = "Male")
```
```{r gentab2}
table(gender)
```
# Strings functions
## Splitting/Find/Replace and Regular Expressions
* R can do much more than find exact matches for a whole string
* Like Perl and other languages, it can use regular expressions.
* What are regular expressions?
* Ways to search for specific strings
* Can be very complicated or simple
* Highly Useful - think "Find" on steroids
## A bit on Regular Expressions
* http://www.regular-expressions.info/reference.html
* They can use to match a large number of strings in one statement
* `.` matches any single character
* `*` means repeat as many (even if 0) more times the last character
* `?` makes the last thing optional
* `^` matches start of vector `^a` - starts with "a"
* `$` matches end of vector `b$` - ends with "b"
## The `stringr` package
The `stringr` package:
* Makes string manipulation more intuitive
* Has a standard format for most functions
* the first argument is a string like first argument is a `data.frame` in `dplyr`
* We will not cover `grep` or `gsub` - base R functions
- are used on forums for answers
* Almost all functions start with `str_*`
## Let's look at modifier for `stringr`
`?modifiers`
* `fixed` - match everything exactly
* `regexp` - default - uses **reg**ular **exp**ressions
* `ignore_case` is an option to not have to use `tolower`
## Substring and String Splitting
* `str_sub(x, start, end)` - substrings from position start to position end
* `str_split(string, pattern)` - splits strings up - returns list!
```{r str_split_orig}
library(stringr)
x <- c("I really", "like writing", "R code programs")
y <- str_split(x, " ") # returns a list
y
```
## Using a fixed expression
One example case is when you want to split on a period "`.`". In regular expressions `.` means **ANY** character, so
```{r}
str_split("I.like.strings", ".")
str_split("I.like.strings", fixed("."))
```
## Let's extract from `y`
```{r stsplit2}
y[[2]]
sapply(y, dplyr::first) # on the fly
sapply(y, nth, 2) # on the fly
sapply(y, last) # on the fly
```
## Separating columns based on a separator
* From `tidyr`, you can split a data set into multiple columns:
```{r separate_df}
df = tibble(x = c("I really", "like writing", "R code programs"))
```
```{r}
df %>% separate(x, into = c("first", "second", "third"))
```
## Separating columns based on a separator
* From `tidyr`, you can split a data set into multiple columns:
```{r separate_df2_for_easy}
df = tibble(x = c("I really", "like writing", "R code programs"))
```
```{r}
df %>% separate(x, into = c("first", "second"))
```
## Separating columns based on a separator
* `extra = "merge"` will not drop data. Also, you can specify the separator
```{r separate_df2}
df = tibble(x = c("I really", "like. _writing R. But not", "R code programs"))
```
```{r}
df %>% separate(x, into = c("first", "second", "third"), extra = "merge")
```
## Separating columns based on a separator
* `extra = "merge"` will not drop data. Also, you can specify the separator
```{r}
df %>% separate(x, into = c("first", "second", "third"),
extra = "merge", sep = " ")
```
## 'Find' functions: `stringr`
`str_detect`, `str_subset`, `str_replace`, and `str_replace_all` search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
* `str_detect` - returns `TRUE` if `pattern` is found
* `str_subset` - returns only the strings which pattern were detected
* convenient wrapper around `x[str_detect(x, pattern)]`
* `str_extract` - returns only strings which pattern were detected, but ONLY the pattern
* `str_replace` - replaces `pattern` with `replacement` the first time
* `str_replace_all` - replaces `pattern` with `replacement` as many times matched
## 'Find' functions: Finding Logicals
These are the indices where the pattern match occurs:
```{r RawlMatch_log}
head(str_detect(Sal$Name, "Rawlings"))
```
## 'Find' functions: Finding Indices
These are the indices where the pattern match occurs:
```{r RawlMatch}
which(str_detect(Sal$Name, "Rawlings"))
```
## Showing difference in `str_extract`
`str_extract` extracts just the matched string
```{r ggrep2}
ss = str_extract(Sal$Name, "Rawling")
head(ss)
ss[ !is.na(ss)]
```
## 'Find' functions: finding values, `stringr` and `dplyr`
```{r ggrep}
str_subset(Sal$Name, "Rawlings")
Sal %>% filter(str_detect(Name, "Rawlings"))
```
## Using Regular Expressions
* Look for any name that starts with:
* Payne at the beginning,
* Leonard and then an S
* Spence then capital C
```{r grepstar_stringr}
head(str_subset( Sal$Name, "^Payne.*"), 3)
```
```{r grepstar2_stringr}
head(str_subset( Sal$Name, "Leonard.?S"))
head(str_subset( Sal$Name, "Spence.*C.*"))
```
## Showing differnce in `str_extract` and `str_extract_all`
`str_extract_all` extracts all the matched strings
- `\\d` searches for DIGITS/numbers
```{r }
head(str_extract(Sal$AgencyID, "\\d"))
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
```
## Showing differnce in `str_replace` and `str_replace_all`
`str_replace_all` extracts all the matched strings
```{r }
head(str_replace(Sal$Name, "a", "j"))
head(str_replace_all(Sal$Name, "a", "j"), 2)
```
## Replace
Let's say we wanted to sort the data set by Annual Salary:
```{r classSal}
class(Sal$AnnualSalary)
```
```{r destringSal}
head(Sal$AnnualSalary, 4)
head(as.numeric(Sal$AnnualSalary), 4)
```
R didn't like the `$` so it thought turned them all to `NA`.
## Replacing and substituting
Now we can replace the `$` with nothing (used `fixed("$")` because `$` means ending):
```{r orderSal}
Sal = Sal %>% mutate(
AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""),
AnnualSalary = as.numeric(AnnualSalary)
) %>%
arrange(desc(AnnualSalary))
```
## Pasting strings with `paste` and `paste0`
Paste can be very useful for joining vectors together:
```{r Paste}
paste("Visit", 1:5, sep = "_")
paste("Visit", 1:5, sep = "_", collapse = " ")
paste("To", "is going be the ", "we go to the store!", sep = "day ")
# and paste0 can be even simpler see ?paste0
paste0("Visit",1:5)
```
## Uniting columns based on a separator
* From `tidyr`, you can unite:
```{r unite_df}
df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5))
```
```{r}
df %>% unite(col = "unique_id", id, visit, sep = "_")
```
## Uniting columns based on a separator
* From `tidyr`, you can unite:
```{r unite_df2}
df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5))
```
```{r}
df %>% unite(col = "unique_id", id, visit, sep = "_", remove = FALSE)
```
## Paste Depicting How Collapse Works
```{r Paste2}
paste(1:5)
paste(1:5, collapse = " ")
```
## Useful String Functions
Useful String functions
* `toupper()`, `tolower()` - uppercase or lowercase your data:
* `str_trim()` (in the `stringr` package) or `trimws` in base
- will trim whitespace
* `nchar` - get the number of characters in a string
## Sorting characters
- `sort` - reorders the data - characters work, but not correctly
- `rank` - gives the rank of the data - ties are split
- `order` - gives the indices, if subset, would give the data sorted
- `x[order(x)]` is the same as sorting
```{r orderrank}
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
order(c("1", "2", "10"))
x = rnorm(10)
x[1] = x[2] # create a tie
rank(x)
```
## Lab Part 3
[Website](http://johnmuschelli.com/intro_to_r/index.html)
## Website
[Website](http://johnmuschelli.com/intro_to_r/index.html)
# Comparison of `stringr` to base R - not covered
# Splitting Strings
## Substringing
Very similar:
Base R
* `substr(x, start, stop)` - substrings from position start to position stop
* `strsplit(x, split)` - splits strings up - returns list!
`stringr`
* `str_sub(x, start, end)` - substrings from position start to position end
* `str_split(string, pattern)` - splits strings up - returns list!
## Splitting String: base R
In base R, `strsplit` splits a vector on a string into a `list`
```{r}
x <- c("I really", "like writing", "R code programs")
y <- strsplit(x, split = " ") # returns a list
y
```
## Showing differnce in `str_extract` and `str_extract_all`
`str_extract_all` extracts all the matched strings
- `\\d` searches for DIGITS/numbers
```{r}
head(str_extract(Sal$AgencyID, "\\d"))
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
```
## 'Find' functions: base R
`grep`: `grep`, `grepl`, `regexpr` and `gregexpr` search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
`grep(pattern, x, fixed=FALSE)`, where:
* pattern = character string containing a regular expression to be matched in the given character vector.
* x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
* If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
## 'Find' functions: stringr compared to base R
Base R does not use these functions. Here is a "translator" of the `stringr` function to base R functions
* `str_detect` - similar to `grepl` (return logical)
* `grep(value = FALSE)` is similar to `which(str_detect())`
* `str_subset` - similar to `grep(value = TRUE)` - return value of matched
* `str_replace` - similar to `sub` - replace one time
* `str_replace_all` - similar to `gsub` - replace many times
## Important Comparisons
Base R:
* Argument order is `(pattern, x)`
* Uses option `(fixed = TRUE)`
`stringr`
* Argument order is `(string, pattern)` aka `(x, pattern)`
* Uses function `fixed(pattern)`
## 'Find' functions: Finding Indices
These are the indices where the pattern match occurs:
```{r}
grep("Rawlings",Sal$Name)
which(grepl("Rawlings", Sal$Name))
which(str_detect(Sal$Name, "Rawlings"))
```
## 'Find' functions: Finding Logicals
These are the indices where the pattern match occurs:
```{r}
head(grepl("Rawlings",Sal$Name))
head(str_detect(Sal$Name, "Rawlings"))
```
## 'Find' functions: finding values, base R
```{r}
grep("Rawlings",Sal$Name,value=TRUE)
Sal[grep("Rawlings",Sal$Name),]
```
## Showing differnce in `str_extract`
`str_extract` extracts just the matched string
```{r}
ss = str_extract(Sal$Name, "Rawling")
head(ss)
ss[ !is.na(ss)]
```
## Showing differnce in `str_extract` and `str_extract_all`
`str_extract_all` extracts all the matched strings
```{r}
head(str_extract(Sal$AgencyID, "\\d"))
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
```
## Using Regular Expressions
* Look for any name that starts with:
* Payne at the beginning,
* Leonard and then an S
* Spence then capital C
```{r}
head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
```
```{r}
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
```
## Using Regular Expressions: `stringr`
```{r}
head(str_subset( Sal$Name, "^Payne.*"), 3)
```
```{r}
head(str_subset( Sal$Name, "Leonard.?S"))
head(str_subset( Sal$Name, "Spence.*C.*"))
```
## Replace
Let's say we wanted to sort the data set by Annual Salary:
```{r}
class(Sal$AnnualSalary)
```
```{r}
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
order(c("1", "2", "10"))
```
## Replace
So we must change the annual pay into a numeric:
```{r}
head(Sal$AnnualSalary, 4)
head(as.numeric(Sal$AnnualSalary), 4)
```
R didn't like the `$` so it thought turned them all to `NA`.
`sub()` and `gsub()` can do the replacing part in base R.
## Replacing and subbing
Now we can replace the `$` with nothing (used `fixed=TRUE` because `$` means ending):
```{r}
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="",
Sal$AnnualSalary, fixed=TRUE))
Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ]
Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
```
## Replacing and subbing: `stringr`
We can do the same thing (with 2 piping operations!) in dplyr
```{r}
dplyr_sal = Sal
dplyr_sal = dplyr_sal %>% mutate(
AnnualSalary = AnnualSalary %>%
str_replace(
fixed("$"),
"") %>%
as.numeric) %>%
arrange(desc(AnnualSalary))
check_Sal = Sal
rownames(check_Sal) = NULL
all.equal(check_Sal, dplyr_sal)
```
## Website
[Website](../index.html)