Often you only want to look at subsets of a data set at any given time. Elements of an R object are selected using the brackets ([
and ]
).
For example, x
is a vector of numbers and we can select the second element of x
using the brackets and an index (2):
x = c(1, 4, 2, 8, 10) x[2]
[1] 4
dplyr
:
nth(x, n = 2)
[1] 4
We can select the fifth or second AND fifth elements below:
x = c(1, 2, 4, 8, 10) x[5]
[1] 10
x[c(2,5)]
[1] 2 10
nth(x, n = c(2, 5)) # nth only returns one number
Error in format_error_bullets(x[-1]): nms %in% c("i", "x", "") is not TRUE
You can put a minus (-
) before integers inside brackets to remove these indices from the data.
x[-2] # all but the second
[1] 1 4 8 10
Note that you have to be careful with this syntax when dropping more than 1 element:
x[-c(1,2,3)] # drop first 3
[1] 8 10
# x[-1:3] # shorthand. R sees as -1 to 3 x[-(1:3)] # needs parentheses
[1] 8 10
What about selecting rows based on the values of two variables? We use logical statements. Here we select only elements of x
greater than 2:
x
[1] 1 2 4 8 10
x > 2
[1] FALSE FALSE TRUE TRUE TRUE
x[ x > 2 ]
[1] 4 8 10
You can have multiple logical conditions using the following:
&
: AND|
: ORx[ x > 2 & x < 5 ]
[1] 4
x[ x > 5 | x == 2 ]
[1] 2 8 10
The which
functions takes in logical vectors and returns the index for the elements where the logical value is TRUE
.
which(x > 5 | x == 2) # returns index
[1] 2 4 5
x[ which(x > 5 | x == 2) ]
[1] 2 8 10
x[ x > 5 | x == 2 ]
[1] 2 8 10
In general, data cleaning is a process of investigating your data for inaccuracies, or recoding it in a way that makes it more manageable.
MOST IMPORTANT RULE - LOOK AT YOUR DATA!
is.na
- is TRUE
if the data is FALSE
otherwise!
- negation (NOT)
is.na(x)
is TRUE
, then !is.na(x)
is FALSE
all
takes in a logical
and will be TRUE
if ALL are TRUE
all(!is.na(x))
- are all values of x
NOT NAany
will be TRUE
if ANY are true
any(is.na(x))
- do we have any NA
’s in x
?complete.cases
- returns TRUE
if EVERY value of a row is NOT NA
FALSE
missing one value (even if not important)tidyr::drop_na
will drop rows with any missingOne of the most important aspects of data cleaning is missing values.
Types of “missing” data:
NA
- general missing dataNaN
- stands for “Not a Number”, happens when you do 0/0.Inf
and -Inf
- Infinity, happens when you take a positive number (or negative number) by 0.Each missing data type has a function that returns TRUE
if the data is missing:
NA
- is.na
NaN
- is.nan
Inf
and -Inf
- is.infinite
is.finite
returns FALSE
for all missing data and TRUE
for non-missingOne important aspect (esp with subsetting) is that logical operations return NA
for NA
values. Think about it, the data could be > 2
or not we don’t know, so R
says there is no TRUE
or FALSE
, so that is missing:
x = c(0, NA, 2, 3, 4, -0.5, 0.2) x > 2
[1] FALSE NA FALSE TRUE TRUE FALSE FALSE
What to do? What if we want if x > 2
and x
isn’t NA
?
Don’t do x != NA
, do x > 2
and x
is NOT NA
:
x != NA
[1] NA NA NA NA NA NA NA
x > 2 & !is.na(x)
[1] FALSE FALSE FALSE TRUE TRUE FALSE FALSE
What about seeing if a value is equal to multiple values? You can do (x == 1 | x == 2) & !is.na(x)
, but that is not efficient.
(x == 0 | x == 2) # has NA
[1] TRUE NA TRUE FALSE FALSE FALSE FALSE
(x == 0 | x == 2) & !is.na(x) # No NA
[1] TRUE FALSE TRUE FALSE FALSE FALSE FALSE
what to do?
%in%
Filter removes missing values, have to keep them if you want them:
df = tibble(x = x) df %>% filter(x > 2)
# A tibble: 2 x 1 x <dbl> 1 3 2 4
df %>% filter(between(x, -1, 3) | is.na(x))
# A tibble: 6 x 1 x <dbl> 1 0 2 NA 3 2 4 3 5 -0.5 6 0.2
dplyr::filter
Be careful with missing data using subsetting:
x %in% c(0, 2, NA) # this
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
x %in% c(0, 2) | is.na(x) # versus this
[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Similarly with logicals, operations/arithmetic with NA
will result in NA
s:
x + 2
[1] 2.0 NA 4.0 5.0 6.0 1.5 2.2
x * 2
[1] 0.0 NA 4.0 6.0 8.0 -1.0 0.4
unique
- gives you the unique values of a variabletable(x)
- will give a one-way table of x
table(x, useNA = "ifany")
- will have row NAtable(x, y)
- will give a cross-tab of x
and y
df %>% count(x, y)
df %>% group_by(x, y) %>% tally
Here we will use table
to make tabulations of the data. Look at ?table
to see options for missing data.
unique(x)
[1] 0.0 NA 2.0 3.0 4.0 -0.5 0.2
table(x)
x -0.5 0 0.2 2 3 4 1 1 1 1 1 1
table(x, useNA = "ifany") # will not
x -0.5 0 0.2 2 3 4 <NA> 1 1 1 1 1 1 1
df %>% count(x)
# A tibble: 7 x 2 x n <dbl> <int> 1 -0.5 1 2 0 1 3 0.2 1 4 2 1 5 3 1 6 4 1 7 NA 1
useNA = "ifany"
will not have NA in table heading if no NA
:
table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3), useNA = "ifany")
0 1 2 3 1 1 4 4
tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3)) %>% count(x)
# A tibble: 4 x 2 x n <dbl> <int> 1 0 1 2 1 1 3 2 4 4 3 4
You can set useNA = "always"
to have it always have a column for NA
table(c(0, 1, 2, 3, 2, 3, 3, 2, 2, 3), useNA = "always")
0 1 2 3 <NA> 1 1 4 4 0
If you use a factor
, all levels will be given even if no exist! - (May be wanted or not):
fac = factor(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), levels = 1:4) tab = table(fac) tab
fac 1 2 3 4 1 4 4 0
tab[ tab > 0 ]
fac 1 2 3 1 4 4
tibble(x = fac) %>% count(x)
# A tibble: 4 x 2 x n <fct> <int> 1 1 1 2 2 4 3 3 4 4 <NA> 1
A two-way table. If you pass in 2 vectors, table
creates a 2-dimensional table.
tab <- table(c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3), useNA = "always") tab
0 1 2 3 4 <NA> 0 1 0 0 0 0 0 1 0 1 0 0 0 0 2 0 0 2 0 2 0 3 0 0 0 4 0 0 <NA> 0 0 0 0 0 0
tab_df = tibble(x = c(0, 1, 2, 3, 2, 3, 3, 2,2, 3), y = c(0, 1, 2, 3, 2, 3, 3, 4, 4, 3)) tab_df %>% count(x, y)
# A tibble: 5 x 3 x y n <dbl> <dbl> <int> 1 0 0 1 2 1 1 1 3 2 2 2 4 2 4 2 5 3 3 4
margin.table
finds the marginal sums of the table. margin
is 1 for rows, 2 for columns in general in R
. Here is the column sums of the table:
margin.table(tab, 2)
0 1 2 3 4 <NA> 1 1 2 4 2 0
prop.table
finds the marginal proportions of the table. Think of it dividing the table by it’s respective marginal totals. If margin
not set, divides by overall total.
prop.table(tab)
0 1 2 3 4 <NA> 0 0.1 0.0 0.0 0.0 0.0 0.0 1 0.0 0.1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.2 0.0 0.2 0.0 3 0.0 0.0 0.0 0.4 0.0 0.0 <NA> 0.0 0.0 0.0 0.0 0.0 0.0
prop.table(tab,1) * 100
0 1 2 3 4 <NA> 0 100 0 0 0 0 0 1 0 100 0 0 0 0 2 0 0 50 0 50 0 3 0 0 0 100 0 0 <NA>
tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = n / sum(n))
# A tibble: 5 x 4 # Groups: x [4] x y n pct_x <dbl> <dbl> <int> <dbl> 1 0 0 1 1 2 1 1 1 1 3 2 2 2 0.5 4 2 4 2 0.5 5 3 3 4 1
library(scales)
Attaching package: 'scales'
The following object is masked from 'package:purrr': discard
The following object is masked from 'package:readr': col_factor
tab_df %>% count(x, y) %>% group_by(x) %>% mutate(pct_x = percent(n / sum(n)))
# A tibble: 5 x 4 # Groups: x [4] x y n pct_x <dbl> <dbl> <int> <chr> 1 0 0 1 100% 2 1 1 1 100% 3 2 2 2 50% 4 2 4 2 50% 5 3 3 4 100%
From https://data.baltimorecity.gov/City-Government/Baltimore-City-Employee-Salaries-FY2015/nsfe-bg53, from https://data.baltimorecity.gov/api/views/nsfe-bg53/rows.csv
Read the CSV into R Sal
:
Sal = jhur::read_salaries() # or Sal = read_csv("https://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv") Sal = rename(Sal, Name = name)
any()
- checks if there are any TRUE
sall()
- checks if ALL are truehead(Sal,2)
# A tibble: 2 x 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Aaron,Pa… Facilities/Off… A03031 OED-Employm… 10/24/1… $55314.00 $53626.… 2 Aaron,Pe… ASSISTANT STAT… A29045 States Atto… 09/25/2… $74000.00 $73000.…
any(is.na(Sal$Name)) # are there any NAs?
[1] FALSE
For example, let’s say gender was coded as Male, M, m, Female, F, f. Using Excel to find all of these would be a matter of filtering and changing all by hand or using if statements.
In dplyr
you can use the recode
function:
data = data %>% mutate(gender = recode(gender, M = "Male", m = "Male", M = "Male"))
or use ifelse
:
data %>% mutate(gender = ifelse(gender %in% c("Male", "M", "m"), "Male", gender))
Sometimes though, it’s not so simple. That’s where functions that find patterns come in very useful.
table(gender)
gender F FeMAle FEMALE Fm M Ma mAle Male MaLe MALE Man 80 88 76 87 99 76 84 83 79 93 84 Woman 71
table(gender)
gender female Female fm male Male 164 151 87 339 259
.
matches any single character*
means repeat as many (even if 0) more times the last character?
makes the last thing optional^
matches start of vector ^a
- starts with “a”$
matches end of vector b$
- ends with “b”stringr
packageThe stringr
package:
data.frame
in dplyr
grep
or gsub
- base R functions
str_*
stringr
?modifiers
fixed
- match everything exactlyregex
- default - uses regular expressionsignore_case
is an option to not have to use tolower
str_sub(x, start, end)
- substrings from position start to position endstr_split(string, pattern)
- splits strings up - returns list!library(stringr) x <- c("I really", "like writing", "R code programs") y <- str_split(x, " ") # returns a list y
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
One example case is when you want to split on a period “.
”. In regular expressions .
means ANY character, so
str_split("I.like.strings", ".")
[[1]] [1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
str_split("I.like.strings", fixed("."))
[[1]] [1] "I" "like" "strings"
y
y[[2]]
[1] "like" "writing"
sapply(y, dplyr::first) # on the fly
[1] "I" "like" "R"
sapply(y, nth, 2) # on the fly
[1] "really" "writing" "code"
sapply(y, last) # on the fly
[1] "really" "writing" "programs"
tidyr
, you can split a data set into multiple columns:df = tibble(x = c("I really", "like writing", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2 rows [1, 2].
# A tibble: 3 x 3 first second third <chr> <chr> <chr> 1 I really <NA> 2 like writing <NA> 3 R code programs
tidyr
, you can split a data set into multiple columns:df = tibble(x = c("I really", "like writing", "R code programs"))
df %>% separate(x, into = c("first", "second"))
Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3].
# A tibble: 3 x 2 first second <chr> <chr> 1 I really 2 like writing 3 R code
extra = "merge"
will not drop data. Also, you can specify the separatordf = tibble(x = c("I really", "like. _writing R. But not", "R code programs"))
df %>% separate(x, into = c("first", "second", "third"), extra = "merge")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 x 3 first second third <chr> <chr> <chr> 1 I really <NA> 2 like writing R. But not 3 R code programs
extra = "merge"
will not drop data. Also, you can specify the separatordf %>% separate(x, into = c("first", "second", "third"), extra = "merge", sep = " ")
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
# A tibble: 3 x 3 first second third <chr> <chr> <chr> 1 I really <NA> 2 like. _writing R. But not 3 R code programs
stringr
str_detect
, str_subset
, str_replace
, and str_replace_all
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
str_detect
- returns TRUE
if pattern
is foundstr_subset
- returns only the strings which pattern were detected
x[str_detect(x, pattern)]
str_extract
- returns only strings which pattern were detected, but ONLY the patternstr_replace
- replaces pattern
with replacement
the first timestr_replace_all
- replaces pattern
with replacement
as many times matchedThese are the indices where the pattern match occurs:
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
These are the indices where the pattern match occurs:
which(str_detect(Sal$Name, "Rawlings"))
[1] 10256 10257 10258
str_extract
str_extract
extracts just the matched string
ss = str_extract(Sal$Name, "Rawling") head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
[1] "Rawling" "Rawling" "Rawling"
stringr
and dplyr
str_subset(Sal$Name, "Rawlings")
[1] "Rawlings,Kellye A" "Rawlings,Paula M" [3] "Rawlings-Blake,Stephanie C"
Sal %>% filter(str_detect(Name, "Rawlings"))
# A tibble: 3 x 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 Rawlings,Ke… EMERGENCY D… A40302 M-R Info Te… 01/06/2… $48940.00 $73356.… 2 Rawlings,Pa… COMMUNITY A… A04015 R&P-Recreat… 12/10/2… $19802.00 $10443.… 3 Rawlings-Bl… MAYOR A01001 Mayors Offi… 12/07/1… $167449.00 $165249…
head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne El,Boaz L" "Payne El,Jackie" [3] "Payne Johnson,Nickole A"
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Payne,Leonard S" "Szumlanski,Leonard S"
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Charles A" "Spencer,Clarence W" "Spencer,Michael C"
str_extract
and str_extract_all
str_extract_all
extracts all the matched strings - \\d
searches for DIGITS/numbers
head(str_extract(Sal$AgencyID, "\\d"))
[1] "0" "2" "6" "9" "4" "9"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "0" "3" "0" "3" "1" [[2]] [1] "2" "9" "0" "4" "5"
str_replace
and str_replace_all
str_replace_all
extracts all the matched strings
head(str_replace(Sal$Name, "a", "j"))
[1] "Ajron,Patricia G" "Ajron,Petra L" "Abjineh,Yohannes T" [4] "Abbene,Anthony M" "Abbey,Emmjnuel" "Abbott-Cole,Michelle"
head(str_replace_all(Sal$Name, "a", "j"), 2)
[1] "Ajron,Pjtricij G" "Ajron,Petrj L"
Let’s say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "character"
head(Sal$AnnualSalary, 4)
[1] "$55314.00" "$74000.00" "$64500.00" "$46309.00"
head(as.numeric(Sal$AnnualSalary), 4)
Warning in head(as.numeric(Sal$AnnualSalary), 4): NAs introduced by coercion
[1] NA NA NA NA
R didn’t like the $
so it thought turned them all to NA
.
Now we can replace the $
with nothing (used fixed("$")
because $
means ending):
Sal = Sal %>% mutate( AnnualSalary = str_replace(AnnualSalary, fixed("$"), ""), AnnualSalary = as.numeric(AnnualSalary) ) %>% arrange(desc(AnnualSalary))
paste
and paste0
Paste can be very useful for joining vectors together:
paste("Visit", 1:5, sep = "_")
[1] "Visit_1" "Visit_2" "Visit_3" "Visit_4" "Visit_5"
paste("Visit", 1:5, sep = "_", collapse = " ")
[1] "Visit_1 Visit_2 Visit_3 Visit_4 Visit_5"
paste("To", "is going be the ", "we go to the store!", sep = "day ")
[1] "Today is going be the day we go to the store!"
# and paste0 can be even simpler see ?paste0 paste0("Visit",1:5)
[1] "Visit1" "Visit2" "Visit3" "Visit4" "Visit5"
tidyr
, you can unite:df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5))
df %>% unite(col = "unique_id", id, visit, sep = "_")
# A tibble: 15 x 1 unique_id <chr> 1 1_1 2 2_1 3 3_1 4 4_1 5 5_1 6 1_2 7 2_2 8 3_2 9 4_2 10 5_2 11 1_3 12 2_3 13 3_3 14 4_3 15 5_3
tidyr
, you can unite:df = tibble(id = rep(1:5, 3), visit = rep(1:3, each = 5))
df %>% unite(col = "unique_id", id, visit, sep = "_", remove = FALSE)
# A tibble: 15 x 3 unique_id id visit <chr> <int> <int> 1 1_1 1 1 2 2_1 2 1 3 3_1 3 1 4 4_1 4 1 5 5_1 5 1 6 1_2 1 2 7 2_2 2 2 8 3_2 3 2 9 4_2 4 2 10 5_2 5 2 11 1_3 1 3 12 2_3 2 3 13 3_3 3 3 14 4_3 4 3 15 5_3 5 3
paste(1:5)
[1] "1" "2" "3" "4" "5"
paste(1:5, collapse = " ")
[1] "1 2 3 4 5"
Useful String functions
toupper()
, tolower()
- uppercase or lowercase your data:str_trim()
(in the stringr
package) or trimws
in base
nchar
- get the number of characters in a stringsort
- reorders the data - characters work, but not correctlyrank
- gives the rank of the data - ties are splitorder
- gives the indices, if subset, would give the data sorted
x[order(x)]
is the same as sortingsort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
x = rnorm(10) x[1] = x[2] # create a tie rank(x)
[1] 2.5 2.5 10.0 7.0 4.0 1.0 8.0 5.0 9.0 6.0
stringr
to base R - not coveredVery similar:
Base R
substr(x, start, stop)
- substrings from position start to position stopstrsplit(x, split)
- splits strings up - returns list!stringr
str_sub(x, start, end)
- substrings from position start to position endstr_split(string, pattern)
- splits strings up - returns list!In base R, strsplit
splits a vector on a string into a list
x <- c("I really", "like writing", "R code programs") y <- strsplit(x, split = " ") # returns a list y
[[1]] [1] "I" "really" [[2]] [1] "like" "writing" [[3]] [1] "R" "code" "programs"
str_extract
and str_extract_all
str_extract_all
extracts all the matched strings - \\d
searches for DIGITS/numbers
head(str_extract(Sal$AgencyID, "\\d"))
[1] "2" "9" "6" "2" "0" "0"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "2" "9" "0" "0" "1" [[2]] [1] "9" "9" "3" "9" "0"
grep
: grep
, grepl
, regexpr
and gregexpr
search for matches to argument pattern within each element of a character vector: they differ in the format of and amount of detail in the results.
grep(pattern, x, fixed=FALSE)
, where:
pattern = character string containing a regular expression to be matched in the given character vector.
x = a character vector where matches are sought, or an object which can be coerced by as.character to a character vector.
If fixed=TRUE, it will do exact matching for the phrase anywhere in the vector (regular find)
Base R does not use these functions. Here is a “translator” of the stringr
function to base R functions
str_detect
- similar to grepl
(return logical)grep(value = FALSE)
is similar to which(str_detect())
str_subset
- similar to grep(value = TRUE)
- return value of matchedstr_replace
- similar to sub
- replace one timestr_replace_all
- similar to gsub
- replace many timesBase R:
(pattern, x)
(fixed = TRUE)
stringr
(string, pattern)
aka (x, pattern)
fixed(pattern)
These are the indices where the pattern match occurs:
grep("Rawlings",Sal$Name)
[1] 9 6854 13284
which(grepl("Rawlings", Sal$Name))
[1] 9 6854 13284
which(str_detect(Sal$Name, "Rawlings"))
[1] 9 6854 13284
These are the indices where the pattern match occurs:
head(grepl("Rawlings",Sal$Name))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(str_detect(Sal$Name, "Rawlings"))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
grep("Rawlings",Sal$Name,value=TRUE)
[1] "Rawlings-Blake,Stephanie C" "Rawlings,Kellye A" [3] "Rawlings,Paula M"
Sal[grep("Rawlings",Sal$Name),]
# A tibble: 3 x 7 Name JobTitle AgencyID Agency HireDate AnnualSalary GrossPay <chr> <chr> <chr> <chr> <chr> <dbl> <chr> 1 Rawlings-Bl… MAYOR A01001 Mayors Offi… 12/07/1… 167449 $165249… 2 Rawlings,Ke… EMERGENCY D… A40302 M-R Info Te… 01/06/2… 48940 $73356.… 3 Rawlings,Pa… COMMUNITY A… A04015 R&P-Recreat… 12/10/2… 19802 $10443.…
str_extract
str_extract
extracts just the matched string
ss = str_extract(Sal$Name, "Rawling") head(ss)
[1] NA NA NA NA NA NA
ss[ !is.na(ss)]
[1] "Rawling" "Rawling" "Rawling"
str_extract
and str_extract_all
str_extract_all
extracts all the matched strings
head(str_extract(Sal$AgencyID, "\\d"))
[1] "2" "9" "6" "2" "0" "0"
head(str_extract_all(Sal$AgencyID, "\\d"), 2)
[[1]] [1] "2" "9" "0" "0" "1" [[2]] [1] "9" "9" "3" "9" "0"
head(grep("^Payne.*", x = Sal$Name, value = TRUE), 3)
[1] "Payne,James R" "Payne,Karen V" "Payne,Jasman T"
head(grep("Leonard.?S", x = Sal$Name, value = TRUE))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"
head(grep("Spence.*C.*", x = Sal$Name, value = TRUE))
[1] "Spencer,Michael C" "Spencer,Clarence W" "Spencer,Charles A"
stringr
head(str_subset( Sal$Name, "^Payne.*"), 3)
[1] "Payne,James R" "Payne,Karen V" "Payne,Jasman T"
head(str_subset( Sal$Name, "Leonard.?S"))
[1] "Szumlanski,Leonard S" "Payne,Leonard S"
head(str_subset( Sal$Name, "Spence.*C.*"))
[1] "Spencer,Michael C" "Spencer,Clarence W" "Spencer,Charles A"
Let’s say we wanted to sort the data set by Annual Salary:
class(Sal$AnnualSalary)
[1] "numeric"
sort(c("1", "2", "10")) # not sort correctly (order simply ranks the data)
[1] "1" "10" "2"
order(c("1", "2", "10"))
[1] 1 3 2
So we must change the annual pay into a numeric:
head(Sal$AnnualSalary, 4)
[1] 238772 211785 200000 192500
head(as.numeric(Sal$AnnualSalary), 4)
[1] 238772 211785 200000 192500
R didn’t like the $
so it thought turned them all to NA
.
sub()
and gsub()
can do the replacing part in base R.
Now we can replace the $
with nothing (used fixed=TRUE
because $
means ending):
Sal$AnnualSalary <- as.numeric(gsub(pattern = "$", replacement="", Sal$AnnualSalary, fixed=TRUE)) Sal <- Sal[order(Sal$AnnualSalary, decreasing=TRUE), ] Sal[1:5, c("Name", "AnnualSalary", "JobTitle")]
# A tibble: 5 x 3 Name AnnualSalary JobTitle <chr> <dbl> <chr> 1 Mosby,Marilyn J 238772 STATE'S ATTORNEY 2 Batts,Anthony W 211785 Police Commissioner 3 Wen,Leana 200000 Executive Director III 4 Raymond,Henry J 192500 Executive Director III 5 Swift,Michael 187200 CONTRACT SERV SPEC II
stringr
We can do the same thing (with 2 piping operations!) in dplyr
dplyr_sal = Sal dplyr_sal = dplyr_sal %>% mutate( AnnualSalary = AnnualSalary %>% str_replace( fixed("$"), "") %>% as.numeric) %>% arrange(desc(AnnualSalary)) check_Sal = Sal rownames(check_Sal) = NULL all.equal(check_Sal, dplyr_sal)
[1] TRUE