--- title: "Subsetting Data in R" author: "Introduction to R for Public Health Researchers" output: ioslides_presentation: css: ../styles.css widescreen: yes --- ```{r, echo = FALSE} library(knitr) opts_chunk$set(comment = "") suppressPackageStartupMessages(library(dplyr)) library(dplyr) library(jhur) ``` ## Overview We showed one way to read data into R using `read_csv` and `read.csv`. In this module, we will show you how to: 1. Select specific elements of an object by an index or logical condition 2. Renaming columns of a `data.frame` 3. Subset rows of a `data.frame` 4. Subset columns of a `data.frame` 5. Add/remove new columns to a `data.frame` 6. Order the columns of a `data.frame` 7. Order the rows of a `data.frame` ## Setup We will show you how to do each operation in base R then show you how to use the `dplyr` package to do the same operation (if applicable). Many resources on how to use `dplyr` exist and are straightforward: * https://cran.rstudio.com/web/packages/dplyr/vignettes/ * https://stat545-ubc.github.io/block009_dplyr-intro.html * https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial The `dplyr` package also interfaces well with tibbles. ## Loading in dplyr and tidyverse ```{r} library(tidyverse) ``` Note, when loading `dplyr`, it says objects can be "masked"/conflicts. That means if you use a function defined in 2 places, it uses the one that is loaded in **last**. ## Loading in dplyr and tidyverse For example, if we print `filter`, then we see at the bottom `namespace:dplyr`, which means when you type `filter`, it will use the one from the `dplyr` package. ```{r} filter ``` ## Loading in dplyr and tidyverse A `filter` function exists by default in the `stats` package, however. If you want to make sure you use that one, you use `PackageName::Function` with the colon-colon ("`::`") operator. ```{r} head(stats::filter,2) ``` This is important when loading many packages, and you may have some conflicts/masking. ## Creating a `data.frame` to work with Here we use one of the datasets that comes with `jhu` called `jhu_cars`, which is a (copy of another called `mtcars`) create a toy data.frame named `df` using random data: ```{r} data(jhu_cars) df = jhu_cars # df is a copy of jhu_cars head(df) # changing df does **not** change jhu_cars ``` ## Creating a `data.frame` to work with If we would like to create a `tibble` ("fancy" data.frame), we can using `as.tbl` or `as_tibble`. ```{r} tbl = as_tibble(df) head(tbl) ``` ## No rownames in tibbles! In the "tidy" data format, all information of interest is a variable (not a name). **as of tibble 2.0, rownames are removed**. For example, `mtcars` has each car name as a row name: ```{r} head(mtcars, 2) head(as_tibble(mtcars), 2) ``` # Renaming Columns ## Renaming Columns of a `data.frame`: dplyr To rename columns in `dplyr`, you use the `rename` command ```{r} df = dplyr::rename(df, MPG = mpg) head(df) df = rename(df, mpg = MPG) # reset - don't need :: b/c not masked ``` ## Renaming All Columns of a `data.frame`: dplyr To rename all columns you use the `rename_all` command (with a function) ```{r} df_upper = dplyr::rename_all(df, toupper) head(df_upper) ``` ## Lab Part 1 [Website](http://johnmuschelli.com/intro_to_r/index.html) # Subsetting Columns ## Subset columns of a `data.frame`: We can grab the `carb` column using the `$` operator. ```{r} df$carb ``` ## Subset columns of a `data.frame`: dplyr The `select` command from `dplyr` allows you to subset (gives a `tibble`!) ```{r} select(df, mpg) ``` ## Subset columns of a `data.frame`: dplyr If you wanted it to be a single vector (not a `tibble`), use `pull`: ```{r} pull(select(df, mpg)) ``` ## Select columns of a `data.frame`: dplyr The `select` command from `dplyr` allows you to subset columns matching strings: ```{r} select(df, mpg, cyl) select(df, starts_with("c")) ``` ## See the Select "helpers" Run the command: ```{r, eval = FALSE} ??tidyselect::select_helpers ``` Here are a few: ```{r, eval = FALSE} one_of() last_col() ends_with() contains() # like searching matches() # Matches a regular expression - cover later ``` ## Lab Part 2 [Website](http://johnmuschelli.com/intro_to_r/index.html) # Subsetting Rows ## Subset rows of a `data.frame`: dplyr The command in `dplyr` for subsetting rows is `filter`. Try `?filter` ```{r} filter(df, mpg > 20 | mpg < 14) ``` Note, no `$` or subsetting is necessary. R "knows" `mpg` refers to a column of `df`. ## Subset rows of a `data.frame`: dplyr You can have multiple logical conditions using the following: * `==` : equals to * `!` : not/negation * `>` / `<`: greater than / less than * `>=` or `<=`: greater than or equal to / less than or equal to * `&` : AND * `|` : OR ## Subset rows of a `data.frame`: dplyr By default, you can separate conditions by commas, and `filter` assumes these statements are joined by `&`: ```{r} filter(df, mpg > 20 & cyl == 4) filter(df, mpg > 20, cyl == 4) ``` ## Subset rows of a `data.frame`: dplyr If you want OR statements, you need to do the pipe `|` explicitly: ```{r} filter(df, mpg > 20 | cyl == 4) ``` ## Lab Part 3 [Website](http://johnmuschelli.com/intro_to_r/index.html) ## Combining `filter` and `select` You can combine `filter` and `select` to subset the rows and columns, respectively, of a `data.frame`: ```{r} select(filter(df, mpg > 20 & cyl == 4), cyl, hp) ``` In `R`, the common way to perform multiple operations is to wrap functions around each other in a nested way such as above ## Assigning Temporary Objects One can also create temporary objects and reassign them: ```{r} df2 = filter(df, mpg > 20 & cyl == 4) df2 = select(df2, cyl, hp) ``` ## Using the `pipe` (comes with `dplyr`): Recently, the pipe `%>%` makes things such as this much more readable. It reads left side "pipes" into right side. RStudio `CMD/Ctrl + Shift + M` shortcut. Pipe `df` into `filter`, then pipe that into `select`: ```{r} df %>% filter(mpg > 20 & cyl == 4) %>% select(cyl, hp) ``` # Adding/Removing Columns ## Adding new columns to a `data.frame`: base R You can add a new column, called `newcol` to `df`, using the `$` operator: ```{r} df$newcol = df$wt/2.2 head(df,3) ``` ## Adding columns to a `data.frame`: dplyr The `$` method is very common. The `mutate` function in `dplyr` allows you to add or replace columns of a `data.frame`: ```{r} df = mutate(df, newcol = wt/2.2) ``` ```{r, echo = FALSE} print(head({df = mutate(df, newcol = wt/2.2)}, 2)) ``` ## Creating conditional variables One frequently-used tool is creating variables with conditions. A general function for creating new variables based on existing variables is the `ifelse()` function, which "returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is `TRUE` or `FALSE`." ``` ifelse(test, yes, no) # test: an object which can be coerced to logical mode. # yes: return values for true elements of test. # no: return values for false elements of test. ``` ## Adding columns to a `data.frame`: dplyr Combined with `ifelse(condition, TRUE, FALSE)`, it can give you: ```{r} df = mutate(df, disp_cat = ifelse( disp <= 200, "Low", ifelse(disp <= 400, "Medium", "High") ) ) head(df$disp_cat) ``` ## Adding columns to a `data.frame`: dplyr Alternatively, `case_when` provides a clean syntax as well: ```{r} df = mutate(df, disp_cat2 = case_when( disp <= 200 ~ "Low", disp > 200 & disp <= 400 ~ "Medium", disp > 400 ~ "High", )) head(df$disp_cat2) ``` ## Removing columns to a `data.frame`: base R You can remove a column by assigning to `NULL`: ```{r, eval = FALSE} df$newcol = NULL ``` ## Removing columns to a `data.frame`: dplyr The `NULL` method is still very common. The `select` function can remove a column with minus (`-`): ```{r, eval = FALSE} select(df, -newcol) ``` ```{r, echo = FALSE} head(select(df, -newcol)) ``` ## Removing columns to a `data.frame`: dplyr Remove `newcol` and `drat` ```{r, eval = FALSE} select(df, -one_of("newcol", "drat")) ``` ```{r, echo = FALSE} head(select(df, -one_of("newcol", "drat"))) ``` # Ordering columns ## Ordering the columns of a `data.frame`: dplyr The `select` function can reorder columns. Put `newcol` first, then select the rest of columns: ```{r, eval = FALSE} select(df, newcol, everything()) ``` ```{r, echo = FALSE} head(select(df, newcol, everything())) ``` ## Ordering the columns of a `data.frame`: dplyr Put `newcol` at the end ("remove, everything, then add back in"): ```{r, eval = FALSE} select(df, -newcol, everything(), newcol) ``` ```{r, echo = FALSE} head(select(df, -newcol, everything(), newcol)) ``` # Ordering rows ## Ordering the rows of a `data.frame`: dplyr The `arrange` function can reorder rows By default, `arrange` orders in ascending order: ```{r} arrange(df, mpg) ``` ## Ordering the rows of a `data.frame`: dplyr Use the `desc` to arrange the rows in descending order: ```{r} arrange(df, desc(mpg)) ``` ## Ordering the rows of a `data.frame`: dplyr It is a bit more straightforward to mix increasing and decreasing orderings: ```{r} arrange(df, mpg, desc(hp)) ``` ## Transmutation The `transmute` function in `dplyr` combines both the `mutate` and `select` functions. One can create new columns and keep the only the columns wanted: ```{r} transmute(df, newcol2 = wt/2.2, mpg, hp) ``` ## Lab Part 4 [Website](http://johnmuschelli.com/intro_to_r/index.html) # Extra Slides ## Renaming Columns of a `data.frame`: base R We can use the `colnames` function to extract and/or directly reassign column names of `df`: ```{r} colnames(df) # just prints colnames(df)[1:3] = c("MPG", "CYL", "DISP") # reassigns head(df) colnames(df)[1:3] = c("mpg", "cyl", "disp") #reset - just to keep consistent ``` ## Renaming Columns of a `data.frame`: base R We can assign the column names, change the ones we want, and then re-assign the column names: ```{r} cn = colnames(df) cn[ cn == "drat"] = "DRAT" colnames(df) = cn head(df) colnames(df)[ colnames(df) == "DRAT"] = "drat" #reset ``` ## Subset rows of a `data.frame` with indices: Let's select **rows** 1 and 3 from `df` using brackets: ```{r} df[ c(1, 3), ] ``` ## Subset columns of a `data.frame`: We can also subset a `data.frame` using the bracket `[, ]` subsetting. For `data.frame`s and matrices (2-dimensional objects), the brackets are `[rows, columns]` subsetting. We can grab the `x` column using the index of the column or the column name ("`carb`") ```{r} df[, 11] df[, "carb"] ``` ## Biggest difference between `tbl` and `data.frame`: Mostly, `tbl` (tibbles) are the same as `data.frame`s, except they don't print all lines. When subsetting only one column using brackets, a `data.frame` will return a vector, but a `tbl` will return a `tbl` ```{r} df[, 1] tbl[, 1] tbl[, "mpg"] df[, 1, drop = FALSE] ``` ## Subset columns of a `data.frame`: We can select multiple columns using multiple column names: ```{r} df[, c("mpg", "cyl")] ``` ## No rownames in tibbles! If you run into losing a variable contained in your row names, use `rownames_to_column` to add it before turning it into a `tibble` to keep them: ```{r} head(rownames_to_column(mtcars, var = "car"), 2) head(as_tibble(rownames_to_column(mtcars, var = "car")), 2) ```