---
title: "Subsetting Data in R"
author: "Introduction to R for Public Health Researchers"
output:
  ioslides_presentation:
    css: ../styles.css
    widescreen: yes
---

```{r, echo = FALSE}
library(knitr)
opts_chunk$set(comment = "")
suppressPackageStartupMessages(library(dplyr))
library(dplyr)
library(jhur)
```

## Overview

We showed one way to read data into R using `read_csv` and `read.csv`.  In this module, we will show you how to:

1. Select specific elements of an object by an index or logical condition
2. Renaming columns of a `data.frame`
3. Subset rows of a `data.frame`
4. Subset columns of a `data.frame`
5. Add/remove new columns to a `data.frame`
6. Order the columns of a `data.frame`
7. Order the rows of a `data.frame`

## Setup

We will show you how to do each operation in base R then show you how to use the `dplyr` package to do the same operation (if applicable).

Many resources on how to use `dplyr` exist and are straightforward: 

* https://cran.rstudio.com/web/packages/dplyr/vignettes/
* https://stat545-ubc.github.io/block009_dplyr-intro.html
* https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial

The `dplyr` package also interfaces well with tibbles.

## Loading in dplyr and tidyverse

```{r}
library(tidyverse)
```

Note, when loading `dplyr`, it says objects can be "masked"/conflicts.  That means if you 
use a function defined in 2 places, it uses the one that is loaded in **last**.

## Loading in dplyr and tidyverse

For example, if we print `filter`, then we see at the bottom `namespace:dplyr`, which
means when you type `filter`, it will use the one from the `dplyr` package.

```{r}
filter
```

## Loading in dplyr and tidyverse

A `filter` function exists by default in the `stats` package, however.  If you want
to make sure you use that one, you use `PackageName::Function` with the colon-colon 
("`::`") operator. 

```{r}
head(stats::filter,2)
```

This is important when loading many packages, and you may have
some conflicts/masking.


## Creating a `data.frame` to work with

Here we use one of the datasets that comes with `jhu` called `jhu_cars`, which is a (copy of another called `mtcars`) create a toy data.frame named `df` using random data:
```{r}
data(jhu_cars)
df = jhu_cars # df is a copy of jhu_cars
head(df) # changing df does **not** change jhu_cars
```

## Creating a `data.frame` to work with

If we would like to create a `tibble` ("fancy" data.frame), we can using `as.tbl` or `as_tibble`.  

```{r}
tbl = as_tibble(df) 
head(tbl)
```


## No rownames in tibbles!
 
In the "tidy" data format, all information of interest is a variable (not a name). **as of tibble 2.0, rownames are removed**.  For example, `mtcars`  has each car name as a row name: 

```{r}
head(mtcars, 2)
head(as_tibble(mtcars), 2)
```

# Renaming Columns

## Renaming Columns of a `data.frame`: dplyr

To rename columns in `dplyr`, you use the `rename` command

```{r}
df = dplyr::rename(df, MPG = mpg)
head(df)
df = rename(df, mpg = MPG) # reset - don't need :: b/c not masked
```

## Renaming All Columns of a `data.frame`: dplyr

To rename all columns you use the `rename_all` command (with a function)

```{r}
df_upper = dplyr::rename_all(df, toupper)
head(df_upper)
```


## Lab Part 1

[Website](http://johnmuschelli.com/intro_to_r/index.html)

# Subsetting Columns

## Subset columns of a `data.frame`: 

We can grab the `carb` column using the `$` operator. 

```{r}
df$carb
```

## Subset columns of a `data.frame`: dplyr

The `select` command from `dplyr` allows you to subset (gives a `tibble`!)
```{r}
select(df, mpg)
```

## Subset columns of a `data.frame`: dplyr

If you wanted it to be a single vector (not a `tibble`), use `pull`:
```{r}
pull(select(df, mpg))
```

## Select columns of a `data.frame`: dplyr

The `select` command from `dplyr` allows you to subset columns matching strings:
```{r}
select(df, mpg, cyl)
select(df, starts_with("c"))
```

## See the Select "helpers"

Run the command:

```{r, eval = FALSE}
??tidyselect::select_helpers
```

Here are a few:

```{r, eval = FALSE}
one_of()
last_col()
ends_with()
contains() # like searching
matches() # Matches a regular expression - cover later
```

## Lab Part 2

[Website](http://johnmuschelli.com/intro_to_r/index.html)

# Subsetting Rows

## Subset rows of a `data.frame`: dplyr

The command in `dplyr` for subsetting rows is `filter`.  Try `?filter`

```{r}
filter(df, mpg > 20 | mpg < 14)
```
Note, no `$` or subsetting is necessary.  R "knows" `mpg` refers to a column of `df`.


## Subset rows of a `data.frame`: dplyr

You can have multiple logical conditions using the following:

* `==` : equals to
* `!` : not/negation
* `>` / `<`: greater than / less than
* `>=` or `<=`: greater than or equal to / less than or equal to
* `&` : AND
* `|` : OR

## Subset rows of a `data.frame`: dplyr

By default, you can separate conditions by commas, and `filter` assumes these statements are joined by `&`:

```{r}
filter(df, mpg > 20 & cyl == 4)
filter(df, mpg > 20, cyl == 4)
```

## Subset rows of a `data.frame`: dplyr

If you want OR statements, you need to do the pipe `|` explicitly:

```{r}
filter(df, mpg > 20 | cyl == 4)
```


## Lab Part 3

[Website](http://johnmuschelli.com/intro_to_r/index.html)

## Combining `filter` and `select`

You can combine `filter` and `select` to subset the rows and columns, respectively, of a `data.frame`:

```{r}
select(filter(df, mpg > 20 & cyl == 4), cyl, hp)
```

In `R`, the common way to perform multiple operations is to wrap functions around each other in a nested way such as above 

## Assigning Temporary Objects

One can also create temporary objects and reassign them:

```{r}
df2 = filter(df, mpg > 20 & cyl == 4)
df2 = select(df2, cyl, hp)
```

## Using the `pipe` (comes with `dplyr`):

Recently, the pipe `%>%` makes things such as this much more readable.  It reads left side "pipes" into right side.  RStudio `CMD/Ctrl + Shift + M` shortcut. Pipe `df` into `filter`, then pipe that into `select`:


```{r}
df %>% filter(mpg > 20 & cyl == 4) %>% select(cyl, hp)
```


# Adding/Removing Columns

## Adding new columns to a `data.frame`: base R

You can add a new column, called `newcol` to `df`, using the `$` operator:
```{r}
df$newcol = df$wt/2.2
head(df,3)
```


## Adding columns to a `data.frame`: dplyr

The `$` method is very common.

The `mutate` function in `dplyr` allows you to add or replace columns of a `data.frame`:
```{r}
df = mutate(df, newcol = wt/2.2)
```

```{r, echo = FALSE}
print(head({df = mutate(df, newcol = wt/2.2)}, 2))
```

## Creating conditional variables

One frequently-used tool is creating variables with conditions.

A general function for creating new variables based on existing variables is the `ifelse()` function, which "returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is `TRUE` or `FALSE`."

```
ifelse(test, yes, no)

# test: an object which can be coerced
    to logical mode.
# yes: return values for true elements of test.
# no: return values for false elements of test.
```


## Adding columns to a `data.frame`: dplyr

Combined with `ifelse(condition, TRUE, FALSE)`, it can give you:

```{r}
df = mutate(df, 
            disp_cat = ifelse(
              disp <= 200,
              "Low",
              ifelse(disp <= 400,
                     "Medium",
                     "High")
            )
)
head(df$disp_cat)
```

## Adding columns to a `data.frame`: dplyr

Alternatively, `case_when` provides a clean syntax as well:

```{r}
df = mutate(df, 
            disp_cat2 = case_when(
              disp <= 200 ~ "Low",
              disp > 200 & disp <= 400 ~ "Medium",
              disp > 400 ~ "High",
            ))
head(df$disp_cat2)
```


## Removing columns to a `data.frame`: base R

You can remove a column by assigning to `NULL`:
```{r, eval = FALSE}
df$newcol = NULL
```

## Removing columns to a `data.frame`: dplyr

The `NULL` method is still very common.

The `select` function can remove a column with minus (`-`):
```{r, eval = FALSE}
select(df, -newcol)
```

```{r, echo = FALSE}
head(select(df, -newcol))
```

## Removing columns to a `data.frame`: dplyr

Remove `newcol` and `drat`
```{r, eval = FALSE}
select(df, -one_of("newcol", "drat"))
```

```{r, echo = FALSE}
head(select(df, -one_of("newcol", "drat")))
```

# Ordering columns

## Ordering the columns of a `data.frame`: dplyr

The `select` function can reorder columns.  Put `newcol` first, then select the rest of columns:
```{r, eval = FALSE}
select(df, newcol, everything())
```

```{r, echo = FALSE}
head(select(df, newcol, everything()))
```

## Ordering the columns of a `data.frame`: dplyr

Put `newcol` at the end ("remove, everything, then add back in"):

```{r, eval = FALSE}
select(df, -newcol, everything(), newcol)
```

```{r, echo = FALSE}
head(select(df, -newcol, everything(), newcol))
```

# Ordering rows

## Ordering the rows of a `data.frame`: dplyr

The `arrange` function can reorder rows  By default, `arrange` orders in ascending order:
```{r}
arrange(df, mpg)
```

## Ordering the rows of a `data.frame`: dplyr
Use the `desc` to arrange the rows in descending order:
```{r}
arrange(df, desc(mpg))
```

## Ordering the rows of a `data.frame`: dplyr
It is a bit more straightforward to mix increasing and decreasing orderings:
```{r}
arrange(df, mpg, desc(hp))
```

## Transmutation

The `transmute` function in `dplyr` combines both the `mutate` and `select` functions. One can create new columns and keep the only the columns wanted:

```{r}
transmute(df, newcol2 = wt/2.2, mpg, hp)
```

## Lab Part 4

[Website](http://johnmuschelli.com/intro_to_r/index.html)


# Extra Slides

## Renaming Columns of a `data.frame`: base R

We can use the `colnames` function to extract and/or directly reassign column names of `df`:

```{r}
colnames(df) # just prints
colnames(df)[1:3] = c("MPG", "CYL", "DISP") # reassigns
head(df)
colnames(df)[1:3] = c("mpg", "cyl", "disp") #reset - just to keep consistent
```

## Renaming Columns of a `data.frame`: base R

We can assign the column names, change the ones we want, and then re-assign
the column names:
```{r}
cn = colnames(df)
cn[ cn == "drat"] = "DRAT"
colnames(df) = cn
head(df)
colnames(df)[ colnames(df) == "DRAT"] = "drat" #reset
```

## Subset rows of a `data.frame` with indices: 

Let's select **rows** 1 and 3 from `df` using brackets:

```{r}
df[ c(1, 3), ]
```

## Subset columns of a `data.frame`: 

We can also subset a `data.frame` using the bracket `[, ]` subsetting. 

For `data.frame`s and matrices (2-dimensional objects), the brackets are `[rows, columns]` subsetting.  We can grab the `x` column using the index of the column or the column name ("`carb`") 

```{r}
df[, 11]
df[, "carb"]
```

## Biggest difference between `tbl` and `data.frame`: 

Mostly, `tbl` (tibbles) are the same as `data.frame`s, except they don't print all lines.  When subsetting only one column using brackets, a `data.frame` will return a vector, but a `tbl` will return a `tbl`

```{r}
df[, 1]
tbl[, 1]
tbl[, "mpg"]
df[, 1, drop = FALSE]
```

## Subset columns of a `data.frame`: 

We can select multiple columns using multiple column names:

```{r}
df[, c("mpg", "cyl")]
```


## No rownames in tibbles!

If you run into losing a variable contained in your row names, use `rownames_to_column` to add it before turning it into a `tibble` to keep them:

```{r}
head(rownames_to_column(mtcars, var = "car"),  2)
head(as_tibble(rownames_to_column(mtcars, var = "car")),  2)
```