---
title: "Data Cleaning Lab"
output: html_document
editor_options:
chunk_output_type: console
---
# Data used
Bike Lanes Dataset: BikeBaltimore is the Department of Transportation's bike program.
The data is from http://data.baltimorecity.gov/Transportation/Bike-Lanes/xzfj-gyms
You can Download as a CSV in your current working directory. Note its also available at: http://johnmuschelli.com/intro_to_r/data/Bike_Lanes.csv
```{r}
library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(jhur)
library(tidyverse)
library(broom)
bike = read_csv(
"http://johnmuschelli.com/intro_to_r/data/Bike_Lanes.csv")
```
# Part 1
1. Count the number of rows of the bike data and count the number of complete cases of the bike data. Use `sum` and `complete.cases`.
```{r}
```
2. Create a data set called `namat` which is equal to `is.na(bike)`.
What is the class of `namat`? Run rowSums and colSums on `namat.`
These represent the number of missing values in the rows and columns of
`bike`. Don't print `rowSums`, but do a table of the `rowSums`
Try `nabike = bike %>% mutate_all(is.na)`. Then try `nabike %>% summarise_all(sum)`. What do you get?
```{r}
```
3. Filter rows of bike that are NOT missing the `route` variable, assign
this to the object `have_route.` Do a table of the `subType` using `table`,
including the missing `subType`s Get the same frequency distribution
using `group_by(subType)` and `tally()`
```{r}
```
4. Filter rows of bike that have the type `SIDEPATH` or `BIKE LANE`
using `%in%`. Call it `side_bike.` Confirm this gives you the same number of results using the `|` and
`==`.
```{r}
```
# Part 2
5. Do a cross tabulation of the bike `type` and the number of lanes.
Call it `tab`. Do a `prop.table` on the rows and columns margins.
Try `as.data.frame(tab)` or `broom::tidy(tab)`
```{r}
```
# Part 3
**New Data set**
Download the "Real Property Taxes" Data from my website (via OpenBaltimore):
http://johnmuschelli.com/intro_to_r/data/Real_Property_Taxes.csv.gz
note you don't need to unzip it to read it into R
6. Read the Property Tax data into R and call it the variable `tax`
```{r}
```
7. How many addresses pay property taxes?
```{r}
```
8. What is the total city and state tax paid?
You need to remove the `$` from the `CityTax` variable
then you need to make it numeric. Try `str_replace`, but remember
$ is "special" and you need `fixed()` around it.
```{r}
```
9. Using `table()` or `group_by` and `summarize(n())` or `tally()`
a. how many observations/properties are in each ward?
```{r}
```
b. what is the mean state tax per ward? use `group_by` and `summarize`
```{r}
```
c. what is the maximum amount still due in each ward? different summarization (`max`)
```{r}
```
d. What is the 75th percentile of city and state tax paid by Ward? (`quantile`)
```{r}
```
10. Make boxplots using showing `cityTax` (`y` -variable) by whether the property is a principal residence (`x`) or not.
```{r}
```
11. Subset the data to only retain those houses that are principal residences.
which command subsets rows? Filter or select?
a) How many such houses are there?
```{r}
```
b) Describe the distribution of property taxes on these residences. Use
hist with certain breaks or `plot(density(variable))`
```{r}
```
**New Data Set**
Read in the Salary FY2015 dataset
```{r}
sal = read_csv("http://johnmuschelli.com/intro_to_r/data/Baltimore_City_Employee_Salaries_FY2015.csv")
```
12. Make an object called health.sal using the salaries data set, with only agencies of those with `"fire"` (or any forms), if any, in the name remember `fixed( ignore_case = TRUE)` will ignore cases
```{r}
```
13. Make a data set called `trans` which contains only agencies that contain "TRANS".
```{r}
```
14. What is/are the profession(s) of people who have `"abra"` in their name for Baltimore's Salaries? Case should be ignored
```{r}
```
15. What is the distribution of annual salaries look like? (use `hist`, 20 breaks) What is the IQR?
Hint: first convert to numeric. Try `str_replace`, but remember`$` is "special" and you need `fixed()` around it.
```{r}
```
16. Convert `HireDate` to the `Date` class - plot Annual Salary vs Hire Date.
Use `AnnualSalary ~ HireDate` with a `data = sal` argument in plot or use
x, y notation in `scatter.smooth`
Use `lubridate` package. Is it `mdy(date)` or `dmy(date)` for this data - look at `HireDate`
```{r}
```
17. Create a smaller dataset that only includes the Police Department, Fire Department and Sheriff's Office. Use the Agency variable with string matching. Call this `emer`. How many employees are in this new dataset?
```{r}
```
18. Create a variable called `dept` in the emer data set.
`dept = str_extract(Agency, ".*(ment|ice)")`. E.g. want to extract all characters
up until `ment` or `ice` (we can group in `regex` using parentheses) and then discard
the rest. Replot annual salary versus hire date, color by `dept` (not yet - using ggplot)
```{r}
```
19 (Bonus). Convert the 'LotSize' variable to a numeric square feet variable in the tax data set. Some tips:
a) 1 acre = 43560 square feet
b) The hyphens represent inches (not decimals)
```{r}
```