This is the second homework assignment for Intro to R. You must submit both the RMD and “knitted” HTML files are due before class on Wednesday (Day 3)

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

This is a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. More details on this dataset are here: http://www.kaggle.com/c/DontGetKicked/details/Background

Questions:

  1. Get the dataset: http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv read the data set in using read_csv(), name the dataset cars.

  2. Read the “dictionary”: http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary_formatted.txt use the read_tsv() function and name it key.

  3. What would you do if the data was formatted like this: http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary.txt with spaces and tabs and such? See the readLines function.

  4. Save the key and data in an “.rda” file so you can access the data offline using the save() function.

  5. How many cars are in the dataset? How many variables are recorded for each car?

  6. What is the range of the manufacturer’s years of the vehicles? Use VehYear

  7. How many cars were from before 2004, and what percent/proportion do these represent? Hint: use summarize/filter or sum

  8. Drop any vehicles that cost less than or equal to $1500 (VehBCost), or missing - (filter() removes missing values!). How many vehicles were removed? Note the rest of the questions expect answers based on this reduced dataset.

  9. How many different vehicle a) manufacturers/makes (Make), b) models (Model) and c) sizes (Size) are there? Hint: use table() or group_by() with tally()/summarize()

  10. Which vehicle a) make, b) model and c) color had the highest average acquisition cost paid for the vehicle at time of purchase, and what was this cost? Hint: use group_by() with summarize()

  11. Which vehicle a) make, b) model and c) color had the highest variability in acquisition cost paid for the vehicle at time of purchase?

  12. How many vehicles (using filter() or sum()):
  1. Were red and have fewer than 30,000 miles?
  2. Are made by GMC and were purchased in Texas?
  3. Are blue or red?
  4. Are made by Chrysler or Nissan and are white or silver?
  5. Are automatic, blue, Pontiac cars with under 40,000 miles?