Instructions

  1. You must submit both the RMD and “knitted” HTML files as one compressed .zip to the Homework 2 Drop Box on CoursePlus.
  2. All assignments are due by the end of the grading period for this term (26 June 2020).

Getting Started

In this assignment, we will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. More details on this dataset are here: https://www.kaggle.com/c/DontGetKicked/overview/background.

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

Problem Set

  1. Get the dataset: http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv.
    Read the data set in using read_csv() and name the dataset “cars”.

  2. Import the “dictionary” from http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary_formatted.txt.
    Use the read_tsv() function and name it “key”.

  3. (Optional) What would you do if the data was formatted with spaces and tabs and such, like this one: http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary.txt
    Hint: see the readLines() function.

  4. Save the key and data in an .rda file so you can access the data offline using the save() function.

  5. How many cars are in the dataset? How many variables are recorded for each car?

  6. What is the range of the manufacturer’s years of the vehicles? Use “VehYear”.

  7. How many cars were from before 2004, and what percent/proportion do these represent?
    Hint: use summarize() and filter() functions or sum().

  8. Drop any vehicles that cost less than or equal to $1500 (“VehBCost”) or that have missing values. How many vehicles were removed?
    Hint: filter() removes missing values.

Use this reduced dataset (generated in question 8) for all subsequent questions (9 through 12).

  1. How many different vehicle a) manufacturers/makes (“Make”), b) models (“Model”), and c) sizes (“Size”) are there?
    Hint: use table() or group_by() with tally() or summarize().

  2. Which vehicle a) make, b) model, and c) color had the highest average acquisition cost paid for the vehicle at time of purchase, and what was this cost?
    Hint: use group_by() with summarize(). Be sure to use the key to find the column that corresponds to the aquisition cost paid for the vehicle at time of purchase!

  3. Which vehicle a) make, b) model, and c) color had the highest variability in acquisition cost paid for the vehicle at time of purchase?

  4. How many vehicles (using filter() or sum() ):

    1. Were red and have fewer than 30,000 miles?
    2. Are made by GMC and were purchased in Texas?
    3. Are blue or red?
    4. Are made by Chrysler or Nissan and are white or silver?
    5. Are automatic, blue, Pontiac cars with under 40,000 miles?