In this assignment, we will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. More details on this dataset are here: https://www.kaggle.com/c/DontGetKicked/overview/background.
## you can add more, or change...these are suggestions library(tidyverse) library(readr) library(dplyr) library(ggplot2) library(tidyr)
Get the dataset: http://johnmuschelli.com/intro_to_r/data/kaggleCarAuction.csv.
Read the data set in using
read_csv() and name the dataset “cars”.
Import the “dictionary” from http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary_formatted.txt.
read_tsv() function and name it “key”.
(Optional) What would you do if the data was formatted with spaces and tabs and such, like this one: http://johnmuschelli.com/intro_to_r/data/Carvana_Data_Dictionary.txt
Hint: see the
Save the key and data in an .rda file so you can access the data offline using the
How many cars are in the dataset? How many variables are recorded for each car?
What is the range of the manufacturer’s years of the vehicles? Use “VehYear”.
How many cars were from before 2004, and what percent/proportion do these represent?
filter() functions or
Drop any vehicles that cost less than or equal to $1500 (“VehBCost”) or that have missing values. How many vehicles were removed?
filter() removes missing values.
Use this reduced dataset (generated in question 8) for all subsequent questions (9 through 12).
How many different vehicle a) manufacturers/makes (“Make”), b) models (“Model”), and c) sizes (“Size”) are there?
Which vehicle a) make, b) model, and c) color had the highest average acquisition cost paid for the vehicle at time of purchase, and what was this cost?
summarize(). Be sure to use the key to find the column that corresponds to the aquisition cost paid for the vehicle at time of purchase!
Which vehicle a) make, b) model, and c) color had the highest variability in acquisition cost paid for the vehicle at time of purchase?
How many vehicles (using