Reference: R4DS Chapter 11
Here is a review of
read_csv and the
parse_ functions that convert raw text data to R vectors. Let’s create some example data:
ssc <- "***Sesame Street Characters*** id,name,v2,v3 1,Oscar,$30.0,2017-10-01 1,Oscar,$27.0,2017-09-09 2,Big Bird,$18.3,2017-08-31 2,Big Bird,$17.5,na 4,Grover,$16.0,2017-10-03 " id <- c(1,1,2,2,4) nm <- c('Oscar','Oscar','Big Bird','Big Bird','Grover') v2 <- c('$30.0','$27.0','$18.3','$17.5','$16.0') v3 <- c('2017-10-01','2017-09-09','2017-08-31','na','2017-10-03')
For numeric data:
## Warning: 5 parsing failures. ## row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 1 NA a double $30.0 row 2 2 NA a double $27.0 col 3 3 NA a double $18.3 expected 4 4 NA a double $17.5 actual 5 5 NA a double $16.0 ##  NA NA NA NA NA ## attr(,"problems") ## # A tibble: 5 x 4 ## row col expected actual ## <int> <int> <chr> <chr> ## 1 1 NA a double $30.0 ## 2 2 NA a double $27.0 ## 3 3 NA a double $18.3 ## 4 4 NA a double $17.5 ## 5 5 NA a double $16.0
parse_double is strict about number formatting. For the dollar amounts in
v2 we can use
parse_number(v2) # removes the dollar signs
##  30.0 27.0 18.3 17.5 16.0
For categorical variables we can use
clevels <- c('Big Bird','Grover','Bert','Ernie','Oscar') parse_factor(nm, levels=clevels)
##  Oscar Oscar Big Bird Big Bird Grover ## Levels: Big Bird Grover Bert Ernie Oscar
Dates are formatted with
## Warning: 1 parsing failure. ## row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 4 NA date like na ##  "2017-10-01" "2017-09-09" "2017-08-31" NA "2017-10-03"
Specify the values that should be treated as
##  "2017-10-01" "2017-09-09" "2017-08-31" NA "2017-10-03"
Putting it all together, we can import this data set with
read_csv, specifying the variable types with
read_csv(ssc, skip = 1, # skip the first line of the file col_types = cols( id = col_integer(), name = col_factor(levels=clevels), v2 = col_number(), # removes the dollar signs v3 = col_date() ), na=c('','na') )
## # A tibble: 5 x 4 ## id name v2 v3 ## <int> <fctr> <dbl> <date> ## 1 1 Oscar 30.0 2017-10-01 ## 2 1 Oscar 27.0 2017-09-09 ## 3 2 Big Bird 18.3 2017-08-31 ## 4 2 Big Bird 17.5 NA ## 5 4 Grover 16.0 2017-10-03
We will use data on Airbnb listings in Boston compiled by Inside Airbnb. Download the first file for Boston, described as “Detailed listings data for Boston”. Decompress the
.gz file and save the resulting
.csv file in your working directory.
Let’s try read this file without specifying data types.
ab <- read_csv('listings.csv')
Examine the column types that were chosen by default:
What data type was given to the
price variable? Print a few
price values to understand what is stored in this variable.
I’ve created a column specification for you to try. First we will need the levels (unique values) for a few categorical variables.
(rmlvl <- unique(ab$room_type))
##  "Private room" "Entire home/apt" "Shared room"
Using the above code as a model, create two other vectors named
bedlvl that contain the unique values of the variables
Now copy and paste the following code to import the data with our own column specification. The
cols_only function will only import the listed columns.
colspec <- cols_only( price = col_double(), zipcode = col_character(), city = col_character(), description = col_character(), name = col_character(), state = col_character(), is_location_exact = col_character(), latitude = col_double(), longitude = col_double(), neighbourhood_cleansed = col_character(), calculated_host_listings_count = col_integer(), availability_365 = col_integer(), host_is_superhost = col_character(), host_since = col_date(), review_scores_rating = col_integer(), property_type = col_factor(levels = proplvl), room_type = col_factor(levels = rmlvl), bathrooms = col_double(), bedrooms = col_integer(), beds = col_integer(), accommodates = col_integer(), bed_type = col_factor(levels = bedlvl), guests_included = col_integer(), cleaning_fee = col_double(), minimum_nights = col_integer(), number_of_reviews = col_integer(), review_scores_rating = col_double(), instant_bookable = col_character(), last_review = col_date(), first_review = col_date(), market = col_character() ) ab <- read_csv('listings.csv', col_types = colspec)
What happened? Enter the command
Fix the column specification so that
cleaning_fee are properly formatted. You can do this by completing the following commands:
colspec$cols[['cleaning_fee']] <- # complete this command colspec$cols[['price']] <- # complete this command ab <- read_csv('listings.csv', col_types = colspec)
aba tibble or a regular
data.frame? How do you know?
Use pipes, the
$ operator, and the
table function to list the number of properties in each
city. List the number of properties in each neighborhood using the
Suppose I store a variable name as a string:
vn <- 'price'
Use pipes and
[[ to select the variable stored in
ab. Pipe the result to the
summary function. You should see output like this:
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 80.0 140.0 172.8 200.0 4000.0
Plot a histogram of the prices (rental cost for a single night). Are there any unusual (high or low) values? Try a few different
Remove properites with absurd (high or low) prices. What cutoff values did you choose and why?
Create a vector called
nbh50 with the names of neighborhoods with at least 50 listings, sorted by median price. I suggest using
filter. What neighborhoods have the highest and lowest median price?
Then enter this command to create a new factor variable with levels corresponding to the sorted neighborhoods:
ab <- mutate(ab, nbh_sorted = factor(neighbourhood_cleansed, levels=nbh50))
Compute the quintiles of the price distribution across all properties. Store the result in a vector called
pquint. There should be six values in this vector, including the minimum and maximum prices. Here is an example of the
quantile(c(100,50,73,1,2), probs=c(0, 0.1,0.5,0.9,1))
## 0% 10% 50% 90% 100% ## 1.0 1.4 50.0 89.2 100.0
Enter this command, which makes a factor variable containing the price quintile for each property:
ab <- mutate(ab, price_q = cut(price, breaks = pquint))
abto only contain properties in those neighborhoods, and map the
price_qcategorical variable to either
fill. Some possibilities include boxplots, histograms, density plots, and (jittered) points. Use
nbh_sortedas your neighborhood variable.