Lab 6, October 24, Solutions
Reading a csv file
Let’s try read this file without specifying data types.
ab <- read_csv('listings.csv')
What data type was given to the
The price variable was imported as a character vector, but it should represent a numeric value.
I’ve created a column specification for you to try. First we will need the levels (unique values) for a few categorical variables.
(rmlvl <- unique(ab$room_type))
##  "Private room" "Entire home/apt" "Shared room"
Using the above code as a model, create two other vectors named
bedlvl that contain the unique values of the variables
proplvl <- unique(ab$property_type) bedlvl <- unique(ab$bed_type)
Now copy and paste the following code to import the data with our own column specification. The
cols_only function will only import the listed columns.
# code omitted
col_double() for price failed because the dollar signs cause parsing errors.
Fix the column specification so that
cleaning_fee are properly formatted.
col_number() will remove the dollar signs for us:
colspec$cols[['cleaning_fee']] <- col_number() colspec$cols[['price']] <- col_number() ab <- read_csv('listings.csv', col_types = colspec)
aba tibble or a regular
data.frame? How do you know?
abis a tibble.
Use pipes, the
$operator, and the
tablefunction to list the number of properties in each
city. List the number of properties in each neighborhood using the
ab %>% .$city %>% table ab %>% .$neighbourhood_cleansed %>% table
Suppose I store a variable name as a string:
vn <- 'price'
Use pipes and
[[to select the variable stored in
ab. Pipe the result to the
ab %>% .[[vn]] %>% summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.0 80.0 140.0 172.8 200.0 4000.0
Plot a histogram of the prices (rental cost for a single night). Are there any unusual (high or low) values?
There are some prices as high as 3000 or 4000.
Remove properites with absurd (high or low) prices.
ab <- filter(ab, price <=2500, price > 0)
Create a vector called
nbh50with the names of neighborhoods with at least 50 listings, sorted by median price. I suggest using
filter. What neighborhoods have the highest and lowest median price?
nbh50 <- ab %>% group_by(neighbourhood_cleansed)%>% summarise(np=n(), medprice = median(price)) %>% filter(np>50) %>% arrange(medprice) %>% .[]
Then enter this command to create a new factor variable with levels corresponding to the sorted neighborhoods:
ab <- mutate(ab, nbh_sorted = factor(neighbourhood_cleansed, levels=nbh50))
Compute the quintiles of the price distribution across all properties. Store the result in a vector called
pquint. There should be six values in this vector, including the minimum and maximum prices.
(pquint <- ab %>% .[[vn]] %>% quantile(probs=seq(0,1,0.2)))
## 0% 20% 40% 60% 80% 100% ## 17 75 119 169 235 2000
Enter this command, which makes a factor variable containing the price quintile for each property:
ab <- mutate(ab, price_q = cut(price, breaks = pquint))
Finally, create one or two informative graphs that display the distribution of prices within each of the
abto only contain properties in those neighborhoods, and map the
price_qcategorical variable to either
ggplot(filter(ab, neighbourhood_cleansed %in% nbh50, !is.na(price_q))) + geom_point(aes(x=nbh_sorted,y=price,color=price_q), position=position_jitter(w=0.2,h=0)) + scale_color_brewer(palette='YlOrRd')+ theme(axis.text.x=element_text(hjust=1,angle=45))+ xlab("")+ylab("Price")