Brook Luers
Teaching    >    Statistics 306, Fall 2017    >    Lab 5

Lab 5, October 10, Solutions

Exam review exercises (solutions)

  1. The day variable is a factor (categorical) variable containing the day of the week when the cyclist crash occurred. month contains the month name.

    Consider the following graphs:

    (a) What geom is used in both graphs?
    geom_bar
    (b) What variables are mapped to the x and y aesthetics?
    (c) Write the commands to create both graphs.

        ggplot(cr)+geom_bar(aes(x=month, fill=day), position='fill')
        ggplot(cr) + geom_bar(aes(x=month, fill=day))
    

    (d) How would a ggplot expert describe the difference between the two graphs (using R and ggplot jargon)?
    The position adjustment is fill in the left graph and stack (the default) in the right-side graph.

    (e) How would a regular person describe the difference between the two graphs (what do they communicate about cyclist-involved crashes)?
    With position='fill', proportions are displayed instead of raw counts. It is difficult to compare the distribution of crashes over days of the week when displaying the raw counts.

  2. Write a command which creates a new data frame, called cr_wayne, which only contains crashes that occurred in Wayne county in 2013, 2014, and 2015 (the latest year in the data frame cr).

    cr_wayne <-
      filter(cr, County=='Wayne', year >= 2013)
    

    Then consider this command:

    cr_wayne %>% 
      group_by(month_num, year) %>%
      summarize(ncr = n()) %>%
      ggplot(aes(x=month_num, y=ncr)) + 
      geom_line()
    

    What aesthetic should be altered to fix the plot? Write the correct ggplot command so that the number of crashes in each month is plotted seperately for each year.

    We need to map the group aesthetic:

    cr_wayne %>% 
      group_by(month_num, year) %>%
      summarize(ncr = n()) %>%
      ggplot(aes(x=month_num, y=ncr, group=year)) + 
      geom_line()
    
  3. Consider the following command.

    filter(cr, County=="Washtenaw") %>%                                 # line 1
      group_by(year, month) %>%                                         #      2
      summarise(ncr = n()) %>%                                          #      3
      group_by(year) %>%                                                #      4
      mutate(rank_ncr = min_rank(desc(ncr))) %>% filter(rank_ncr <= 2) %>% #   5
      select(-rank_ncr) %>%                                             #      6
      arrange(desc(year))                                               #      7
    

    (a) Describe, in words, what is acocmplished by lines 1–3. Write down two (possible) rows of the data frame that results from running only lines 1–3 (with the final %>% operator removed).
    Comptutes the number of crashes in each year-month combination for Washtenaw county:

        ## # A tibble: 137 x 3
        ## # Groups:   year [?]
        ##    year   month    ncr
        ##    <int>  <fctr>  <int>
        ## 1  2004    March   2
        ## 2  2004    April   5
        ## 3  2004    May    12	
    

    (b) To describe lines 4 and 5, complete these sentences:
    For each year, rank the months (in that year) by the number of crashes that occured. Then keep the two months with the highest number of crashes.

    (c) Now describe what the entire command accomplishes. Write down possible values for the first three rows of the result.
    For each year in Washtenaw county, find the two months with the highest number of cyclist-involved crashes. Sort from latest year to earliest year.

    (d) How many rows does the resulting data frame contain? Assume that there were cyclist-car crashes in all years and months in Washtenaw county. There are 12 years represented in this data set.
    26 rows

  4. Recall the data frame cr_year, which contains the number of cyclist-car crashes in each year and County:

    head(cr_year)
    
    ## # A tibble: 6 x 3
    ## # Groups:   County [6]
    ##    County  year ncrash
    ##    <fctr> <int>  <int>
    ## 1 Allegan  2004     18
    ## 2  Alpena  2004     12
    ## 3  Antrim  2004      2
    ## 4  Baraga  2004      1
    ## 5   Barry  2004      9
    ## 6     Bay  2004     31
    

    Suppose I create a list of the counties that surround Washtenaw county, like this:

    county_list <-  c('Wayne','Washtenaw',
                      "Livingston","Jackson","Ingham",
                      "Oakland","Lenawee","Monroe")
    

    Fill in the following code to create the graph below. The blue line and points plot the number of crashes for Washtenaw county.

    ggplot(filter(cr_year, County %in% county_list), 
            aes(x=year,y=ncrash,group=County)) +
      geom_line(aes(color=County=="Washtenaw"), show.legend = FALSE) +
      geom_point(aes(color=County=="Washtenaw"),show.legend = FALSE)
          
    

Extra exercise (using your computer)

Continuing with the crash data, recreate the following plot:

Map the x aesthetic to the hour_num variable. You will need to compute the y variable using group_by, summarise and mutate. This displays the proportion of crashes, in each day of the week, that occur during each hour of the day (all years and counties are pooled).

cr %>% 
  filter(!is.na(hour_num)) %>%
  group_by(day, hour_num) %>%
  summarise(ncr_hour = n()) %>%
  group_by(day) %>%
  mutate(cr_prop_hour = ncr_hour / sum(ncr_hour)) %>%
  ggplot(aes(x=hour_num,y=cr_prop_hour)) + 
  geom_line(aes(group=day)) + 
  facet_wrap(~day) + xlab("Hour of day")+
  ylab("Proportion of crashes")+
  ggtitle("Within-day timing of cyclist-car crashes")