Brook Luers
Teaching    >    Statistics 306, Fall 2017    >    Lab 8

Lab 8, November 7, Solutions

Exercise: Cyclist crashes in MI (again)

cr <- read_csv("cyclist_crashes.txt") %>% 
  rename(city = City.or.Township) # for convenience

Recall that this dataset contains one row for each bicycle-involved crash in Michigan between 2004 and 2015.

  1. Compute the number of crashes in each County-year combination. Store the result in a data frame called cr_year.

    cr_year <-
      cr %>%
      group_by(County, year) %>%
      summarise(ncr = n())
    
    print(cr_year)
    
    ## # A tibble: 857 x 3
    ## # Groups:   County [?]
    ##     County  year   ncr
    ##      <chr> <int> <int>
    ##  1  Alcona  2006     1
    ##  2  Alcona  2009     1
    ##  3   Alger  2008     2
    ##  4   Alger  2009     1
    ##  5   Alger  2012     1
    ##  6 Allegan  2004    18
    ##  7 Allegan  2005     9
    ##  8 Allegan  2006    14
    ##  9 Allegan  2007    10
    ## 10 Allegan  2008    11
    ## # ... with 847 more rows
    
  2. The dataset cr_year illustrates “implicit” missing values. For example:

    filter(cr_year, County=="Alcona")
    
    ## # A tibble: 2 x 3
    ## # Groups:   County [1]
    ##   County  year   ncr
    ##    <chr> <int> <int>
    ## 1 Alcona  2006     1
    ## 2 Alcona  2009     1
    

    This means that between 2004 and 2015 only two crashes were recorded in Alcona county, and they occurred during 2006 and 2009. To fill in the missing values, use the complete function. First read the documentation for complete here. Then use complete so that cr_year has a row for every year-county combination and ncr is zero when there were no crashes in the corresponding county-year period.

    cr_year <-
      cr_year %>% 
      ungroup %>% 
      complete(County, year, fill=list(ncr = 0 ))
    
    print(cr_year)	  
    
    ## # A tibble: 996 x 3
    ##    County  year   ncr
    ##     <chr> <int> <dbl>
    ##  1 Alcona  2004     0
    ##  2 Alcona  2005     0
    ##  3 Alcona  2006     1
    ##  4 Alcona  2007     0
    ##  5 Alcona  2008     0
    ##  6 Alcona  2009     1
    ##  7 Alcona  2010     0
    ##  8 Alcona  2011     0
    ##  9 Alcona  2012     0
    ## 10 Alcona  2013     0
    ## # ... with 986 more rows
    
  3. As a review, use spread to produce the following table:

    spread(cr_year, key=County,value=ncr)
    
    ## # A tibble: 12 x 84
    ##     year Alcona Alger Allegan Alpena Antrim Arenac Baraga Barry   Bay
    ##  * <int>  <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
    ##  1  2004      0     0      18     12      2      0      1     9    31
    ##  2  2005      0     0       9     12      0      1      0     5    28
    ##  3  2006      1     0      14      8      2      1      0     4    37
    ##  4  2007      0     0      10     10      4      2      1     3    28
    ##  5  2008      0     2      11      8      1      1      0     4    31
    ##  6  2009      1     1       8      3      2      0      1     5    31
    ##  7  2010      0     0      19      7      1      2      0     7    24
    ##  8  2011      0     0      16      8      1      0      1     5    28
    ##  9  2012      0     1      15      7      3      1      0     6    26
    ## 10  2013      0     0      10     10      1      2      0     4    18
    ## 11  2014      0     0      11      8      2      1      0     1    37
    ## 12  2015      0     0      12      7      2      0      0     5    21
    ## # ... with 74 more variables: Benzie <dbl>, Berrien <dbl>, Branch <dbl>,
    ##     --- additional output omitted ---
    
  4. How is the city variable formatted? Run this command:

    cr %>%
      select(year, Crash.Instance, city)
    

    Now use separate so that the city variable no longer contains the county name.

    cr  <-
      cr %>% separate(city, c('county2','city'), sep=': ')
    

    To check your answer:

    cr %>% select(Crash.Instance, city, County, county2)
    
    ## # A tibble: 23,809 x 4
    ##    Crash.Instance           city    County          county2
    ##  *          <int>          <chr>     <chr>            <chr>
    ##  1     2004374322      Roseville    Macomb    Macomb County
    ##  2     2004374216   Lasalle Twp.    Monroe    Monroe County
    ##  3     2004373946   Lincoln Twp.   Berrien   Berrien County
    ##  4     2004373597     Hazel Park   Oakland   Oakland County
    ##  5     2004371832 Hillsdale Twp. Hillsdale Hillsdale County
    ##    --- additional output omitted ---
    
  5. Create a data frame called cr_year_wash which contains the number of crashes in each city in Washtenaw county for the years 2004–2015.

    cr_year_wash <-
      filter(cr, County=='Washtenaw') %>% 
      group_by(city, year) %>% 
      summarise(ncr=n())
    
    cr_year_wash
    
    ## # A tibble: 153 x 3
    ## # Groups:   city [?]
    ##         city  year   ncr
    ##        <chr> <int> <int>
    ##  1 Ann Arbor  2004    45
    ##  2 Ann Arbor  2005    28
    ##  3 Ann Arbor  2006    50
    ##  4 Ann Arbor  2007    34
    ##  5 Ann Arbor  2008    58
    ##  6 Ann Arbor  2009    61
    ##  7 Ann Arbor  2010    59
    ##  8 Ann Arbor  2011    59
    ##  9 Ann Arbor  2012    64
    ## 10 Ann Arbor  2013    63
    ## # ... with 143 more rows
    
  6. Are there any missing values in cr_year_wash? Ensure that there are no explicit or implicit missing values. If there were no crashes in a given year for any of the cities in Washtenaw county, make sure that ncr is zero for that year. If you call complete you should first call ungroup so there is no grouping variable in cr_year_wash.
    see below

  7. Create a plot of the number of crashes over time for the cities in Washtenaw county. Your plot should look something like this:

    cr_year_wash %>% ungroup %>%
      complete(city, year, fill = list(ncr = 0)) %>%
      ggplot(aes(x=year, y=ncr)) + geom_line(aes(color=city=='Ann Arbor', group=city)) +
      scale_color_discrete(breaks=c("FALSE", "TRUE"), labels=c("Other", "Ann Arbor"), name='')