Brook's homepage Statistics 306, Fall 2017

Lab 12, December 5

Solutions

library(tidyverse)
library(stringr)

for loops

Reference: section 21.2

A for loop executes a block of code (the “body” of the loop) once for each item in a sequence. In this example, the sequence consists of the integers 1, 2, …, n. The loop body is executed once for each item in the sequence, and the variable i is assigned to the corresponding sequence item in each iteration. Each iteration of this loop computes the ith Fibonacci number.

prev <- 0
cur <- 1
n <- 10
fibn <- vector('integer', n)
for (i in seq_along(fibn)){
  print(str_c("i = ", i))
  fibn[i] <- prev + cur
  prev <- cur
  cur <- fibn[i]
}
## [1] "i = 1"
## [1] "i = 2"
## [1] "i = 3"
## [1] "i = 4"
## [1] "i = 5"
## [1] "i = 6"
## [1] "i = 7"
## [1] "i = 8"
## [1] "i = 9"
## [1] "i = 10"
fibn
##  [1]  1  2  3  5  8 13 21 34 55 89

Exercise Part 1

This exercise demonstrates how to import many data files into a single data frame and reviews many of the concepts we have covered this semetser.

Start by downloading acs-income.zip from our Canvas site. Uncompress this file in your working directory. The data, from the American Communtiy Survey, describe the household income distribution of each county in the six New England states. There are five pairs of files, one for each of the years 2011, 2012, 2013, 2014, and 2015. The _metadata.csv files contain column names and descriptions for each column in the corresponding data files.

  1. Start by creating a character vector containing the names of each metadata file. Complete this command by specifying a regular expression to match the files ending with _metadata.csv. If your uncompressed files are located in a folder called acs-income, your command will look like this:

    fnames <- dir(path="acs-income/",
                  pattern="") # complete this line
    

    If the 10 csv files are in your working directory, then you can omit the path= argument.

    fnames
    
    ## [1] "ACS_11_5YR_S1901_metadata.csv" "ACS_12_5YR_S1901_metadata.csv"
    ## [3] "ACS_13_5YR_S1901_metadata.csv" "ACS_14_5YR_S1901_metadata.csv"
    ## [5] "ACS_15_5YR_S1901_metadata.csv"
    
  2. Extract the two-digit year from each file name in fnames, then convert the two-digit years to four-digit years. Complete the following commands.

    First extract the two-digit year:

    fyears <- str_extract(fnames, ...) # complete this command
    
    fyears
    
    ## [1] "11" "12" "13" "14" "15"
    

    Then prepend 20 to convert the two-digit years to four-digit years.

    fyears <- str_c(...) # complete this command
    
    fyears
    
    ## [1] "2011" "2012" "2013" "2014" "2015"
    
  3. Run these commands to create a placeholder list where we will store the metadata for each year:

    metadata <- vector("list", length(fyears))
    names(metadata) <- fyears
    metadata
    
    ## $`2011`
    ## NULL
    ## 
    ## $`2012`
    ## NULL
    ## 
    ## $`2013`
    ## NULL
    ## 
    ## $`2014`
    ## NULL
    ## 
    ## $`2015`
    ## NULL
    
  4. Write the command to display the second element of the vector fnames.

  5. Complete the body of the following for loop. Each iteration of the loop imports the metadata file corresponding to the ith element of fnames and stores the result as the ith element of metadata. In each iteration, provide an argument to read_csv so the columns of the imported data frame are named varname and vardesc.

    for (i in seq_along(fnames)) { 
      metadata[[i]] <- read_csv(...) # complete this line
    }
    

    To check your answer, print the element of the list metadata named "2014":

    ## # A tibble: 131 x 2
    ##              varname                                         vardesc
    ##                <chr>                                           <chr>
    ##  1            GEO.id                                              Id
    ##  2           GEO.id2                                             Id2
    ##  3 GEO.display-label                                       Geography
    ##  4     HC01_EST_VC01                     Households; Estimate; Total
    ##  5     HC01_MOE_VC01              Households; Margin of Error; Total
    ##  6     HC02_EST_VC01                       Families; Estimate; Total
    ##  7     HC02_MOE_VC01                Families; Margin of Error; Total
    ##  8     HC03_EST_VC01        Married-couple families; Estimate; Total
    ##  9     HC03_MOE_VC01 Married-couple families; Margin of Error; Total
    ## 10     HC04_EST_VC01           Nonfamily households; Estimate; Total
    ## # ... with 121 more rows
    
  6. Now we can combine this list of data frames into a single data frame with bind_rows. This function takes a list of data frames and “stacks” them vertically. The .id argument will create a column called year to identify the source data frame for each row.

    metadata <- bind_rows(metadata, .id='year')
    head(metadata)
    
    ## # A tibble: 6 x 3
    ##    year           varname                            vardesc
    ##   <chr>             <chr>                              <chr>
    ## 1  2011            GEO.id                                 Id
    ## 2  2011           GEO.id2                                Id2
    ## 3  2011 GEO.display-label                          Geography
    ## 4  2011     HC01_EST_VC01        Households; Estimate; Total
    ## 5  2011     HC01_MOE_VC01 Households; Margin of Error; Total
    ## 6  2011     HC02_EST_VC01          Families; Estimate; Total
    
  7. Import the five data files into a data frame called incomedata, using the previous few steps as a guide. Start by creating another vector of file names, fnames2. Use the dir function and a regular expression to match the data file names, which all end with S1901.csv.

    fnames2 # the data file names
    
    ## [1] "ACS_11_5YR_S1901.csv" "ACS_12_5YR_S1901.csv" "ACS_13_5YR_S1901.csv"
    ## [4] "ACS_14_5YR_S1901.csv" "ACS_15_5YR_S1901.csv"
    

    Now write a for loop to import the five data files. Do not specify column names when importing these files; we will use the variable descriptions in metadata to understand what each column represents.

    incomedata <- vector("list", length(fnames2))
    names(incomedata) <- str_c(..., str_extract(fnames2, ... )) # complete this line, the names correspond to the four-digit years
    for (...) {               # complete this line
      incomedata[[i]] <- ...  # complete this line
    }
    incomedata <- bind_rows(incomedata, .id='year')
    

    Check your answer:

    dim(incomedata)
    
    ## [1] 335 132
    
    count(incomedata, year)
    
    ## # A tibble: 5 x 2
    ##    year     n
    ##   <chr> <int>
    ## 1  2011    67
    ## 2  2012    67
    ## 3  2013    67
    ## 4  2014    67
    ## 5  2015    67
    
  8. Print the unique values of metadata$vardesc containing the string Households; Estimate;. You should see 16 variable descriptions, most of which correspond to income categories (e.g. households earning between 10,000 and 14,999 dollars per year). We will focus on just two of these variables: the proportion of households earning more than $200,000 per year and the median household income.
  9. Complete this command to produce the output that follows.

    filter(metadata, str_detect(vardesc, ...))
    
    ## # A tibble: 5 x 3
    ##    year       varname                                vardesc
    ##   <chr>         <chr>                                  <chr>
    ## 1  2011 HC01_EST_VC11 Households; Estimate; $200,000 or more
    ## 2  2012 HC01_EST_VC11 Households; Estimate; $200,000 or more
    ## 3  2013 HC01_EST_VC11 Households; Estimate; $200,000 or more
    ## 4  2014 HC01_EST_VC11 Households; Estimate; $200,000 or more
    ## 5  2015 HC01_EST_VC11 Households; Estimate; $200,000 or more
    

    We see that every year has the same variable name for the proportion of households earning more than $200,000 (this might not always be the case with Census data). Store the variable name (HC01...) in an object called var_maxinc.

  10. Complete the following command to check the variable name for median household income.

    filter(metadata, str_detect(vardesc, ...))
    
    ## # A tibble: 5 x 3
    ##    year       varname                                       vardesc
    ##   <chr>         <chr>                                         <chr>
    ## 1  2011 HC01_EST_VC13 Households; Estimate; Median income (dollars)
    ## 2  2012 HC01_EST_VC13 Households; Estimate; Median income (dollars)
    ## 3  2013 HC01_EST_VC13 Households; Estimate; Median income (dollars)
    ## 4  2014 HC01_EST_VC13 Households; Estimate; Median income (dollars)
    ## 5  2015 HC01_EST_VC13 Households; Estimate; Median income (dollars)
    

    Again, all five years use the same variable name for median household income. Store this variable name in an object called var_median.

  11. From incomedata, select the columns year, GEO.display-label, and the two columns representing the median household income and the proportion of households making more than $200,000. Rename these columns so your result looks like this:

    ## # A tibble: 335 x 4
    ##     year                         county median gt200k
    ##    <int>                          <chr>  <int>  <dbl>
    ##  1  2011  Fairfield County, Connecticut  82558   16.7
    ##  2  2011   Hartford County, Connecticut  64007    6.5
    ##  3  2011 Litchfield County, Connecticut  71497    6.6
    ##  4  2011  Middlesex County, Connecticut  77095    7.8
    ##  5  2011  New Haven County, Connecticut  62497    5.9
    ##  6  2011 New London County, Connecticut  67010    5.2
    ##  7  2011    Tolland County, Connecticut  80333    5.6
    ##  8  2011    Windham County, Connecticut  60063    2.6
    ##  9  2011     Androscoggin County, Maine  45699    1.4
    ## 10  2011        Aroostook County, Maine  37138    1.1
    ## # ... with 325 more rows
    

    Make sure your year variable is either an integer or double. Store this data frame in an object called incomedata_small.

  12. Use separate to add a state column to incomedata_small:

    incomedata_small <-
      separate(incomedata_small, ...) # complete this line
    incomedata_small
    
    ## # A tibble: 335 x 5
    ##     year        county       state median gt200k
    ##  * <int>         <chr>       <chr>  <int>  <dbl>
    ##  1  2011    Fairfield  Connecticut  82558   16.7
    ##  2  2011     Hartford  Connecticut  64007    6.5
    ##  3  2011   Litchfield  Connecticut  71497    6.6
    ##  4  2011    Middlesex  Connecticut  77095    7.8
    ##  5  2011    New Haven  Connecticut  62497    5.9
    ##  6  2011   New London  Connecticut  67010    5.2
    ##  7  2011      Tolland  Connecticut  80333    5.6
    ##  8  2011      Windham  Connecticut  60063    2.6
    ##  9  2011 Androscoggin        Maine  45699    1.4
    ## 10  2011    Aroostook        Maine  37138    1.1
    ## # ... with 325 more rows
    
  13. Write a command to check if there are any county names that are duplicated across states. Start by filtering incomedata_small to a single year. Then use one or more dplyr functions to find county names that are repeated across multiple states.

  14. Create a data frame called top2counties that contains, for each state, the two counties with the highest proportion of households earning more than $200,000 in 2015:

    top2counties
    
    ## # A tibble: 12 x 5
    ## # Groups:   state [6]
    ##     year        county         state median gt200k
    ##    <int>         <chr>         <chr>  <int>  <dbl>
    ##  1  2015    Fairfield    Connecticut  84233   18.1
    ##  2  2015    Middlesex    Connecticut  79893    9.6
    ##  3  2015   Cumberland          Maine  60051    5.7
    ##  4  2015         York          Maine  57919    3.9
    ##  5  2015    Middlesex  Massachusetts  85118   13.4
    ##  6  2015      Norfolk  Massachusetts  88262   14.4
    ##  7  2015 Hillsborough  New Hampshire  71244    7.0
    ##  8  2015   Rockingham  New Hampshire  81198    8.8
    ##  9  2015      Bristol   Rhode Island  72458    9.6
    ## 10  2015      Newport   Rhode Island  69526    7.8
    ## 11  2015   Chittenden        Vermont  65350    5.7
    ## 12  2015   Grand Isle        Vermont  62608    5.6
    
  15. Complete the following command to produce the plot below.

    incomedata_small %>% 
      ggplot(...) + 
       ... +         # more ggplot commands
      geom_text(aes(label = county), size=2, hjust=1, vjust=0,
              data = top2counties)
    

  16. Produce a similar plot using the median household income in each county. First create a data frame top2counties_med with the two counties in each state with the highest median income in 2015:

    top2counties_med
    
    ## # A tibble: 12 x 5
    ## # Groups:   state [6]
    ##     year        county         state median gt200k
    ##    <int>         <chr>         <chr>  <int>  <dbl>
    ##  1  2015    Fairfield    Connecticut  84233   18.1
    ##  2  2015    Middlesex    Connecticut  79893    9.6
    ##  3  2015   Cumberland          Maine  60051    5.7
    ##  4  2015         York          Maine  57919    3.9
    ##  5  2015    Middlesex  Massachusetts  85118   13.4
    ##  6  2015      Norfolk  Massachusetts  88262   14.4
    ##  7  2015 Hillsborough  New Hampshire  71244    7.0
    ##  8  2015   Rockingham  New Hampshire  81198    8.8
    ##  9  2015      Bristol   Rhode Island  72458    9.6
    ## 10  2015   Washington   Rhode Island  72807    7.5
    ## 11  2015   Chittenden        Vermont  65350    5.7
    ## 12  2015   Grand Isle        Vermont  62608    5.6
    

    Then reproduce the following plot:

Exercise part 2 (extra review)

Here are some more exercises using the same data from above. You need to complete steps 1 through 8 from part 1 before attempting part 2.

  1. Reproduce the following sequence using str_pad:

    ##  [1] "02" "03" "04" "05" "06" "07" "08" "09" "10" "11"
    

    Combine your previous command with str_c to produce the vector incvars_dist:

    incvars_dist <- str_c(..., #complete this line
                          str_pad(...)) # your command from the previous question
    
    incvars_dist
    
    ##  [1] "HC01_EST_VC02" "HC01_EST_VC03" "HC01_EST_VC04" "HC01_EST_VC05"
    ##  [5] "HC01_EST_VC06" "HC01_EST_VC07" "HC01_EST_VC08" "HC01_EST_VC09"
    ##  [9] "HC01_EST_VC10" "HC01_EST_VC11"
    
  2. Apply your separate command from an earlier question (see above) to the original incomedata data frame, so we have separate state and county columns.

    incomedata <-
          separate(incomedata, ...) # complete this line
    

    Now use filter and select to create a data frame incomedata_nh that only contains rows for counties in New Hampshire and has columns for year, county, and the 10 variables named in the vector incvars_dist:

    incomedata_nh <- incomedata %>% 
      filter(...) %>% # complete this line 
      select(...)   # complete this line
    
    incomedata_nh
    
    ## # A tibble: 50 x 12
    ##     year        county HC01_EST_VC02 HC01_EST_VC03 HC01_EST_VC04
    ##    <chr>         <chr>         <dbl>         <dbl>         <dbl>
    ##  1  2011      Belknap            4.8           4.8           8.9
    ##  2  2011      Carroll            5.5           5.3           8.8
    ##  3  2011     Cheshire            4.8           4.7          10.3
    ##  4  2011         Coos            7.5           7.5          14.3
    ##  5  2011      Grafton            5.6           4.5           9.9
    ##  6  2011 Hillsborough            4.1           3.5           7.9
    ##  7  2011    Merrimack            4.2           3.9           8.3
    ##  8  2011   Rockingham            3.1           3.0           6.6
    ##  9  2011    Strafford            5.9           4.6           8.6
    ## 10  2011     Sullivan            5.8           6.1           9.0
    ## # ... with 40 more rows, and 7 more variables: HC01_EST_VC05 <dbl>,
    ## #   HC01_EST_VC06 <dbl>, HC01_EST_VC07 <dbl>, HC01_EST_VC08 <dbl>,
    ## #   HC01_EST_VC09 <dbl>, HC01_EST_VC10 <dbl>, HC01_EST_VC11 <dbl>
    
  3. Convert incomedata_nh to “long” format using gather. One of the arguments to gather should be your vector incvars_dist.

    incomedata_nh <-
      incomedata_nh %>% gather(...) # complete this line
    
    incomedata_nh
    
    ## # A tibble: 500 x 4
    ##     year        county       varname proportion
    ##    <chr>         <chr>         <chr>      <dbl>
    ##  1  2011      Belknap  HC01_EST_VC02        4.8
    ##  2  2011      Carroll  HC01_EST_VC02        5.5
    ##  3  2011     Cheshire  HC01_EST_VC02        4.8
    ##  4  2011         Coos  HC01_EST_VC02        7.5
    ##  5  2011      Grafton  HC01_EST_VC02        5.6
    ##  6  2011 Hillsborough  HC01_EST_VC02        4.1
    ##  7  2011    Merrimack  HC01_EST_VC02        4.2
    ##  8  2011   Rockingham  HC01_EST_VC02        3.1
    ##  9  2011    Strafford  HC01_EST_VC02        5.9
    ## 10  2011     Sullivan  HC01_EST_VC02        5.8
    ## # ... with 490 more rows
    
  4. We are now going to replace these coded variable names with the descriptions in metadata. First run this command:

    filter(metadata, year==2011, varname %in% incvars_dist)
    
    ## # A tibble: 10 x 3
    ##     year       varname                                    vardesc
    ##    <chr>         <chr>                                      <chr>
    ##  1  2011 HC01_EST_VC02    Households; Estimate; Less than $10,000
    ##  2  2011 HC01_EST_VC03   Households; Estimate; $10,000 to $14,999
    ##  3  2011 HC01_EST_VC04   Households; Estimate; $15,000 to $24,999
    ##  4  2011 HC01_EST_VC05   Households; Estimate; $25,000 to $34,999
    ##  5  2011 HC01_EST_VC06   Households; Estimate; $35,000 to $49,999
    ##  6  2011 HC01_EST_VC07   Households; Estimate; $50,000 to $74,999
    ##  7  2011 HC01_EST_VC08   Households; Estimate; $75,000 to $99,999
    ##  8  2011 HC01_EST_VC09 Households; Estimate; $100,000 to $149,999
    ##  9  2011 HC01_EST_VC10 Households; Estimate; $150,000 to $199,999
    ## 10  2011 HC01_EST_VC11     Households; Estimate; $200,000 or more
    

    Pipe the results of the previous command to extract the vardesc column and store it as a vector income_descriptions.

    income_descriptions
    
    ##  [1] "Households; Estimate; Less than $10,000"   
    ##  [2] "Households; Estimate; $10,000 to $14,999"  
    ##  [3] "Households; Estimate; $15,000 to $24,999"  
    ##  [4] "Households; Estimate; $25,000 to $34,999"  
    ##  [5] "Households; Estimate; $35,000 to $49,999"  
    ##  [6] "Households; Estimate; $50,000 to $74,999"  
    ##  [7] "Households; Estimate; $75,000 to $99,999"  
    ##  [8] "Households; Estimate; $100,000 to $149,999"
    ##  [9] "Households; Estimate; $150,000 to $199,999"
    ## [10] "Households; Estimate; $200,000 or more"
    
  5. Remove the string Households; Estimate; from each element of income_descriptions, so this vector looks like this:

    income_descriptions
    
    ##  [1] "Less than $10,000"    "$10,000 to $14,999"   "$15,000 to $24,999"  
    ##  [4] "$25,000 to $34,999"   "$35,000 to $49,999"   "$50,000 to $74,999"  
    ##  [7] "$75,000 to $99,999"   "$100,000 to $149,999" "$150,000 to $199,999"
    ## [10] "$200,000 or more"
    
  6. Run the following command so that the vector income_descriptions is named.

    names(income_descriptions) <- incvars_dist
    

    This allows us to select the description using the variable name:

    income_descriptions['HC01_EST_VC02']
    
    ##       HC01_EST_VC02 
    ## "Less than $10,000"
    

    Now run this command to add these variable descriptions to our New Hampshire income data:

    incomedata_nh <- 
      incomedata_nh %>% mutate(vardesc = income_descriptions[varname],
                               vardesc = factor(vardesc, levels=income_descriptions))
    
    # check the results:
    sample_n(incomedata_nh, 5) %>% select(varname, vardesc)
    
    ## # A tibble: 5 x 2
    ##         varname              vardesc
    ##           <chr>               <fctr>
    ## 1 HC01_EST_VC10 $150,000 to $199,999
    ## 2 HC01_EST_VC09 $100,000 to $149,999
    ## 3 HC01_EST_VC07   $50,000 to $74,999
    ## 4 HC01_EST_VC11     $200,000 or more
    ## 5 HC01_EST_VC06   $35,000 to $49,999
    

    (We are storing vardesc as a factor so that the ordering of the categories is preserved when we use them to make graphics.)

  7. Sort the New Hampshire county names in descending order according to the proportion of households making more than $200,000 in 2015. Complete this command:

    counties_ordered <- 
      incomedata_nh %>% filter(year==2015, ...) %>% # complete this line
      arrange(...) %>%  # complete this line
      .$county
    

    Now change the county variable to a factor using this command:

    incomedata_nh <-
      incomedata_nh %>% mutate(county = factor(county, levels = counties_ordered))
    
  8. Finally, complete this ggplot command to visualize the income distribution in New Hampshire counties in 2015.

    filter(incomedata_nh, ... )%>% # complete this line
      ggplot() + 
      geom_tile(...) + # complete this line
      theme(axis.text.x = element_text(angle=45, hjust=1)) + # to make the x-axis labels legble
      scale_fill_gradient2(low='white', high='darkgreen',
                           name="", breaks=c(5,10,15,20),
                           labels=c("5","10","15","20 percent")) +
      ...  # change the default axis labels and plot title