Brook's homepage Statistics 306, Fall 2017

Lab 1, September 12

Introductory Poll

If you haven’t already, please take this Canvas survey

Installing R and RStudio

The classroom machines (and the machines in any campus computer lab) should have RStudio set up and ready to use.
To use your own laptop:

  1. First install R
  2. Then install RStudio
  3. Open RStudio
  4. Install the tidyverse package:
  5. Load the tidyverse package:

    Each time you open RStudio, you will need to load the tidyverse package (step 5). The other setup steps only need to be performed once.

R Preliminaries

Scripts, working directory, workspace

A script is a file with a list of R commands. It will be helpful to create and save a script file for each lab or homework assignment.

Your R session takes place in a working directory, which is a folder on your computer. If you are loading data into R using a file, R will look for that file in the working directory.

getwd() # Print the current working directory
setwd('~/Documents/your_directory/') # Set the working directory
list.files() # List the files in your working directory

Your R workspace contains all of the objects you have created in your R session. Each time you assign a value to a variable name you have created an object.

x <- 3
ls() # List all of the objects in your workspace
## [1] "x"
## Error in print(y): object 'y' not found

R documentation

Use ?function_name to see the help page for any function. The online ggplot documentation is available here.

R Vocabulary

Importing data

On our Canvas site, download big10_gradrates.txt. This file contains six-year graduation rates for universities in the Big 10 (source). Save this file in your working directory.

Now load the file into your workspace. The function read.table imports data from a text file. Look at the help page for read.table. Import the big10_gradrates.txt file, using the following information to help you specify the arguments to read.table:

The results of read.table are returned as a data frame object. Use the assignment operator <- to name your data frame grad.

Let’s see what variables are in this data set:

## [1] "SCL_NAME"    "SCL_PRIVATE" "year"        "Gender"      "Population" 
## [6] "GradRate"

The Population variable indicates whether the graduation rate (GradRate) is computed for student-athletes or all students. The Gender variable indicates whether the graduation rate is computed for both men and women, only men, or only women.

Examining a data frame

Here are some functions that help us understand the structure of a data set:

Let’s take a look at the graduation rate data set.

SCL_NAME SCL_PRIVATE year Gender Population GradRate
University of Illinois at Urbana-Champaign (0) Public 1995 Combined Student Body 78
Northwestern University (1) Private 1995 Combined Student Body 92
Indiana University-Bloomington (0) Public 1995 Combined Student Body 68
University of Iowa (0) Public 1995 Combined Student Body 65
University of Maryland-College Park (0) Public 1995 Combined Student Body 65
University of Michigan-Ann Arbor (0) Public 1995 Combined Student Body 83
## [1] "Combined" "Female"   "Male"
## [1] "Student Athletes" "Student Body"

Lab exercise

Let’s start by plotting the graduation rate for the University of Michigan. Filter the data set to only include graduation rates for Michigan (we will learn more about filtering later in the course):

grad_mich <- filter(grad, SCL_NAME == "University of Michigan-Ann Arbor")
ggplot(grad_mich) + geom_point(aes(x=year, y=GradRate))

This doesn’t look so great. Maybe a line would be better.

ggplot(grad_mich) + geom_line(aes(x=year, y=GradRate))

Yikes. Remember the data set has graduation rates for student athletes, all students, men, and women. The table function can be helpful here:

# The number of observations in each 
# combination of these categorical variables
table(grad_mich$Population, grad_mich$Gender) 
##                    Combined Female Male
##   Student Athletes       14     14   14
##   Student Body           14     14   14

There are 14 years in this data set. To correctly plot the graduation rates over time we should have one line per group of 14 observations.
Let’s ignore the gender variable right now and plot one line for athletes and one line for the whole student body.

grad_mich <- filter(grad_mich, Gender=="Combined")
ggplot(grad_mich) +
  geom_line(aes(x=year, y=GradRate, linetype=Population)) +

Examine the help page for scale_linetype_discrete. Adding a discrete scale like this lets you control the legend title, labels, etc.

Assignment (in class)

Your task is to recreate this figure:


Lab 1 solution

Solution to lab exercise

ggplot extras