If you haven’t already, please take this Canvas survey
The classroom machines (and the machines in any campus computer lab) should have RStudio set up and ready to use.
To use your own laptop:
Each time you open RStudio, you will need to load the tidyverse package (step 5). The other setup steps only need to be performed once.
A script is a file with a list of R commands. It will be helpful to create and save a script file for each lab or homework assignment.
library(tidyverse)into the first line.
Your R session takes place in a working directory, which is a folder on your computer. If you are loading data into R using a file, R will look for that file in the working directory.
getwd() # Print the current working directory setwd('~/Documents/your_directory/') # Set the working directory list.files() # List the files in your working directory
Your R workspace contains all of the objects you have created in your R session. Each time you assign a value to a variable name you have created an object.
x <- 3 ls() # List all of the objects in your workspace
##  "x"
## Error in print(y): object 'y' not found
?function_name to see the help page for any function. The online ggplot documentation is available here.
Now load the file into your workspace. The function
read.table imports data from a text file.
Look at the help page for
read.table. Import the
big10_gradrates.txt file, using the following information to help you specify the arguments to
The results of
read.table are returned as a data frame object. Use the assignment operator
<- to name your data frame
Let’s see what variables are in this data set:
##  "SCL_NAME" "SCL_PRIVATE" "year" "Gender" "Population" ##  "GradRate"
Population variable indicates whether the graduation rate (
GradRate) is computed for student-athletes or all students. The
Gender variable indicates whether the graduation rate is computed for both men and women, only men, or only women.
Here are some functions that help us understand the structure of a data set:
str(grad)will display a data frame’s variable names, number of observations, and other information.
head(grad)displays the first few rows of the data frame
grad$year: the dollar sign returns all of the values for a single variable in a data frame.
levels(grad$SCL_NAME): print the unique values of a factor (categorical) variable
unique(grad$year): print the unique values of a variable
View(grad)look at the data frame in a new window, displayed like an Excel spreadsheet.
table(grad$Population)count the number of observations for each level of a factor.
Let’s take a look at the graduation rate data set.
|University of Illinois at Urbana-Champaign||(0) Public||1995||Combined||Student Body||78|
|Northwestern University||(1) Private||1995||Combined||Student Body||92|
|Indiana University-Bloomington||(0) Public||1995||Combined||Student Body||68|
|University of Iowa||(0) Public||1995||Combined||Student Body||65|
|University of Maryland-College Park||(0) Public||1995||Combined||Student Body||65|
|University of Michigan-Ann Arbor||(0) Public||1995||Combined||Student Body||83|
##  "Combined" "Female" "Male"
##  "Student Athletes" "Student Body"
Let’s start by plotting the graduation rate for the University of Michigan. Filter the data set to only include graduation rates for Michigan (we will learn more about filtering later in the course):
grad_mich <- filter(grad, SCL_NAME == "University of Michigan-Ann Arbor") ggplot(grad_mich) + geom_point(aes(x=year, y=GradRate))
This doesn’t look so great. Maybe a line would be better.
ggplot(grad_mich) + geom_line(aes(x=year, y=GradRate))
Yikes. Remember the data set has graduation rates for student athletes, all students, men, and women. The
table function can be helpful here:
# The number of observations in each # combination of these categorical variables table(grad_mich$Population, grad_mich$Gender)
## ## Combined Female Male ## Student Athletes 14 14 14 ## Student Body 14 14 14
There are 14 years in this data set. To correctly plot the graduation rates over time we should have one line per group of 14 observations.
Let’s ignore the gender variable right now and plot one line for athletes and one line for the whole student body.
grad_mich <- filter(grad_mich, Gender=="Combined") ggplot(grad_mich) + geom_line(aes(x=year, y=GradRate, linetype=Population)) + scale_linetype_discrete(name="")
Examine the help page for
scale_linetype_discrete. Adding a discrete scale like this lets you control the legend title, labels, etc.
Your task is to recreate this figure:
graddata set, not
scale_color_hueto your plot, specifying the
groupaesthetic to the appropriate variable.
+ xlab("my label")
+ ggtitle("A title")