Lab 1, September 12
- Introductory Poll
- Installing R and RStudio
- R Preliminaries
- Importing data
- Examining a data frame
- Lab exercise
- Lab 1 solution
If you haven’t already, please take this Canvas survey
Installing R and RStudio
The classroom machines (and the machines in any campus computer lab) should have RStudio set up and ready to use.
To use your own laptop:
- First install R
- Then install RStudio
- Open RStudio
- Install the tidyverse package:
- Load the tidyverse package:
Each time you open RStudio, you will need to load the tidyverse package (step 5). The other setup steps only need to be performed once.
Scripts, working directory, workspace
A script is a file with a list of R commands. It will be helpful to create and save a script file for each lab or homework assignment.
- In RStudio, select File > New File > R Script.
- Type the command
library(tidyverse)into the first line.
- Save the script with the name lab1.R in a convenient folder on your computer.
- Add further commands to the file and save your work periodically.
Your R session takes place in a working directory, which is a folder on your computer. If you are loading data into R using a file, R will look for that file in the working directory.
getwd() # Print the current working directory setwd('~/Documents/your_directory/') # Set the working directory list.files() # List the files in your working directory
Your R workspace contains all of the objects you have created in your R session. Each time you assign a value to a variable name you have created an object.
x <- 3 ls() # List all of the objects in your workspace
##  "x"
## Error in print(y): object 'y' not found
?function_name to see the help page for any function. The online ggplot documentation is available here.
- Data frame: an R object roughly similar to an Excel spreadsheet. Each row is an observation and each column contains a single variable.
- Factor: This is what R calls a categorical variable. Levels are the unique possible values of a factor.
Now load the file into your workspace. The function
read.table imports data from a text file.
Look at the help page for
read.table. Import the
big10_gradrates.txt file, using the following information to help you specify the arguments to
- The file name in your working directory should be
- The first row of the file contains variable names
- The fields (variable values) in the file are separated with commas
The results of
read.table are returned as a data frame object. Use the assignment operator
<- to name your data frame
Let’s see what variables are in this data set:
##  "SCL_NAME" "SCL_PRIVATE" "year" "Gender" "Population" ##  "GradRate"
Population variable indicates whether the graduation rate (
GradRate) is computed for student-athletes or all students. The
Gender variable indicates whether the graduation rate is computed for both men and women, only men, or only women.
Examining a data frame
Here are some functions that help us understand the structure of a data set:
str(grad)will display a data frame’s variable names, number of observations, and other information.
head(grad)displays the first few rows of the data frame
grad$year: the dollar sign returns all of the values for a single variable in a data frame.
levels(grad$SCL_NAME): print the unique values of a factor (categorical) variable
unique(grad$year): print the unique values of a variable
View(grad)look at the data frame in a new window, displayed like an Excel spreadsheet.
table(grad$Population)count the number of observations for each level of a factor.
Let’s take a look at the graduation rate data set.
|University of Illinois at Urbana-Champaign||(0) Public||1995||Combined||Student Body||78|
|Northwestern University||(1) Private||1995||Combined||Student Body||92|
|Indiana University-Bloomington||(0) Public||1995||Combined||Student Body||68|
|University of Iowa||(0) Public||1995||Combined||Student Body||65|
|University of Maryland-College Park||(0) Public||1995||Combined||Student Body||65|
|University of Michigan-Ann Arbor||(0) Public||1995||Combined||Student Body||83|
##  "Combined" "Female" "Male"
##  "Student Athletes" "Student Body"
Let’s start by plotting the graduation rate for the University of Michigan. Filter the data set to only include graduation rates for Michigan (we will learn more about filtering later in the course):
grad_mich <- filter(grad, SCL_NAME == "University of Michigan-Ann Arbor") ggplot(grad_mich) + geom_point(aes(x=year, y=GradRate))
This doesn’t look so great. Maybe a line would be better.
ggplot(grad_mich) + geom_line(aes(x=year, y=GradRate))
Yikes. Remember the data set has graduation rates for student athletes, all students, men, and women. The
table function can be helpful here:
# The number of observations in each # combination of these categorical variables table(grad_mich$Population, grad_mich$Gender)
## ## Combined Female Male ## Student Athletes 14 14 14 ## Student Body 14 14 14
There are 14 years in this data set. To correctly plot the graduation rates over time we should have one line per group of 14 observations.
Let’s ignore the gender variable right now and plot one line for athletes and one line for the whole student body.
grad_mich <- filter(grad_mich, Gender=="Combined") ggplot(grad_mich) + geom_line(aes(x=year, y=GradRate, linetype=Population)) + scale_linetype_discrete(name="")
Examine the help page for
scale_linetype_discrete. Adding a discrete scale like this lets you control the legend title, labels, etc.
Assignment (in class)
Your task is to recreate this figure:
- Use the original
graddata set, not
scale_color_hueto your plot, specifying the
- You will need to map the
groupaesthetic to the appropriate variable.
Lab 1 solution
- Edit the x-axis label:
+ xlab("my label")
- Add a title:
+ ggtitle("A title")
- You can control most of the visual elements of a plot using theme.