Brook Luers Statistician, data scientist
CV     Teaching     Research     Blog    

Simple OpenStreetMap Queries with R

I wanted to learn about the affordability of “walkable” homes in a handful of cities. A walkable home, for me, has grocery stores, bus or train stops, cafes, and other conveniences within about a half-mile radius. To do this I decided to create my own version of the WalkScore, since it costs money to obtain the actual WalkScore® for a large set of addresses.

A future post will discuss the affordability of walkable homes. Here I will describe the processes of querying OpenStreetMap (OSM) for neighborhood features surrounding a specific address using R.

My desired workflow was as follows:

street address (latitude, longitude)
   --> coordinates for a walkable region around the address
   --> OpenStreetMap query for useful features inside the walkable region
   --> results in an R data frame

Chicago’s Union Station is located at 225 S. Canal St in Chicago, with approximate coordinates 41.878516, -87.639607.

The results of the OSM query will look like this:

Union Station query results

…with data that looks like this:

@type @id @lat @lon amenity name shop highway railway public_transport bicycle
node 1629279491 41.87748 -87.62850 restaurant Plymouth NA NA NA NA NA
node 2792878028 38.90000 -77.03390 restaurant Joe’s Seafood, Prime Steak & Stone Crab - Washington DC NA NA NA NA NA
node 4595627821 41.87924 -87.63513 NA Franklin & Adams NA bus_stop NA NA NA
node 6280526718 41.88711 -87.64608 restaurant Carnivale NA NA NA NA NA
node 804650032 38.90395 -77.04293 restaurant Poki DC NA NA NA NA NA
node 6267343176 41.88098 -87.64145 restaurant Blackwood BBQ NA NA NA NA NA
node 6299447606 41.88411 -87.64102 NA Pastoral bakery NA NA NA NA
node 5249818041 41.88343 -87.64874 restaurant Parlor Pizza Bar NA NA NA NA NA
node 4887024365 38.90284 -77.03883 bar Off The Record NA NA NA NA NA
node 611023873 38.90191 -77.02552 cafe Lawson Grill NA NA NA NA NA

Geographic coordinates for the region surrounding an address

Before finding the coordinates of a region surrounding an address, you need the coordinates of the address itself. A service like geocod.io will convert a list of addresses to geographic coordinates.

The walkable region surrounding an address will be specified using a bounding box. For simplicity we will define the walkable region based on “crow-flies” distance (the length of the arc connecting two points on the surface of a sphere), not the actual walking distance between the points.

This calculation is described here and implemented below in R.

deg2rad <- function(x){
  return(pi * x / 180)
}
rad2deg <- function(x){
  return(x * 180 / pi)
}
get_bbox <- function(lat, lon, dist_km = 1){
  latrad <- deg2rad(lat)
  lonrad <- deg2rad(lon)
  RADIUS <- 6371 # km
  r <- dist_km /RADIUS
  latmin <- latrad - r
  latmax <- latrad + r
  
  T1 <- asin(sin(latrad) / cos(r))
  delta <- asin(sin(r) / cos(latrad))
  lonmin <- lonrad - delta
  lonmax <- lonrad + delta
  
  bbox <- 
    cbind(rad2deg(latmin),rad2deg(lonmin),rad2deg(latmax),rad2deg(lonmax))
  return(bbox)
}

For OpenStreetMap queries, a bounding box has the format (bottom, left, top, right), where bottom is the minimum latitude value for the box, left is the minimum longitude value, and so on.

Querying OpenStreetMap for neighborhood features

You can query the OpenStreetMap API, called Overpass, using overpass turbo, an interactive web tool. In the next section we’ll see how to query the API using R.

Useful features of a neighborhood, like stores and bus stops, are called nodes, ways, or relations in OpenStreetMap. All of the features I used to create my walkability index are nodes or ways. A node is a single point and a way is a list of nodes. So, for example, a Pick ‘n Save grocery store is likely represented by a way containing nodes which define the outline of the building (see example here).

For illustration, we’ll search in the area surrounding Union Station, which has approximate coordinates 41.87852, -87.63961. A 1-kilometer (about 0.6 miles) bounding box centered at this point is:

( bbox <- get_bbox(41.87852, -87.63961) ) 
##          [,1]      [,2]     [,3]      [,4]
## [1,] 41.86953 -87.65169 41.88751 -87.62753

Find nodes with specific tags

To find useful features near Union Station, we want to find all nodes and ways with specific tags, such as “supermarket” or “cafe”. A wiki page lists the various OSM tags. Here is a starter OSM query:

[out:json];   //specify the output format
// parentheses create a union of elements
(    
   node["highway"="bus_stop"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["railway"="station"]["station"="subway"](41.8695227,-87.651685,41.8875092,-87.6275284);
);
// return the tags and their values, and the center point (lat,lon) of each element
out tags center; 

Union Station: nearby bus and rail stations

This returns all nodes in the bounding box with tag highway equal to bus_stop, and all nodes with tag railway equal to station. In general, the statement node["key"="value"] returns all nodes with key equal to value; we restrict the results to a given bounding box by appending (bottom, left, top right) to the node statement.

The outer parentheses (enclosing the two node statements) make a difference: remove them and the query would only return the subway station nodes.

We can also use regular expressions when filtering by a tag value:

node["shop"~"supermarket|bakery"]

The ~ indicates that we are finding nodes with tag shop matching the regular expression supermarket|bakery.

Find stores, cafes, restuarants, and transit (with csv output)

Combining these ideas, we can search for stores, public transit stations, and bicycle paths with a query like this:

[out:csv(::type, ::id, ::lat, ::lon, amenity, name, shop, highway, railway, public_transport, bicycle)];

(
   node["shop"~"supermarket|bakery|convenience|coffee|deli|greengrocer|tea|clothes|hardware|houseware|bicycle"](41.8695227,-87.651685,41.8875092,-87.6275284);
   way["shop"~"supermarket|bakery|convenience|coffee|deli|greengrocer|tea|clothes|hardware|houseware|bicycle"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["amenity"~"cafe|library|restaurant|bar"](41.8695227,-87.651685,41.8875092,-87.6275284);
   way["amenity"~"cafe|library|restaurant|bar"](41.8695227,-87.651685,41.8875092,-87.6275284);
   way["bicycle"="designated"]["highway"="path"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["highway"="bus_stop"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["railway"="station"]["station"="subway"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["railway"="tram_stop"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["public_transport"="platform"](41.8695227,-87.651685,41.8875092,-87.6275284);
   node["railway"="subway_entrance"](41.8695227,-87.651685,41.8875092,-87.6275284);
);
out tags center;

Cafes, restaurants, and many other useful things are tagged as amenities, not shops. This query also demonstrates how to output tabular data with a header row. The fields like ::type and ::lat are “special fields”, described on the wiki. We now have a query which returns most of the features surrounding an address which make it easier to live there without a car.

The language guide provides more details about the OSM query language.

Use R to query OpenStreetMap

Now we are ready to build queries and interact with the API using R. Here is a data.frame with two example addresses:

address zip beds baths sqfeet price city state Latitude Longitude
1600 Pennsylvania Ave NW 20500 40 35 55000 1e+07 Washington DC 38.89742 -77.03648
225 S Canal St 60606 2 1 10000 2e+06 chicago il 41.87852 -87.63961

For each address, we want to run the same OSM query for the walkable region surrounding the address. We will take each latitude, longitude pair in the data.frame, calculate its bounding box, and insert this bounding box into a string containing the query we built previously.

( bboxes <- get_bbox(d$Latitude, d$Longitude) )
##          [,1]      [,2]     [,3]      [,4]
## [1,] 38.88842 -77.04803 38.90641 -77.02492
## [2,] 41.86952 -87.65169 41.88751 -87.62753
( bboxes_string <- apply(bboxes, 1, function(brow) 
    return(paste("(", paste(brow, collapse=','), ")", sep="")))
)
## [1] "(38.8884237839408,-77.0480313718395,38.9064102160592,-77.0249206281604)"
## [2] "(41.8695227839408,-87.651685533676,41.8875092160592,-87.6275284663239)"

Programatically build query for multiple locations

The “meat” of the query consists of the node and way statements:

# Apply this regular expression to both nodes and ways
shopTags <-
  'supermarket|bakery|convenience|coffee|deli|greengrocer|tea|clothes|hardware|houseware|bicycle'
shopTagFilter <- paste('["shop"~"', shopTags, '"]', sep = '')
amenTags <- 'cafe|library|restaurant|bar'
amenTagFilter <- paste('["amenity"~"', amenTags, '"]', sep = '')
queryKeys <-
  c(
    paste('node', shopTagFilter, sep = ''),
    paste('way', shopTagFilter, sep = ''),
    paste('node', amenTagFilter, sep = ''),
    paste('way', amenTagFilter, sep = ''),
    'way["bicycle"="designated"]["highway"="path"]',
    'node["highway"="bus_stop"]',
    'node["railway"="station"]["station"="subway"]',
    'node["railway"="tram_stop"]',
    'node["public_transport"="platform"]',
    'node["railway"="subway_entrance"]'
  )
cat(paste(queryKeys,collapse='\n'))
## node["shop"~"supermarket|bakery|convenience|coffee|deli|greengrocer|tea|clothes|hardware|houseware|bicycle"]
## way["shop"~"supermarket|bakery|convenience|coffee|deli|greengrocer|tea|clothes|hardware|houseware|bicycle"]
## node["amenity"~"cafe|library|restaurant|bar"]
## way["amenity"~"cafe|library|restaurant|bar"]
## way["bicycle"="designated"]["highway"="path"]
## node["highway"="bus_stop"]
## node["railway"="station"]["station"="subway"]
## node["railway"="tram_stop"]
## node["public_transport"="platform"]
## node["railway"="subway_entrance"]

We just need to paste each bounding box to the end of these statements and specify the output format:

# Specify the output format and the output content
startQuery <-
  '[out:csv(::type, ::id, ::lat, ::lon,amenity, name,shop,highway,railway,public_transport,bicycle)];('
endQuery <- ');out tags center;'
# note the opening and closing parentheses

# Combine the bounding boxes with the node and way statements
queryBig <- sapply(bboxes_string, function(bbs)
  return(paste(
    paste(queryKeys, bbs, ';', sep = ''),
    collapse = ''
  )))

# length(queryBig) = (number of addresses)
# ...combine the address-specific queries into a single string: 
queryBig <- paste(queryBig, collapse='') 
fullQuery <- paste(startQuery, queryBig, endQuery, sep = '')
str(fullQuery)
##  chr "[out:csv(::type, ::id, ::lat, ::lon,amenity, name,shop,highway,railway,public_transport,bicycle)];(node[\"shop\"| __truncated__

Obtain results via HTTP request

Finally, we can use the httr package to send our query to the Overpass API and manipulate the results.

baseurl <- 'http://overpass-api.de/api/interpreter'
resp <- GET(baseurl, query = list(data = fullQuery))
head(read_tsv(content(resp, 'text')), 10)
## # A tibble: 10 x 11
##    `@type`    `@id` `@lat` `@lon` amenity name       shop  highway railway
##    <chr>      <dbl>  <dbl>  <dbl> <chr>   <chr>      <chr> <chr>   <chr>  
##  1 node      2.58e8   41.9  -87.6 <NA>    UIC-Halst… <NA>  <NA>    station
##  2 node      2.58e8   41.9  -87.6 <NA>    Clinton    <NA>  <NA>    station
##  3 node      2.58e8   41.9  -87.6 <NA>    Harrison   <NA>  <NA>    station
##  4 node      2.58e8   41.9  -87.6 <NA>    Jackson    <NA>  <NA>    station
##  5 node      2.58e8   41.9  -87.6 <NA>    Monroe     <NA>  <NA>    station
##  6 node      2.59e8   41.9  -87.6 <NA>    Clinton    <NA>  <NA>    station
##  7 node      2.90e8   41.9  -87.6 <NA>    Washington <NA>  <NA>    station
##  8 node      2.90e8   41.9  -87.6 <NA>    Monroe     <NA>  <NA>    station
##  9 node      2.90e8   41.9  -87.6 <NA>    Jackson    <NA>  <NA>    station
## 10 node      3.18e8   41.9  -87.6 <NA>    State & M… <NA>  bus_st… <NA>   
## # ... with 2 more variables: public_transport <chr>, bicycle <chr>

The object content(resp, 'text') is a raw character vector, which can be converted to a data.frame with read_tsv, from the readr package.