Fun with the Yelp API

This post details my first attempt at using the Yelp API with R (and Tableau). Going in, I wasn’t quite sure in which direction I wanted to go. I had some big ideas but quickly came to the realization that I should come up with something simple first to figure things out. The search API has one major limitation: it returns a maximum of 20 records per search. This is pretty unkind to “big” ideas. So I decided to go with this:

  • Pull the top 20 highest rated “food” establishments by state
  • Each restaurant has a category. I’d look at the most frequently occurring category in the 20 records returned for a given state, and map the results in Tableau.

Pretty useless, and pretty straightforward to get my feet wet. A couple caveats: the code is ugly, and pretty well 100% old school R. I’m certain there are better ways to go about this, but efficiency wasn’t a concern at least for the time being. I just wanted to throw together something that worked. Also, in terms of the basic API connection I referenced this post at stackoverflow.

consumerKey = "AirdUEybmOOnaU2ctKHj6Q"
consumerSecret = "ZfmDvGS0zZrROiyVxs0T9mlbQ6M"
token = "VS_b5omGNQ6_CCBO5Ou9bDDjBYSrfWK6"
token_secret = "_deyWlngdD0ziH14bIqMMFqgMk0"</code>

require(httr)
require(httpuv)
require(jsonlite)
require(dplyr)
library(plyr); library(dplyr)

# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)

#state codes
states<-read.csv("States.csv", sep=",", header=FALSE)
states<-as.vector(states[,1])
#replace state codes w duplicate city names. We use 
#Olympia as a proxy for Washington
states<-replace(states, states=="LA", "Louisiana")
states<-replace(states, states=="WA", "Olympia")

#initialize a list. We'll iterate through each state code, 
#create a data frame of the top 20 restaurants
#then save the data frame to a list
rest_data<-list()

#iterate through all 51 states and pull the top 
#20 restaurants by rating
for (i in 1:51) {
yelpurl locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
rest_data[[i]]<-data.frame(locationdataList)
}

For this first chunk, we take care of the requisite API authorization. I then load up a vector of state codes to iterate through. I ran into a bit of a problem here, as the API confuses the state code LA for the city of Los Angeles, and the state code of WA for Washington DC. I’ve got a hacky solution here to replace LA with “Lousiana”, but as far as WA goes I’m simply using Olympia for now. You can also specify a geographical bounding box via coordinates which would be easy to do.
We then iterate through each state, do the API call, parse the results, and save each resulting data.frame to a list object.

Remember I mentioned this is old-school so you can expect many for-loops.

#names of the features we want to keep
keeps <- c("businesses.rating", "businesses.review_count", 
"businesses.name", "businesses.categories", "city", "state")
#iterate through each state and build the dataframe 
#of top 20 locations per state
for (i in 1:51) {
  df<-rest_data[[i]]
  #flatten the lists
  df$businesses.categories <- vapply(df$businesses.categories, 
paste, collapse = ", ", character(1L))
  df$businesses.rating<-unlist(df$businesses.rating)
  df$businesses.review_count<-unlist(df$businesses.review_count)
  df$businesses.name<-unlist(df$businesses.name)
  df$city<-unlist(df$businesses.location$city)
  df$state<-unlist(df$businesses.location$state_code)
  #keep the fields we want
  df<-df[,(names(df) %in% keeps)]
  rest_data[[i]]<-df
}

The API returns a nested data.frame making binding the list impossible. So I ieterate through each DF in my rest_data list, flatten as needed, and get rid of the fields I don’t need.

#use the data.table package to bind all data frames 
#for each state
top_rest<-bind_rows(rest_data)
#aggregate a new df. we will the include a col for the 
#top category for each state
top_cat<-aggregate(top_rest$businesses.review_count, 
list(state=top_rest$state), sum)

#now we create a new data frame going through each 
#state and returning the top category
for (i in 1:51) {
  subset_state<-subset(top_rest, state==top_cat$state[i])
  cats<-paste(subset_state$businesses.categories)
  #switch to lowercase to compare strings
  cats<-tolower(cats)
  # we see many versions of american such as 
tradamaerican, american (new), etc... so replace with american
  
  cats<-gsub("([^, ]+?american)|(american[^,]+)", 
"american", cats)
  #extract only unique words from each category string, 
as each string represents a resto
  cats<-vapply(lapply(strsplit(cats, " "), unique), paste, 
character(1L), collapse = " ")
  #put everyithing back together in one big string
  cats<-paste(cats, collapse=' ')
  #split string by individual words into a vector
  split_cats<-unlist(strsplit(cats, split=", "))
  
  top_cat$top_cat[i]<-unlist(as.vector(names(sort(table(split_cats),decreasing=TRUE)[1:1])))
}

After going through and fixing each data.frame in my list, I bind everything to one data.frame using bind_rows() from the dplyr package. Maybe the only non old-school thing here. May I should have stuck to rbind() to keep with the trend. We then use aggregate to summarize by state. The resulting data.frame forms the basis for the visualization.

Now we get to the fun part. Each business record has a comma-delimited category field which looks like “American (New), Italian, Seafood, newamerican, italian, seafood”. for each restaurant I standardize the various forms of “american” to um… “american” (I admit spending way too much time on this, good thing regexr.com is fun), and remove duplicates. I then split everything into individual categories and throw everything (categories for all restaurants in that state) back into one big string. Extract the most frequently occurring category for the state, and were done.

We now have the data.frame top_cat (based on our earlier aggregation) from which we get the state code, the number of reviews for the top 20 results, and the most frequently occurring category.

  state     x            top_cat
1    AK  3172           american
2    AL  2073           american
3    AR  1691           american
4    AZ 10118           american
5    CA 42671 breakfast & brunch
6    CO 13553         sandwiches

I dumped out a csv and brought it into Tableau to map the results. Turns out “American” food is a pretty big deal. How informative!

Embedding doesn’t seem to work great so check it out at Tableau Public here.

The code pulls from the API every time it’s run, so these results will change.

The code can be found via Github here.

Yelp_API_L

Fun with the Yelp API