A choropleth map of American commute times


I just posted my first Kaggle script, which uses American census data to look at commute times. Turns out if you hate commuting, the worst place to live is Maryland and the best is… South Dakota. It also turns out that while women typically earn far less than men, their commutes are pretty well identical. So they travel the same distance to earn less money.

Check it out on Kaggle here: https://www.kaggle.com/philjette/2013-american-community-survey/the-commuters

or on GitHub here: https://github.com/philjette/CensusData

A choropleth map of American commute times

Fun with the Yelp API

This post details my first attempt at using the Yelp API with R (and Tableau). Going in, I wasn’t quite sure in which direction I wanted to go. I had some big ideas but quickly came to the realization that I should come up with something simple first to figure things out. The search API has one major limitation: it returns a maximum of 20 records per search. This is pretty unkind to “big” ideas. So I decided to go with this:

  • Pull the top 20 highest rated “food” establishments by state
  • Each restaurant has a category. I’d look at the most frequently occurring category in the 20 records returned for a given state, and map the results in Tableau.

Pretty useless, and pretty straightforward to get my feet wet. A couple caveats: the code is ugly, and pretty well 100% old school R. I’m certain there are better ways to go about this, but efficiency wasn’t a concern at least for the time being. I just wanted to throw together something that worked. Also, in terms of the basic API connection I referenced this post at stackoverflow.

consumerKey = "AirdUEybmOOnaU2ctKHj6Q"
consumerSecret = "ZfmDvGS0zZrROiyVxs0T9mlbQ6M"
token = "VS_b5omGNQ6_CCBO5Ou9bDDjBYSrfWK6"
token_secret = "_deyWlngdD0ziH14bIqMMFqgMk0"</code>

library(plyr); library(dplyr)

# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)

#state codes
states<-read.csv("States.csv", sep=",", header=FALSE)
#replace state codes w duplicate city names. We use
#Olympia as a proxy for Washington
states<-replace(states, states=="LA", "Louisiana")
states<-replace(states, states=="WA", "Olympia")

#initialize a list. We'll iterate through each state code,
#create a data frame of the top 20 restaurants
#then save the data frame to a list

#iterate through all 51 states and pull the top
#20 restaurants by rating
for (i in 1:51) {
yelpurl locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)

For this first chunk, we take care of the requisite API authorization. I then load up a vector of state codes to iterate through. I ran into a bit of a problem here, as the API confuses the state code LA for the city of Los Angeles, and the state code of WA for Washington DC. I’ve got a hacky solution here to replace LA with “Lousiana”, but as far as WA goes I’m simply using Olympia for now. You can also specify a geographical bounding box via coordinates which would be easy to do.
We then iterate through each state, do the API call, parse the results, and save each resulting data.frame to a list object.

Remember I mentioned this is old-school so you can expect many for-loops.

#names of the features we want to keep
keeps <- c("businesses.rating", "businesses.review_count",
"businesses.name", "businesses.categories", "city", "state")
#iterate through each state and build the dataframe
#of top 20 locations per state
for (i in 1:51) {
  #flatten the lists
  df$businesses.categories <- vapply(df$businesses.categories,
paste, collapse = ", ", character(1L))
  #keep the fields we want
  df<-df[,(names(df) %in% keeps)]

The API returns a nested data.frame making binding the list impossible. So I ieterate through each DF in my rest_data list, flatten as needed, and get rid of the fields I don’t need.

#use the data.table package to bind all data frames
#for each state
#aggregate a new df. we will the include a col for the
#top category for each state
list(state=top_rest$state), sum)

#now we create a new data frame going through each
#state and returning the top category
for (i in 1:51) {
  subset_state<-subset(top_rest, state==top_cat$state[i])
  #switch to lowercase to compare strings
  # we see many versions of american such as
tradamaerican, american (new), etc... so replace with american

  cats<-gsub("([^, ]+?american)|(american[^,]+)",
"american", cats)
  #extract only unique words from each category string,
as each string represents a resto
  cats<-vapply(lapply(strsplit(cats, " "), unique), paste,
character(1L), collapse = " ")
  #put everyithing back together in one big string
  cats<-paste(cats, collapse=' ')
  #split string by individual words into a vector
  split_cats<-unlist(strsplit(cats, split=", "))


After going through and fixing each data.frame in my list, I bind everything to one data.frame using bind_rows() from the dplyr package. Maybe the only non old-school thing here. May I should have stuck to rbind() to keep with the trend. We then use aggregate to summarize by state. The resulting data.frame forms the basis for the visualization.

Now we get to the fun part. Each business record has a comma-delimited category field which looks like “American (New), Italian, Seafood, newamerican, italian, seafood”. for each restaurant I standardize the various forms of “american” to um… “american” (I admit spending way too much time on this, good thing regexr.com is fun), and remove duplicates. I then split everything into individual categories and throw everything (categories for all restaurants in that state) back into one big string. Extract the most frequently occurring category for the state, and were done.

We now have the data.frame top_cat (based on our earlier aggregation) from which we get the state code, the number of reviews for the top 20 results, and the most frequently occurring category.

  state     x            top_cat
1    AK  3172           american
2    AL  2073           american
3    AR  1691           american
4    AZ 10118           american
5    CA 42671 breakfast & brunch
6    CO 13553         sandwiches

I dumped out a csv and brought it into Tableau to map the results. Turns out “American” food is a pretty big deal. How informative!

Embedding doesn’t seem to work great so check it out at Tableau Public here.

The code pulls from the API every time it’s run, so these results will change.

The code can be found via Github here.


Fun with the Yelp API

Feature Selection using Information Gain in R

When considering a predictive model, you might be interested in knowing which features of your data provide the most information about the target variable of interest. For example, suppose we’d like to predict the species of Iris based on sepal length and width as well as petal length and width (using the iris dataset in R).


Which of these 4 features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species, and could only ask for the value of 1 feature, which feature would give you the greatest likelihood of winning your bet?

While there are many R packages out there for attribute selection, I’ve coded a few basic functions for my own usage for selecting attributes based on Information Gain (and hence on Shannon Entropy).

For starters, let’s define what we mean by Entropy and Information Gain.

Shannon Entropy
H(p_1 \dots p_n) = \sum_{i=1}^{n} p_i\log_2 p_i

Where p_i is the probability of value i and n is the number of possible values. For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing \frac{1}{3} of the data. Therefore

\sum_{i=1}^{3} \frac{1}{3}_i \log_2 \frac{1}{3}_i = 1.59

Information Gain
IG = H_p - \sum_{i=1}^{n} p_{ci}H_{ci}

 Where H_p is the entropy of the parent (the complete, unsegmented dataset), n  is the number of values of our target variable (and the number of child segments), p_{ci} is the probability that an observation is in child i (the weighting), and H_{ci} is the entropy of child (segment) i.

Continuing with our iris example, we could ask the following: “Can we improve (reduce) the entropy of the parent dataset by segmenting on Sepal Length?”

In this case, Sepal Length is numeric. You’ll notice the code provides functions for both numeric and categorical variables. For categorical variables, we simply segment on each possible value. However in the numeric case, we will bin the data according to the desired number of breaks (which is set to 4 by default).

If we segment using 5 breaks, we get 5 children. Note e is the computed entropy for this subset, p is the proportion of records, N is the number of records, and min and max are… the min and max.


We improve on the entropy of the parent in each child. In fact, segment 5 is perfectly pure, though weighted lightly due to the low proportion of records it contains. We can formalize this using the information gain formula noted above. Calling the IG_numeric function, we see the that IG(Sepal.Length) = .64 using 5 breaks.

Note that the categorical and numeric functions are called as follows

IG_numeric(data, feature, target, bins=4)

Both functions return the IG value, however you can change return(IG) to return(dd_data) to return the summary of the segments as a data.frame for investigation.

You could easily modify the code to:

– Optimize the number of splits for numeric attributes

– Iterate through a pre-determined index of attributes and rank their IG in a data.frame

I’ll add these features once I have the time to do so, but please feel free to let me know if either I’m out to lunch or if you have any questions\comments\proposed improvements.

Here’s the code: https://github.com/philjette/InformationGain

Feature Selection using Information Gain in R