About

I’m a data geek working in Technology Strategy at Telus Communications in Toronto, ON. I’m happiest when problem solving and am generally attracted to things I don’t quite understand. My interests include business intelligence, data science, and programming. I also enjoy music, cycling, and the hoppiest of pale ales.

This site is simply a repository for various projects.

Twitter: https://twitter.com/philjette | LinkedIn: https://ca.linkedin.com/in/philjette

About

A choropleth map of American commute times

commute_times

I just posted my first Kaggle script, which uses American census data to look at commute times. Turns out if you hate commuting, the worst place to live is Maryland and the best is… South Dakota. It also turns out that while women typically earn far less than men, their commutes are pretty well identical. So they travel the same distance to earn less money.

Check it out on Kaggle here: https://www.kaggle.com/philjette/2013-american-community-survey/the-commuters

or on GitHub here: https://github.com/philjette/CensusData

A choropleth map of American commute times

Fun with the Yelp API

This post details my first attempt at using the Yelp API with R (and Tableau). Going in, I wasn’t quite sure in which direction I wanted to go. I had some big ideas but quickly came to the realization that I should come up with something simple first to figure things out. The search API has one major limitation: it returns a maximum of 20 records per search. This is pretty unkind to “big” ideas. So I decided to go with this:

  • Pull the top 20 highest rated “food” establishments by state
  • Each restaurant has a category. I’d look at the most frequently occurring category in the 20 records returned for a given state, and map the results in Tableau.

Pretty useless, and pretty straightforward to get my feet wet. A couple caveats: the code is ugly, and pretty well 100% old school R. I’m certain there are better ways to go about this, but efficiency wasn’t a concern at least for the time being. I just wanted to throw together something that worked. Also, in terms of the basic API connection I referenced this post at stackoverflow.

consumerKey = "AirdUEybmOOnaU2ctKHj6Q"
consumerSecret = "ZfmDvGS0zZrROiyVxs0T9mlbQ6M"
token = "VS_b5omGNQ6_CCBO5Ou9bDDjBYSrfWK6"
token_secret = "_deyWlngdD0ziH14bIqMMFqgMk0"</code>

require(httr)
require(httpuv)
require(jsonlite)
require(dplyr)
library(plyr); library(dplyr)

# authorization
myapp = oauth_app("YELP", key=consumerKey, secret=consumerSecret)
sig=sign_oauth1.0(myapp, token=token,token_secret=token_secret)

#state codes
states<-read.csv("States.csv", sep=",", header=FALSE)
states<-as.vector(states[,1])
#replace state codes w duplicate city names. We use 
#Olympia as a proxy for Washington
states<-replace(states, states=="LA", "Louisiana")
states<-replace(states, states=="WA", "Olympia")

#initialize a list. We'll iterate through each state code, 
#create a data frame of the top 20 restaurants
#then save the data frame to a list
rest_data<-list()

#iterate through all 51 states and pull the top 
#20 restaurants by rating
for (i in 1:51) {
yelpurl locationdata=GET(yelpurl, sig)
locationdataContent = content(locationdata)
locationdataList=jsonlite::fromJSON(toJSON(locationdataContent))
rest_data[[i]]<-data.frame(locationdataList)
}

For this first chunk, we take care of the requisite API authorization. I then load up a vector of state codes to iterate through. I ran into a bit of a problem here, as the API confuses the state code LA for the city of Los Angeles, and the state code of WA for Washington DC. I’ve got a hacky solution here to replace LA with “Lousiana”, but as far as WA goes I’m simply using Olympia for now. You can also specify a geographical bounding box via coordinates which would be easy to do.
We then iterate through each state, do the API call, parse the results, and save each resulting data.frame to a list object.

Remember I mentioned this is old-school so you can expect many for-loops.

#names of the features we want to keep
keeps <- c("businesses.rating", "businesses.review_count", 
"businesses.name", "businesses.categories", "city", "state")
#iterate through each state and build the dataframe 
#of top 20 locations per state
for (i in 1:51) {
  df<-rest_data[[i]]
  #flatten the lists
  df$businesses.categories <- vapply(df$businesses.categories, 
paste, collapse = ", ", character(1L))
  df$businesses.rating<-unlist(df$businesses.rating)
  df$businesses.review_count<-unlist(df$businesses.review_count)
  df$businesses.name<-unlist(df$businesses.name)
  df$city<-unlist(df$businesses.location$city)
  df$state<-unlist(df$businesses.location$state_code)
  #keep the fields we want
  df<-df[,(names(df) %in% keeps)]
  rest_data[[i]]<-df
}

The API returns a nested data.frame making binding the list impossible. So I ieterate through each DF in my rest_data list, flatten as needed, and get rid of the fields I don’t need.

#use the data.table package to bind all data frames 
#for each state
top_rest<-bind_rows(rest_data)
#aggregate a new df. we will the include a col for the 
#top category for each state
top_cat<-aggregate(top_rest$businesses.review_count, 
list(state=top_rest$state), sum)

#now we create a new data frame going through each 
#state and returning the top category
for (i in 1:51) {
  subset_state<-subset(top_rest, state==top_cat$state[i])
  cats<-paste(subset_state$businesses.categories)
  #switch to lowercase to compare strings
  cats<-tolower(cats)
  # we see many versions of american such as 
tradamaerican, american (new), etc... so replace with american
  
  cats<-gsub("([^, ]+?american)|(american[^,]+)", 
"american", cats)
  #extract only unique words from each category string, 
as each string represents a resto
  cats<-vapply(lapply(strsplit(cats, " "), unique), paste, 
character(1L), collapse = " ")
  #put everyithing back together in one big string
  cats<-paste(cats, collapse=' ')
  #split string by individual words into a vector
  split_cats<-unlist(strsplit(cats, split=", "))
  
  top_cat$top_cat[i]<-unlist(as.vector(names(sort(table(split_cats),decreasing=TRUE)[1:1])))
}

After going through and fixing each data.frame in my list, I bind everything to one data.frame using bind_rows() from the dplyr package. Maybe the only non old-school thing here. May I should have stuck to rbind() to keep with the trend. We then use aggregate to summarize by state. The resulting data.frame forms the basis for the visualization.

Now we get to the fun part. Each business record has a comma-delimited category field which looks like “American (New), Italian, Seafood, newamerican, italian, seafood”. for each restaurant I standardize the various forms of “american” to um… “american” (I admit spending way too much time on this, good thing regexr.com is fun), and remove duplicates. I then split everything into individual categories and throw everything (categories for all restaurants in that state) back into one big string. Extract the most frequently occurring category for the state, and were done.

We now have the data.frame top_cat (based on our earlier aggregation) from which we get the state code, the number of reviews for the top 20 results, and the most frequently occurring category.

  state     x            top_cat
1    AK  3172           american
2    AL  2073           american
3    AR  1691           american
4    AZ 10118           american
5    CA 42671 breakfast & brunch
6    CO 13553         sandwiches

I dumped out a csv and brought it into Tableau to map the results. Turns out “American” food is a pretty big deal. How informative!

Embedding doesn’t seem to work great so check it out at Tableau Public here.

The code pulls from the API every time it’s run, so these results will change.

The code can be found via Github here.

Yelp_API_L

Fun with the Yelp API

Passing the SAP Certified Application Associate exam for Project System (C_TPLM22_64)

So I passed my Certified Application Associate in Project System for SAP ERP (C_TPLM22_64) exam. Just in time for the November announcement on S/4, but I guess that’s how it goes with IT certifications. Project System is hardly the sexiest product in the SAP landscape at the moment, hence while studying I had a reasonably difficult time uncovering details regarding what I might expect from the exam itself, how others had prepared, what resources proved most useful, etc… so I thought, if I were to pass, I’d throw a quick post together outlining my experience.

Preparation 

First off, you’ll find the basic curriculum and practice questions provided by SAP. This pretty well represents the entirety of information you’ll find regarding the exam. Based on the curriculum, I stuck mostly to the PLM series course material. I focused on PLM200, PLM210, PLM220, and PLM230. Due to the  sub-8% weighting of the Reporting category, I skipped PLM240 altogether, assuming applied experience would get me by. This was a mistake, and I would definitely suggest reviewing PLM240. I also read Kieron Dowling’s SAP Project System Handbook.

I was quite new to PS when I sat for the exam, so my prep time will likely differ from yours. I knew I wanted to take the exam about 6 months prior to taking it, but I’d say I only got serious about studying 2 months prior, really ramping up with 1 month to go.

Strategy

Don’t waste your time looking for practice questions. You’ll find a bunch of shady sites offering so-called “brain dumps” of recent exams (which is completely against SAP’s policies), and will waste precious, constructive study time. I did sign up for an account at sapexam.com, which provides legitimate practice questions based on the PLM course materials. The exam questions were much more consulting oriented than those proved by sapexam.com, but I’d say this was still useful if for nothing more than to throw some variety into your study sessions. Just be careful not to fall into the rote memorization trap, where a false sense of confidence is achieved via answer memorization as opposed to actual understanding of the topics. In order to avoid this I kept my practice exams to a minimum, and was careful to review both correct and incorrect answers. From this perspective, sapexam did help guide my studies.

I first read through the PLM course materials quickly without making any annotations. I then went back through each with greater care, highlighting areas of greater importance. I went back through one last time and compiled notes on each course based off of the highlighted areas. Every few days I would do a mini (40 question) practice exam, and identify problem areas. I also used SCN and SAP Help for various topics, such as account assignment. I spent a great deal of time on PLM230 (Project Accounting) as this was my weakest point. From my summarized versions of each course, I made flashcards which were easy to review before bed, with my morning coffee, etc…

The flashcards were only somewhat useful. I wasted time memorizing transaction codes and definitions, when the exam itself focused much more on consulting type questions. There were no “when would you use transaction xxxx” type questions (at least in the exam I took).

Exam Day

I cleared my schedule for 2 days prior to the exam, which I had scheduled on a Friday at 10AM. I did very little studying in these 2 days, but did review problem areas and completed a full practice exam both days. Aside from that, I just relaxed. I made it to the testing centre a bit early on the day of, and was pretty well checked right in to write the exam.

The exam itself was split into categories in accordance with the curriculum provided by SAP. All questions were one of the following types:

  • True\False
  • Multiple choice, single correct answer
  • Multiple choice, multiple correct answers

Of course for those with multiple correct answers, you need to get them all in order to get it right. As I mentioned previously, the questions themselves were frequently consulting oriented, from the point of view of a client requiring particular functionality. Some were definition based. I was lucky that PLM240 had a very low weighting in the exam, because I naively believed that experience alone would suffice. Don’t make this mistake, and while you likely won’t pay as much attention to PLM240 as PLM230, you should still make sure to give it a review.

I made it through the exam quite quickly, not nearly using up the 3 hours allocated. The exam allows you to flag questions for review, and I used this function extensively. Upon completion, you can then go back through your flagged questions prior to submitting.

That’s it!

Head over to the exam invigilator’s desk and collect your printout. Hopefully you’ll see “PASS” and feel a bit of relief. This was my first certification, and a serious morale booster to succeed.

In review

  • Don’t waste time on shady “brain dump” sites. If you want to take some practice exams, sapexam.com is a decent source. Actually, for this exam they’re likely the only legitimate source. For the record, I have no affiliation with them.
  • Don’t fall into the rote memorization trap. Memorizing answers from either the PLM chapter reviews or from sapexam can give you a false sense of readiness. Make sure to focus on consulting scenarios.
  • 2 months of focused prep should be adequate, based on the strategy described above. This is also in line with comments I’ve come across on SCN. That said, significant consulting experience would cut this down.

Good luck!

Passing the SAP Certified Application Associate exam for Project System (C_TPLM22_64)

Ode to a young family

On Sunday, September 27th, 3 young children and their grandfather were killed by a drunk driver North of Toronto.

Over the next weeks, months, and even years, they’ll slowly collapse under the weight of endless platitudes. The children looked so peaceful, the grandfather so young. They’ve gone to a better place now, you’ll rejoin them in the inevitable afterlife. They surely died without suffering, succumbing at once to the impact. God called his precious angels back. Le petit jésus les aimait trop.

And for their part they will acquiesce. They will give interviews, forced to recount their horror, to endure as the mass media’s search for the elusive ratings bump uncovers details of ever-increasing poignancy. Where were they going, what were they wearing, what’s the last thing they said. Because what better than the mass extinction of a young family to draw in viewers. The outrage! He was from a rich, well to do family! In this photo, we see him seated comfortably, defiantly, in a Ferrari.

But the fact is these endless platitudes, and even the would-be heir turned mass murderer himself mean nothing. We have nothing to say. We fill the air with noises because it comforts us, not them. I can’t even listen to them because what they have to say scares me. What they represent I can’t even comprehend. Life is fickle, short, unpredictable, and meaningless. The universe owes you nothing. The young are in no way owed the future, afforded the beauty and luxury of time.

And why? Because selfishness reigns. Because we can’t get it through our fucking thick skulls that we aren’t alone on this earth, that our actions affect others. That getting behind the wheel after your last beer, just one more beer, can cause someone to wish they died along with the family you took from them. Something is wrong. But we already knew this. How many times must we be shocked into the realization?

If they stood in front of me today, I would have nothing to say. Nothing.

Ode to a young family

Trimmed Mean in R

I found myself having to compute the trimmed mean today in R, and couldn’t immediately find a function that would accomplish this end. What do I mean (ha!) by the trimmed mean?

The trimmed mean removes outliers. We define outliers as any value x such that:

LQ - 1.5 \times IQR < x < UQ + 1.5 \times IQR

Here’s our function:

#function to compute the trimmed mean
trimmed_means<-function(data_to_trim){
Outlier_UQ = quantile(data_to_trim, .75)+1.5*IQR(data_to_trim)
Outlier_LQ = quantile(data_to_trim, .25)-1.5*IQR(data_to_trim)
data_trimmed Outlier_LQ]
return(mean(data_trimmed))
}

Simple as that. You can use this in something like ddply to compare with the mean.

Using the JohnsonJohnson dataset on quarterly earnings (dollars) per Johnson & Johnson share 1960–80, we get

> mean(JohnsonJohnson)
[1] 4.799762
> trimmed_means(JohnsonJohnson)
[1] 4.523902

The trimmed mean above computing the mean on all observations -3.44<x<2.82

 

Trimmed Mean in R

Feature Selection using Information Gain in R

When considering a predictive model, you might be interested in knowing which features of your data provide the most information about the target variable of interest. For example, suppose we’d like to predict the species of Iris based on sepal length and width as well as petal length and width (using the iris dataset in R).

iris

Which of these 4 features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species, and could only ask for the value of 1 feature, which feature would give you the greatest likelihood of winning your bet?

While there are many R packages out there for attribute selection, I’ve coded a few basic functions for my own usage for selecting attributes based on Information Gain (and hence on Shannon Entropy).

For starters, let’s define what we mean by Entropy and Information Gain.

Shannon Entropy
H(p_1 \dots p_n) = \sum_{i=1}^{n} p_i\log_2 p_i

Where p_i is the probability of value i and n is the number of possible values. For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing \frac{1}{3} of the data. Therefore

\sum_{i=1}^{3} \frac{1}{3}_i \log_2 \frac{1}{3}_i = 1.59

Information Gain
IG = H_p - \sum_{i=1}^{n} p_{ci}H_{ci}

 Where H_p is the entropy of the parent (the complete, unsegmented dataset), n  is the number of values of our target variable (and the number of child segments), p_{ci} is the probability that an observation is in child i (the weighting), and H_{ci} is the entropy of child (segment) i.

Continuing with our iris example, we could ask the following: “Can we improve (reduce) the entropy of the parent dataset by segmenting on Sepal Length?”

In this case, Sepal Length is numeric. You’ll notice the code provides functions for both numeric and categorical variables. For categorical variables, we simply segment on each possible value. However in the numeric case, we will bin the data according to the desired number of breaks (which is set to 4 by default).

If we segment using 5 breaks, we get 5 children. Note e is the computed entropy for this subset, p is the proportion of records, N is the number of records, and min and max are… the min and max.

Sepal_IG

We improve on the entropy of the parent in each child. In fact, segment 5 is perfectly pure, though weighted lightly due to the low proportion of records it contains. We can formalize this using the information gain formula noted above. Calling the IG_numeric function, we see the that IG(Sepal.Length) = .64 using 5 breaks.

Note that the categorical and numeric functions are called as follows

IG_numeric(data, feature, target, bins=4)
IG_cat(data,feature,target)

Both functions return the IG value, however you can change return(IG) to return(dd_data) to return the summary of the segments as a data.frame for investigation.

You could easily modify the code to:

– Optimize the number of splits for numeric attributes

– Iterate through a pre-determined index of attributes and rank their IG in a data.frame

I’ll add these features once I have the time to do so, but please feel free to let me know if either I’m out to lunch or if you have any questions\comments\proposed improvements.

Here’s the code: https://github.com/philjette/InformationGain

Feature Selection using Information Gain in R