I’m a data geek working in Technology Strategy at TELUS Communications in Toronto, ON. I’m happiest when problem solving and am generally attracted to things I don’t quite understand. My interests include music, data science, and trail running (the latter to the point of obsession).

Twitter: https://twitter.com/philjette | LinkedIn: https://ca.linkedin.com/in/philjette

GitHub: https://github.com/philjette/ | RPubs: https://rpubs.com/philjet



Feature Selection using Information Gain in R

When considering a predictive model, you might be interested in knowing which features of your data provide the most information about the target variable of interest. For example, suppose we’d like to predict the species of Iris based on sepal length and width as well as petal length and width (using the iris dataset in R).


Which of these 4 features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species, and could only ask for the value of 1 feature, which feature would give you the greatest likelihood of winning your bet?

While there are many R packages out there for attribute selection, I’ve coded a few basic functions for my own usage for selecting attributes based on Information Gain (and hence on Shannon Entropy).

For starters, let’s define what we mean by Entropy and Information Gain.

Shannon Entropy
H(p_1 \dots p_n) = \sum_{i=1}^{n} p_i\log_2 p_i

Where p_i is the probability of value i and n is the number of possible values. For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing \frac{1}{3} of the data. Therefore

\sum_{i=1}^{3} \frac{1}{3}_i \log_2 \frac{1}{3}_i = 1.59

Information Gain
IG = H_p - \sum_{i=1}^{n} p_{ci}H_{ci}

 Where H_p is the entropy of the parent (the complete, unsegmented dataset), n  is the number of values of our target variable (and the number of child segments), p_{ci} is the probability that an observation is in child i (the weighting), and H_{ci} is the entropy of child (segment) i.

Continuing with our iris example, we could ask the following: “Can we improve (reduce) the entropy of the parent dataset by segmenting on Sepal Length?”

In this case, Sepal Length is numeric. You’ll notice the code provides functions for both numeric and categorical variables. For categorical variables, we simply segment on each possible value. However in the numeric case, we will bin the data according to the desired number of breaks (which is set to 4 by default).

If we segment using 5 breaks, we get 5 children. Note e is the computed entropy for this subset, p is the proportion of records, N is the number of records, and min and max are… the min and max.


We improve on the entropy of the parent in each child. In fact, segment 5 is perfectly pure, though weighted lightly due to the low proportion of records it contains. We can formalize this using the information gain formula noted above. Calling the IG_numeric function, we see the that IG(Sepal.Length) = .64 using 5 breaks.

Note that the categorical and numeric functions are called as follows

IG_numeric(data, feature, target, bins=4)

Both functions return the IG value, however you can change return(IG) to return(dd_data) to return the summary of the segments as a data.frame for investigation.

You could easily modify the code to:

– Optimize the number of splits for numeric attributes

– Iterate through a pre-determined index of attributes and rank their IG in a data.frame

I’ll add these features once I have the time to do so, but please feel free to let me know if either I’m out to lunch or if you have any questions\comments\proposed improvements.

Here’s the code: https://github.com/philjette/InformationGain

Feature Selection using Information Gain in R

Aviation Accident Data in R

A couple of months ago, following the devastating crash of a Germanwings plane in the French Alps killing all those on board, I wrote some code to scrape historical data on aviation accidents for further analysis in R. The code is written in pure (or nearly pure) old-school R, and gathers data from planecrashinfo.com.

The code was then expanded upon using the Hadleyverse by hrbrmstr at R-bloggers. Check it out here.

The planecrashinfo.com accident database includes:

  • All civil and commercial aviation accidents of scheduled and non-scheduled passenger airliners worldwide, which resulted in a fatality (including all U.S. Part 121 and Part 135 fatal accidents)
  • All cargo, positioning, ferry and test flight fatal accidents.
  • All military transport accidents with 10 or more fatalities.
  • All commercial and military helicopter accidents with greater than 10 fatalities.
  • All civil and military airship accidents involving fatalities.
  • Aviation accidents involving the death of famous people.
  • Aviation accidents or incidents of noteworthy interest.


Here are a couple quick plots from the data

Average proportion of fatalities. We see very little improvement since the early 50s.


Crash fatalities vs. passengers from 2000 to 2015


Aviation Accident Data in R