Feature Selection using Information Gain in R

When considering a predictive model, you might be interested in knowing which features of your data provide the most information about the target variable of interest. For example, suppose we’d like to predict the species of Iris based on sepal length and width as well as petal length and width (using the iris dataset in R).

iris

Which of these 4 features provides the “purest” segmentation with respect to the target? Or put differently, if you were to place a bet on the correct species, and could only ask for the value of 1 feature, which feature would give you the greatest likelihood of winning your bet?

While there are many R packages out there for attribute selection, I’ve coded a few basic functions for my own usage for selecting attributes based on Information Gain (and hence on Shannon Entropy).

For starters, let’s define what we mean by Entropy and Information Gain.

Shannon Entropy
H(p_1 \dots p_n) = \sum_{i=1}^{n} p_i\log_2 p_i

Where p_i is the probability of value i and n is the number of possible values. For example in the iris dataset, we have 3 possible values for Species (Setosa, Versicolor, Virginica), each representing \frac{1}{3} of the data. Therefore

\sum_{i=1}^{3} \frac{1}{3}_i \log_2 \frac{1}{3}_i = 1.59

Information Gain
IG = H_p - \sum_{i=1}^{n} p_{ci}H_{ci}

 Where H_p is the entropy of the parent (the complete, unsegmented dataset), n  is the number of values of our target variable (and the number of child segments), p_{ci} is the probability that an observation is in child i (the weighting), and H_{ci} is the entropy of child (segment) i.

Continuing with our iris example, we could ask the following: “Can we improve (reduce) the entropy of the parent dataset by segmenting on Sepal Length?”

In this case, Sepal Length is numeric. You’ll notice the code provides functions for both numeric and categorical variables. For categorical variables, we simply segment on each possible value. However in the numeric case, we will bin the data according to the desired number of breaks (which is set to 4 by default).

If we segment using 5 breaks, we get 5 children. Note e is the computed entropy for this subset, p is the proportion of records, N is the number of records, and min and max are… the min and max.

Sepal_IG

We improve on the entropy of the parent in each child. In fact, segment 5 is perfectly pure, though weighted lightly due to the low proportion of records it contains. We can formalize this using the information gain formula noted above. Calling the IG_numeric function, we see the that IG(Sepal.Length) = .64 using 5 breaks.

Note that the categorical and numeric functions are called as follows

IG_numeric(data, feature, target, bins=4)
IG_cat(data,feature,target)

Both functions return the IG value, however you can change return(IG) to return(dd_data) to return the summary of the segments as a data.frame for investigation.

You could easily modify the code to:

– Optimize the number of splits for numeric attributes

– Iterate through a pre-determined index of attributes and rank their IG in a data.frame

I’ll add these features once I have the time to do so, but please feel free to let me know if either I’m out to lunch or if you have any questions\comments\proposed improvements.

Here’s the code: https://github.com/philjette/InformationGain

Advertisements
Feature Selection using Information Gain in R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s