LabelBinarizer vs get_dummies
LabelBinarizer and get_dummies are two features of Scikit-Learn/Pandas that are extremely crucial when it comes to transforming data for the purpose of train-test split (creating models and testing them). Like everything, they both come with their own set of pros and cons in terms of when it is appropriate to use them or when potentially using them can limit what you can do in terms of effectively creating models and its efficiency.
In this post we are going to explore how both of them work with data and by the end we will be able to see if one of these features may be better for us to use based on it's capabitilies.
To get started we will need to import some Python libraries and then create a DataFrame to run these features on.
import pandas as pd
from sklearn.preprocessing import LabelBinarizer;
(get_dummies is a Pandas function and is used as: pd.get_dummies)
(LabelBinarizer is a SciKit-Learn function and can be imported seperately as we have done above without importing the entire SciKit-Learn library)
Let's get started with the pd.get_dummies function.
The purpose of this function is to take a column with categorical values and convert them into new columns consisting of numerical values (0 and 1). This has to be done for the modelling process as the algorthim would not be able to take in categorical values alone.
Let's try this out.
First off we need to create a DataFrame:
(I am going to create a DataFrame consisting of famous male and female athletes)
data = pd.DataFrame({
'name': ['James Harden', 'Serena Williams', 'Lionel Messi', 'Mike Trout', 'Tom Brady', 'Joe Thornton', 'Sania Mirza'],
'sex': ['male', 'female', 'male', 'male', 'male', 'male', 'female'],
'sport': ['basketball', 'tennis', 'soccer', 'baseball', 'football', 'hockey', 'tennis'],
'country': ['USA', 'USA', 'Argentina', 'USA', 'USA', 'Canda', 'India'],
'won_championship': ['no', 'yes', 'yes', 'no', 'yes', 'no', 'yes']
})
data
name | sex | sport | country | won_championship | |
---|---|---|---|---|---|
0 | James Harden | male | basketball | USA | no |
1 | Serena Williams | female | tennis | USA | yes |
2 | Lionel Messi | male | soccer | Argentina | yes |
3 | Mike Trout | male | baseball | USA | no |
4 | Tom Brady | male | football | USA | yes |
5 | Joe Thornton | male | hockey | Canda | no |
6 | Sania Mirza | female | tennis | India | yes |
Here you can see that we have created a table with data consisting of some famous male and female athletes from around the world. We have columns consisting of:
-Player name
-Sex
-The sport they play
-Nationality
Now let's take a closer look at the columns in the DataFrame. I have inputted all of the columns with categorical values. If we wanted to analyze this data and compare the variables this would be difficult because as mentioned already we need to have numerical values to be able to do that such as 1's and 0's.
Taking a closer look we can see that two of our columns ('sex' and 'won championship') are populated with just two variables each:
'sex' = male or female and 'won championship' = yes or no.
Looking at this we can right away see that these categorical values can be quickly converted into numerics to be better represented. Lets go ahead and do that
data['sex'] = data['sex'].replace(['male', 'female'], [1,0])
(here we are using the .replace() function to replace the categorical values 'male' and 'female' into 1 and 0 respectively)
data['won_championship'] = data['won_championship'].replace(['yes', 'no'], [1,0])
(once again we are using .replace() to replace the 'yes' and 'no' in the 'won championship' column into '1' and '0')
data
name | sex | sport | country | won_championship | |
---|---|---|---|---|---|
0 | James Harden | 1 | basketball | USA | 0 |
1 | Serena Williams | 0 | tennis | USA | 1 |
2 | Lionel Messi | 1 | soccer | Argentina | 1 |
3 | Mike Trout | 1 | baseball | USA | 0 |
4 | Tom Brady | 1 | football | USA | 1 |
5 | Joe Thornton | 1 | hockey | Canda | 0 |
6 | Sania Mirza | 0 | tennis | India | 1 |
In our new table you can now see that the 'sex' and 'won championship' columns have had their categorical values replaced with numerical values of 1 and 0. This is easy to do when there are just two categorical values within a column. But, what are we suppose to do when there are multiple categorical values? This is when the get_dummies function can come in handy.
Let's take a look at the 'country' column. Looking at it quickly we can see that there are 4 different categorical values:
Usa, Argentina, Canada and India.
You might think that we could replace these categorical values the same way we did with the previous two columns that we converted ('sex and 'won championship') but then we would have to use a range of numerics from 1-4 to represent each category. This would not work as the algorithm would not be able to clearly understand what we are trying to convey and therefore the category would no longer be of any use to us for modelling purposes.
Instead of converting the categorical values from a range of 1-4 and keeping them in the same column we are going to use the pd.get_dummies function to break the column apart and convert the categories into 1's and 0's so that the algorithm can incorporate this piece of data.
pd.get_dummies(data['country'])
Argentina | Canda | India | USA | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 |
2 | 1 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 1 |
4 | 0 | 0 | 0 | 1 |
5 | 0 | 1 | 0 | 0 |
6 | 0 | 0 | 1 | 0 |
Using the pd.get_dummies function has taken the 'country' column and broken it up into 4 seperate columns (in alphabetical order) for the 4 different countries that were represented in our DataFrame.
Let's take a minute to now understand these new columns.
If you scroll up and take a look at the original DataFrame you can see that the basketball player James Harden is represented at Index 0 (the very first row in our table)
Now taking a look at our new dummy columns for 'country' you can see that under Argentina, Canada and India there is a 0 value and a value of 1 under USA for Index 0.
This is because in our original DataFrame we indicated that James Harden was from the USA so now when we create these dummy columns using the get_dummies function our code understands that, to represent James Harden as an athlete from the USA it can only assign a numerical value of 1 under the new 'USA' column leaving every other column to represent 0 because he is not from any of those countries. And the same concept applies for every other athlete in the DataFrame.
To simplify:
the number 1 represents that the player is from that specific country and a 0 specifies 'not from this country'
Now we can take the newly created dummy columns and merge them into our original DataFrame using the pd.concat function so that we can get a better picture.
dummy = pd.get_dummies(data['country'])
data = pd.concat([data, dummy], axis =1)
data
name | sex | sport | country | won_championship | Argentina | Canda | India | USA | |
---|---|---|---|---|---|---|---|---|---|
0 | James Harden | 1 | basketball | USA | 0 | 0 | 0 | 0 | 1 |
1 | Serena Williams | 0 | tennis | USA | 1 | 0 | 0 | 0 | 1 |
2 | Lionel Messi | 1 | soccer | Argentina | 1 | 1 | 0 | 0 | 0 |
3 | Mike Trout | 1 | baseball | USA | 0 | 0 | 0 | 0 | 1 |
4 | Tom Brady | 1 | football | USA | 1 | 0 | 0 | 0 | 1 |
5 | Joe Thornton | 1 | hockey | Canda | 0 | 0 | 1 | 0 | 0 |
6 | Sania Mirza | 0 | tennis | India | 1 | 0 | 0 | 1 | 0 |
Because we now have seperate columns representing where each athlete is from we no longer need the 'country' column so we can go ahead and drop that column from the DataFrame using the .drop() function
data = data.drop(columns = 'country')
data
name | sex | sport | won_championship | Argentina | Canda | India | USA | |
---|---|---|---|---|---|---|---|---|
0 | James Harden | 1 | basketball | 0 | 0 | 0 | 0 | 1 |
1 | Serena Williams | 0 | tennis | 1 | 0 | 0 | 0 | 1 |
2 | Lionel Messi | 1 | soccer | 1 | 1 | 0 | 0 | 0 |
3 | Mike Trout | 1 | baseball | 0 | 0 | 0 | 0 | 1 |
4 | Tom Brady | 1 | football | 1 | 0 | 0 | 0 | 1 |
5 | Joe Thornton | 1 | hockey | 0 | 0 | 1 | 0 | 0 |
6 | Sania Mirza | 0 | tennis | 1 | 0 | 0 | 1 | 0 |
There you go! We have now converted majority of our columns into numerical values of 1's and 0's.
pd.get_dummies is a great function to use to achieve this with but now the real question is:
Is this suitable for creating models and using the data that we have to make a prediction?
Let's find out!
What we are now going to attempt is to build a model and predict based on all our data:
what are the chances of an athlete to win a championship in his/her respective sport?
to do this we will have to now train-test split our data and define our X and Y variables
(X represents all the variables(columns) we will use to make a prediction and Y represents what we are predicting (in this scenario that will be the likelihood of winning a championship)
The first step is to assign what peices of information we want for our X and y variables.
Because we want to predict 'winning a championship' were going to drop that from our data DataFrame and assign that to X and assign the 'won_championship' column to our y (our prediction)
Because all of our rows are identified by an Index number we can also go ahead and drop the 'name' column since each player has it's own unique id # based off the Index. This will be helpful for modelling.
X = data.drop(columns = ['name', 'won_championship'])
y = data['won_championship']
This is how our X and y variables look after the variable assignment:
X
sex | sport | Argentina | Canda | India | USA | |
---|---|---|---|---|---|---|
0 | 1 | basketball | 0 | 0 | 0 | 1 |
1 | 0 | tennis | 0 | 0 | 0 | 1 |
2 | 1 | soccer | 1 | 0 | 0 | 0 |
3 | 1 | baseball | 0 | 0 | 0 | 1 |
4 | 1 | football | 0 | 0 | 0 | 1 |
5 | 1 | hockey | 0 | 1 | 0 | 0 |
6 | 0 | tennis | 0 | 0 | 1 | 0 |
y
0 0
1 1
2 1
3 0
4 1
5 0
6 1
Name: won_championship, dtype: int64
Now we need to import 'train-test split' so we can begin to model our data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
If you take a look at the columns that we assigned to our X variable you will notice that we still have one column ('sport') that is still housing categorical values. To be able to go ahead and begin modelling our data we will need to create dummy variables for that column using the pd.get_dummies function.
We will go ahead and create dummy variables now for our X_train and X_test variables:
X_train1 = pd.get_dummies(X_train, columns=['sport'])
X_train1
sex | Argentina | Canda | India | USA | sport_baseball | sport_football | sport_hockey | sport_soccer | sport_tennis | |
---|---|---|---|---|---|---|---|---|---|---|
5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
X_test1 = pd.get_dummies(X_test, columns=['sport'])
X_test1
sex | Argentina | Canda | India | USA | sport_basketball | sport_tennis | |
---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
And there we have it
our entire DataFrame now only consists of numerical values of 0's and 1's thanks to the get_dummies function
Lets now go ahead and model our data and obtain a score.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train1, y_train)
model.score(X_test1, y_test)
C:\Users\v_sha\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-f729cc9cba07> in <module>
4
5 model.fit(X_train1, y_train)
----> 6 model.score(X_test1, y_test)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\base.py in score(self, X, y, sample_weight)
288 """
289 from .metrics import accuracy_score
--> 290 return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
291
292
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
279 Predicted class label per sample.
280 """
--> 281 scores = self.decision_function(X)
282 if len(scores.shape) == 1:
283 indices = (scores > 0).astype(np.int)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\base.py in decision_function(self, X)
260 if X.shape[1] != n_features:
261 raise ValueError("X has %d features per sample; expecting %d"
--> 262 % (X.shape[1], n_features))
263
264 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 7 features per sample; expecting 10
Oops...
we got ourselves an error...
Something went wrong when we tried to model out data. Lets take a look and see if we can figure out what has happened:
ValueError: X has 7 features per sample; expecting 10
This is the error that has been returned to us when we tried to return a score for our model. The reason for this is that when we created new dummy variables for our X_train and X_test for the 'sport' column, our algorithm was not able to pick up the new peices and merge it correctly to what we already had. That is why were getting an error telling us that:
We were given 7 features but actually we were suppose to receive 10. This is where get_dummies is no longer helpful.
We could of switched all our columns into dummy variables prior to doing train-test split but that defeats the purpose of training our data set to pick up new information. Model's need to be able to predict...that's the whole purpose of them. Doing to much cleaning or data altering prior to train-test split takes away it's learning ability and renders it useless.
This is as far as get_dummies can take us...
But thanks to SciKit-Learn we have a function that is basically get_dummies on steroids.
allow me to introduce you to:
LabelBinarizer()
As i already mentioned, the LabelBinarizer function is basically pd.get_dummies on steroids.
By this I mean it has the same capabilities but it will actually do what we need it to do so that we can model our data and return a score for our prediction. Something that get_dummies wasn't able to do for us because the alogrithm is not able to add in new peices.
Let's now try and create our model using LabelBinarizer.
(we already imported the function at the beginning of the notebook)
First we will need to once again assign the correct variables to our X and y (X = features, y = what we are predicting)
X = data.drop(columns = ['name', 'won_championship'])
y = data['won_championship']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
Now we bring in the LabelBinarizer function
lb = LabelBinarizer()
lb.fit(X_train['sport'])
lb.classes_
lb.transform(X_train['sport'])
X_train = X_train.join(pd.DataFrame(lb.fit_transform(X_train['sport']),
columns=lb.classes_,
index=X_train.index))
X_train
sex | sport | Argentina | Canda | India | USA | baseball | football | hockey | soccer | tennis | |
---|---|---|---|---|---|---|---|---|---|---|---|
5 | 1 | hockey | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | soccer | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | football | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
3 | 1 | baseball | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | tennis | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
So far, using the LabelBinarizer function we have gone ahead and recreated what pd.get_dummies did for us by taking the 'sport' column and breaking it up into new dummy columns representing each sport seperately.
The big difference in the two functions is that LabelBinarizer will be able to retain this new information and continue using it for our modelling purposes.
We now need to do this same step for our testing data (X_test)
X_test = X_test.join(pd.DataFrame(lb.transform(X_test['sport']),
columns=lb.classes_,
index=X_test.index))
X_test
sex | sport | Argentina | Canda | India | USA | baseball | football | hockey | soccer | tennis | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | basketball | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | tennis | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
We have now also broken up the 'sport' column for our test data and have created new individual columns for each sport represented by 0's and 1's.
Now since we have blown up the original 'sport' column into individual columns we can go ahead and drop the sport column from our datasets.
X_train = X_train.drop(columns = 'sport')
X_test = X_test.drop(columns = 'sport')
X_train
sex | Argentina | Canda | India | USA | baseball | football | hockey | soccer | tennis | |
---|---|---|---|---|---|---|---|---|---|---|
5 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
X_test
sex | Argentina | Canda | India | USA | baseball | football | hockey | soccer | tennis | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
Alright, now that we have transformed all our columns using LabelBinarizer lets now go ahead and see if we will be able to model a score for what we were trying to predict:
What are the chances of an athlete winning in his/her sport?
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
C:\Users\v_sha\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
FutureWarning)
0.5
IT WORKED!!!
According to our model, a player in his/her sport has a 50% chance of winning a championship
(this data was created to show you how get_dummies and LabelBinarizer work as functions and this by no means is an accurate prediction score of an athlete winning a championship in his/her respective sport)
Well there you go...
LabelBinarizer was able to accomplish what get_dummies could not do for us.