Categorical Encoding, One Hot Encoding and why use it?

What is categorical encoding?

In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.

Name Age Zip Code Salary
Jim 43 94404 45000
Jon 37 94407 80000
Merry 36 94404 65000
Tim 42 94403 75000
Hailey 29 94407 60000

In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.

As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.

Name Age Zip Code Sex Salary
Jim 43 94404 M 45000
Jon 37 94407 M 80000
Merry 36 94404 F 65000
Tim 42 94403 M 75000
Hailey 29 94407 F 60000

So after encoding Zip Code an Sex features as enums both features will look like as below:

Name Age Zip Code Sex Salary
Jim 43 1 1 45000
Jon 37 2 1 80000
Merry 36 1 0 65000
Tim 42 3 1 75000
Hailey 29 2 0 60000

As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:

Age Zip Code Sex
43 1 1
37 2 1
36 1 0
42 3 1
29 2 0

Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.

What is One Hot Encoding?

In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.

One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:

Age Zip Code M F Salary
43 1 1 0 45000
37 2 1 0 80000
36 1 0 1 65000
42 3 1 0 75000
29 2 0 1 60000

If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:

Age 94404 94407 94403 M F Salary
43 1 0 0 1 0 45000
37 0 1 0 1 0 80000
36 1 0 0 0 1 65000
42 0 0 1 1 0 75000
29 0 1 0 0 1 60000

So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.

One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.

Why Use it?

There are several good reasons to use One Hot Encoding in the data.

As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.

Other Resources:

Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

For in depth feature engineering please visit the following slides from HJ Van Veen:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s