What is categorical encoding?
In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.
In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.
As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.
So after encoding Zip Code an Sex features as enums both features will look like as below:
As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:
Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.
What is One Hot Encoding?
In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.
One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:
If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:
So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.
One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.
Why Use it?
There are several good reasons to use One Hot Encoding in the data.
As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.
Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:
For in depth feature engineering please visit the following slides from HJ Van Veen: