Categorical Encoding, One Hot Encoding and why use it?

What is categorical encoding?

In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.

Name Age Zip Code Salary
Jim 43 94404 45000
Jon 37 94407 80000
Merry 36 94404 65000
Tim 42 94403 75000
Hailey 29 94407 60000

In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.

As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.

Name Age Zip Code Sex Salary
Jim 43 94404 M 45000
Jon 37 94407 M 80000
Merry 36 94404 F 65000
Tim 42 94403 M 75000
Hailey 29 94407 F 60000

So after encoding Zip Code an Sex features as enums both features will look like as below:

Name Age Zip Code Sex Salary
Jim 43 1 1 45000
Jon 37 2 1 80000
Merry 36 1 0 65000
Tim 42 3 1 75000
Hailey 29 2 0 60000

As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:

Age Zip Code Sex
43 1 1
37 2 1
36 1 0
42 3 1
29 2 0

Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.

What is One Hot Encoding?

In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.

One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:

Age Zip Code M F Salary
43 1 1 0 45000
37 2 1 0 80000
36 1 0 1 65000
42 3 1 0 75000
29 2 0 1 60000

If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:

Age 94404 94407 94403 M F Salary
43 1 0 0 1 0 45000
37 0 1 0 1 0 80000
36 1 0 0 0 1 65000
42 0 0 1 1 0 75000
29 0 1 0 0 1 60000

So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.

One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.

Why Use it?

There are several good reasons to use One Hot Encoding in the data.

As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.

Other Resources:

Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

For in depth feature engineering please visit the following slides from HJ Van Veen:

Advertisements

My experiment using lightGBM (Microsoft) from scratch at OSX

LightGBM is a fast, distributed, high performance gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks. It is under the umbrella of the DMTK(http://github.com/microsoft/dmtk) project of Microsoft.

Pre-requisite:

  • cmake
  • gcc

Test Environment:

$ cmake -version
cmake version 3.6.2
$ gcc --version 
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin16.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Get Source

$ git clone --recursive https://github.com/Microsoft/LightGBM.git

Preparation:

$ cd lightgbm
$ mkdir build
$ cd build
cmake -DCMAKE_CXX_COMPILER=g++-6 -DCMAKE_C_COMPILER=gcc-6 ..

Configuration:

$ cmake -DCMAKE_CXX_COMPILER=g++-6 -DCMAKE_C_COMPILER=gcc-6 .. 
-- The C compiler identification is GNU 6.2.0
-- The CXX compiler identification is GNU 6.2.0
-- Checking whether C compiler has -isysroot
-- Checking whether C compiler has -isysroot - yes
-- Checking whether C compiler supports OSX deployment target flag
-- Checking whether C compiler supports OSX deployment target flag - yes
-- Check for working C compiler: /usr/local/bin/gcc-6
-- Check for working C compiler: /usr/local/bin/gcc-6 -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Checking whether CXX compiler has -isysroot
-- Checking whether CXX compiler has -isysroot - yes
-- Checking whether CXX compiler supports OSX deployment target flag
-- Checking whether CXX compiler supports OSX deployment target flag - yes
-- Check for working CXX compiler: /usr/local/bin/g++-6
-- Check for working CXX compiler: /usr/local/bin/g++-6 -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Try OpenMP C flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Try OpenMP CXX flag = [-fopenmp]
-- Performing Test OpenMP_FLAG_DETECTED
-- Performing Test OpenMP_FLAG_DETECTED - Success
-- Found OpenMP: -fopenmp
-- Configuring done
CMake Warning (dev):
 Policy CMP0042 is not set: MACOSX_RPATH is enabled by default. Run "cmake
 --help-policy CMP0042" for policy details. Use the cmake_policy command to
 set the policy and suppress this warning.

MACOSX_RPATH is not specified for the following targets:

_lightgbm

This warning is for project developers. Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /Users/avkashchauhan/src/github.com/microsoft/LightGBM/build

Build now:

$ make -j 
Scanning dependencies of target lightgbm
Scanning dependencies of target _lightgbm
[ 6%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
....
.....
....


[ 97%] Linking CXX shared library ../lib_lightgbm.so
[100%] Linking CXX executable ../lightgbm
[100%] Built target _lightgbm
[100%] Built target lightgbm

Install Python package:

$ make install 
[ 50%] Built target _lightgbm
[100%] Built target lightgbm
Install the project...
-- Install configuration: ""
-- Installing: /usr/local/bin/lightgbm
-- Installing: /usr/local/lib/lib_lightgbm.so
-- Installing: /usr/local/include/LightGBM
-- Installing: /usr/local/include/LightGBM/application.h
-- Installing: /usr/local/include/LightGBM/bin.h
-- Installing: /usr/local/include/LightGBM/boosting.h
-- Installing: /usr/local/include/LightGBM/c_api.h
-- Installing: /usr/local/include/LightGBM/config.h
-- Installing: /usr/local/include/LightGBM/dataset.h
-- Installing: /usr/local/include/LightGBM/dataset_loader.h
-- Installing: /usr/local/include/LightGBM/export.h
-- Installing: /usr/local/include/LightGBM/feature.h
-- Installing: /usr/local/include/LightGBM/meta.h
-- Installing: /usr/local/include/LightGBM/metric.h
-- Installing: /usr/local/include/LightGBM/network.h
-- Installing: /usr/local/include/LightGBM/objective_function.h
-- Installing: /usr/local/include/LightGBM/tree.h
-- Installing: /usr/local/include/LightGBM/tree_learner.h
-- Installing: /usr/local/include/LightGBM/utils
-- Installing: /usr/local/include/LightGBM/utils/array_args.h
-- Installing: /usr/local/include/LightGBM/utils/common.h
-- Installing: /usr/local/include/LightGBM/utils/log.h
-- Installing: /usr/local/include/LightGBM/utils/openmp_wrapper.h
-- Installing: /usr/local/include/LightGBM/utils/pipeline_reader.h
-- Installing: /usr/local/include/LightGBM/utils/random.h
-- Installing: /usr/local/include/LightGBM/utils/text_reader.h
-- Installing: /usr/local/include/LightGBM/utils/threading.h

Test it now:

$ python -c 'import lightgbm as lg;print(lg.__version__)'
0.1

Sample Code Jupyter Notebook:

# In[1]:
import json
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error

# In[2]:
# load or create your dataset
print('Load data...')
df_train = pd.read_csv('~/src/github.com/microsoft/LightGBM/examples/regression/regression.train', header=None, sep='\t')
df_test = pd.read_csv('~/src/github.com/microsoft/LightGBM/examples/regression/regression.test', header=None, sep='\t')

# In[4]:
df_train.shape

# In[5]:
df_test.shape

# In[6]:
y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# In[8]:
y_train.shape

# In[10]:
X_train.shape

# In[11]:
X_test.shape

# In[12]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# In[13]:
# specify your configurations as a dict
params = {
 'task': 'train',
 'boosting_type': 'gbdt',
 'objective': 'regression',
 'metric': {'l2', 'auc'},
 'num_leaves': 31,
 'learning_rate': 0.05,
 'feature_fraction': 0.9,
 'bagging_fraction': 0.8,
 'bagging_freq': 5,
 'verbose': 0
}

# In[14]:
print('Start training...')
# train
gbm = lgb.train(params,
 lgb_train,
 num_boost_round=20,
 valid_sets=lgb_eval,
 early_stopping_rounds=5)

# In[15]:
print('Start predicting...')
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
# eval
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)


# In[16]:
print('Dump model to JSON as : lightgbm_model.json')
# dump model to json (and save to file)
model_json = gbm.dump_model()

with open('lightgbm_model.json', 'w+') as f:
 json.dump(model_json, f, indent=4)

print('Above lightgbm_model.json file is saved at your local file system, mostly where jupyter notebook started')

# In[17]:
print('Feature Importance Results:')
print('Feature names:', gbm.feature_name())
print('Calculate feature importances...')
# feature importances
print('Feature importances:', list(gbm.feature_importance()))

# In[18]:
print('Save model...')
# save model to file
gbm.save_model('lightgbm_model.txt')
print('Above lightgbm_model.txt file is saved at your local file system, mostly where jupyter notebook started')

 

 

A great way to probe personal traits through simple questions

Source: https://medium.com/the-coffeelicious/questions-that-truly-reveal-someones-personality-and-capabilities-ecf9f37fc4e2

stock-vector-personality-chart-with-keywords-and-icons-348777452

I really like these questions which could open the personal window of anyone if asked properly…

I want to give full credit to the author Tiffany Sun for composing the list below.

Enjoy!

  1. If you could have superpowers, would you use it for good or for evil?
  2. How old would you be if you didn’t know how old you are?
  3. Would you accept the gift of reading other people’s minds if it meant you could never turn it off?
  4. If the average human life span was 40 years, how would you live your life differently?
  5. Do you think crying is a sign of weakness or strength?
  6. Would you rather be able to eat as much as you want with no weight gain, or require only 3 hours of sleep a day?
  7. If you had to choose to live without one of your 5 senses, which one would you give up?
  8. In what ways are you the same as your childhood self?
  9. If you had your own TV network, what would it be about?
  10. If you’re in a bad mood, do you prefer to be left alone or have someone cheer you up?
  11. Would you rather know without a doubt the purpose and direction of your life or never have to worry about money for the rest of your life?
  12. If you could master one skill you don’t have right now, what would it be?
  13. What song typifies the last 24 hours of your life?
  14. What words would you pass to your childhood self?
  15. If you had to do it over again, what would you study in school?
  16. If you could have any accent, which one would it be?
  17. Would you rather be married in an arranged marriage or spend the rest of your life single?
  18. If you could be someone of the opposite sex for a day, what would be the first thing you do?
  19. Would you rather have an extra hour everyday or have $40 given to you free and clear everyday?
  20. If you were to be stranded on a deserted island with one other person, who would it be?
  21. What would you do differently if you knew nobody would judge you?
  22. Would you rather spend 48 straight hours in a public restroom or spend the next 2 months taking only public transportation?
  23. What did you learn in school that has proven to be the least useful?
  24. If you had an extra hour every day, what would you do with it?
  25. Would you rather lose your sense of taste and small or lose all of your hair?
  26. If you could invent something, what would it be and why?
  27. Would you rather have more than 5 friends or fewer than 5 friends?
  28. What stands between you and happiness?
  29. If today were to be your last day in your country, what would you want to do?
  30. Would you rather lose all of your old memories, or never be able to make new ones?
  31. What was the last thing you got for free?
  32. Would you rather be extremely attractive or be married to someone who is extremely attractive?
  33. What do you want to be remembered for?
  34. Would you rather have $50,000 free and clear or $1,000,000 that is illegal?
  35. If you could trade lives with one of your friends, who would it be?
  36. Would you rather discover something great and share it? Or discover something evil and prevent it?
  37. What movie deserves a sequel?
  38. If you could see 24 hours into the future, what would you be doing?

 

 

View story at Medium.com

The state of cybersecurity startup and VC funding

israel-cybersecurity-landscape

Eighteen months ago Venture Beat mapped 120 Israeli cybersecurity startups to serve as an internal guide for everyone and as a public resource for visitors to the Israeli cyber security scene.

Today, the landscape has grown to encompass 150 companies, including 40 new startups. We divide these companies into 16 distinct categories, with two new categories: “deception” and “software development life cycle.”

More info can be gleaned from the article below:

 

Source: http://venturebeat.com/2017/02/03/elon-musk-other-ceos-voice-travel-ban-concerns-to-trump/