Installing R packages missing error



— Please select a CRAN mirror for use in this session —
Secure CRAN mirrors

1: 0-Cloud [https] 2: Algeria [https]
3: Australia (Canberra) [https] 4: Australia (Melbourne 1) [https]
5: Australia (Melbourne 2) [https] 6: Australia (Perth) [https]
7: Austria [https] 8: Belgium (Ghent) [https]
9: Brazil (PR) [https] 10: Brazil (RJ) [https]
11: Brazil (SP 1) [https] 12: Brazil (SP 2) [https]
13: Bulgaria [https] 14: Chile 1 [https]
15: Chile 2 [https] 16: China (Hong Kong) [https]
17: China (Guangzhou) [https] 18: China (Lanzhou) [https]
19: China (Shanghai 1) [https] 20: China (Shanghai 2) [https]
21: Colombia (Cali) [https] 22: Czech Republic [https]
23: Denmark [https] 24: East Asia [https]
25: Ecuador (Cuenca) [https] 26: Ecuador (Quito) [https]
27: Estonia [https] 28: France (Lyon 1) [https]
29: France (Lyon 2) [https] 30: France (Marseille) [https]
31: France (Montpellier) [https] 32: France (Paris 2) [https]
33: Germany (Erlangen) [https] 34: Germany (Göttingen) [https]
35: Germany (Münster) [https] 36: Greece [https]
37: Iceland [https] 38: India [https]
39: Indonesia (Jakarta) [https] 40: Ireland [https]
41: Italy (Padua) [https] 42: Japan (Tokyo) [https]
43: Japan (Yonezawa) [https] 44: Korea (Busan) [https]
45: Korea (Seoul 1) [https] 46: Korea (Ulsan) [https]
47: Malaysia [https] 48: Mexico (Mexico City) [https]
49: Norway [https] 50: Philippines [https]
51: Serbia [https] 52: Spain (A Coruña) [https]
53: Spain (Madrid) [https] 54: Sweden [https]
55: Switzerland [https] 56: Turkey (Denizli) [https]
57: Turkey (Mersin) [https] 58: UK (Bristol) [https]
59: UK (London 1) [https] 60: USA (CA 1) [https]
61: USA (IA) [https] 62: USA (KS) [https]
63: USA (MI 1) [https] 64: USA (NY) [https]
65: USA (OR) [https] 66: USA (TN) [https]
67: USA (TX 1) [https] 68: Uruguay [https]
69: Vietnam [https] 70: (other mirrors)
Selection: 60
trying URL ‘’

Content type ‘application/x-gzip’ length 712037 bytes (695 KB)

downloaded 695 KB
The downloaded binary packages are in
Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
unable to load shared object ‘/Library/Frameworks/R.framework/Resources/modules//’:
dlopen(/Library/Frameworks/R.framework/Resources/modules//, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib
Referenced from: /Library/Frameworks/R.framework/Resources/modules//
Reason: image not found



SOlution > Just Install XQuartz from the link below and you dont need to set X11 as default server at all. It should fix your problem.


Holdout prediction with cross validation in K-means modeling

Sometime you may need to combine holdout predictions, while keep_cross_validation_predictions parameter is active in Python, here is the Python code as sample:

import h2o
from h2o.estimators.kmeans import H2OKMeansEstimator
prostate = h2o.import_file("")
predictors = ["AGE", "RACE", "VOL", "GLEASON"]
kmeans_model = H2OKMeansEstimator(k=10, nfolds = 5,
kmeans_model.train(predictors, training_frame = prostate)



Sparkling Water 2.0 Walkthrough with pysparkling



Pysparkling Command:

$$> bin/pysparkling --num-executors 2 --executor-memory 2g --driver-memory 2g --conf spark.dynamicAllocation.enabled=false
 Python 2.7.10 (default, Jul 30 2016, 18:31:42)
 [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
 Type "help", "copyright", "credits" or "license" for more information.
 Using Spark's default log4j profile: org/apache/spark/
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel).
 16/10/20 09:29:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 Welcome to
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\ version 2.0.1
 Using Python version 2.7.10 (default, Jul 30 2016 18:31:42)
 SparkSession available as 'spark'.

Now entering Commands:

>>> from pysparkling import *
>>> from pyspark import SparkContext
>>> from pyspark.sql import SQLContext
>>> import h2o
>>> sqlContext = SQLContext(sc)
>>> sqlContext

>>> hc = H2OContext.getOrCreate(sc)

Here is the successful output:

 16/10/20 09:31:10 WARN InternalH2OBackend: Increasing 'spark.locality.wait' to value 30000
 16/10/20 09:31:10 WARN InternalH2OBackend: The property 'spark.scheduler.minRegisteredResourcesRatio' is not specified!
 We recommend to pass `--conf spark.scheduler.minRegisteredResourcesRatio=1`
 Warning: if you don't want to start local H2O server, then use of `h2o.connect()` is preferred.
 Checking whether there is an H2O instance running at connected.
 -------------------------- ----------------------------------------
 H2O cluster uptime: 09 secs
 H2O cluster version:
 H2O cluster version age: 1 month
 H2O cluster name: sparkling-water-avkashchauhan_2132345410
 H2O cluster total nodes: 3
 H2O cluster free memory: 2.364 Gb
 H2O cluster total cores: 24
 H2O cluster allowed cores: 24
 H2O cluster status: accepting new members, healthy
 H2O connection url:
 H2O connection proxy:
 Python version: 2.7.10 final
 -------------------------- ----------------------------------------

Now verifying sparkling water package and make sure you have pysparkling reference to 2.0_2.0.0 package above.

>> h2o

<module ‘h2o’ from ‘/private/var/folders/x7/331tvwcd6p17jj9zdmhnkpyc0000gn/T/avkashchauhan/spark/work/spark- 28af708d-a149-435a-9a53-41e63d9ba7f5/userFiles-cf7c7aaf-610f-439d-9037- 2dcddca73524/h2o_pysparkling_2.0-2.0.0-py2.7.egg/h2o/init.pyc’>

Getting Help for h2o:

>> help(h2o)

Getting Cluster Status

>> h2o.cluster_status()

Feature Binning in GBM models and prediction

We have a categorical variable with about 900 levels. Out of these levels some of them happen very rarely. Out of millions of observations there are levels that would occur less than 500 times. The minimum number of rows chosen to fit the model was 500. In addition, we didn’t want to restrict the number of bins, so we chose to go with the default nbins_cats=1024.

I had a look at the POJO and for that categorical variable we got this part of the script:

// The class representing column C11 class GBM_model_python_1_ColInfo_99 implements { public static final String[] VALUES = new String[406]; static { GBM_model_python1_ColInfo_99_0.fill(VALUES); } static final class GBM_model_python*_1_ColInfo_99_0 implements { static final void fill(String[] sa) { sa[0] = "C003"; … sa[405] = "C702"; } } }

As you can see, in the script there are only 406 bins created. My questions are the following:

Question 1: What happens with the other bins up to 900? If when scoring the model we have a value for variable C99 from the bins not included in the 406, will that observation be scored? And how exactly variable C99 will be used when scoring the model?

Question 2: If the model will see for variable C99 a value outside the 900 distinct levels we had in our training data, will the model score that observation? If yes, how?

Yes the H2O model (whether it be in POJO / MOJO form or running in H2O ) , will score when receiving unseen categoricals.


In GBM, during training, the optimal split direction for every feature value (numeric and categorical, including missing values/NAs) is computed, and stored for future use during scoring.

This means that missing numeric, categorical, or unseen categorical values are turned into NAs.

Specifically, if there are NO NAs in the training data, then NAs in the test data follow the majority direction (the direction with the most observations); If there are NAs in the training data, then NAs in the test set follow the direction that is optimal for the NAs of the training data.

For more details we have an updated GBM documentation page at:

Categorical Encoding, One Hot Encoding and why use it?

What is categorical encoding?

In the data science categorical values are encoded as enumerator so the algorithms can use them numerically when processing the data and generating the relationship with other features used for learning.

Name Age Zip Code Salary
Jim 43 94404 45000
Jon 37 94407 80000
Merry 36 94404 65000
Tim 42 94403 75000
Hailey 29 94407 60000

In above example the Zip Code is not a numeric values instead each number represents a certain area. So using Zip code as number will not create a relationship among other features such as age or salary however if we encode it to categorial then relationship among other features would be define properly. So we use Zip Code feature as categorical or enum when we feed for machine learning algorithm.

As string or character feature should be set to categorical or enum as well to generalize the relationship among features. In the above dataset if we add another feature name “Sex” as below then using “sex” feature as categorical will improve the relationship among other features.

Name Age Zip Code Sex Salary
Jim 43 94404 M 45000
Jon 37 94407 M 80000
Merry 36 94404 F 65000
Tim 42 94403 M 75000
Hailey 29 94407 F 60000

So after encoding Zip Code an Sex features as enums both features will look like as below:

Name Age Zip Code Sex Salary
Jim 43 1 1 45000
Jon 37 2 1 80000
Merry 36 1 0 65000
Tim 42 3 1 75000
Hailey 29 2 0 60000

As Name feature will not help us any ways to related Age, Zip Code and Sex so we can drop it and stick with Age, Zip Code and Sex to understand Salary first and then predict the same Salary for the new values. So the input data set will look like as below:

Age Zip Code Sex
43 1 1
37 2 1
36 1 0
42 3 1
29 2 0

Above you can see that all the data is in numeric format and it is ready to be processed by algorithms to create a relationship among it to first learn and then predict.

What is One Hot Encoding?

In the above example you can see that the values i.e. Male or Female are part of feature name “Sex” so their exposure with other features is not that rich or in depth. What if Male and Female be features like Age or Zip Code? In that case the relationship for being Male or Female with other data set will be much higher.. Using one hot encoding for a specific feature provides necessary & proper representation of the distinct elements for that feature, which helps improved learning.

One Hot Encoding does exactly the same. It takes distinct values from the feature and convert into a feature itself to improve the relationship with overall data. So if we choose One Hot Encoding to the “Sex” feature the dataset will look like as below:

Age Zip Code M F Salary
43 1 1 0 45000
37 2 1 0 80000
36 1 0 1 65000
42 3 1 0 75000
29 2 0 1 60000

If we decide to set One Hot Encoding to Zip Code as well then our data set will look like as below:

Age 94404 94407 94403 M F Salary
43 1 0 0 1 0 45000
37 0 1 0 1 0 80000
36 1 0 0 0 1 65000
42 0 0 1 1 0 75000
29 0 1 0 0 1 60000

So above you can see that each values has significant representation and a deep relationship with the other values. One hot encoding is also called as one-of-K scheme.

One Hot encoding can use either dense or sparse implementation when it creates the feature from the encoded values.

Why Use it?

There are several good reasons to use One Hot Encoding in the data.

As you can see, using One Hot encoding, sparsity of data is included into original data set which is more memory friendly and improve learning time if algorithm is designed to handle data sparsity properly.

Other Resources:

Please visit the following link to see the One-Hot-Encoding implementation in scikit-learn:

For in depth feature engineering please visit the following slides from HJ Van Veen:

A great way to probe personal traits through simple questions



I really like these questions which could open the personal window of anyone if asked properly…

I want to give full credit to the author Tiffany Sun for composing the list below.


  1. If you could have superpowers, would you use it for good or for evil?
  2. How old would you be if you didn’t know how old you are?
  3. Would you accept the gift of reading other people’s minds if it meant you could never turn it off?
  4. If the average human life span was 40 years, how would you live your life differently?
  5. Do you think crying is a sign of weakness or strength?
  6. Would you rather be able to eat as much as you want with no weight gain, or require only 3 hours of sleep a day?
  7. If you had to choose to live without one of your 5 senses, which one would you give up?
  8. In what ways are you the same as your childhood self?
  9. If you had your own TV network, what would it be about?
  10. If you’re in a bad mood, do you prefer to be left alone or have someone cheer you up?
  11. Would you rather know without a doubt the purpose and direction of your life or never have to worry about money for the rest of your life?
  12. If you could master one skill you don’t have right now, what would it be?
  13. What song typifies the last 24 hours of your life?
  14. What words would you pass to your childhood self?
  15. If you had to do it over again, what would you study in school?
  16. If you could have any accent, which one would it be?
  17. Would you rather be married in an arranged marriage or spend the rest of your life single?
  18. If you could be someone of the opposite sex for a day, what would be the first thing you do?
  19. Would you rather have an extra hour everyday or have $40 given to you free and clear everyday?
  20. If you were to be stranded on a deserted island with one other person, who would it be?
  21. What would you do differently if you knew nobody would judge you?
  22. Would you rather spend 48 straight hours in a public restroom or spend the next 2 months taking only public transportation?
  23. What did you learn in school that has proven to be the least useful?
  24. If you had an extra hour every day, what would you do with it?
  25. Would you rather lose your sense of taste and small or lose all of your hair?
  26. If you could invent something, what would it be and why?
  27. Would you rather have more than 5 friends or fewer than 5 friends?
  28. What stands between you and happiness?
  29. If today were to be your last day in your country, what would you want to do?
  30. Would you rather lose all of your old memories, or never be able to make new ones?
  31. What was the last thing you got for free?
  32. Would you rather be extremely attractive or be married to someone who is extremely attractive?
  33. What do you want to be remembered for?
  34. Would you rather have $50,000 free and clear or $1,000,000 that is illegal?
  35. If you could trade lives with one of your friends, who would it be?
  36. Would you rather discover something great and share it? Or discover something evil and prevent it?
  37. What movie deserves a sequel?
  38. If you could see 24 hours into the future, what would you be doing?



View at

Handling various errors when installing Tensorflow

The problem happens if protobuf for python is older then 3.1.0 and TF is older too. I had exact problem as below:

$ python

Python 2.7.10 (default, Jul 30 2016, 19:40:32)
>>> from import tfprof_log_pb2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named tfprof

This is how i solved it:

Updated setuptools to latest 32.x or above version (I did this because my TF install was failing to update setuptools to version 32.x):

pip install update --user setuptools

After that I installed TF as below:

$ export TF_BINARY_URL=
$ sudo pip install --upgrade $TF_BINARY_URL

You will note that TF 12.1 installs the following:

Collecting tensorflow==0.12.1 from
Collecting numpy>=1.11.0 (from tensorflow==0.12.1)
Collecting protobuf>=3.1.0 (from tensorflow==0.12.1)
Collecting setuptools (from protobuf>=3.1.0->tensorflow==0.12.1)

After successful TF Install:

Successfully installed protobuf-3.1.0.post1 tensorflow-0.12.1

I tried the very first command as below:

$ python

Python 2.7.10 (default, Jul 30 2016, 19:40:32)
>>> from import tfprof_log_pb2
>>> tfprof_log_pb2
<module '' from '/Library/Python/2.7/site-packages/tensorflow/tools/tfprof/tfprof_log_pb2.pyc'>