If you would like to know what is word2vec and why you should use it, there is lots of material available to scan. You can learn more about H2O implementation of Word2Vec here, along with its configuration and interpretation.
In this Scala example we will use H2O Word2Vec algorithm to build a model using the given Text (as text file, or an Array) and then build Word2vec model from it.
Here is the full Scala code of the following example at my github.
Lets start H2O cluster first:
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
Now we will be importing required libraries to get our job done:
import scala.io.Source
import _root_.hex.word2vec.{Word2Vec, Word2VecModel}
import _root_.hex.word2vec.Word2VecModel.Word2VecParameters
import water.fvec.Vec
Now we will be creating a stop words list which are not useful for text mining and removed from the word source:
val STOP_WORDS = Set("ourselves", "hers", "between", "yourself", "but", "again", "there", "about",
"once", "during", "out", "very", "having", "with", "they", "own", "an", "be", "some", "for", "do",
"its", "yours", "such", "into", "of", "most", "itself", "other", "off", "is", "s", "am", "or", "who", "as",
"from", "him", "each", "the", "themselves", "until", "below", "are", "we", "these", "your", "his", "through", "don", "nor", "me", "were", "her",
"more", "himself", "this", "down", "should", "our", "their", "while", "above", "both", "up",
"to", "ours", "had", "she", "all", "no", "when", "at", "any", "before", "them", "same", "and", "been", "have", "in", "will", "on", "does", "yourselves", "then", "that", "because", "what", "over", "why", "so", "can",
"did", "not", "now", "under", "he", "you", "herself", "has", "just", "where", "too", "only", "myself", "which", "those", "i", "after", "few", "whom", "t", "being", "if", "theirs", "my", "against", "a", "by", "doing",
"it", "how", "further", "was", "here", "than")
Note:
Now lets ingest the text data we would want to run Word2Vec algorithms to vectorize the data first and then run machine learning experiment to it.
I have downloaded a free story “The Adventure of Sherlock Holmes” from Internet and using that as my source.
val filename = "/Users/avkashchauhan/Downloads/TheAdventuresOfSherlockHolmes.txt"
val lines = Source.fromFile(filename).getLines.toArray
val sparkframe = sc.parallelize(lines)
Now lets defined the tokenize function which will convert out input text to tokens:
def tokenize(line: String) = {
//get rid of nonWords such as punctuation as opposed to splitting by just " "
line.split("""\W+""")
.map(_.toLowerCase)
//Lets remove stopwords defined above
.filterNot(word => STOP_WORDS.contains(word)) :+ null
}
Now we will be calling the tokenize function to create a list of labeled words:
val allLabelledWords = sparkframe.flatMap(d => tokenize(d))
Note: You can also use your own or a custom tokenize function from a library as well, you just need to map the function to the DataFrame.
Now lets convert the collection of label words into an H2O DataFrame:
val h2oFrame = h2oContext.asH2OFrame(allLabelledWords)
Here is the time now to use the H2O Word2Vec algorithm by configuring the parameters first:
val w2vParams = new Word2VecParameters
w2vParams._train = h2oFrame._key
w2vParams._epochs = 500
w2vParams._min_word_freq = 0
w2vParams._init_learning_rate = 0.05f
w2vParams._window_size = 20
w2vParams._vec_size = 20
w2vParams._sent_sample_rate = 0.0001f
Now we will perform the real action, building the model:
val w2v = new Word2Vec(w2vParams).trainModel().get()
Now we can apply the model to perform some actions on it:
Lets start first test by finding synonyms using this given word2vec model. We will be calling findSynonyms method by passing a given word to find N synonyms, the results will be the top ‘count’ synonyms with their distance values:
w2v.findSynonyms("love", 3)
w2v.findSynonyms("help", 2)
w2v.findSynonyms("hate", 1)
Lets Transform words using w2v model and aggregate method average:
The transform() function takes an H2O Vec as the first parameter, where the vector needs to be extracted from the H2O frame h2oFrame.
val newSparkFrame = w2v.transform(h2oFrame.vec(0), Word2VecModel.AggregateMethod.NONE).toTwoDimTable()
Thats it, enjoy!!