List of machine leanring algorithms supported in Mahout 0.7

Here is a list of Mahout built-in algorithms available in 0.7:

  •   arff.vector: : Generate Vectors from an ARFF file or directory
  •   baumwelch: : Baum-Welch algorithm for unsupervised HMM training
  •   canopy: : Canopy clustering
  •   cat: : Print a file or resource as the logistic regression models would see it
  •   cleansvd: : Cleanup and verification of SVD output
  •   clusterdump: : Dump cluster output to text
  •   clusterpp: : Groups Clustering Output In Clusters
  •   cmdump: : Dump confusion matrix in HTML or text formats
  •   cvb: : LDA via Collapsed Variation Bayes (0th deriv. approx)
  •   cvb0_local: : LDA via Collapsed Variation Bayes, in memory locally.
  •   dirichlet: : Dirichlet Clustering
  •   eigencuts: : Eigencuts spectral clustering
  •   evaluateFactorization: : compute RMSE and MAE of a rating matrix factorization against probes
  •   fkmeans: : Fuzzy K-means clustering
  •   fpg: : Frequent Pattern Growth
  •   hmmpredict: : Generate random sequence of observations by given HMM
  •   itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering
  •   kmeans: : K-means clustering
  •   lucene.vector: : Generate Vectors from a Lucene index
  •   matrixdump: : Dump matrix in CSV format
  •   matrixmult: : Take the product of two matrices
  •   meanshift: : Mean Shift clustering
  •   minhash: : Run Minhash clustering
  •   parallelALS: : ALS-WR factorization of a rating matrix
  •   recommendfactorized: : Compute recommendations using the factorization of a rating matrix
  •   recommenditembased: : Compute recommendations using item-based collaborative filtering
  •   regexconverter: : Convert text files on a per line basis based on regular expressions
  •   rowid: : Map SequenceFile<Text,VectorWritable> to {SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>}
  •   rowsimilarity: : Compute the pairwise similarities of the rows of a matrix
  •   runAdaptiveLogistic: : Score new production data using a probably trained and validated AdaptivelogisticRegression model
  •   runlogistic: : Run a logistic regression model against CSV data
  •   seq2encoded: : Encoded Sparse Vector generation from Text sequence files
  •   seq2sparse: : Sparse Vector generation from Text sequence files
  •   seqdirectory: : Generate sequence files (of Text) from a directory
  •   seqdumper: : Generic Sequence File dumper
  •   seqmailarchives: : Creates SequenceFile from a directory containing gzipped mail archives
  •   seqwiki: : Wikipedia xml dump to sequence file
  •   spectralkmeans: : Spectral k-means clustering
  •   split: : Split Input data into test and train sets
  •   splitDataset: : split a rating dataset into training and probe parts
  •   ssvd: : Stochastic SVD
  •   svd: : Lanczos Singular Value Decomposition
  •   testnb: : Test the Vector-based Bayes classifier
  •   trainAdaptiveLogistic: : Train an AdaptivelogisticRegression model
  •   trainlogistic: : Train a logistic regression using stochastic gradient descent
  •   trainnb: : Train the Vector-based Bayes classifier
  •   transpose: : Take the transpose of a matrix
  •   validateAdaptiveLogistic: : Validate an AdaptivelogisticRegression model against hold-out data set
  •   vecdist: : Compute the distances between a set of Vectors (or Cluster or Canopy, they must fit in memory) and a list of Vectors
  •   vectordump: : Dump vectors from a sequence file to text
  •   viterbi: : Viterbi decoding of hidden states from given output states sequence

To further use any one of the above just try appending it with mahout and you will see more details on how to use it:

$mahout kmeans

usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths>              comma separated archives to be unarchived
on the compute machines.
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-files <paths>                 comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-libjars <paths>               comma separated jar files to include in the classpath.
-tokenCacheFile <tokensFile>   name of the file with the tokens

Missing required option –clusters

Usage:
[–input <input> –output <output> –distanceMeasure <distanceMeasure>
–clusters <clusters> –numClusters <k> –convergenceDelta <convergenceDelta>
–maxIter <maxIter> –overwrite –clustering –method <method>
–outlierThreshold <outlierThreshold> –help –tempDir <tempDir> –startPhase
<startPhase> –endPhase <endPhase>]
–clusters (-c) clusters    The input centroids, as Vectors.  Must be a
SequenceFile of Writable, Cluster/Canopy.  If k is
also specified, then a random set of vectors will
be selected and written out to this path first

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s