Pattern Analysis Computation Methods and Algorithms for Machine Learning

In machine learning, pattern recognition is the assignment of a label to a given input value. An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is “spam” or “non-spam”). However, pattern recognition is a more general problem that encompasses other types of output as well. Other examples are regression, which assigns a real-valued output to each input; sequence labeling, which assigns a class to each member of a sequence of values (for example, part of speech tagging, which assigns a part of speech to each word in an input sentence); and parsing, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence.
Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to do “fuzzy” matching of inputs. This is opposed to pattern matching algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm is regular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of many text editors and word processors. In contrast to pattern recognition, pattern matching is generally not considered a type of machine learning, although pattern-matching algorithms (especially with fairly general, carefully tailored patterns) can sometimes succeed in providing similar-quality output to the sort provided by pattern-recognition algorithms.


Pattern Analysis Computation Methods:

  • Ridge regression
  • Regularized Fisher discriminant
  • Regularized kernel Fisher discriminant
  • Maximizing variance
  • Maximizing covariance
  • Canonical correlation analysis
  • Kernel CCA
  • Regularized CCA
  • Kernel regularized CCA
  • Smallest enclosing hyper sphere
  • Soft minimal hyper sphere
  • nu-soft minimal hyper sphere
  • Hard margin SVM
  • 1-norm soft margin SVM
  • 2-norm soft margin SVM
  • Ridge regression optimization
  • Quadratic e-insensitive
  • Linear e-insensitive SVR
  • nu-SVR 
  • Soft ranking 
  • Cluster quality 
  • Cluster optimization strategy 
  • Multiclass clustering
  • Relaxed multiclass clustering 
  • Visualization quality

Pattern Analysis Algorithms:

  • Normalization 
  • Centering data 
  • Simple novelty detection 
  • Parzen based classifier 
  • Cholesky decomposition or dual Gram�Schmidt 
  • Standardizing data 
  • Kernel Fisher discriminant 
  • Primal PCA 
  • Kernel PCA 
  • Whitening 
  • Primal CCA 
  • Kernel CCA 
  • Principal components regression 
  • PLS feature extraction 
  • Primal PLS 
  • Kernel PLS 
  • Smallest hyper sphere enclosing data 
  • Soft hyper sphere minimization 
  • nu-soft minimal hyper sphere 
  • Hard margin SVM 
  • Alternative hard margin SVM 
  • 1-norm soft margin SVM 
  • nu-SVM
  • 2-norm soft margin SVM
  • Kernel ridge regression
  • 2-norm SVR
  • 1-norm SVR
  • nu-support vector regression
  • Kernel perceptron
  • Kernel adatron 
  • On-line SVR
  • nu-ranking
  • On-line ranking
  • Kernel k-means
  • MDS for kernel-embedded data
  • Data visualization


Keywords: Data Mining, Machine Learning, Algorithms



Details on Scikit-Learn Python based Machine Learning Library

SCiKit-Learn Python based Machine Learning Library which is open sources through BSD.



  • Python >= 2.6
  • numpy > = 1.3
  • scipy >= 0.7

Includes supervised learning algorithms:

  • Generalized Linear Model with with scipy.sparse bindings for wide features datasets
  • Support Vector Machine (SVM) based on libsvm
  • Stochastic Gradient Descent
  • bayesian methods
  • Gaussian Processes
  • Nearest Neighbors
  • Partial Least Squares
  • Naive Bayes
  • Decision Trees
  • Ensemble methods
  • Multiclass and multilabel algorithms
  • Feature selection
  • L1 and L1+L2 regularized regression methods aka Lasso and Elastic Net models implemented with algorithms such as LARS and coordinate descent
  • Linear and Quadratic Discriminant Analysis

Includes unsupervised clustering algorithms:

  • Gaussian mixture models
  • kmeans++
  • meanshift
  • affinity propagation
  • Manifold learning
  • spectral clustering
  • Decomposing signals in components (matrix factorization problems)
  • Covariance estimation
  • Novelty and Outlier Detection
  • Hidden Markov Models (HMMs)

Include other tools:

  • feature extractors for text content (token and char ngrams + hashing vectorizer)
  • univariate feature selections
  • a simple pipe line tool
  • numerous implementations of cross validation strategies
  • performance metrics evaluation and ploting (ROC curve, AUC, confusion matrix, …)
  • a grid search utility to perform hyper-parameters tuning using parallel cross validation
  • integration with joblib for caching partial results when working in interactive environment (e.g. using ipython)


  • Each algorithm implementation comes with sample programs demonstrating it’s usage either on toy data or real life datasets.

Source code:

  • Get the source code from Git-hub

Top most algorithms used in Data Mining

I am trying to compile a comprehensive  list of  Data Mining Algorithm and while trying to do so I found a top 10 list can be created by several ways.

Based on a Scientific research paper here is top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006 and  these top 10 algorithms are among the most influential data mining algorithms in the research community

  1. C4.5
  2. k-Means
  3. SVM
  4. Apriori
  5. EM
  6. PageRank
  7. AdaBoost
  8. kNN
  9. Naive Bayes
  10. CART

Public Voting:

  1. Decision Trees/Rules
  2. Regression
  3. Clustering
  4. Statistics (descriptive)
  5. Visualization
  6. Time series/Sequence analysis
  7. Support Vector (SVM)
  8. Association rules
  9. Ensemble methods
  10. Text Mining
  11. Neural Nets
  12. Boosting
  13. Bayesian
  14. Bagging
  15. Factor Analysis
  16. Anomaly/Deviation detection
  17. Social Network Analysis
  18. Survival Analysis
  19. Genetic algorithms
  20. Uplift modeling

Based on voting done by “Mahout user mailing list” here is the list:

  1. Matrix factorization (SVD)
  2. k-means
  3. Naive Bayes
  4. Dirichlet Process Clustering
  5. Matrix Factorization
  6. Frequent Pattern Matching
  7. LDA
  8. Expectation Maximization
  9. SVM
  10. Decision Trees
  11. Logistics Regression
  12. Random Forest