My Experiment with Python based Open source data visualization and analysis Tool – Orange (Part 1)

Install Orange 2.0b for Windows completely which will install pyMl, PythonWin, QT and several other Python libraries along with Orange.

http://orange.biolab.si/

With Orange you can access regular tab or comma delimited data or you can use C4.5 data which is separated into two separate file as *.data and *.names. Mainly Orange supports C4.5, Assistant, Retis, and tab-delimited (native Orange) data formats.

Loading Tab Delimited Data:

Launch PythonWin and have a Tab Delimited content to load it

>>> import orange
>>> print orange.version
2.0b (19:12:26, Feb 14 2012)
>>> data = orange.ExampleTable(“C:Python27lenses.tab”)
>>> print data.domain.attributes
<Orange.feature.Discrete ‘age’, Orange.feature.Discrete ‘prescription’, Orange.feature.Discrete ‘astigmatic’, Orange.feature.Discrete ‘tear_rate’>
>>>
>>> for i in data.domain.attributes:
… print i.name,
… print

age
prescription
astigmatic
tear_rate
>>> for i in range(3):
… print data[i]

[‘young’, ‘myope’, ‘no’, ‘reduced’, ‘none’]
[‘young’, ‘myope’, ‘no’, ‘normal’, ‘soft’]
[‘young’, ‘myope’, ‘yes’, ‘reduced’, ‘none’]
>>> for i in range(5):
… print data[i]

[‘young’, ‘myope’, ‘no’, ‘reduced’, ‘none’]
[‘young’, ‘myope’, ‘no’, ‘normal’, ‘soft’]
[‘young’, ‘myope’, ‘yes’, ‘reduced’, ‘none’]
[‘young’, ‘myope’, ‘yes’, ‘normal’, ‘hard’]
[‘young’, ‘hypermetrope’, ‘no’, ‘reduced’, ‘none’]

Loading C4.5 Delimited Data:

Launch PythonWin and have a C4.5  content to load it

>>> import os
>>> os.chdir(“c:/Python27/ascdata”)
>>> os.listdir(os.curdir)
[‘car.data’, ‘car.names’, ‘mydata.txt’]
>>> car_data = orange.ExampleTable(“car”)
>>> print car_data.domain.attributes
<Orange.feature.Discrete ‘buying’, Orange.feature.Discrete ‘maint’, Orange.feature.Discrete ‘doors’, Orange.feature.Discrete ‘persons’, Orange.feature.Discrete ‘lugboot’, Orange.feature.Discrete ‘safety’>

>>> for i in range(10):
… print car_data[i]

[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘small’, ‘low’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘small’, ‘med’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘small’, ‘high’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘med’, ‘low’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘med’, ‘med’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘med’, ‘high’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘big’, ‘low’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘big’, ‘med’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘2’, ‘big’, ‘high’, ‘unacc’]
[‘v-high’, ‘v-high’, ‘2’, ‘4’, ‘small’, ‘low’, ‘unacc’]

Advertisements

Machine Learning Libraries in Python

Here is a collection of Machine Learning Libraries in Python:

PyBrain (http://pybrain.org/)

PyBrain is a modular Machine Learning Library for Python. Its goal is to offer flexible, easy-to-use yet still powerful algorithms for Machine Learning Tasks and a variety of predefined environments to test and compare your algorithms. PyBrain is short for Python-Based Reinforcement Learning, Artificial Intelligence and Neural Network Library. In fact, we came up with the name first and later reverse-engineered this quite descriptive “Backronym”.

mlPy (http://mlpy.sourceforge.net/)

mlpy is a Python module for Machine Learning built on top of NumPy/SciPy and the GNU Scientific Libraries. mlpy provides a wide range of state-of-the-art machine learning methods for supervised and unsupervised problems and it is aimed at finding a reasonable compromise among modularity, maintainability, reproducibility, usability and efficiency. mlpy is multiplatform, it works with Python 2 and 3 and it is Open Source, distributed under the GNU General Public License version 3.

PyML: Machine Learning in Python (http://pyml.sourceforge.net/)
PyML is an interactive object oriented framework for machine learning written in Python. PyML focuses on SVMs and other kernel methods. It is supported on Linux and Mac OS X.
Features

  • Classifiers: support vector machines, nearest neighbor, ridge regression
  • Multi-class methods (one-against-rest and one-against-one)
  • Feature selection (filter methods, RFE)
  • Model selection
  • Preprocessing and normalization
  • Syntax for combining classifiers
  • Classifier testing (cross-validation, error rates, ROC curves)
  • Various kernels for biological sequences (several variants of the spectrum kernel, and the weighted-degree kernel).

Shogun – A Large Scale Machine Learning Toolbox (http://www.shogun-toolbox.org/)
The machine learning toolbox’s focus is on large scale kernel methods and especially on Support Vector Machines (SVM) [1]. It provides a generic SVM object interfacing to several different SVM implementations, among them the state of the art OCAS [21], Liblinear [20], LibSVM [2], SVMLight, [3] SVMLin [4] and GPDT [5]. Each of the SVMs can be combined with a variety of kernels. The toolbox not only provides efficient implementations of the most common kernels, like the Linear, Polynomial, Gaussian and Sigmoid Kernel but also comes with a number of recent string kernels as e.g. the Locality Improved [6], Fischer [7], TOP [8], Spectrum [9], Weighted Degree Kernel (with shifts) [10] [11] [12]. For the latter the efficient LINADD [12] optimizations are implemented. For linear SVMs the COFFIN framework [22][23] allows for on-demand computing feature spaces on-the-fly, even allowing to mix sparse, dense and other data types. Furthermore, SHOGUN offers the freedom of working with custom pre-computed kernels. One of its key features is the combined kernel which can be constructed by a weighted linear combination of a number of sub-kernels, each of which not necessarily working on the same domain. An optimal sub-kernel weighting can be learned using Multiple Kernel Learning [13] [14] [18] [19]. Currently SVM one-class, 2-class and multiclass classification and regression problems can be dealt with. However SHOGUN also implements a number of linear methods like Linear Discriminant Analysis (LDA), Linear Programming Machine (LPM), (Kernel) Perceptrons and features algorithms to train hidden markov models. The input feature-objects can be dense, sparse or strings and of type int/short/double/char and can be converted into different feature types. Chains of preprocessors (e.g. substracting the mean) can be attached to each feature object allowing for on-the-fly pre-processing.
SHOGUN is implemented in C++ and interfaces to Matlab(tm), R, Octave and Python and is proudly released as Machine Learning Open Source Software.

MDP – Modular toolkit for Data Processing (http://mdp-toolkit.sourceforge.net/)
Modular toolkit for Data Processing (MDP) is a Python data processing framework.
From the user’s perspective, MDP is a collection of supervised and unsupervised learning algorithms and other data processing units that can be combined into data processing sequences and more complex feed-forward network architectures.
From the scientific developer’s perspective, MDP is a modular framework, which can easily be expanded. The implementation of new algorithms is easy and intuitive. The new implemented units are then automatically integrated with the rest of the library.
The base of available algorithms is steadily increasing and includes signal processing methods (Principal Component Analysis, Independent Component Analysis, Slow Feature Analysis), manifold learning methods ([Hessian] Locally Linear Embedding), several classifiers, probabilistic methods (Factor Analysis, RBM), data pre-processing methods, and many others.

Orange http://orange.biolab.si/

Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Components for machine learning. Add-ons for bioinformatics and text mining. Packed with features for data analytics.
MILK: MACHINE LEARNING TOOLKIT (http://packages.python.org/milk/)
Milk is a machine learning toolkit in Python. Its focus is on supervised classification with several classifiers available: SVMs (based on libsvm), k-NN, random forests, decision trees. It also performs feature selection. These classifiers can be combined in many ways to form different classification systems.
For unsupervised learning, milk supports k-means clustering and affinity propagation. Milk is flexible about its inputs. It optimised for numpy arrays, but can often handle anything (for example, for SVMs, you can use any dataype and any kernel and it does the right thing).
There is a strong emphasis on speed and low memory usage. Therefore, most of the performance sensitive code is in C++. This is behind Python-based interfaces for convenience.

 

scikit-learn: machine learning in Python (http://scikit-learn.sourceforge.net/stable/)

scikit-learn is a Python module integrating classic machine learning algorithms in the tightly-knit world of scientific Python packages (numpy, scipy, matplotlib). It aims to provide simple and efficient solutions to learning problems that are accessible to everybody and reusable in various contexts: machine-learning as a versatile tool for science and engineering.

 

Cloud Fair 2012 Presentation on Apache Hadoop, Windows Azure and Open Source

Recently I was given an opportunity to talk about “Apache Hadoop on Windows Azure” and “Open Source & Cloud Computing” at Cloud Fair 2012 Seattle.  If you wish to get the presentation, please follow the info below:

Presentation Topic: Apache Hadoop on Windows Azure

Since Microsoft’s adoption of open source Apache Hadoop in its cloud offering and CTP release “Apache Hadoop on Windows Azure”, the Hadoop is most exciting and adopted technology to analyze large amount of data in a very simple ways. The service offered as an elastic service for both on premise and public clouds based on Microsoft and Apache Hadoop technologies. Scale invariant insight, information processing, and analytics are now available to all participants in the Microsoft ecosystem – cloud and enterprise. Best of all, these capabilities have been bridged into the vibrant domains of Office BI and Collaboration, Data Warehousing, and Visualization/Reporting.

You will learn how to unlock business insights from all your structured and unstructured data, including large volumes of data not previously activated, with Microsoft’s Big Data solution. I will explain using live demonstration, how you can enterprise class Hadoop based solutions designed by Microsoft, on both Windows Server and Windows Azure. This talk is developer oriented and tutorial includes installation, configuration and simplified Big Data analysis with JavaScript.

Please download the Presentation from the link below:
http://www.papershare.com/paper/Apache-Hadoop-on-Windows-Azure

Presentation Title: Open Source and Cloud Computing

Targeting the Open Source developer community will unlock significant growth opportunities for any cloud service vendor however for an open source developer; it is very hard to decide which cloud option to choose. Cost reduction is one of the main value-prop of both Cloud services and Open Source tools. So combining both of them is a significant reduction in cost, however, other hidden and unknown facts, may appear later. A significant portion of the development community uses LAMP stack and when you add Java, Java Script, Ruby, CGI, Python, Node.JS, *SQL, *DB, and Hadoop to list, you have a majority of the application currently thriving to move to cloud.
This interactive session targets to open source developers, suggesting what cloud services could be best option and what they should really look for, in a cloud platform. Attendees will learn about cloud services both big and small, those have successfully adopted Open Source applications support in their services offering. Attendees can use the information to understand what technical limitations exist along with how to overcome these hurdles.

Please download the Presentation from the link below:
http://www.papershare.com/paper/Open-Source-and-Cloud-Computing

MapReduce in Cloud

When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:

  1. A collection of machines which are Hadoop/MapReduce ready and instant available
  2. You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
  3. It means you just need to hook your data and push MapReduce jobs immediately
  4. Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.

Here are a few options which are available now, which I tried before writing here:

Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/

The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/

Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/

Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

MapReduce in Cloud

When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:

  1. A collection of machines which are Hadoop/MapReduce ready and instant available
  2. You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
  3. It means you just need to hook your data and push MapReduce jobs immediately
  4. Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.

Here are a few options which are available now, which I tried before writing here:

Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
https://www.hadooponazure.com/

The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.

Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
http://aws.amazon.com/elasticmapreduce/

Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
https://developers.google.com/bigquery/

Mortar Data:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look:
http://mortardata.com/#!/how_it_works

Big Data in Astronomical scale HDF and HUDF

Scientists in general, and astronomers in particular, have been at the forefront when it comes to dealing with large amounts of data. These days, the “Big Data” community, as it is known, includes almost every scientific endeavor — and even you.

In fact, Big Data is not just about extremely large collections of information hidden in databases inside archives like the Barbara A. Mikulski Archive for Space Telescopes. Big Data includes the hidden data you carry with you all the time in now-ubiquitous smart phones: calendars, photographs, SMS messages, usage information and records of our current and past locations. As we live our lives, we leave behind us a “data exhaust” that tells something about ourselves.

Star-Forming Region LH 95 in the Large Magellanic Cloud

…..

In late 1995, the Hubble Space Telescope took hundreds of exposures of a seemingly empty patch of sky near the constellation of Ursa Major (the Big Dipper). The Hubble Deep Field (HDF), as it is known, uncovered a mystifying collection of about 3,000 galaxies at various stages of their evolution. Most of the galaxies were faint, and from them we began to learn a story about our Universe that had not been told before.

……

So was the HDF unique? Were we just lucky to observe a crowded but faint patch of sky? To address this question, and determine if indeed the HDF was a “lucky shot,” in 2004  Hubble took a million-second-long exposure in a similarly “empty” patch of sky: The Hubble Ultra Deep Field (HUDF). The result was even more breathtaking. Containing an estimated 10,000 galaxies, the HUDF revealed glimpses of the first galaxies as they emerge from the so-called “dark ages” — the time shortly after the Big Bang when the first stars reheated the cold, dark universe. As with the HDF, the HUDF data was made immediately available to the community, and has spawned hundreds of publications and several follow-up observations.

Read Full Article at: http://hubblesite.org/blog/2012/04/data-exhaust/

Open Source system for data mining – RapidMiner

RapidMiner is unquestionably the world-leading open-source system for data mining. It is available as a stand-alone application for data analysis and as a data mining engine for the integration into own products. Thousands of applications of RapidMiner in more than 40 countries give their users a competitive edge.

  • Data Integration, Analytical ETL, Data Analysis, and Reporting in one single suite
  • Powerful but intuitive graphical user interface for the design of analysis processes
  • Repositories for process, data and meta data handling
  • Only solution with meta data transformation: forget trial and error and inspect results already during design time
  • Only solution which supports on-the-fly error recognition and quick fixes
  • Complete and flexible: Hundreds of data loading, data transformation, data modeling, and data visualization methods

01_design_perspective.jpg (1096×796)