Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)
    • Easy Web interface to manage, build, monitor and query cubes
    • Security capability to set ACL at Cube/Project Level
    • Support LDAP Integration

Keywords: Kylin, Big Data, Hadoop, Jobs, OLAP, SQL, Query

Advertisements

Big Data – Transition from Velocity, Variety and Volume to adding Variability and Complexity in the mix

Previous Definition: Velocity, Variety and Volume

New Definition: Velocity, Variety and Volume + Variability and Complexity

bigdata-3cbigdata-4c

Visualizing Social Network Data based on Twitter #Hashtag using NodeXL

In the live demo you will learn how to visualizing social network data using NodeXL. NodeXL is an open source addon to Excel great to download social network data from Twitter, Facebook or flicker. In this example I am using twitter #Hashtag to collect social network data.

NodeXL: http://nodexl.codeplex.com/

Big Data in Astronomical scale HDF and HUDF

Scientists in general, and astronomers in particular, have been at the forefront when it comes to dealing with large amounts of data. These days, the “Big Data” community, as it is known, includes almost every scientific endeavor — and even you.

In fact, Big Data is not just about extremely large collections of information hidden in databases inside archives like the Barbara A. Mikulski Archive for Space Telescopes. Big Data includes the hidden data you carry with you all the time in now-ubiquitous smart phones: calendars, photographs, SMS messages, usage information and records of our current and past locations. As we live our lives, we leave behind us a “data exhaust” that tells something about ourselves.

Star-Forming Region LH 95 in the Large Magellanic Cloud

…..

In late 1995, the Hubble Space Telescope took hundreds of exposures of a seemingly empty patch of sky near the constellation of Ursa Major (the Big Dipper). The Hubble Deep Field (HDF), as it is known, uncovered a mystifying collection of about 3,000 galaxies at various stages of their evolution. Most of the galaxies were faint, and from them we began to learn a story about our Universe that had not been told before.

……

So was the HDF unique? Were we just lucky to observe a crowded but faint patch of sky? To address this question, and determine if indeed the HDF was a “lucky shot,” in 2004  Hubble took a million-second-long exposure in a similarly “empty” patch of sky: The Hubble Ultra Deep Field (HUDF). The result was even more breathtaking. Containing an estimated 10,000 galaxies, the HUDF revealed glimpses of the first galaxies as they emerge from the so-called “dark ages” — the time shortly after the Big Bang when the first stars reheated the cold, dark universe. As with the HDF, the HUDF data was made immediately available to the community, and has spawned hundreds of publications and several follow-up observations.

Read Full Article at: http://hubblesite.org/blog/2012/04/data-exhaust/

Amazon AWS Free Usage Tier

Below are the highlights of AWS’s free usage tiers. All are available for one year (except Amazon SimpleDB, SQS, and SNS which are free indefinitely):

AWS Free Usage Tier (Per Month):

  • 750 hours of Amazon EC2 Linux Micro Instance usage (613 MB of memory and 32-bit and 64-bit platform support) – enough hours to run continuously each month*
  • 750 hours of Amazon EC2 Microsoft Windows Server Micro Instance usage (613 MB of memory and 32-bit and 64-bit platform support) – enough hours to run continuously each month*
  • 750 hours of an Elastic Load Balancer plus 15 GB data processing*
  • 30 GB of Amazon Elastic Block Storage, plus 2 million I/Os and 1 GB of snapshot storage*
  • 5 GB of Amazon S3 standard storage, 20,000 Get Requests, and 2,000 Put Requests*
  • 15 GB of bandwidth out aggregated across all AWS services*
  • 25 Amazon SimpleDB Machine Hours and 1 GB of Storage**
  • 100,000 Requests of Amazon Simple Queue Service**
  • 100,000 Requests, 100,000 HTTP notifications and 1,000 email notifications for Amazon Simple Notification Service**
  • 10 Amazon Cloudwatch metrics, 10 alarms, and 1,000,000 API requests**

In addition to these services, the AWS Management Console is available at no charge to help you build and manage your application on AWS.

* These free tiers are only available to existing AWS customers who have signed-up for Free Tier after October 20, 2010 and new AWS customers, and are available for 12 months following your AWS sign-up date. When your free usage expires or if your application use exceeds the free usage tiers, you simply pay standard, pay-as-you-go service rates (see each service page for full pricing details). Restrictions apply; see offer terms for more details.

** These free tiers do not expire after 12 months and are available to both existing and new AWS customers indefinitely.

The AWS free usage tier applies to participating services across all AWS regions: US East (Northern Virginia), US West (Oregon), US West (Northern California), EU (Ireland), Asia Pacific (Singapore), Asia Pacific (Tokyo), and South America (Sao Paulo). Your free usage is calculated each month across all regions and automatically applied to your bill – free usage does not accumulate.

http://aws.amazon.com/free/

A list of Top Open Source Project of 2011 Compiled by OpenLogic

Overall, according to OpenLogic, the top 16 hottest open-source projects based on growth from 2010 to 2011 were:

1. HBase (http://hbase.apache.org/)

2. Node.js  (http://nodejs.org/)

3. nginx (http://nginx.org/)

4. Hadoop (http://hadoop.apache.org/)

5. Ruby on Rails (http://rubyonrails.org/)

6. MongoDB (http://www.mongodb.org/)

7. Tomcat (http://tomcat.apache.org/)

8. MySQL (http://www.mysql.com/)

9. Apache HTTP Server (http://httpd.apache.org/)

10. Spring Framework (http://www.springsource.org/)

11. PostgreSQL (tie)  (http://www.postgresql.org/)

11. Grails (tie) (http://grails.org/)

12. Struts (http://struts.apache.org/)

13. JBoss  (http://www.jboss.org/)

14. GlassFish (http://glassfish.java.net/)

15. CouchDB (http://couchdb.apache.org/)

As stated, OpenLogic also evaluated each category and categorized projects as “trending up” (up across most or all metrics), “trending level” (up and down on a roughly equal number of metrics), or “trending down” (down across all or most metrics).

For the Application Server/Web Server category: Node.js and nginx were trending up; Tomcat and Apache HTTP Server were trending level; and JBoss and GlassFish were trending down.

For the Frameworks category: Ruby on Rails was trending up; Spring Grails and Struts were trending level, with no projects trending down.

For the database and big data category: HBase, Hadoop and MongoDB were trending up; MySQL and PostgreSQL were trending level; and CouchDB was trending down.

The OpenLogic report ranked year over year–January 2010 through November 2010 and January 2011 through November 2011–data based on growth or loss on several metrics. The data comes from several sources including public data from Google searches. The report also evaluated OpenLogic OLEX search volume, views of packages, downloads, requests within corporations to use the project and matches against the project during scans. OpenLogic OLEX is a software-as-a-service solution for the governance and provisioning of open-source software used by many Fortune 100 companies. OpenLogic also aggregated data on customers purchasing support contracts from OpenLogic for each project, as well as projects that users deployed through OpenLogic CloudSwing, an open platform-as-a-service offering.

 

 

 

Source:

http://www.eweek.com/c/a/Linux-and-Open-Source/OpenLogic-Ranks-Top-OpenSource-Projects-of-2011-365276/

Scientific Computing in Cloud using MATLAB and Windows Azure

There are couple of things you could do while planning to run MATLAB on Windows Azure. Here i will provide a few resources to get your started. I am also working on a Windows Azure SDK 1.6 and MATLAB Runtime 7.1 installer based sample also, which I will release shortly.

Understanding Windows Azure HPC Scheduler SDK and how it works in general with Windows Azure:

Getting Started with Application Deployment with the Windows Azure HPC Scheduler (Document Walhthrough)

Windows Azure HPC Scheduler code sample: Overview (Video instructions – part 1)

Watch Video about running Windows Azure HPC Scheduler code sample: Publish the service (Video instructions – Part 2)

Step by Step hands on training guide to use Windows HPC with Burst to Windows Azure:

Learn more about Message Passing Interface (MPI): MPI is a platform-independent standard for messaging between HPC nodes. Microsoft MPI (MS MPI) is the MPI implementation used for MPI applications executed by Windows HPC Server 2008 R2 SP2. Integration of Windows HPC Server 2008 R2 SP2 with Windows Azure supports running MPI applications on Windows Azure nodes.

You can use Windows Azure HPC Scheduler (follow link below for more info)

Using MATLAB with Windows Azure HPC Scheduler SDK:

In MATLAB Parallel Computing Toolbox, you can find MATLAB  MPI implementation MPICH2 MPI.

Windows Azure HPC Scheduler allows for spawning worker nodes on which MPI jobs can be run. With a local head node and for compute-bound workloads you could

  • have Matlab cloudbursting to Azure via MPI
  • use a local non-Matlab WS2008R2 master node and run MPI jobs using common numeric libraries

Installing MATLAB Runtime with your Windows Azure application:

To install MCR (MATLAB Compiler Runtime) in Windows Azure you could do the following:

  1. Create a Startup task to download MCR zip and then install it.
    1. Learn More about Startup task here
    1. You can use Azure BootStrapper application to download and install very easily
  1. If you are using Worker role and want set a specific application as role entry point

Other Useful Resources:

Some case studies from Curtin University, Australia, using MATLAB and Windows Azure:

A bunch of Microsoft Research Projects specially ModisAzure use MATLAB on Azure which you can find as the link below:

A presentation done by Techila‘s at Microsoft TechDays 2011 presentation showed MATLAB code running in Windows Azure.