When someone is looking at cloud to find MapReduce to process your large amount of data, I think this is what you are looking for:
- A collection of machines which are Hadoop/MapReduce ready and instant available
- You just don’t want to build Hadoop(HDFS/MapReduce) instances from scratch because there are several IaaS service available give you hundreds of machines in cloud however building a Hadoop cluster will be nightmare.
- It means you just need to hook your data and push MapReduce jobs immediately
- Being in cloud, means you just want to harvest the power of thousands of machines available in cloud “instantly” and want to pay the cost of CPU usage per hour you will consume.
Here are a few options which are available now, which I tried before writing here:
Apache Hadoop on Windows Azure:
Microsoft also has Hadoop/MapReduce running on Windows Azure but it is under limited CTP, however you can provide your information and request for CTP access at link below:
The Developer Preview for the Apache Hadoop- based Services for Windows Azure is available by invitation.
Amazon: Elastic Map Reduce
Amazon Elastic MapReduce (Amazon EMR) is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Google Big Query:
Besides that you can also try Google BigQuery in which you will have to move your data to Google propitiatory Storage first and then run BigQuery on it. Remember BigQuery is based on Dremel which is similar to MapReduce however faster due to column based search processing.
Google BigQuery is invitation only however you sure can request for access:
There is another option is to use Mortar Data, as they have used python and pig, intelligently to write jobs easily and visualize the results. I found it very interesting, please have a look: