Big Data – Transition from Velocity, Variety and Volume to adding Variability and Complexity in the mix

Previous Definition: Velocity, Variety and Volume

New Definition: Velocity, Variety and Volume + Variability and Complexity

bigdata-3cbigdata-4c

Advertisements

A collection of Big Data Books from Packt Publication

I found that Packt publication have few great books on Big Data and here is a collection of few books which I found very useful:Screen Shot 2014-09-30 at 11.50.08 AM

Packt is giving its readers a chance to dive into their comprehensive catalog of over 2000 books and videos for the next 7 days with LevelUp program:

packt

Packt is offering all of its eBooks and Videos at just $10 each or less

The more EXP customers want to gain, the more they save:

  • Any 1 or 2 eBooks/Videos – $10 each
  • Any 3 to 5 eBooks/Videos – $8 each
  • Any 6 or more eBooks/Videos – $6 each

More Information is available at bit.ly/Yj6oWq  |  bit.ly/1yu4679

For more information please visit : www.packtpub.com/packt/offers/levelup

Big Data 1B dollars Club – Top 20 Players

Here is a list of top players in Big Data world having influence over billion dollars (or more) Big Data projects directly or indirectly (not in order):

  1. Microsoft
  2. Google
  3. Amazon
  4. IBM
  5. HP
  6. Oracle
  7. VMWare
  8. Terradata
  9. EMC
  10. Facebook
  11. GE
  12. Intel
  13. Cloudera
  14. SAS
  15. 10Gen
  16. SAP
  17. Hortonworks
  18. MapR
  19. Palantir
  20. Splunk

The list is based on each above companies involvement in Big data directly or indirectly along with a direct product or not. All of above companies are involved in Big Data projects worth considering Billion+ …

Watch Spark Summit 2014 on UStream

You can check out the Spark Summit 2014 agenda here: http://spark-summit.org/2014/agenda

logo2014

 

 

 

 

 

 

U Stream Sessions :

http://www.ustream.tv/channel/spark-summit-2014

Please register yourself at the summit site to get more details information.

Keywords: Spark Summit, Hadoop, Spark,

Apache Ambari 1.6.0 support with Blueprints is released

What is Ambari Blueprint?

Ambari Blueprint allows an operator to instantiate a Hadoop cluster quickly—and reuse the blueprint to replicate cluster instances elsewhere, for example, as development and test clusters, staging clusters, performance testing clusters, or co-located clusters.

Release URL: http://www.apache.org/dyn/closer.cgi/ambari/ambari-1.6.0

blueprints

 

 

 

 

 

 

Ambari Blueprint supports PostgreSQL:

Ambari now extends database support for Ambari DB, Hive and Oozie to include PostgreSQL. This means that Ambari now provides support for the key databases used in enterprises today: PostgreSQL, MySQL and Oracle. The PostgreSQL configuration choice is reflected in this database support matrix.

More Links:

 

Content Source: http://hortonworks.com/blog/apache-ambari-1-6-0-released-blueprints

Learning Cloudera Impala – Book Availability

Learning Cloudera Impala:

 
Using Cloudera Impala is for those who really want to take advantage of their Hadoop cluster by processing extremely large amounts of raw data in Hadoop at real-time speed. Prior knowledge of Hadoop and some exposure to HIVE and MapReduce is expected.
 
LearningClouderaImpala

Learning Cloudera Impala

You will learn from this book:

  • Understand the various ways of installing Impala in your Hadoop cluster
  • Use the Impala shell API to interact with Impala components
  • Utilize Impala Query Language and built-in functions to play with data
  • Administrate and fine-tune Impala for high availability
  • Identify and troubleshoot problems in a variety of ways
  • Get acquainted with various input data formats in Hadoop and how to use them with Impala
  • Comprehend how third party applications can connect with Impala to provide data visualization and various other enhancements

Available at:

Screen Shot 2013-12-28 at 11.21.15 AM

Screen Shot 2013-12-28 at 11.20.32 AMScreen Shot 2013-12-28 at 11.20.54 AM

Continue reading

20TB Earth Science Dataset on AWS With NASA / NEX available for Public

AWS has been working with the NASA Earth Exchange (NEX) team to make it easier and more efficient for researchers to access and process earth science data. The goal is to make a number of important data sets accessible to a wider audience of full-time researchers, students, and citizen scientists. This important new project is called OpenNEX. Up until now, it has been logistically difficult for researchers to gain easy access to this data due to its dynamic nature and immense size (tens of terabytes). Limitations on download bandwidth, local storage, and on-premises processing power made in-house processing impractical.

nasa_nex_landsat_us_2005_forest_leaf_area_1

Access Dataset: s3://nasanex/NEX-DCP30

Consult the detail page and the tech note to learn more about the provenance, format, structure, and attribution requirements.

NASA Earth Exchange (NEX):

The NASA Earth Exchange (NEX) Downscaled Climate Projections (NEX-DCP30) dataset is comprised of downscaled climate scenarios for the conterminous United States that are derived from the General Circulation Model (GCM) runs conducted under the Coupled Model Intercomparison Project Phase 5 (CMIP5) [Taylor et al. 2012] and across the four greenhouse gas emissions scenarios known as Representative Concentration Pathways (RCPs) [Meinshausen et al. 2011] developed for the Fifth Assessment Report of the Intergovernmental Panel on Climate Change (IPCC AR5). The dataset includes downscaled projections from 33 models, as well as ensemble statistics calculated for each RCP from all model runs available. The purpose of these datasets is to provide a set of high resolution, bias-corrected climate change projections that can be used to evaluate climate change impacts on processes that are sensitive to finer-scale climate gradients and the effects of local topography on climate conditions.

Each of the climate projections includes monthly averaged maximum temperature, minimum temperature, and precipitation for the periods from 1950 through 2005 (Retrospective Run) and from 2006 to 2099 (Prospective Run).

Website: NASA NEX

Summary

  • Short Name: NEX-DCP30
  • Version: 1
  • Format: netCDF4 classic
  • Spatial Coverage: CONUS
  • Temporal Coverage:
    • 1950 – 2005 historical or 2006 – 2099 RCP
  • Data Resolution:
    • Latitude Resolution: 30 arc second
    • Longitude Resolution: 30 arc second
    • Temporal Resolution: monthly
  • Data Size:
    • Total Dataset Size: 17 TB
    • Individual file size: 2 GB

Learn more about NEX – NASA Earth Exchange Downscaled Project

NEX Virtual Workshop: https://nex.nasa.gov/nex/projects/1328/