Category Archives: BigData

What is big data?


According to IBM,  Every day we create 2.5 quintillion (2.5*1018 ) bytes of data in the world and it’s so much that about 90% of the world’s data today has been created in the last two years alone. This vast amount of data generated so fast is throwing a lot of challenges to the data science and related field in analyzing and utilizing them. This fast generating, challenging, variety and difficult data is called big data.

Big data is not a single technology but a combination of old and new technologies that help companies gain actionable insight. So big data is the capability to manage huge volume of different data, at the right speed and within the right time frame to allow real-time analysis and action.

The major challenges of big data are:

Volumn: How much data.

Velocity: How fast the data is processed.

Variety: Different types of data

Big data comprises of almost all kinds of data available in the world that are structured and unstructured. Unstructured data is data that’s not in a particular data model and it can be any data such as text, sensor data, audio, video, images, click streams, log files to name a few.  In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form.  Recently analysts predict that data will grow 800% over the next five years. Computer world says that unstructured information might account for more than 70-80% of all data in an organization. So it’s extremely crucial to analyze and utilize these vast amounts of data for the benefit of the organization.

Global Market for Big data:

  • Digital information is growing at 57% per annum globally.
  • With global social network penetration and mobile internet penetration both under 20% this growth has only just begun.
  • All the data generated is valuable, but only if it can be interpreted in a timely and cost effective manner.
  • IDC expects revenues for big data technology infrastructure to grow by 40% per annum for the next three years.

In 2006, IDC estimated, the world produced 0.18 zettabytes of digital information. It grew to 1.8 zettabytes in 2011 and will reach 35 zettabytes by 2020.

Few statistics to demonstrate the ‘big’ part of the bigdata:

  1. Twitter generates nearly 12 TB of data per day, 58 million tweets perday.
  2. Every hour Wallmart controls more than 1 million customer transactions. All of this information is transferred into a database working with over 2.5 petabytes of information.
  3. According to FICO, the credit card fraud system currently in place helps protect over 2 billion accounts all over the globe.
  4. Currently Facebook holds more than 45 billion photos in its entire user base and the number of photos growing rapidly.
  5. The amount of data processed daily by Google is 20 PB and monthly worldwide searches on Google sites are 87.8 billion.

Here is an interesting statistics from YouTube alone:

  1. More than 1 billion UNIQUE users visit YouTube every month.
  2. Over 4 billion hours of video are watched each month.
  3. 72 hours of video are uploaded every minute. (It will take 3 days to watch them all without sleep).

So, Big data is the next big thing happening to IT industry. To be successful in the IT industry it’s really crucial to adopt to big data analytics to make use of the exploding amount of data that’s available now and in the future.


Understanding MapReduce

As Google says, MapReduce is a Programming model and an associated implementation for processing and generating large data sets.  MapReduce is the heart of Hadoop and Big Data analytics and is highly scalable in processing very large amounts of data using commodity PCs.

MapReduce consists of two major tasks (jobs), Map and Reduce, which do all the data processing.  The first task is Map task which takes all the given inputs, fed as key/value pair, and generates intermediate key/value pairs based on the specific functional problem. The second task is the Reduce task which takes Map task’s key/value pair output and combines them to produce the smaller set of final output.

A MapReduce Example:

Say we have a collection of sports documents containing total score of each sports man based on various games he/she played in their career. Now we have to find out the highest score of each sports man. Each document can contain more than one entry for a particular sports man. Score is represented as key value pair in the documents, Sports person name is the key and his/her score is the value.

For Simplicity we take only 3 sportsmen and their score and here is one sample document.

Mike, 100

Joseph, 150

Robert, 200

Mike, 130

Joseph, 90

Robert, 160

We take four more documents the same way and process them using MapReduce. Initially Map job is fed with all these five documents to generate the intermediate key/value pair as output.

Here is the sample result from Map task out of all those five documents.

The result of the first document (given above) is:

(Mike, 130) (Joseph, 150)(Robert, 200)

Please note Map job produced the highest score of each person out of its map task.

Same way all other four map job (not shown for simplicity) produced the following result:

(Joseph, 100) (Robert, 150) (Mike, 130)

(Joseph, 170) (Robert, 110)

(Mike, 120) (Joseph, 155)(Robert, 220)

(Mike, 170) (Joseph, 130)(Robert, 250)

Now all the five map jobs produced the above five results of key, value pair. These five results are now fed to Reduce job to generate the final results. There can be five reduced jobs or lesser than that but it will have capabilities to sync between them to produce one final result. After all the results from Map jobs are processed the Reduce job will produce the following result containing the highest score of each sports person.

(Mike, 170) (Joseph, 170) (Robert, 250)

As the name MapReduce implies, Reduce job is performed always after Map job. Most importantly all these functionalities are done parallel in a clustered environment with commodity machines. Meaning in our example, first five map jobs are done parallel to produce five results and these five results are fed into the Reduce job which will again run parallel to produce the final result. Because of this high degree of parallelism and scalability any amount of data can be processed and generated using this framework in a considerably good amount of time.

In a real MapReduce environment, MapReduce is a software framework with great interface to easily plug in this Map and Reduce functionalities to the framework. Usually user is given option to write his own Map and Reduce logic and pass it over to the framework to use this Map and Reduce logic for processing the input data and produce output data. This software framework hides most of the complexity of parallelization, fault-tolerance, data distribution and load balancing in the library and let the user to focus only on the Map and Reduce functionalities.