What is big data?


According to IBM,  Every day we create 2.5 quintillion (2.5*1018 ) bytes of data in the world and it’s so much that about 90% of the world’s data today has been created in the last two years alone. This vast amount of data generated so fast is throwing a lot of challenges to the data science and related field in analyzing and utilizing them. This fast generating, challenging, variety and difficult data is called big data.

Big data is not a single technology but a combination of old and new technologies that help companies gain actionable insight. So big data is the capability to manage huge volume of different data, at the right speed and within the right time frame to allow real-time analysis and action.

The major challenges of big data are:

Volumn: How much data.

Velocity: How fast the data is processed.

Variety: Different types of data

Big data comprises of almost all kinds of data available in the world that are structured and unstructured. Unstructured data is data that’s not in a particular data model and it can be any data such as text, sensor data, audio, video, images, click streams, log files to name a few.  In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form.  Recently analysts predict that data will grow 800% over the next five years. Computer world says that unstructured information might account for more than 70-80% of all data in an organization. So it’s extremely crucial to analyze and utilize these vast amounts of data for the benefit of the organization.

Global Market for Big data:

  • Digital information is growing at 57% per annum globally.
  • With global social network penetration and mobile internet penetration both under 20% this growth has only just begun.
  • All the data generated is valuable, but only if it can be interpreted in a timely and cost effective manner.
  • IDC expects revenues for big data technology infrastructure to grow by 40% per annum for the next three years.

In 2006, IDC estimated, the world produced 0.18 zettabytes of digital information. It grew to 1.8 zettabytes in 2011 and will reach 35 zettabytes by 2020.

Few statistics to demonstrate the ‘big’ part of the bigdata:

  1. Twitter generates nearly 12 TB of data per day, 58 million tweets perday.
  2. Every hour Wallmart controls more than 1 million customer transactions. All of this information is transferred into a database working with over 2.5 petabytes of information.
  3. According to FICO, the credit card fraud system currently in place helps protect over 2 billion accounts all over the globe.
  4. Currently Facebook holds more than 45 billion photos in its entire user base and the number of photos growing rapidly.
  5. The amount of data processed daily by Google is 20 PB and monthly worldwide searches on Google sites are 87.8 billion.

Here is an interesting statistics from YouTube alone:

  1. More than 1 billion UNIQUE users visit YouTube every month.
  2. Over 4 billion hours of video are watched each month.
  3. 72 hours of video are uploaded every minute. (It will take 3 days to watch them all without sleep).

So, Big data is the next big thing happening to IT industry. To be successful in the IT industry it’s really crucial to adopt to big data analytics to make use of the exploding amount of data that’s available now and in the future.


Understanding MapReduce

As Google says, MapReduce is a Programming model and an associated implementation for processing and generating large data sets.  MapReduce is the heart of Hadoop and Big Data analytics and is highly scalable in processing very large amounts of data using commodity PCs.

MapReduce consists of two major tasks (jobs), Map and Reduce, which do all the data processing.  The first task is Map task which takes all the given inputs, fed as key/value pair, and generates intermediate key/value pairs based on the specific functional problem. The second task is the Reduce task which takes Map task’s key/value pair output and combines them to produce the smaller set of final output.

A MapReduce Example:

Say we have a collection of sports documents containing total score of each sports man based on various games he/she played in their career. Now we have to find out the highest score of each sports man. Each document can contain more than one entry for a particular sports man. Score is represented as key value pair in the documents, Sports person name is the key and his/her score is the value.

For Simplicity we take only 3 sportsmen and their score and here is one sample document.

Mike, 100

Joseph, 150

Robert, 200

Mike, 130

Joseph, 90

Robert, 160

We take four more documents the same way and process them using MapReduce. Initially Map job is fed with all these five documents to generate the intermediate key/value pair as output.

Here is the sample result from Map task out of all those five documents.

The result of the first document (given above) is:

(Mike, 130) (Joseph, 150)(Robert, 200)

Please note Map job produced the highest score of each person out of its map task.

Same way all other four map job (not shown for simplicity) produced the following result:

(Joseph, 100) (Robert, 150) (Mike, 130)

(Joseph, 170) (Robert, 110)

(Mike, 120) (Joseph, 155)(Robert, 220)

(Mike, 170) (Joseph, 130)(Robert, 250)

Now all the five map jobs produced the above five results of key, value pair. These five results are now fed to Reduce job to generate the final results. There can be five reduced jobs or lesser than that but it will have capabilities to sync between them to produce one final result. After all the results from Map jobs are processed the Reduce job will produce the following result containing the highest score of each sports person.

(Mike, 170) (Joseph, 170) (Robert, 250)

As the name MapReduce implies, Reduce job is performed always after Map job. Most importantly all these functionalities are done parallel in a clustered environment with commodity machines. Meaning in our example, first five map jobs are done parallel to produce five results and these five results are fed into the Reduce job which will again run parallel to produce the final result. Because of this high degree of parallelism and scalability any amount of data can be processed and generated using this framework in a considerably good amount of time.

In a real MapReduce environment, MapReduce is a software framework with great interface to easily plug in this Map and Reduce functionalities to the framework. Usually user is given option to write his own Map and Reduce logic and pass it over to the framework to use this Map and Reduce logic for processing the input data and produce output data. This software framework hides most of the complexity of parallelization, fault-tolerance, data distribution and load balancing in the library and let the user to focus only on the Map and Reduce functionalities.

Scalable Application Architecture

Architecting scalable system is more challenging and when it’s done right its more rewarding.  Top tech companies like Google and Amazon are world-famous for their versatile high performing systems. They are all highly versatile and famous because they are all highly scalable.  To simply put, scalability is known as the capability of a system to efficiently meet its growing demand/users without adversely affecting its performance, cost, maintainability etc.

Performance – capability of a system on how efficiently it performs its tasks in a given time and utilizes its resources, again a most important factor/feature of a system.  In a non-profit public applications, it’s somehow tolerable to perform a task in two to three seconds whereas in commercial and mission critical applications it’s not at all tolerable when time taken to perform certain task exceeds even two to three seconds. When a commercial system just hit market and gaining users/popularity is not vulnerable to scalability issues as the system is just gaining users. Performance of the system can be considerably high. Once application picks up market and system demand grows drastically over time, there is a lot of chances that an application might highly vulnerable to performance and scalability issues. When the system is not designed with future performance and scalability issues in mind, the system will surely lose its hard-earned potential customers over time.

General techniques for scalable system design

Performance and scalability design and implementation is done in almost all of the phases of system development starting from planning/requirement analysis to even after deploying to production environment. So we will discuss here on how to make decision on performance and scalable capabilities of a system in three stages of a system life-cycle.

  1. Planning and Requirement analysis phase
  2. Design and Development phase
  3. Production and Maintenance phase

Planning and Requirement analysis phase:

Major software design decisions and system modeling are made at this stage. Making right and futuristic decision at this stage is crucial for the success of the project.  System and performance model can be created to evaluate the system design before investing time and resources to implement a flawed design.

The time, effort and money invested upfront in performance modeling should be proportional to project risk. For a project with significant risk, where performance is critical, you may spend more time and energy up front developing the model. For a project where performance is less of a concern, modeling might be simple.

Budget represents the constraints and enables us to specify how much we can spend (resource-wise) and how we plan to spend it. Constraints govern our total spending, and then we can decide where to spend to get to the total. We assign budget in terms of response time, throughput, latency and resource utilization.

The performance model we develop helps us capture the following important information upfront:

Category Description
Application Description The design of the application in terms of its layers and its target infrastructure.
Scenarios Critical and significant use cases, sequence diagrams, and user stories relevant to performance.
Performance objectives Response time, throughput, resource utilization
Budgets Constraints we set on the execution of use cases, such as maximum execution time and resource utilization levels, including CPU, memory, disk I/O and network I/O.
Workload goals Goals for the number of users, concurrent users, data volumes, and information about the desired use of the application.
Quality-of-service (QoS) requirements QoS requirements, such as security, maintainability, and interoperability, may impact the performance. So We should have an agreement across software and infrastructure teams about QoS restrictions and requirements.

2. Design and Development phase:

Design and development phase is another important phase of software development where we have to focus on performance and scalability requirements. Typically any website application will have static contents and business logic and data storage. It’s really a wise idea to layer them separately so that each layer can scale independently from one another. Following flexible architecture is important so that when there is any new change comes the architecture should be able to adapt easily to the change without affecting performance or scalability.

Static content –  consists of Images(JPEG, GIF or other), HTML, JavaScript and CSS, can be layered separately  so that it can be hosted independently from other part of the system leading to efficient scaling. Since static contents are stateless, replica can be easily created in a cluster environment for high availability. Content Delivery Network can be formed in geographically different locations for the static content which will promote high scalability and availability of static content.

Business logic layer – consists of application logic written in C#, Java or other programming language, should also be layered separately and should be able to be hosted in its own server to promote scalability.  A great approach to implement business logic is to implement and expose them as Web Service, SOAP based or REST based. REST based services are highly scalable and interoperable compared to SOAP based services. Parallelism can also be applied easily on business logic when it’s designed and implemented highly modular. Parallel processing applications can perform really well when user demands are more and they try to access the system more concurrently.  Another interesting idea to apply on business logic layer is that when various logical parts of the system are grouped/layered separately based on its functional scenarios, like order processing module from report processing module, it’s really easy to decide to create more nodes for highly accessible module like order processing compare to less accessible module like report processing in a cluster environment.

4. Production and Maintenance phase:

In this final phase of SDLC, we will see the actual result of all the efforts we put for bringing up the system. During the initial launch of the system there may not be issues about availability of the system as the users are just growing and system is very well capable to cope up with the low or medium demand. When user counts shooting up to a level beyond the capability of the system, that’s when all the availability, performance and scalability issues are starting to pour in.

Suppose we reached the threshold limit of the system capability beyond which system tends to experience issues, then following are the areas we have to target to sort out the issues.

  1. Performance tuning.
  2. Scaling up (vertical scaling).
  3. Scaling out (Horizontal scaling).

Performance tuning.

This step would consist of refactoring application’s source code, analyzing an application’s configuration settings, attempting to further parallelize application’s logic and implementing caching strategies. When application matures over time, performance tuning in the above said areas might not be possible as there won’t be any scope for further performance tuning.

Scaling up (vertical scaling)

Scaling up is adding more resource to the existing web server in order to increase the performance of the system when demand grows. Resources can be adding more core processors, more physical memory. Vertical scaling is relatively simpler to do because it requires no changes to an application, a web application simply goes from running on a single node to running on a single node with more resources.

Even though vertical scaling is the simplest of scaling techniques, there are limitations to it. Limitations on vertical scaling can be due to the operating system itself or an operational constraint like security, management or a provider’s architecture.

Operating system OS type Physical memory
Windows server 2008 standard 32-bits 4 GB
Windows server 2008 standard 64-bits 32 GB
Linux 32-bits 1 GB to 4 GB
Linux 64-bits 4GB to 32 GB

As you can see operating system have limitations on increasing memory beyond that the system cannot be upgraded, so performance cannot be increased after that limit.

Scaling out (Horizontal scaling)

Scaling out refers to adding more nodes forming a cluster environment and making the application to run those nodes to provide high availability. In scaling terminology, this implies that the node on which a web application is running cannot be equipped with more CPU, memory, Bandwidth or I/O capacity and thus a web application is split to run on multiple boxes or nodes.

When we are planning to horizontally scale our web application, we have to concentrate on how to scale out the three major layers of our web application, Static content layer, business logic layer and data storage layer.

Scaling out static content layer is always easy as it’s stateless, because no matter which node of a cluster we use to retrieve our static content it’s going to be the same content. So now the question is when to scale static content layer? Well, the reason is when we experience increased latency in loading the static content, such as HTML files, CSS files, JavaScript files, on the browser.

When we plan to horizontally scale static content, the first thing we need to address is how to set up master-slave architecture. A master being the node where you would apply changes made to an application static content layer and the slave node(s) being the one(s) that receive updates (replicated/synchronized) from the master at the predetermined time.

Unlike scaling out static content layer, business logic layer is so sensitive because it maintains state.  Meaning, when node1 handles order from 1 to 5000 customers in an order processing system and node 2 handles orders from 5001 to 10000, and when customer who belongs to order 200 is good to carry out transactions as long as he/she is routed to node1. Problem arises when he is routed to node2 as node2 has no clue about order 200.

To handle the above said problem the solution lies in specialized software such as Terracotta, GigaSpaces, Oracle coherence etc.  These softwares solve the issues through replication and synchronization. In addition to that this problem can also be solved by making business logic tier and permanent storage tier working together. We can choose better database which can horizontally scale well in a distributed cluster environment. There are many such Dbs available in the market, Apache Cassandra, amazon SimpleDB etc.

Scaling permanent storage tier is a huge topic by itself, so I am leaving it as a scope for my future articles. 🙂

Introductory demo on ASP.NET WebAPI

Welcome back, this is the second article in the series of articles on the ground breaking Microsoft REST based SOA technology framework, ASP.NET Web API.  I am going to show you some hands on demo on Implementing ASP.NET WebAPI using Microsoft Visual Studio 2012.

My previous article in this series on Introduction to ASP.NET WebAPI can be found in this link: https://alagesann.com/2012/10/31/a-detailed-introduction-to-asp-net-web-api/

If you have knowledge on ASP.NET MVC already, you probably know the most part of implementing a WebAPI because ASP.NET Web API structured, Implemented and routed almost like an ASP.NET MVC application, though there are some significant differences which we will explore shortly. So I assume here that you have atleast a little knowledge on the ASP.NET MVC  technology.

I am using Visual Studio 2012 ultimate for this demo.

You can create a simple Web API project just in 2 steps:

  1.  Create a new project by using ASP.NET MVC 4 Web Application project template.

Give a name for the project; I am using the name as “BookStore” and choose a location to store your project. Click OK.

2.  In the following window, choose the Web API to make the project ready for implementing web api. You can choose any template in order to build a web api project, but I am using Web API template so it will give me a sample Api controller automatically. Leave Razor view engine selected and since I am not using a unit test for now for simplicity, leave that check box unchecked.

Now you are landed on the visual studio working window where you can see an initial window like this:

Important points to note here with respect to WebApi are:

  1. System.Web.Http dll is referenced automatically for you to build the web Api functionality.
  2. WebApi is installed by default as a nuget package for the projects you create in VS 2012. You can check that by going through “Manage Nuget  Packages” option from the  context window that appears when you right click on the project. “ Microsoft ASP.NET WebApi “ is installed as shown in the following Nuget Packages window: 
  3. A Default WebApi controller named “ValuesController” is added to the “Controllers” folder and this controller is extended from “ApiController” (will explore why is that shortly). Most importantly this controller has methods “Get, Get by Id, Post, Put and Delete” implemented.
  4. A WebApi route configuration file “WebApiConfig.cs” is added to the “App_Start” folder.

Okay, enough of analyzing some basic setups, lets browse the project and try to navigate to the default Apis created by Visual studio.

I am using google chrome, so let’s use it and play with the urls. When I try the url: /Values/Get/
to invoke the “Get” action of the ValuesController, it returns a 404 error. It should supposed to work as this is how it’s been working in ASP.NET MVC, but now it disappoints.
So let’s check what is there in the routing configuration for this WebApi project. Routing is defined in the WebApiConfig.cs as:

name: “DefaultApi”,
routeTemplate: “api/{controller}/{id}”,
defaults: new { id = RouteParameter.Optional }

So it turns out that, WebApi slightly works different from ASP.NET WebApi from routing perspective. WebApi routing has the word “api” in the route before the controller name. So the url must be


Also note from the route template that there is no action name given right after the controller name. So How the appropriate action is going to be invoked when we hit the url. That’s an another aspect that ASP.NET MVC routing different from WebApi.

Http verbs like Get, Post, Put and Delete that are used to invoke the url are directly mapped to the   controller actions.

So I have to hit the url “/api/Values” it’s going to invoke the Get action from the ValuesController because it’s the “Get” http verb that’s used internally by the browser when I hit the url and I see the result as follows:

Cool, our Api now worked and it returned the response in the form of xml. But, wait. We just return the array of strings from the Get action and how it returns them as xml? It’s all the magic of Content-negotiation feature of WebApi framework.

WebApi supports many content forms starting from XML, JSON to ATOM and best of all it has the capability to be extended programmatically.

Okay, XML is fine. Now, how to retrieve the values in the form of JSON? This needs a little more capability on invoking the urls than the browser itself. So I am going to use Fiddler tool for that which you can download and install from the url:  http://www.fiddler2.com/fiddler2/version.asp and there are lot of videos and information on the fiddler for information on how to use it. So let’s directly use fiddler and get the JSON representation of the response.

My request header looks like this:

Host: localhost:8425
User-Agent: Fiddler
Content-Type: application/json; charset=utf-8

And hit Execute.

we get the response as:

Note the Json response of the response.


We explored how to create a sample ASP.NET WebAPI project and its structure and routing mechanisms. We also tried to execute the apis using browser and fiddler. In the next article, we will explore how to implement a real service project, BookStore Service, using ASP.NET Web Api.

A detailed Introduction to ASP.NET Web API

This is a series of article on the Microsoft new framework called ASP.NET Web API which is released as part of the ASP.NET MVC 4.  ASP.NET Web API is the Microsoft’s recent solid answer for the HTTP oriented protocol services, in other words, REST services. Microsoft defines it as ASP.NET Web API is a framework that makes it easy to build HTTP services that reach a broad range of clients, including browsers and mobile devices. ASP.NET Web API is an ideal platform for building RESTful applications on the .NET Framework.

HTTP oriented RESTful services

HTTP oriented RESTful services are services that make use of HTTP verbs for its CRUD functionalities. HTTP GET verb for read/retrieve functionality, POST verb for Creation, PUT verb for updation, DELETE verb for removal.

For Example, if you are creating an online Bookstore application. You will use:

  1. HTTP GET verb for retrieving all customer details of the book store or detail of a particular customer.
  2. HTTP POST verb for creating any new customer when he/she signs up for your online book store.
  3. HTTP PUT verb for updating existing customer information like name, emailed, address or his/her subscription plan change.
  4. HTTP DELETE verb to remove the customer from the book store customer list when he/she terminates the subscription.  You can either remove the customer permanently or just put some flags and mark the customer as removed without removing his/her entire account from the application, well, that is internal detail as per the application requirements.

HTTP protocol oriented services have been in use for a few years already and is been a trendy way to implement web services (I mean services that run on the web, not the classic Web Services specifically as there is a major difference between HTTP service and classic Web Service). But really a great push towards HTTP services raised when hand-held devices and portable devices like smartphone, tablet evolved to a great extent as it used a rich internet applications with rich user interfaces.

The main driving forces of HTTP oriented services are:

  1. Hand-held portable devices like smart phones and tablets, these are the revolutionary devices that brought a whole new computing trend these days by brining lots and lots of apps with great user interface that needed huge dynamic data from its server.
  2. Browser based applications, applications like games, weather apps, personal apps etc., that run as embedded apps on the browser with rich user interface
  3. RIA applications/SPA (Single Page Applications) on the web needed dynamic data so frequently than plain html from its server and that data is rendered on the page using client technologies like JQuery, Knockout etc.
  4. Windows 8 brought a whole new trend of bringing desktop apps that run on the windows desktop like tablet apps. Those apps send/receive data over the net and those apps can be Microsoft’s own apps or third party apps installed from Microsoft store.

 ASP.NET Web APIs are different/better from classic Web Service and WCF:

Though these two frameworks can be used to develop services to support heterogeneous devices and platforms, their main differences are in their communication protocols. Classic Web Service uses SOAP and ASP.NET Web API uses HTTP as their respective communication protocol under the hood.

Diagram 1:  Heterogeneous devices possibly with various OS and applications communicating with SOAP, A classic Web Service scenario. 

Diagram 1:  Heterogeneous devices possibly with various OS and applications communicating with HTTP protocol, A ASP.NET WebAPI scenario. 

Advantages of ASP.NET Web API that uses Http protocol are:

  1. HTTP messages can be cached, firewall friendly and lightweight.
  2. They can be encrypted and can be processed by the processor in mobile devices.
  3.  Most of the devices in the world have capability to send and receive HTTP messages. So major reach.
  4. ASP.NET WebAPI provides many features like content negotiation and URI routing out of the box whereas in WCF you got to implement them using some plumbing.
  5. Full support for ASP.NET routing and filters
  6. Strong support of variety of output formats like JSON, XML and ATOM.
  7. Easy to unit test and provides self-hosting support.

Comparing ASP.NET WebAPI and Webforms and MVC:

  1. WebApi is designed especially to meet the current trend  of rich clients/applications that are running on web browser that are developed using client technologies like Jquery, knockout etc are consuming lightweight services over http protocol. In such cases webserver’s main role is not just about serving some html content but it’s really about accepting data and returning data in different format. You might have clients that are not just browser, it can be a mobile devices.
  2. But Webforms or MVC are mainly for consuming html. It can do other things like webapi, but for that you got to do a lot of plumbing work to make it happen. In other words, MVC is page/html oriented but webapi is API oriented.


ASP.NET Web API is definitely a strong framework for building RESTful services compare to other existing framework like ASP.NET MVC and WCF. I strongly recommend anyone planning to use this technology for their service implementation that are targeting to global reach.

Let’s dive in some coding implementation of ASP.NET Web API in the next article of this series.

Next article in this series is here :


Thanks for reading.