Monthly Archives: May 2013

Boyer-Moore Algorithm

Unlike Knuth-Moriss-Pratt linear time algorithm, Boyer-moore algorithm can search for a pattern in a sublinear time. Till KMP time, people worked hard to come up with an algorithm to search for a pattern in linear time, i.e., O(M+N) time where M is the total length of the pattern to search and N is the total length of the String searched from.

Boyer and Moore were curious to nail down an algorithm which can work more efficiently than KMP in sublinear time. By “Sublinear” Boyer-moore means, this can search a pattern from the string by inspecting minimum number of characters rather than inspecting every character in the string. They deviced an intellectual logic to skip inspecting some characters and still reached a successful search match when the pattern existed in the string.

Though Boyer-moore algorithm works efficiently in sublinear time, this has few limitations in which case it will end up working in linear time. Search efficiency increases when the number of characters from the pattern increases as it gets more characters to skip from inspecting. Efficiency falls back to linear time when it had to search only one character.

The idea of how it works:

The main idea behind the algorithm is it gains more important information by matching the pattern from right-left rather than usual left-right matching. Assume pattern has p0, p1, p2… pm characters and Searching string have s0, s1, s2… sn characters. Algorithm takes edge by comparing sm to pm, sm-1 to pm-1 and so on up to s0-p0.

An Example:Image 1Assume that there is no space between the characters both in pattern and string and search happens at the place where upward arrow points.

Since pattern length is 7, search starts at the String location “F” and step left one by one. Searching is done by picking a character at the String pointed by the upward arrow and searches that character in the pattern.

Character “F” from string is matched with char “T” at the pattern and then with “A”, then with “H” and so on. Since char “F” is not matched with ANY of the character from the pattern, we can confirm that the whole pattern won’t be matched with the current position in the string. By current position, I mean from the char pointed by the first char of the pattern to the char pointed by last char of the pattern (that’s where the actual searching happens now). In this case, pattern is not matched from “W” to “F”.

The key point is that for match to succeed, the char “F” should exist atleast in one place in the pattern. Since it’s not appearing even in one place, we can confirm the pattern won’t match in the current place. So we can safely skip from inspecting all other left chars from “-“ to “W” in the string with the every char of the pattern.

Boyer-moore observation 1:  If char is known not to occur in pattern, then we know we need not consider the possibility of an occurrence of pattern starting at string position 1, 2, 3… or patternlength: Such an occurrence would require that char to search be a character of pattern.

According to observation 1, we can safely skip first 7 chars from the string and first char of pattern is positioned at the 8th char of string.

Image 2Now,  matching occurs starting from char “I” to “-“ in the string and the upward arrow pointing at “-“ that’s where the match starts.

“-“ is matched with every char in the pattern from right to left and we have match at the fifth location in the pattern. Importantly first matching with the pattern only considered during matching. When there is a match occurs pattern has to be realigned with the string so that the currently matched char is positioned at the same location.  In this case, pattern is realigned to match those two hyphens.  After realignment arrow too has to be repositioned down by 4 chars so as to point to the last char of the pattern.

Since our first search char “-“ matched at the fifth position in the pattern, not the first position (right most of pattern as the string), we know for sure that its worthless matching the rest of the chars in the string starting from “Y” to “I” left to right in the string. So moving pattern to right by 4 is safe.

Boyer-moore observation 2:  if the last (right most) occurrence of char in pattern is delta1 characters from the right end of the pattern, then we know we can slide pattern down delta1 positions without checking for matches.

Current position after applying observation 2

Image 3Now, matching starts at “T” and current position taken for matching in string is “T” to “T” that’s where the pattern is positioned now.

First match succeed as “T” in string matches with “T” in the pattern.

Upward arrow moved to the left by one position to point to “L” and “L” is matched with “A” in the pattern but match fails.

Boyer-Moore observation 3a:  if the mismatch occurs k characters from the start of the pattern and the mismatched character not in pattern, then we can advance atleast k characters.

‘L’ is not in ‘P’ and the mismatch occurred against p6, hence we can advance (atleast) 6 characters.

Current position is:

Image 4However we can actually do better than this:

Boyer-moore observation 3b:  Since we know that earlier we already matched some characters (1 in this case). If the matched characters don’t match the start of the pattern, then we can actually jump forward a little more, delta2 distance.

Current position is:

Image 5Now pointer is point at the hyphen and that hyphen occurs at the third character of the pattern, we can apply observation 1 which moves the pattern 4 characters forward to match those two hyphens.

Current position is:

Image 6Now upward arrow points at the last char of the string T and when we match each character of string with the pattern, we get successful match between all the chars with the pattern. So we FOUND the pattern in the string.

Performance analysis:

Though Boyer-Moore algorithm performance is sublinear in best case, but it has O(MN) runtime in worst case when the length of characters to search is very small.

Boyer-Moore algorithm is really fast on larger alphabets and for small set of characters it’s recommended to use Knuth-Morris-Pratt algorithm.

What is big data?

big-data

According to IBM,  Every day we create 2.5 quintillion (2.5*1018 ) bytes of data in the world and it’s so much that about 90% of the world’s data today has been created in the last two years alone. This vast amount of data generated so fast is throwing a lot of challenges to the data science and related field in analyzing and utilizing them. This fast generating, challenging, variety and difficult data is called big data.

Big data is not a single technology but a combination of old and new technologies that help companies gain actionable insight. So big data is the capability to manage huge volume of different data, at the right speed and within the right time frame to allow real-time analysis and action.

The major challenges of big data are:

Volumn: How much data.

Velocity: How fast the data is processed.

Variety: Different types of data

Big data comprises of almost all kinds of data available in the world that are structured and unstructured. Unstructured data is data that’s not in a particular data model and it can be any data such as text, sensor data, audio, video, images, click streams, log files to name a few.  In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form.  Recently analysts predict that data will grow 800% over the next five years. Computer world says that unstructured information might account for more than 70-80% of all data in an organization. So it’s extremely crucial to analyze and utilize these vast amounts of data for the benefit of the organization.

Global Market for Big data:

  • Digital information is growing at 57% per annum globally.
  • With global social network penetration and mobile internet penetration both under 20% this growth has only just begun.
  • All the data generated is valuable, but only if it can be interpreted in a timely and cost effective manner.
  • IDC expects revenues for big data technology infrastructure to grow by 40% per annum for the next three years.

In 2006, IDC estimated, the world produced 0.18 zettabytes of digital information. It grew to 1.8 zettabytes in 2011 and will reach 35 zettabytes by 2020.

Few statistics to demonstrate the ‘big’ part of the bigdata:

  1. Twitter generates nearly 12 TB of data per day, 58 million tweets perday.
  2. Every hour Wallmart controls more than 1 million customer transactions. All of this information is transferred into a database working with over 2.5 petabytes of information.
  3. According to FICO, the credit card fraud system currently in place helps protect over 2 billion accounts all over the globe.
  4. Currently Facebook holds more than 45 billion photos in its entire user base and the number of photos growing rapidly.
  5. The amount of data processed daily by Google is 20 PB and monthly worldwide searches on Google sites are 87.8 billion.

Here is an interesting statistics from YouTube alone:

  1. More than 1 billion UNIQUE users visit YouTube every month.
  2. Over 4 billion hours of video are watched each month.
  3. 72 hours of video are uploaded every minute. (It will take 3 days to watch them all without sleep).

So, Big data is the next big thing happening to IT industry. To be successful in the IT industry it’s really crucial to adopt to big data analytics to make use of the exploding amount of data that’s available now and in the future.

Understanding MapReduce

As Google says, MapReduce is a Programming model and an associated implementation for processing and generating large data sets.  MapReduce is the heart of Hadoop and Big Data analytics and is highly scalable in processing very large amounts of data using commodity PCs.

MapReduce consists of two major tasks (jobs), Map and Reduce, which do all the data processing.  The first task is Map task which takes all the given inputs, fed as key/value pair, and generates intermediate key/value pairs based on the specific functional problem. The second task is the Reduce task which takes Map task’s key/value pair output and combines them to produce the smaller set of final output.

A MapReduce Example:

Say we have a collection of sports documents containing total score of each sports man based on various games he/she played in their career. Now we have to find out the highest score of each sports man. Each document can contain more than one entry for a particular sports man. Score is represented as key value pair in the documents, Sports person name is the key and his/her score is the value.

For Simplicity we take only 3 sportsmen and their score and here is one sample document.

Mike, 100

Joseph, 150

Robert, 200

Mike, 130

Joseph, 90

Robert, 160

We take four more documents the same way and process them using MapReduce. Initially Map job is fed with all these five documents to generate the intermediate key/value pair as output.

Here is the sample result from Map task out of all those five documents.

The result of the first document (given above) is:

(Mike, 130) (Joseph, 150)(Robert, 200)

Please note Map job produced the highest score of each person out of its map task.

Same way all other four map job (not shown for simplicity) produced the following result:

(Joseph, 100) (Robert, 150) (Mike, 130)

(Joseph, 170) (Robert, 110)

(Mike, 120) (Joseph, 155)(Robert, 220)

(Mike, 170) (Joseph, 130)(Robert, 250)

Now all the five map jobs produced the above five results of key, value pair. These five results are now fed to Reduce job to generate the final results. There can be five reduced jobs or lesser than that but it will have capabilities to sync between them to produce one final result. After all the results from Map jobs are processed the Reduce job will produce the following result containing the highest score of each sports person.

(Mike, 170) (Joseph, 170) (Robert, 250)

As the name MapReduce implies, Reduce job is performed always after Map job. Most importantly all these functionalities are done parallel in a clustered environment with commodity machines. Meaning in our example, first five map jobs are done parallel to produce five results and these five results are fed into the Reduce job which will again run parallel to produce the final result. Because of this high degree of parallelism and scalability any amount of data can be processed and generated using this framework in a considerably good amount of time.

In a real MapReduce environment, MapReduce is a software framework with great interface to easily plug in this Map and Reduce functionalities to the framework. Usually user is given option to write his own Map and Reduce logic and pass it over to the framework to use this Map and Reduce logic for processing the input data and produce output data. This software framework hides most of the complexity of parallelization, fault-tolerance, data distribution and load balancing in the library and let the user to focus only on the Map and Reduce functionalities.

Scalable Application Architecture

Architecting scalable system is more challenging and when it’s done right its more rewarding.  Top tech companies like Google and Amazon are world-famous for their versatile high performing systems. They are all highly versatile and famous because they are all highly scalable.  To simply put, scalability is known as the capability of a system to efficiently meet its growing demand/users without adversely affecting its performance, cost, maintainability etc.

Performance – capability of a system on how efficiently it performs its tasks in a given time and utilizes its resources, again a most important factor/feature of a system.  In a non-profit public applications, it’s somehow tolerable to perform a task in two to three seconds whereas in commercial and mission critical applications it’s not at all tolerable when time taken to perform certain task exceeds even two to three seconds. When a commercial system just hit market and gaining users/popularity is not vulnerable to scalability issues as the system is just gaining users. Performance of the system can be considerably high. Once application picks up market and system demand grows drastically over time, there is a lot of chances that an application might highly vulnerable to performance and scalability issues. When the system is not designed with future performance and scalability issues in mind, the system will surely lose its hard-earned potential customers over time.

General techniques for scalable system design

Performance and scalability design and implementation is done in almost all of the phases of system development starting from planning/requirement analysis to even after deploying to production environment. So we will discuss here on how to make decision on performance and scalable capabilities of a system in three stages of a system life-cycle.

  1. Planning and Requirement analysis phase
  2. Design and Development phase
  3. Production and Maintenance phase

Planning and Requirement analysis phase:

Major software design decisions and system modeling are made at this stage. Making right and futuristic decision at this stage is crucial for the success of the project.  System and performance model can be created to evaluate the system design before investing time and resources to implement a flawed design.

The time, effort and money invested upfront in performance modeling should be proportional to project risk. For a project with significant risk, where performance is critical, you may spend more time and energy up front developing the model. For a project where performance is less of a concern, modeling might be simple.

Budget represents the constraints and enables us to specify how much we can spend (resource-wise) and how we plan to spend it. Constraints govern our total spending, and then we can decide where to spend to get to the total. We assign budget in terms of response time, throughput, latency and resource utilization.

The performance model we develop helps us capture the following important information upfront:

Category Description
Application Description The design of the application in terms of its layers and its target infrastructure.
Scenarios Critical and significant use cases, sequence diagrams, and user stories relevant to performance.
Performance objectives Response time, throughput, resource utilization
Budgets Constraints we set on the execution of use cases, such as maximum execution time and resource utilization levels, including CPU, memory, disk I/O and network I/O.
Workload goals Goals for the number of users, concurrent users, data volumes, and information about the desired use of the application.
Quality-of-service (QoS) requirements QoS requirements, such as security, maintainability, and interoperability, may impact the performance. So We should have an agreement across software and infrastructure teams about QoS restrictions and requirements.

2. Design and Development phase:

Design and development phase is another important phase of software development where we have to focus on performance and scalability requirements. Typically any website application will have static contents and business logic and data storage. It’s really a wise idea to layer them separately so that each layer can scale independently from one another. Following flexible architecture is important so that when there is any new change comes the architecture should be able to adapt easily to the change without affecting performance or scalability.

Static content –  consists of Images(JPEG, GIF or other), HTML, JavaScript and CSS, can be layered separately  so that it can be hosted independently from other part of the system leading to efficient scaling. Since static contents are stateless, replica can be easily created in a cluster environment for high availability. Content Delivery Network can be formed in geographically different locations for the static content which will promote high scalability and availability of static content.

Business logic layer – consists of application logic written in C#, Java or other programming language, should also be layered separately and should be able to be hosted in its own server to promote scalability.  A great approach to implement business logic is to implement and expose them as Web Service, SOAP based or REST based. REST based services are highly scalable and interoperable compared to SOAP based services. Parallelism can also be applied easily on business logic when it’s designed and implemented highly modular. Parallel processing applications can perform really well when user demands are more and they try to access the system more concurrently.  Another interesting idea to apply on business logic layer is that when various logical parts of the system are grouped/layered separately based on its functional scenarios, like order processing module from report processing module, it’s really easy to decide to create more nodes for highly accessible module like order processing compare to less accessible module like report processing in a cluster environment.

4. Production and Maintenance phase:

In this final phase of SDLC, we will see the actual result of all the efforts we put for bringing up the system. During the initial launch of the system there may not be issues about availability of the system as the users are just growing and system is very well capable to cope up with the low or medium demand. When user counts shooting up to a level beyond the capability of the system, that’s when all the availability, performance and scalability issues are starting to pour in.

Suppose we reached the threshold limit of the system capability beyond which system tends to experience issues, then following are the areas we have to target to sort out the issues.

  1. Performance tuning.
  2. Scaling up (vertical scaling).
  3. Scaling out (Horizontal scaling).

Performance tuning.

This step would consist of refactoring application’s source code, analyzing an application’s configuration settings, attempting to further parallelize application’s logic and implementing caching strategies. When application matures over time, performance tuning in the above said areas might not be possible as there won’t be any scope for further performance tuning.

Scaling up (vertical scaling)

Scaling up is adding more resource to the existing web server in order to increase the performance of the system when demand grows. Resources can be adding more core processors, more physical memory. Vertical scaling is relatively simpler to do because it requires no changes to an application, a web application simply goes from running on a single node to running on a single node with more resources.

Even though vertical scaling is the simplest of scaling techniques, there are limitations to it. Limitations on vertical scaling can be due to the operating system itself or an operational constraint like security, management or a provider’s architecture.

Operating system OS type Physical memory
Windows server 2008 standard 32-bits 4 GB
Windows server 2008 standard 64-bits 32 GB
Linux 32-bits 1 GB to 4 GB
Linux 64-bits 4GB to 32 GB

As you can see operating system have limitations on increasing memory beyond that the system cannot be upgraded, so performance cannot be increased after that limit.

Scaling out (Horizontal scaling)

Scaling out refers to adding more nodes forming a cluster environment and making the application to run those nodes to provide high availability. In scaling terminology, this implies that the node on which a web application is running cannot be equipped with more CPU, memory, Bandwidth or I/O capacity and thus a web application is split to run on multiple boxes or nodes.

When we are planning to horizontally scale our web application, we have to concentrate on how to scale out the three major layers of our web application, Static content layer, business logic layer and data storage layer.

Scaling out static content layer is always easy as it’s stateless, because no matter which node of a cluster we use to retrieve our static content it’s going to be the same content. So now the question is when to scale static content layer? Well, the reason is when we experience increased latency in loading the static content, such as HTML files, CSS files, JavaScript files, on the browser.

When we plan to horizontally scale static content, the first thing we need to address is how to set up master-slave architecture. A master being the node where you would apply changes made to an application static content layer and the slave node(s) being the one(s) that receive updates (replicated/synchronized) from the master at the predetermined time.

Unlike scaling out static content layer, business logic layer is so sensitive because it maintains state.  Meaning, when node1 handles order from 1 to 5000 customers in an order processing system and node 2 handles orders from 5001 to 10000, and when customer who belongs to order 200 is good to carry out transactions as long as he/she is routed to node1. Problem arises when he is routed to node2 as node2 has no clue about order 200.

To handle the above said problem the solution lies in specialized software such as Terracotta, GigaSpaces, Oracle coherence etc.  These softwares solve the issues through replication and synchronization. In addition to that this problem can also be solved by making business logic tier and permanent storage tier working together. We can choose better database which can horizontally scale well in a distributed cluster environment. There are many such Dbs available in the market, Apache Cassandra, amazon SimpleDB etc.

Scaling permanent storage tier is a huge topic by itself, so I am leaving it as a scope for my future articles. 🙂