Wednesday, March 27, 2013

BSP - Thinking Beyond Mapreduce


In 2004, Google published research papers on GFS and  MapReduce that became the basis for the open source Hadoop platform now used by Yahoo, Facebook, and Microsoft.  From then on MapReduce has become synonymous with Big Data. 

However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, documents and other graph data structures.

Therefore, Google built Pregel, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines.

Pregel, published in 2010 shows how to implement a number of algorithms like Google’s PageRank, shortest path, bipartite matching and semi-clustering algorithm.


Why BSP?

Today, many practical data processing applications require a more flexible programming abstraction model that is compatible to run on highly scalable and massive data systems (e.g., HDFS, HBase, etc). 

A message passing paradigm beyond Map-Reduce framework would increase its flexibility in its communication capability. 

Bulk Synchronous Parallel (BSP) model fills the bill appropriately. Some of its significant advantages over MapReduce and MPI are:

·         Supports message passing paradigm style of application development
·         Provides a flexible, simple, and easy-to-use small APIs
·         Enables to perform better than MPI for communication-intensive applications
·         Guarantees impossibility of deadlocks or collisions in the communication mechanisms
  

What is BSP?

The Bulk Synchronous Parallel (BSP) was developed by Leslie Valiant during the 1980s. 

The definitive article was published in 1990.

A BSP computation proceeds in a series of global supersteps.

A superstep consists of three components:

1.    Concurrent computation
Several computations take place on every participating processor. Each process only uses values stored in the local memory of the processor. The computations are independent in the sense that they occur asynchronously of all the others.

2.    Communication
The processes exchange data between themselves. This exchange takes the form of one-sided put and get calls, rather than two-sided send and receive calls.

3.    Barrier synchronization
When a process reaches this point (the barrier), it waits until all other processes have finished their communication actions.

                                                    Fig 1 - A BSP Superstep



BSP Frameworks:

The various options available in the open source world derived from Pregel are

Apache HAMA vs GIRAPH:

Here let’s focus on the two open source Apache projects  the implements BSP programming model.

HAMA
GIRAPH
Apache Top Level Project (from May-2012)
Apache Top Level Project (from May-2012)
Latest Version 0.6.0 released on 28-Nov-2012
Latest Version 0.1.0 on 08-Feb-2012
Pure BSP Engine
Uses BSP, but BSP API is not exposed
Generally for iterative processing like Matrix, Graph processing.
Just for Graph Processing.
Jobs are run as BSP Job on HDFS
Jobs are run as Mapper only Job on Hadoop.




                                                  Fig 2 – Apache Hama vs Giraph



Hama's fault tolerance capability is immature at the moment.

Apache HAMA focus is currently on messaging scalability, then on YARN, and fault tolerance.

In my next post, i will mention why not MapReduce for iterative processing.


But, It’s time to think beyond MapReduce and Apache HAMA looks promising.

No comments:

Post a Comment