In 2004, Google published research papers on GFS and MapReduce that became the basis for the open source Hadoop platform now used by
Yahoo, Facebook, and Microsoft. From then
on MapReduce has
become synonymous with Big Data.
However, certain core assumptions of MapReduce are at
fundamental odds with analyzing networks of people, documents and other graph
data structures.
Therefore, Google built Pregel, a large bulk
synchronous processing application for petabyte -scale graph processing on
distributed commodity machines.
Pregel, published in 2010 shows how to implement a number of algorithms like
Google’s PageRank, shortest path, bipartite matching and semi-clustering
algorithm.
Why
BSP?
Today, many practical data processing applications require a
more flexible programming abstraction model that is compatible to run on highly
scalable and massive data systems (e.g., HDFS, HBase, etc).
A message passing paradigm beyond Map-Reduce framework would increase its flexibility in its communication capability.
Bulk Synchronous Parallel (BSP) model fills the bill appropriately. Some of its significant advantages over MapReduce and MPI are:
A message passing paradigm beyond Map-Reduce framework would increase its flexibility in its communication capability.
Bulk Synchronous Parallel (BSP) model fills the bill appropriately. Some of its significant advantages over MapReduce and MPI are:
·
Supports message passing paradigm style of application
development
·
Provides a flexible, simple, and easy-to-use small APIs
·
Enables to perform better than MPI for communication-intensive
applications
·
Guarantees impossibility of deadlocks or collisions in the
communication mechanisms
What
is BSP?
The Bulk Synchronous Parallel (BSP) was
developed by Leslie Valiant during the 1980s.
The definitive article was published in 1990.
The definitive article was published in 1990.
A BSP computation proceeds in a series of global supersteps.
A superstep consists of three components:
1.
Concurrent
computation
Several computations take place on every
participating processor. Each process only uses values stored in the local
memory of the processor. The computations are independent in the sense that
they occur asynchronously of all the others.
2.
Communication
The processes exchange data between themselves.
This exchange takes the form of one-sided put and get calls,
rather than two-sided send and receive calls.
3.
Barrier
synchronization
When a process reaches this point (the barrier),
it waits until all other processes have finished their communication actions.
Fig 1 - A BSP
Superstep
BSP
Frameworks:
The various options available in the open source world derived from Pregel are
Apache
HAMA vs GIRAPH:
Here let’s focus on the two open source Apache
projects the implements BSP programming
model.
HAMA
|
GIRAPH
|
Apache Top Level Project (from May-2012)
|
Apache Top Level Project (from May-2012)
|
Latest Version 0.6.0 released
on 28-Nov-2012
|
Latest Version 0.1.0
on 08-Feb-2012
|
Pure BSP Engine
|
Uses BSP, but BSP API is not exposed
|
Generally for iterative processing like Matrix, Graph
processing.
|
Just for Graph Processing.
|
Jobs are run as BSP Job on HDFS
|
Jobs are run as Mapper only Job on Hadoop.
|
Fig 2 – Apache Hama vs Giraph
Hama's fault tolerance capability is immature at the moment.
Apache HAMA focus is currently on messaging scalability, then
on YARN, and fault tolerance.
In my next post, i will mention why not MapReduce for iterative processing.
In my next post, i will mention why not MapReduce for iterative processing.
But, It’s
time to think beyond MapReduce and Apache HAMA looks promising.


No comments:
Post a Comment