In my previous post http://arasan-blog.blogspot.in/2013/03/bsp-thinking-beyond-mapreduce.html
i have mentioned why BSP and comparison between the BSP implementations Apache projects
HAMA & Apache Giraph.
This paper compares BSP and
MapReduce.
Dissection
of Mapreduce
Let’s see why not mapreduce by
dissecting the Hadoop Mapreduce programming model.
Please find below the high level
Mapreduce Pipeline image.
Fig:1 - High Level Mapreduce
Pipeline
In MapReduce model Map & Reduce
tasks executes in isolation. There is no
communication between the mappers.
In a typical Hadoop Mapreduce job, we
could find the following steps highlighted in almost all the materials we come
across in the internet.
- Map task.
- Shuffle & sort
- Reduce task.
In the above picture, I have
highlighted the I/O operations involved
in a Hadoop Mapreduce job.
Map + Reduce + Network Data Transfer +
4 times (I/O operation)
I believe, everyone agrees with me
that I/O operations are costly.
In a typical enterprise Hadoop jobs
with 5 iterations (or) rather 5 Mapreduce jobs for a particular problem.
The I/O operation alone amounts to 20
times.
Also, there is time involved in startup of jvm for the map & reduce task for
each MR Job. (Even though there are various performance tunings available like
JVM reuse etc.)
KMeans
Clustering – BSP vs MapReduce
Please find the following links that may
be useful.
I have compared Mahout (Mapreduce) KMeans clustering with HAMA (BSP) KMeans implementation and my experiment shows HAMA is far ahead of Mahout KMeans clustering
execution.
Thus,
for iterative processing problems BSP overshadows the MapReduce Programming.

See also, Apache MRQL ;-)
ReplyDelete