Saturday, March 30, 2013

BSP - MapReduce is not the answer to every problem




In my previous post http://arasan-blog.blogspot.in/2013/03/bsp-thinking-beyond-mapreduce.html i have mentioned why BSP and comparison between the BSP implementations Apache projects HAMA & Apache Giraph.
This paper compares BSP and MapReduce.

Dissection of Mapreduce
Let’s see why not mapreduce by dissecting the Hadoop Mapreduce programming model.

Please find below the high level Mapreduce Pipeline image.

               Fig:1 - High Level Mapreduce Pipeline

In MapReduce model Map & Reduce tasks executes in isolation. There is no communication between the mappers.

In a typical Hadoop Mapreduce job, we could find the following steps highlighted in almost all the materials we come across in the internet.
  1.  Map task.
  2. Shuffle & sort
  3. Reduce task.


In the above picture, I have highlighted the I/O operations involved in a Hadoop Mapreduce job.
Map + Reduce + Network Data Transfer + 4 times (I/O operation)

I believe, everyone agrees with me that I/O operations are costly.

In a typical enterprise Hadoop jobs with 5 iterations (or) rather 5 Mapreduce jobs for a particular problem. The I/O operation alone amounts to 20 times.

Also, there is time involved in startup of jvm for the map & reduce task for each MR Job. (Even though there are various performance tunings available like JVM reuse etc.)


KMeans Clustering – BSP vs MapReduce

Please find the following links that may be useful.

I have compared Mahout (Mapreduce) KMeans clustering with HAMA (BSP) KMeans implementation and my experiment shows HAMA is far ahead of Mahout KMeans clustering execution.


Thus, for iterative processing problems BSP overshadows the MapReduce Programming.

1 comment: