Saturday, March 30, 2013

BSP - MapReduce is not the answer to every problem




In my previous post http://arasan-blog.blogspot.in/2013/03/bsp-thinking-beyond-mapreduce.html i have mentioned why BSP and comparison between the BSP implementations Apache projects HAMA & Apache Giraph.
This paper compares BSP and MapReduce.

Dissection of Mapreduce
Let’s see why not mapreduce by dissecting the Hadoop Mapreduce programming model.

Please find below the high level Mapreduce Pipeline image.

               Fig:1 - High Level Mapreduce Pipeline

In MapReduce model Map & Reduce tasks executes in isolation. There is no communication between the mappers.

In a typical Hadoop Mapreduce job, we could find the following steps highlighted in almost all the materials we come across in the internet.
  1.  Map task.
  2. Shuffle & sort
  3. Reduce task.


In the above picture, I have highlighted the I/O operations involved in a Hadoop Mapreduce job.
Map + Reduce + Network Data Transfer + 4 times (I/O operation)

I believe, everyone agrees with me that I/O operations are costly.

In a typical enterprise Hadoop jobs with 5 iterations (or) rather 5 Mapreduce jobs for a particular problem. The I/O operation alone amounts to 20 times.

Also, there is time involved in startup of jvm for the map & reduce task for each MR Job. (Even though there are various performance tunings available like JVM reuse etc.)


KMeans Clustering – BSP vs MapReduce

Please find the following links that may be useful.

I have compared Mahout (Mapreduce) KMeans clustering with HAMA (BSP) KMeans implementation and my experiment shows HAMA is far ahead of Mahout KMeans clustering execution.


Thus, for iterative processing problems BSP overshadows the MapReduce Programming.

Wednesday, March 27, 2013

BSP - Thinking Beyond Mapreduce


In 2004, Google published research papers on GFS and  MapReduce that became the basis for the open source Hadoop platform now used by Yahoo, Facebook, and Microsoft.  From then on MapReduce has become synonymous with Big Data. 

However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, documents and other graph data structures.

Therefore, Google built Pregel, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines.

Pregel, published in 2010 shows how to implement a number of algorithms like Google’s PageRank, shortest path, bipartite matching and semi-clustering algorithm.


Why BSP?

Today, many practical data processing applications require a more flexible programming abstraction model that is compatible to run on highly scalable and massive data systems (e.g., HDFS, HBase, etc). 

A message passing paradigm beyond Map-Reduce framework would increase its flexibility in its communication capability. 

Bulk Synchronous Parallel (BSP) model fills the bill appropriately. Some of its significant advantages over MapReduce and MPI are:

·         Supports message passing paradigm style of application development
·         Provides a flexible, simple, and easy-to-use small APIs
·         Enables to perform better than MPI for communication-intensive applications
·         Guarantees impossibility of deadlocks or collisions in the communication mechanisms
  

What is BSP?

The Bulk Synchronous Parallel (BSP) was developed by Leslie Valiant during the 1980s. 

The definitive article was published in 1990.

A BSP computation proceeds in a series of global supersteps.

A superstep consists of three components:

1.    Concurrent computation
Several computations take place on every participating processor. Each process only uses values stored in the local memory of the processor. The computations are independent in the sense that they occur asynchronously of all the others.

2.    Communication
The processes exchange data between themselves. This exchange takes the form of one-sided put and get calls, rather than two-sided send and receive calls.

3.    Barrier synchronization
When a process reaches this point (the barrier), it waits until all other processes have finished their communication actions.

                                                    Fig 1 - A BSP Superstep



BSP Frameworks:

The various options available in the open source world derived from Pregel are

Apache HAMA vs GIRAPH:

Here let’s focus on the two open source Apache projects  the implements BSP programming model.

HAMA
GIRAPH
Apache Top Level Project (from May-2012)
Apache Top Level Project (from May-2012)
Latest Version 0.6.0 released on 28-Nov-2012
Latest Version 0.1.0 on 08-Feb-2012
Pure BSP Engine
Uses BSP, but BSP API is not exposed
Generally for iterative processing like Matrix, Graph processing.
Just for Graph Processing.
Jobs are run as BSP Job on HDFS
Jobs are run as Mapper only Job on Hadoop.




                                                  Fig 2 – Apache Hama vs Giraph



Hama's fault tolerance capability is immature at the moment.

Apache HAMA focus is currently on messaging scalability, then on YARN, and fault tolerance.

In my next post, i will mention why not MapReduce for iterative processing.


But, It’s time to think beyond MapReduce and Apache HAMA looks promising.

Saturday, March 23, 2013

Running GATE on Hadoop

In this post i would like to mention Running GATE on Hadoop

GATE - general achitecture for text engineering is open source software capable of solving almost any text processing problem.

Hadoop GATE is a github project contains a simple Hadoop job that runs a GATE application.
This  job runs an archived GATE application on text files comprised of one document per line. It produces sequence files containing XML representations of the document annotation. The GATE application is a archive file with an application .xgapp file in its root directory. This application is copied to HDFS and placed into the distributed cache.

Prerequisite:
This project uses the new Hadoop API. It is built with Maven and demonstrates how to use Maven to package all the GATE dependencies into a single jar file.

Maven Build:
The mvn package command failed with the following error

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22.394s
[INFO] Finished at: Sat Mar 23 07:51:19 UTC 2013
[INFO] Final Memory: 8M/25M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project Hadoop-GATE: Could not resolve dependencies for project Hadoop-GATE:Hadoop-GATE:jar:1.0: Could not find artifact gate:gate-compiler-jdt:jar:1.0 in central (http://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException



Modify the pom.xml as mentioned below
Change the value gate to uk.ac.gate in the groupId tag present in the below snippet.


<dependency>
    <groupId>uk.ac.gate</groupId>
    <artifactId>gate-compiler-jdt</artifactId>
    <version>1</version>
</dependency>
           
Maven Build will result in Success message after the above mentioned modification.

The packaged jar Hadoop-GATE-1.0.jar can be found under target folder.

Hadoop  Job Execution:
$hadoop jar Hadoop-GATE-1.0.jar wpmcn.gate.hadoop.HadoopGATE ANNIE.zip input output

Behemoth Project
There is also a different github project Behemoth that also runs GATE on Hadoop.
 
Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.  

Saturday, March 16, 2013

Hadoop ClassNotFoundException


Java requires third-party and user-defined classes to be on the command line’s "-classpath" option when the JVM is launched.

MapReduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts.

ClassNotFoundException occurs when the required libraries are not found in the class paths of the nodes running the map/reduce tasks.


Below are the different ways to avoid the ClassNotFoundException in hadoop.

1. Include the JAR in the “-libjars” command line option of the "hadoop jar ..." command

The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. 
More specifically, you will find the JAR in one of the ${mapred.local.dir}/taskTracker/archive/${user.name}/distcache/… subdirectories on local nodes.

2. Include the referenced JAR in the lib subdirectory of the submittable JAR.

Points 1 & 2 are preferred when the JARs are 
  • small, 
  • change often, and 
  • are job-specific

3. Install the JAR on the cluster nodes.
The easiest way is to place the JAR into $HADOOP_HOME/lib directory as everything from this directory is included when a Hadoop daemon starts. 


The same guiding principles apply to native code libraries that need to be run on the nodes (JNI or C++ pipes). 

  1. You can put them into distributed cache with the “-files” options, 
  2. include them into archive files specified with the “-archives” option, or 
  3. install them on the cluster nodes.


 If the dynamic library linker is configured properly the native code should be made available to your task attempts. You can also modify the environment of the job’s running task attempts explicitly by specifying JAVA_LIBRARY_PATH or LD_LIBRARY_PATH variables:


hadoop jar <your jar> [main class]
      -D mapred.child.env="LD_LIBRARY_PATH=/path/to/your/libs" ...