Saturday, March 23, 2013

Running GATE on Hadoop

In this post i would like to mention Running GATE on Hadoop

GATE - general achitecture for text engineering is open source software capable of solving almost any text processing problem.

Hadoop GATE is a github project contains a simple Hadoop job that runs a GATE application.
This  job runs an archived GATE application on text files comprised of one document per line. It produces sequence files containing XML representations of the document annotation. The GATE application is a archive file with an application .xgapp file in its root directory. This application is copied to HDFS and placed into the distributed cache.

Prerequisite:
This project uses the new Hadoop API. It is built with Maven and demonstrates how to use Maven to package all the GATE dependencies into a single jar file.

Maven Build:
The mvn package command failed with the following error

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 22.394s
[INFO] Finished at: Sat Mar 23 07:51:19 UTC 2013
[INFO] Final Memory: 8M/25M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project Hadoop-GATE: Could not resolve dependencies for project Hadoop-GATE:Hadoop-GATE:jar:1.0: Could not find artifact gate:gate-compiler-jdt:jar:1.0 in central (http://repo.maven.apache.org/maven2) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException



Modify the pom.xml as mentioned below
Change the value gate to uk.ac.gate in the groupId tag present in the below snippet.


<dependency>
    <groupId>uk.ac.gate</groupId>
    <artifactId>gate-compiler-jdt</artifactId>
    <version>1</version>
</dependency>
           
Maven Build will result in Success message after the above mentioned modification.

The packaged jar Hadoop-GATE-1.0.jar can be found under target folder.

Hadoop  Job Execution:
$hadoop jar Hadoop-GATE-1.0.jar wpmcn.gate.hadoop.HadoopGATE ANNIE.zip input output

Behemoth Project
There is also a different github project Behemoth that also runs GATE on Hadoop.
 
Behemoth is an open source platform for large scale document processing based on Apache Hadoop.

Note that Behemoth does not implement any NLP or Machine Learning components as such but serves as a 'large-scale glueware' for existing resources. Being Hadoop-based, it benefits from all its features, namely scalability, fault-tolerance and most notably the back up of a thriving open source community.  

1 comment:

  1. I have read your blog it was nice to follow even I am looking for your future updates. Hadoop is a highly growing & scoopful technology in IT market it’s an open-source software framework for managing big data in a distributed fashion on large commodity computing hardware.
    Hadoop training in chennai

    ReplyDelete