Command line for hadoop streaming

2014-09-23T01:37:21

I am trying to use hadoop streaming where I have a java class which is used as mapper. To keep the problem simple let us assume the java code is like the following:

import java.io.* ;

class Test {

    public static void main(String args[]) {
        try {
            BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
            String input ;
            while ((input = br.readLine()) != null) {
                  System.out.println(input) ;
            }
        } catch (IOException io) {
            io.printStackTrace() ;
        }
    }
}

I can compile it as "javac Test.java" run it from command line as follows:

[abhattac@eat1-hcl4014 java]$ cat a.dat
abc
[abhattac@eat1-hcl4014 java]$ cat a.dat | java Test
abc
[abhattac@eat1-hcl4014 java]

Let us assume that I have a file in HDFS: a.dat

[abhattac@eat1-hcl4014 java]$ hadoop fs -cat /user/abhattac/a.dat
Abc

[abhattac@eat1-hcl4014 java]$ jar cvf Test.jar Test.class
added manifest
adding: Test.class(in = 769) (out= 485)(deflated 36%)
[abhattac@eat1-hcl4014 java]$

Now I try to use (Test.java) as mapper in hadoop streaming. What do I provide for [1] -mapper command line option. Should it be like the following? [2] -file command line option. Do I need to make a jar file out of Test.class? If that is the case do I need to include MANIFEST.MF file to indicate the main class?

I tried all these options but none of them seem to work. Any help will be appreciated.

hadoop jar /export/apps/hadoop/latest/contrib/streaming/hadoop-streaming-1.2.1.45.jar -file Test.jar -mapper 'java Test' -input /user/abhattac/a.dat -output /user/abhattac/output

The command above doesn't work. The error message in task log is:

stderr logs

Exception in thread "main" java.lang.NoClassDefFoundError: Test
Caused by: java.lang.ClassNotFoundException: Test
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)

Copyright License:
Author:「user3138594」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/25979956/command-line-for-hadoop-streaming

About “Command line for hadoop streaming” questions

I am trying to use hadoop streaming where I have a java class which is used as mapper. To keep the problem simple let us assume the java code is like the following: import java.io.* ; class Test ...
I want to pipe my hadoop streaming job. For example I had run a command hadoop jar hadoop-streaming.jar -mapper map1.py -reducer reducer.py -input xx -output /output1 But I want to use output ...
I am trying to use Hadoop streaming with python scripts, but unfortunately I am getting following error: 14/08/23 13:31:50 INFO streaming.StreamJob: To kill this job, run: 14/08/23 13:31:50 INFO
I was browsing through the Hadoop website and found the following link for hadoop streaming. https://hadoop.apache.org/docs/current1/streaming.html But, I am more interested in Hadoop YARN (MRv2) -
I am a newbie at using Hadoop streaming with Python. I was successfully able to run the wordcount example explained in most of the references. But when I started with one of my own written small py...
I am trying to chain some Streaming jobs( jobs written in Python). I did it, but I have problem with -D commands. Here is the code, public class OJs extends Configured implements Tool { public int...
I'm working on a Hadoop streaming workflow for Amazon Elastic Map Reduce and it involves serializing some binary objects and streaming those into Hadoop. Does Hadoop have a maximum line length for
I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here...
I am trying to execute the command below on cloudera hadoop, but it runs into problem and I am getting the error message. Not sure if its a bug or I have done something wrong. Any information would...
I ran into these issues while using Hadoop Streaming. I'm writing code in python 1) Aggregate library package According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.