Why is hadoop mapReduce with python failing but the scripts are working on command line?

2016-01-07T04:59:29

I'm trying to implement a simple Hadoop map reduce example using Cloudera 5.5.0 The map & reduce steps should be implemented using Python 2.6.6

Problem:

  • If the scripts are being executed on the unix command line they're working perfectly fine and producing the expected output.

cat join2*.txt | ./join3_mapper.py | sort | ./join3_reducer.py

  • But executing the scripts as a hadoop task terribly fails:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -input /user/cloudera/inputTV/join2_gen*.txt -output /user/cloudera/output_tv -mapper /home/cloudera/join3_mapper.py -reducer /home/cloudera/join3_reducer.py -numReduceTasks 1

16/01/06 12:32:32 INFO mapreduce.Job: Task Id : attempt_1452069211060_0026_r_000000_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538) at org.apache.hadoop.streaming.PipeReducer.close(PipeReducer.java:134) at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)

  • The mapper works, if the hadoop command is executed with -numReduceTasks 0, the hadoop job is executing only map step, ends successfully and the output directory contains the result files from map step.

  • I guess there must be something wrong with the reduce step then ?

  • The stderr logs in Hue shows nothing relevant:

Log Upload Time: Wed Jan 06 12:33:10 -0800 2016 Log Length: 222 log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Code of the scripts: 1st file: join3_mapper.py

#!/usr/bin/env python

import sys

for line in sys.stdin:
   line       = line.strip()   #strip out carriage return
   tuple2  = line.split(",")   #split line, into key and value, returns a list

   if len(tuple2) == 2:
      key = tuple2[0]
      value = tuple2[1]
      if value == 'ABC':
         print('%s\t%s' % (key, value) )
      elif value.isdigit():
         print('%s\t%s' % (key, value) ) 

The 2nd file: join3_reducer.py

#!/usr/bin/env python
import sys

last_key      = None              #initialize these variables
running_total = 0
abcFound =False;
this_key      = None

# -----------------------------------
# Loop the file
#  --------------------------------
for input_line in sys.stdin:
    input_line = input_line.strip()

    # --------------------------------
    # Get Next Key value pair, splitting at tab
    # --------------------------------
    tuple2 = input_line.split("\t") 

    this_key = tuple2[0]    
    value = tuple2[1]
    if value.isdigit():
        value = int(value) 

    # ---------------------------------
    # Key Check part
    #    if this current key is same 
    #          as the last one Consolidate
    #    otherwise  Emit
    # ---------------------------------
    if last_key == this_key:     
        if value == 'ABC':  # filter for only ABC in TV shows
            abcFound=True;
        else:
            if isinstance(value, (int,long) ): 
                running_total += value   

    else:
        if last_key:         #if this key is different from last key, and the previous 
                             #   (ie last) key is not empy,
                             #   then output 
                             #   the previous <key running-count>
           if abcFound:
              print('%s\t%s' % (last_key, running_total) )
              abcFound=False;

        running_total = value    #reset values
        last_key = this_key

if last_key == this_key:
    print('%s\t%s' % (last_key, running_total) )

I have tried various different ways of declaring the input file to the hadoop command, no difference, no success.

What am I doing wrong ? Hints, ideas are very appreciated thank you

Copyright License:
Author:「Marco P.」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/34642659/why-is-hadoop-mapreduce-with-python-failing-but-the-scripts-are-working-on-comma

About “Why is hadoop mapReduce with python failing but the scripts are working on command line?” questions

I'm trying to implement a simple Hadoop map reduce example using Cloudera 5.5.0 The map &amp; reduce steps should be implemented using Python 2.6.6 Problem: If the scripts are being executed on the
I am mastering the computing paradigm MapReduce in the Hadoop environment. I created two Python files containing a transformer and a reducer. with open('mapper_hadoop.py', 'w') as fh: fh.write(...
I am running a hadoop mapreduce job using a Python program that creates different input paths as parameters for the mapreduce job. I am currently checking for hadoop fs path existence, before I pass
Team, Please be informed that i am running Hadoop mapreduce examples (Version 2.7.1). It is failing with below error. Exit code: 1 Exception message: CreateSymbolicLink error (1314): A required
I am currently learning to use Hadoop mapred an have come across this error: packageJobJar: [/home/hduser/mapper.py, /home/hduser/reducer.py, /tmp/hadoop-unjar4635332780289131423/] [] /tmp/
In my Mac I have a standalone installation of Hadoop 3.3.0 I have 2 Python scripts, mapper.py and reducer.py. I can successfully run 1 step of map and reduce, correctly writing the output on local ...
I am trying to execute a python MapReduce wordcount Program I take it from writing a Hadoop MapReduce program in python just to try to understand how it works but the problem always is Job not
I've installed the Hadoop file and I'm trying to run the MapReduce example in the terminal, but am getting the command not found message, can someone help me with this issue, thanks. Ismails-MacBoo...
I'm building a Hadoop (0.20.1) mapreduce job that uses HBase (0.20.1) as both the data source and data sink. I would like to write the job in Python which has required me to use hadoop-0.20.1-strea...
MapReduce job is failing with following error even though JAVA_HOME is set. /bin/bash: /bin/java: No such file or directory I am trying to setup hadoop (3.3.4) on my Mac M1. I have set JAVA_HOME i...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.