Hadoop streaming with Python and python subprocess

2012-03-07T15:04:03

I have established a basic hadoop master slave cluster setup and able to run mapreduce programs (including python) on the cluster.

Now I am trying to run a python code which accesses a C binary and so I am using the subprocess module. I am able to use the hadoop streaming for a normal python code but when I include the subprocess module to access a binary, the job is getting failed.

As you can see in the below logs, the hello executable is recognised to be used for the packaging, but still not able to run the code.

. . packageJobJar: [/tmp/hello/hello, /app/hadoop/tmp/hadoop-unjar5030080067721998885/] [] /tmp/streamjob7446402517274720868.jar tmpDir=null

JarBuilder.addNamedStream hello
.
.
12/03/07 22:31:32 INFO mapred.FileInputFormat: Total input paths to process : 1
12/03/07 22:31:32 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/03/07 22:31:32 INFO streaming.StreamJob: Running job: job_201203062329_0057
12/03/07 22:31:32 INFO streaming.StreamJob: To kill this job, run:
12/03/07 22:31:32 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_201203062329_0057
12/03/07 22:31:32 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201203062329_0057
12/03/07 22:31:33 INFO streaming.StreamJob:  map 0%  reduce 0%
12/03/07 22:32:05 INFO streaming.StreamJob:  map 100%  reduce 100%
12/03/07 22:32:05 INFO streaming.StreamJob: To kill this job, run:
12/03/07 22:32:05 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=master:54311 -kill job_201203062329_0057

12/03/07 22:32:05 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201203062329_0057
12/03/07 22:32:05 ERROR streaming.StreamJob: Job not Successful!

12/03/07 22:32:05 INFO streaming.StreamJob: killJob...
Streaming Job Failed!

Command I am trying is :

hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/MARS.py -reducer /home/hduser/MARS_red.py -input /user/hduser/mars_inputt -output /user/hduser/mars-output -file /tmp/hello/hello -verbose

where hello is the C executable. It is a simple helloworld program which I am using to check the basic functioning.

My Python code is :

#!/usr/bin/env python
import subprocess
subprocess.call(["./hello"])

Any help with how to get the executable run with Python in hadoop streaming or help with debugging this will get me forward in this.

Thanks,

Ganesh

Copyright License:
Author:「Ganesh」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/9597122/hadoop-streaming-with-python-and-python-subprocess

About “Hadoop streaming with Python and python subprocess” questions

I have established a basic hadoop master slave cluster setup and able to run mapreduce programs (including python) on the cluster. Now I am trying to run a python code which accesses a C binary a...
I have a large scale log processing problem that I have to run on a hadoop cluster. The task is to feed each line of the log into a executable "cmd" and check the result to decide whether to keep t...
I'm trying to run a simple mapreduce word-count program with python on Windows and I got this error: 2021-08-07 23:30:14,670 INFO mapreduce.Job: Task Id : attempt_1628353447352_0001_m_000001_0, St...
I'm trying to deploy an algorithm on a Hadoop cluster, written in Python, using the Hadoop Streaming functionality. When I try to execute it "locally", using the following syntax cat poi/* | ./
I'm trying to run a Map-Reduce job on Hadoop Streaming with Python scripts, and It work fines when I use jupyter terminal. But when I run the following ./bin/hadoop jar /usr/local/hadoop/share/hadoop/
I'm using python with hadoop streaming. Despite careful unit testing, errors inevitably creep in. When they do, this error message is all that hadoop gives: java.lang.RuntimeException: PipeMapRed.
I am using hadoop streaming running a c++ executable (a bioinformatic software called blast) with python subprocess. Blast will output a result file when executing on command line. But when running...
I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run t
I have two small python scripts CountWordOccurence_mapper.py #!/usr/bin/env python import sys #print(sys.argv[1]) text = sys.argv[1] wordCount = text.count(sys.argv[2]) #print (sys.argv[2],
From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error ERROR streaming.StreamJob: Job not Successful! 10/12/16 17:13:38 INFO

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.