Python code is valid but Hadoop Streaming produces part-00000 "Empty file"

2012-11-19T06:07:57

On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program.

Also, for reference, this.

My program is in Python and uses Hadoop Streaming.

I have written a simple vector multiplication program where mapper.py takes input files v1 and v2, each containing a vector in the form 12,33,10 and returns the products. Then reducer.py returns the sum of the products, i.e.:

mapper: map(mult,v1,v2)

reducer: sum(p1,p2,p3,...,pn)

mapper.py :

import sys

def mult(x,y):      
    return int(x)*int(y)

# Input comes from STDIN (standard input).

inputvec = tuple()

for i in sys.stdin:
    i = i.strip()

    inputvec += (tuple(i.split(",")),)

v1 = inputvec[0]
v2 = inputvec[1]

results = map(mult, v1, v2)

# Simply printing the results variable would print the tuple. This
# would be fine except that the STDIN of reduce.py takes all the 
# output as input, including brackets, which can be problematic

# Cleaning the output ready to be input for the Reduce step:

for o in results:
    print ' %s' % o,

reducer.py:

import sys

result = int()

for a in sys.stdin:

    a = a.strip()
    a = a.split()

for r in range(len(a)):
    result += int(a[r])

print result

In the in subdirectory I have v1 containing 5,12,20 and v2 containing 14,11,3.

Testing locally, things work as expected:

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py
 70  132  60

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort
 70  132  60

hduser@ubuntu:~/VectMult$ cat in/* | python ./mapper.py | sort | python ./reducer.py
262

When I run it in Hadoop, it appears to do so successfully and doesn't throw up any exceptions:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper python /home/hduser/VectMult3/mapper.py -reducer python /home/hduser/VectMult3/reducer.py -input /home/hduser/VectMult3/in -output /home/hduser/VectMult3/out4
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/app/hadoop/tmp/hadoop-unjar2168776605822419867/] [] /tmp/streamjob6920304075078514767.jar tmpDir=null
12/11/18 21:20:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/11/18 21:20:09 WARN snappy.LoadSnappy: Snappy native library not loaded
12/11/18 21:20:09 INFO mapred.FileInputFormat: Total input paths to process : 2
12/11/18 21:20:09 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
12/11/18 21:20:09 INFO streaming.StreamJob: Running job: job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: To kill this job, run:
12/11/18 21:20:09 INFO streaming.StreamJob: /usr/local/hadoop/libexec/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201211181903_0009
12/11/18 21:20:09 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201211181903_0009
12/11/18 21:20:10 INFO streaming.StreamJob:  map 0%  reduce 0%
12/11/18 21:20:24 INFO streaming.StreamJob:  map 67%  reduce 0%
12/11/18 21:20:33 INFO streaming.StreamJob:  map 100%  reduce 0%
12/11/18 21:20:36 INFO streaming.StreamJob:  map 100%  reduce 22%
12/11/18 21:20:45 INFO streaming.StreamJob:  map 100%  reduce 100%
12/11/18 21:20:51 INFO streaming.StreamJob: Job complete: job_201211181903_0009
12/11/18 21:20:51 INFO streaming.StreamJob: Output: /home/hduser/VectMult3/out4

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat /home/hduser/VectMult3/out4/part-00000
Warning: $HADOOP_HOME is deprecated.

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls /home/hduser/VectMult3/out4/
Warning: $HADOOP_HOME is deprecated.

Found 3 items
-rw-r--r--   1 hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_SUCCESS
drwxr-xr-x   - hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/_logs
-rw-r--r--   1 hduser supergroup          0 2012-11-18 22:05 /home/hduser/VectMult3/out4/part-00000

But when I check the output, all I find is a 0 byte empty file.

I can't work out what's gone wrong. Can anyone help?


Edit: Response to @DiJuMx

One way to fix this would be to output to a temporary file from map, and then use the temporary file in reduce.

Not sure Hadoop allows this? Hopefully someone who knows better can correct me on this.

Before attempting this, try writing a simpler version which just passes the data straight through with no processing.

I thought this was a good idea, just to check that the data is flowing through correctly. I used the following for this:

Both mapper.py and reducer.py
import sys

for i in sys.stdin:
    print i,

What comes out should be exactly what went in. Still outputs an Empty File.

Alternatively, edit your existing code in reduce to output an (error) message to the output file if the input was blank

mapper.py

import sys

for i in sys.stdin:
    print "mapped",

print "mapper",

reducer.py

import sys

for i in sys.stdin:
    print "reduced",

print "reducer",  

If input received, it should ultimately output reduced. Either way, it should at least output reducer. Actual output is an Empty File, still.

Copyright License:
Author:「dafuloth」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/13445126/python-code-is-valid-but-hadoop-streaming-produces-part-00000-empty-file

About “Python code is valid but Hadoop Streaming produces part-00000 "Empty file"” questions

On an Ubuntu virtual machine I have set up a single-node cluster as per Michael Noll's tutorial and this has been my starting point for writing a Hadoop program. Also, for reference, this. My pr...
I am very new to Hadoop and was trying to run a simple program using this. I have copied the local example data to hdfs, but during my map reduce job when I am running this command as per the offi...
I have the following script : <?php $a =isset($_POST['text'])?$_POST['text']:'not yet'; $comm='/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-*streaming*.jar -mappe...
Here is the detail: The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output In the input path, there are 20,000 files from part-00000 to part-19999, each fi...
I have established a basic hadoop master slave cluster setup and able to run mapreduce programs (including python) on the cluster. Now I am trying to run a python code which accesses a C binary a...
I run a hadoop streaming job like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -Dmapred.output.compres=true -
I am trying to use Hadoop streaming with a private python interpreter (Hortonworks data platform 2.2.0). The python interpreter is private in the sense that it is a virtual environment interpreter ...
I am trying to overcome the following error in a hadoop streaming job on EMR. Container [pid=30356,containerID=container_1391517294402_0148_01_000021] is running beyond physical memory limits I t...
I'm trying to get map-reduce functionality with python using mongo-hadoop. Hadoop is working, hadoop streaming is working with python and the mongo-hadoop adaptor is working. However, the mongo-had...
I'm trying to run a Map-Reduce job on Hadoop Streaming with Python scripts, and It work fines when I use jupyter terminal. But when I run the following ./bin/hadoop jar /usr/local/hadoop/share/hadoop/

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.