empty output collection using hadoop streaming with mongo-hadoop and python

2015-09-24T22:03:40

I'am trying to use hadoop streaming with mongo-hadoop and python. Reading from a mongodb collection works, writing does not. As seen below the job runs successfully but the output collection stays empty.

I tried the prebuild 1.4.0 jars and the latest git snapshot (1.4.1) of mongo-hadoop. The Hadoop Distribution is Hortonworks Sandbox with HDP 2.2.4.2, but HDP 2.3 also doesn't work.

The mongo-hadoop wiki is slighty outdated, therefore i'am not sure if using the right arguments, missing something or observing a bug.

$cat run_python.sh

#!/bin/bash
set -x
export LIBJARS="/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar","/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar","/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar"

su hdfs - -m -c "hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar \
-files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py \
-D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
-D mongo.auth.uri=mongodb://hadoop:[email protected]:27017/admin \
-D mongo.input.uri=mongodb://hadoop:[email protected]:27017/hadoop.in \
-D mongo.output.uri=mongodb://hadoop:[email protected]:27017/hadoop.out \
-D mongo.job.verbose=true \
-libjars ${LIBJARS} \
-input /tmp/in \
-output /tmp/out \
-io mongodb \
-inputformat com.mongodb.hadoop.mapred.MongoInputFormat \
-outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
-mapper mapper.py \
-reducer reducer.py"

output

[root@sandbox python]# ./run_python.sh 
+ export LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ su hdfs - -m -c 'hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar     -files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py     -D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver     -D mongo.auth.uri=mongodb://hadoop:[email protected]:27017/admin     -D mongo.input.uri=mongodb://hadoop:[email protected]:27017/hadoop.in     -D mongo.output.uri=mongodb://hadoop:[email protected]:27017/hadoop.out     -D mongo.job.verbose=true     -libjars /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar     -input /tmp/in     -output /tmp/out     -io mongodb     -inputformat com.mongodb.hadoop.mapred.MongoInputFormat     -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat     -mapper mapper.py     -reducer reducer.py'
packageJobJar: [] [/usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.2-2.jar] /tmp/streamjob7732112681113565020.jar tmpDir=null
15/09/24 13:38:38 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:38 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:39 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:39 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:41 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:41 INFO driver.cluster: No server chosen by PrimaryServerSelector from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:41 INFO driver.connection: Opened connection [connectionId{localValue:1, serverValue:1358}] to 127.0.0.1:27017
15/09/24 13:38:41 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=28894677}
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:2, serverValue:1359}] to 127.0.0.1:27017
15/09/24 13:38:42 INFO splitter.MongoSplitterFactory: Retrieved Collection stats:{ "ns" : "hadoop.in" , "count" : 100 , "size" : 148928 , "avgObjSize" : 1489 , "numExtents" : 3 , "storageSize" : 172032 , "lastExtentSize" : 131072.0 , "paddingFactor" : 1.0 , "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only." , "userFlags" : 1 , "capped" : false , "nindexes" : 1 , "indexDetails" : { } , "totalIndexSize" : 8176 , "indexSizes" : { "_id_" : 8176} , "ok" : 1.0}
15/09/24 13:38:42 INFO driver.connection: Closed connection [connectionId{localValue:2, serverValue:1359}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:42 INFO mapred.MongoInputFormat: Using com.mongodb.hadoop.splitter.StandaloneMongoSplitter@1a43c7a0 to calculate splits. (old mapreduce API)
15/09/24 13:38:42 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:42 INFO splitter.StandaloneMongoSplitter: Running splitvector to check splits against mongodb://hadoop:[email protected]:27017/hadoop.in
15/09/24 13:38:42 INFO driver.cluster: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:3, serverValue:1360}] to 127.0.0.1:27017
15/09/24 13:38:42 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=27903847}
15/09/24 13:38:42 INFO driver.connection: Opened connection [connectionId{localValue:4, serverValue:1361}] to 127.0.0.1:27017
15/09/24 13:38:42 WARN splitter.StandaloneMongoSplitter: WARNING: No Input Splits were calculated by the split code. Proceeding with a *single* split. Data may be too small, try lowering 'mongo.input.split_size' if this is undesirable.
15/09/24 13:38:42 INFO splitter.MongoCollectionSplitter: Created split: min=null, max= null
15/09/24 13:38:42 INFO driver.connection: Closed connection [connectionId{localValue:4, serverValue:1361}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:43 INFO mapreduce.JobSubmitter: number of splits:1
15/09/24 13:38:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443100485659_0008
15/09/24 13:38:44 INFO impl.YarnClientImpl: Submitted application application_1443100485659_0008
15/09/24 13:38:44 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1443100485659_0008/
15/09/24 13:38:44 INFO mapreduce.Job: Running job: job_1443100485659_0008
15/09/24 13:38:52 INFO mapreduce.Job: Job job_1443100485659_0008 running in uber mode : false
15/09/24 13:38:52 INFO mapreduce.Job:  map 0% reduce 0%
15/09/24 13:39:01 INFO mapreduce.Job:  map 100% reduce 0%
15/09/24 13:39:09 INFO mapreduce.Job:  map 100% reduce 100%
15/09/24 13:39:09 INFO mapreduce.Job: Job job_1443100485659_0008 completed successfully
15/09/24 13:39:10 INFO mapreduce.Job: Counters: 49
File System Counters
    FILE: Number of bytes read=6506
    FILE: Number of bytes written=257301
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=376
    HDFS: Number of bytes written=3000
    HDFS: Number of read operations=3
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=1
Job Counters 
    Launched map tasks=1
    Launched reduce tasks=1
    Rack-local map tasks=1
    Total time spent by all maps in occupied slots (ms)=5865
    Total time spent by all reduces in occupied slots (ms)=5166
    Total time spent by all map tasks (ms)=5865
    Total time spent by all reduce tasks (ms)=5166
    Total vcore-seconds taken by all map tasks=5865
    Total vcore-seconds taken by all reduce tasks=5166
    Total megabyte-seconds taken by all map tasks=1466250
    Total megabyte-seconds taken by all reduce tasks=1291500
Map-Reduce Framework
    Map input records=100
    Map output records=100
    Map output bytes=6300
    Map output materialized bytes=6506
    Input split bytes=376
    Combine input records=0
    Combine output records=0
    Reduce input groups=100
    Reduce shuffle bytes=6506
    Reduce input records=100
    Reduce output records=100
    Spilled Records=200
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=152
    CPU time spent (ms)=2150
    Physical memory (bytes) snapshot=295743488
    Virtual memory (bytes) snapshot=1995943936
    Total committed heap usage (bytes)=262909952
Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
File Input Format Counters 
    Bytes Read=0
File Output Format Counters 
    Bytes Written=0
15/09/24 13:39:10 INFO streaming.StreamJob: Output directory: /tmp/out

Using the same script and storing the output as bson works though.

[root@sandbox python]# ./run_python_bson_output.sh
+ export LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ LIBJARS=/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar
+ su hdfs - -m -c 'hadoop jar /usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming.jar     -files /home/hdfs/example/video/python/mapper.py,/home/hdfs/example/video/python/reducer.py     -D stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver     -D mongo.auth.uri=mongodb://hadoop:[email protected]:27017/admin     -D mongo.input.uri=mongodb://127.0.0.1:27017/hadoop.in     -D mongo.job.verbose=true     -D mapreduce.output.fileoutputformat.outputdir=/tmp/output.bson     -libjars /usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-core-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-hadoop-streaming-1.4.0.jar,/usr/hdp/2.2.4.2-2/hadoop/lib/mongo-java-driver-3.0.2.jar     -input /tmp/in     -output /tmp/videos_streaming     -io mongodb     -inputformat com.mongodb.hadoop.mapred.MongoInputFormat     -outputformat com.mongodb.hadoop.mapred.BSONFileOutputFormat     -mapper mapper.py     -reducer reducer.py'
packageJobJar: [] [/usr/hdp/2.2.4.2-2/hadoop-mapreduce/hadoop-streaming-2.6.0.2.2.4.2-2.jar] /tmp/streamjob3257949526000997018.jar tmpDir=null
15/09/24 13:38:00 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:00 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:00 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/24 13:38:00 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/09/24 13:38:01 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:01 INFO driver.cluster: No server chosen by PrimaryServerSelector from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:1, serverValue:1352}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=24906864}
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:2, serverValue:1353}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO splitter.MongoSplitterFactory: Retrieved Collection stats:{ "ns" : "hadoop.in" , "count" : 100 , "size" : 148928 , "avgObjSize" : 1489 , "numExtents" : 3 , "storageSize" : 172032 , "lastExtentSize" : 131072.0 , "paddingFactor" : 1.0 , "paddingFactorNote" : "paddingFactor is unused and unmaintained in 3.0. It remains hard coded to 1.0 for compatibility only." , "userFlags" : 1 , "capped" : false , "nindexes" : 1 , "indexDetails" : { } , "totalIndexSize" : 8176 , "indexSizes" : { "_id_" : 8176} , "ok" : 1.0}
15/09/24 13:38:02 INFO driver.connection: Closed connection [connectionId{localValue:2, serverValue:1353}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:02 INFO mapred.MongoInputFormat: Using com.mongodb.hadoop.splitter.StandaloneMongoSplitter@6e2cc310 to calculate splits. (old mapreduce API)
15/09/24 13:38:02 INFO driver.cluster: Cluster created with settings {hosts=[127.0.0.1:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500}
15/09/24 13:38:02 INFO splitter.StandaloneMongoSplitter: Running splitvector to check splits against mongodb://127.0.0.1:27017/hadoop.in
15/09/24 13:38:02 INFO driver.cluster: No server chosen by ReadPreferenceServerSelector{readPreference=primary} from cluster description ClusterDescription{type=UNKNOWN, connectionMode=SINGLE, all=[ServerDescription{address=127.0.0.1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
15/09/24 13:38:02 INFO driver.connection: Opened connection [connectionId{localValue:3, serverValue:1354}] to 127.0.0.1:27017
15/09/24 13:38:02 INFO driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=127.0.0.1:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=32114805}
15/09/24 13:38:03 INFO driver.connection: Opened connection [connectionId{localValue:4, serverValue:1355}] to 127.0.0.1:27017
15/09/24 13:38:03 WARN splitter.StandaloneMongoSplitter: WARNING: No Input Splits were calculated by the split code. Proceeding with a *single* split. Data may be too small, try lowering 'mongo.input.split_size' if this is undesirable.
15/09/24 13:38:03 INFO splitter.MongoCollectionSplitter: Created split: min=null, max= null
15/09/24 13:38:03 INFO driver.connection: Closed connection [connectionId{localValue:4, serverValue:1355}] to 127.0.0.1:27017 because the pool has been closed.
15/09/24 13:38:03 INFO mapreduce.JobSubmitter: number of splits:1
15/09/24 13:38:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1443100485659_0007
15/09/24 13:38:03 INFO impl.YarnClientImpl: Submitted application application_1443100485659_0007
15/09/24 13:38:03 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1443100485659_0007/
15/09/24 13:38:03 INFO mapreduce.Job: Running job: job_1443100485659_0007
15/09/24 13:38:12 INFO mapreduce.Job: Job job_1443100485659_0007 running in uber mode : false
15/09/24 13:38:12 INFO mapreduce.Job:  map 0% reduce 0%
15/09/24 13:38:20 INFO mapreduce.Job:  map 100% reduce 0%
15/09/24 13:38:28 INFO mapreduce.Job:  map 100% reduce 100%
15/09/24 13:38:28 INFO mapreduce.Job: Job job_1443100485659_0007 completed successfully
15/09/24 13:38:28 INFO mapreduce.Job: Counters: 49
File System Counters
    FILE: Number of bytes read=6506
    FILE: Number of bytes written=256757
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=336
    HDFS: Number of bytes written=3600
    HDFS: Number of read operations=5
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
Job Counters 
    Launched map tasks=1
    Launched reduce tasks=1
    Rack-local map tasks=1
    Total time spent by all maps in occupied slots (ms)=6144
    Total time spent by all reduces in occupied slots (ms)=5032
    Total time spent by all map tasks (ms)=6144
    Total time spent by all reduce tasks (ms)=5032
    Total vcore-seconds taken by all map tasks=6144
    Total vcore-seconds taken by all reduce tasks=5032
    Total megabyte-seconds taken by all map tasks=1536000
    Total megabyte-seconds taken by all reduce tasks=1258000
Map-Reduce Framework
    Map input records=100
    Map output records=100
    Map output bytes=6300
    Map output materialized bytes=6506
    Input split bytes=336
    Combine input records=0
    Combine output records=0
    Reduce input groups=100
    Reduce shuffle bytes=6506
    Reduce input records=100
    Reduce output records=100
    Spilled Records=200
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=177
    CPU time spent (ms)=2220
    Physical memory (bytes) snapshot=296923136
    Virtual memory (bytes) snapshot=1996275712
    Total committed heap usage (bytes)=262746112
Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
File Input Format Counters 
    Bytes Read=0
File Output Format Counters 
    Bytes Written=3600
15/09/24 13:38:28 INFO streaming.StreamJob: Output directory: /tmp/videos_streaming

Even restoring the outputted bson into mongodb works.

Copyright License:
Author:「onebitaway」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/32763450/empty-output-collection-using-hadoop-streaming-with-mongo-hadoop-and-python

About “empty output collection using hadoop streaming with mongo-hadoop and python” questions

I'm trying to get map-reduce functionality with python using mongo-hadoop. Hadoop is working, hadoop streaming is working with python and the mongo-hadoop adaptor is working. However, the mongo-had...
I'am trying to use hadoop streaming with mongo-hadoop and python. Reading from a mongodb collection works, writing does not. As seen below the job runs successfully but the output collection stays ...
I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm
I'm new to Mongodb and Hadoop. I'm trying to access mongodb data as input to hadoop mapreduce job. i don't quite know how to specify which collection to use to get data from. this is what i tried:
I have a Spark process that is currently using the mongo-hadoop bridge (from https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/README.rst ) to access the mongo database:
I get the following error when running mongo-hadoop streaming: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or directory at java.lang.ProcessBuilder.start(
I followed mongo-hadoop connector's documentation. I am able to transfer data from inputCol collection to outputCol collection in testDB database using: Configuration mongodbConfig = new
I am trying to integrate Hadoop and Mongo. I downloaded the mongo-hadoop files from git and trying to built jar files by using below. But i'm getting below error [krishna@localhost mongo-hadoop]$ ./
I would like to update a specific collection in MongoDb via Spark in Java. I am using the MongoDB Connector for Hadoop to retrieve and save information from Apache Spark to MongoDb in Java. After
Trying to make use of your post: https://gist.github.com/2884606 I try to run the command: hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb:...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.