Hadoop streaming with mongo-hadoop connector fails

2016-06-14T00:32:46

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm trying to use the mongo-hadoop connector.

The script is written in Perl and provisioned to all the nodes in the cluster with all the additional dependencies provisioned. The script emits in binary mode a BSON serialized version of the original JSON file.

For some reasons the job fails with the following error:

Error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to com.mongodb.hadoop.io.BSONWritable
    at com.mongodb.hadoop.streaming.io.MongoInputWriter.writeValue(MongoInputWriter.java:10)
    at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:106)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)

Not only this but I created a version of the same script in python using the pymono-hadoop package. the job fails with the same error.

After digging a little bit more in the logs for the failed tasks I discovered that the actual error is:

2016-06-13 16:13:11,778 INFO [Thread-12] com.mongodb.hadoop.io.BSONWritable: No Length Header available.java.io.EOFException

The problem is that it fails silently, I've added some logging in the mapper but from what it looks the mapper doesn't even get called. This is how I'm calling the job:

yarn jar /usr/hdp/2.4.0.0-169/hadoop-mapreduce/hadoop-streaming.jar \
    -libjars /opt/mongo-hadoop/mongo-hadoop-core-1.5.2.jar,/opt/mongo-hadoop/mongo-hadoop-streaming-1.5.2.jar,/opt/mongo-hadoop/mongodb-driver-3.2.2.jar \
    -D mongo.output.uri="${MONGODB}" \
    -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat \
    -jobconf stream.io.identifier.resolver.class=com.mongodb.hadoop.streaming.io.MongoIdentifierResolver \
    -io mongodb \
    -input "${INPUT_PATH}" \
    -output "${OUTPUT_PATH}" \
    -mapper "/opt/mongo/mongo_mapper.py"

What am I doing wrong? It seems there's no other way to get data from HDFS into MongoDB...

Copyright License:
Author:「Tudor Marghidanu」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/37794896/hadoop-streaming-with-mongo-hadoop-connector-fails

About “Hadoop streaming with mongo-hadoop connector fails” questions

I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm
I want to launch the MongoDB Hadoop Streaming connector, so I downloaded a compatible version of Hadoop (the 2.2.0) (see https://github.com/mongodb/mongo-hadoop/blob/master/README.md#apache-hadoop-...
I'm trying to get map-reduce functionality with python using mongo-hadoop. Hadoop is working, hadoop streaming is working with python and the mongo-hadoop adaptor is working. However, the mongo-had...
I am using Mongo-Hadoop connector to work with Spark and MongoDB.I want to delete the documents in an RDD from the MongoDB,looks there is a MongoUpdateWritable to support document update. Is there...
I have a Spark process that is currently using the mongo-hadoop bridge (from https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/README.rst ) to access the mongo database:
I want to use CDH 4.5 with mongo-hadoop: https://github.com/mongodb/mongo-hadoop When i execute: ./gradlew jar -Phadoop_version=CDH4 It fails and says its an "Unknown hadoop version". As far as ...
I am trying to use this hadoop mongo connector, https://github.com/mongodb/mongo-hadoop I have seen many examples of connecting to a particular mongo collection using something like this,
I am trying to install and configure hive with mongo-hadoop-core 2.0.2, for the first time. I have installed hadoop 2.8.0, Hive 2.1.1 and MongoDB 3.4.6. and everything works fine when running
what is the difference between mongo-hadoop and mongo-spark connector and does pymongo work only with mango-hadoop? Is pymongo used only with mongo-hadoop?
I get the following error when running mongo-hadoop streaming: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or directory at java.lang.ProcessBuilder.start(

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.