Hadoop streaming to python using mongo-hadoop

2013-04-10T00:35:29

I'm trying to get map-reduce functionality with python using mongo-hadoop. Hadoop is working, hadoop streaming is working with python and the mongo-hadoop adaptor is working. However, the mongo-hadoop streaming examples with python aren't working. When trying to run the example in streaming/examples/treasury I get the following error:

$user@host: ~/git/mongo-hadoop/streaming$ hadoop jar target/mongo-hadoop-streaming-assembly-1.0.1.jar -mapper examples/treasury/mapper.py -reducer examples/treasury/reducer.py -inputformat com.mongodb.hadoop.mapred.MongoInputFormat -outputformat com.mongodb.hadoop.mapred.MongoOutputFormat -inputURI mongodb://127.0.0.1/mongo_hadoop.yield_historical.in -outputURI mongodb://127.0.0.1/mongo_hadoop.yield_historical.streaming.out

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Running

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Init

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Process Args

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Setup Options'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: PreProcess Args

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Parse Options

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-mapper'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'examples/treasury/mapper.py'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-reducer'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'examples/treasury/reducer.py'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-inputformat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'com.mongodb.hadoop.mapred.MongoInputFormat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-outputformat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'com.mongodb.hadoop.mapred.MongoOutputFormat'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-inputURI'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/mongo_hadoop.yield_historical.in'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: '-outputURI'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Arg: 'mongodb://127.0.0.1/mongo_hadoop.yield_historical.streaming.out'

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Add InputSpecs

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Setup output_

13/04/09 11:54:34 INFO streaming.StreamJobPatch: Post Process Args

13/04/09 11:54:34 INFO streaming.MongoStreamJob: Args processed.

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

13/04/09 11:54:36 INFO io.MongoIdentifierResolver: Resolving: bson

**Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/mapreduce/filecache/DistributedCache**
    at org.apache.hadoop.streaming.StreamJob.setJobConf(StreamJob.java:959)
    at com.mongodb.hadoop.streaming.MongoStreamJob.run(MongoStreamJob.java:36)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at com.mongodb.hadoop.streaming.MongoStreamJob.main(MongoStreamJob.java:63)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.mapreduce.filecache.DistributedCache
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    ... 10 more

If anyone could shed some light it would be a big help.

Full info:

As far as I can tell I needed to get the following four things to work:

  1. Install and test hadoop
  2. Install and test hadoop streaming with python
  3. Install and test mongo-hadoop
  4. Install and test mongo-hadoop streaming with python

So the short of it is I've got everything working up to the fourth step. Using (https://github.com/danielpoe/cloudera) i've got cloudera 4 installed

  1. Using chef recipe cloudera 4 has been installed and is up and running and tested
  2. Using michael nolls blog tutorials, tested hadoop streaming with python successfully
  3. Using the docs at mongodb.org was able to run both treasury and ufo examples (build cdh4 in build.sbt)
  4. Have downloaded 1.5 hours worth of twitter data using the readme for the twitter example in streaming/examples and also tried the treasury example.

Copyright License:
Author:「Conor」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/15907560/hadoop-streaming-to-python-using-mongo-hadoop

About “Hadoop streaming to python using mongo-hadoop” questions

I'm trying to get map-reduce functionality with python using mongo-hadoop. Hadoop is working, hadoop streaming is working with python and the mongo-hadoop adaptor is working. However, the mongo-had...
I created this job that reads a bunch of JSON files from HDFS and tries to load them into MongoDB. It's just the map script because I don't require any additional processing on the reduce step. I'm
I get the following error when running mongo-hadoop streaming: java.io.IOException: Cannot run program "mapper.py": error=2, No such file or directory at java.lang.ProcessBuilder.start(
I'am trying to use hadoop streaming with mongo-hadoop and python. Reading from a mongodb collection works, writing does not. As seen below the job runs successfully but the output collection stays ...
Trying to make use of your post: https://gist.github.com/2884606 I try to run the command: hadoop jar mongo-hadoop-streaming-assembly*.jar -mapper mapper.rb -reducer reducer.rb -inputURI mongodb:...
I want to launch the MongoDB Hadoop Streaming connector, so I downloaded a compatible version of Hadoop (the 2.2.0) (see https://github.com/mongodb/mongo-hadoop/blob/master/README.md#apache-hadoop-...
I have a Spark process that is currently using the mongo-hadoop bridge (from https://github.com/mongodb/mongo-hadoop/blob/master/spark/src/main/python/README.rst ) to access the mongo database:
I'm new to Mongodb and Hadoop. I'm trying to access mongodb data as input to hadoop mapreduce job. i don't quite know how to specify which collection to use to get data from. this is what i tried:
I am trying to integrate Hadoop and Mongo. I downloaded the mongo-hadoop files from git and trying to built jar files by using below. But i'm getting below error [krishna@localhost mongo-hadoop]$ ./
I am trying to do a $near query with MongoDB using Spark and mongo-hadoop with lat/lon coordinates that change. How can I do a query with mongo-hadoop? Apart from somethnig like: mongodbConfig.s...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.