How to control the number of hadoop streaming output files

2013-10-11T16:51:03

Here is the detail:

The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output In the input path, there are 20,000 files from part-00000 to part-19999, each file is about 64MB. What I want to do is to write a hadoop streaming job to merge these 20,000 files into 10,000 files.

Is there a way to merge these 20,000 files to 10,000 files using hadoop streaming job? Or, in other words, Is there a way to control the number of hadoop streaming output files?

Thanks in advance!

Copyright License:
Author:「Charlie Lin」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/19313998/how-to-control-the-number-of-hadoop-streaming-output-files

About “How to control the number of hadoop streaming output files” questions

Here is the detail: The input files is in the hdfs path /user/rd/input, and the hdfs output path is /user/rd/output In the input path, there are 20,000 files from part-00000 to part-19999, each fi...
Is there a way to control the output filenames of an Hadoop Streaming job? Specifically I would like my job's output files content and name to be organized by the ket the reducer outputs - each file
I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create...
Seems like this should be simple; I have a set of files on our cluster with the cluster-default block size of 128MB. I have a streaming job that process them, and I would like the output files cre...
I am attempting to zip files on HDFS using BZip2. Doing this using MapReduce streaming seems like a good approach, as per an answer on the following post: Hadoop compress file in HDFS The relevan...
I am new to Hadoop and am trying to use its streaming feature with Python written mapper and reducer. The problem is that my original input file will contain sequences of lines which are to be iden...
I have a hadoop streaming job. This job makes use of a python script which imports another python script. The command works fine from the command line but fails when using hadoop streaming. Here...
I run a hadoop streaming job like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -Dmapred.output.compres=true -
I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them. I tried: hadoop jar /usr/lib/hadoop-mapreduce/
I have written a mapper function that parses the XML and outputs the result as columns separted by "\t" as shown below Name Age ABC 23 XYZ 24 ERT 25 Using the Hadoop Streaming Code as menti...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.