hadoop streaming produces uncompressed files despite mapred.output.compress=true

2014-05-21T02:56:15

I run a hadoop streaming job like this:

hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar 
       -Dmapred.reduce.tasks=16
       -Dmapred.output.compres=true
       -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec
       -input foo
       -output bar
       -mapper "python zot.py"
       -reducer /bin/cat

I do get 16 files in the output directory which contain the correct data, but the files are not compressed:

$ hadoop fs -get bar/part-00012
$ file part-00012
part-00012: ASCII text, with very long lines
  1. why is part-00012 not compressed?
  2. how do I have my data set split into a small number (say, 16) gzip-compressed files?

PS. See also "Using gzip as a reducer produces corrupt data"

PPS. This is for vw.

PPPS. I guess I can do hadoop fs -get, gzip, hadoop fs -put, hadoop fs -rm 16 times, but this seems like a very non-hadoopic way.

Copyright License:
Author:「sds」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/23767799/hadoop-streaming-produces-uncompressed-files-despite-mapred-output-compress-true

About “hadoop streaming produces uncompressed files despite mapred.output.compress=true” questions

I run a hadoop streaming job like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -Dmapred.output.compres=true -
When I run hadoop streaming like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -input foo -output bar -...
I'm currently processing about 300 GB of log files on a 10 servers hadoop cluster. My data is being saved in folders named YYMMDD so each day can be accessed quickly. My problem is that I just fou...
Seems like this should be simple; I have a set of files on our cluster with the cluster-default block size of 128MB. I have a streaming job that process them, and I would like the output files cre...
Using pig or hadoop streaming, has anyone loaded and uncompressed a zipped file? The original csv file was compressed using pkzip.
So, theres a bunch of log files in /var/log/… on hdfs that can be either uncompressed or compressed with snappy. If they don't end in .snappy I'd like to compress them, and name them with the endi...
Grep seems not to be working for hadoop streaming For: hadoop jar /usr/local/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-streaming.jar -input /user/root/tmp2/user.data -output /user/root/selecte...
I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs. My streaming reducer (a php script) is outputting records
I have written a simple bash script. The exact code is here. ideone.com/8XQCjH #!/bin/bash if ! bzip2 -t "$file" then printf '%s is corrupted\n' "$file" rm -f "$file" #e
I have read and tried every example I could find for what seems like this straight forward problem. Assume there is a set of uncompressed text files and that I want to run a processing step on the...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.