Read Hadoop SequenceFile: weird hex number stream

2013-03-14T16:45:25

I am trying to convert a piece of Hadoop SequenceFile into plain text with the following code:

    Configuration config = new Configuration();
    Path path = new Path( inputPath );
    SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config);
    WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance();
    Writable value = (Writable) reader.getValueClass().newInstance();

    File output = new File(outputPath);
    if(!output.exists()) output.createNewFile();

    FileOutputStream fos = new FileOutputStream(output);
    BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(fos, "utf-8"));

    int count = 0;

    try {
        while(reader.next(key,value) && count < 1000)
        {
            bw.write("Key::: " + key);
            bw.newLine();
            bw.write("Value::: " + value);
            bw.newLine();
            bw.newLine();
            count++;
        }
    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }

    reader.close();
    bw.close();

The keys can be properly converted. However, the values are converted into weired HEX number stream. A sample is:

Value::: 1f 8b 08 00 00 00 00 00 00 03 e5 bd f9 7b 13 47 d6 28 fc 73 e6 79 e6 7f e8 28 17 6c 5f bc 68 5f 6c e4 5c 96 64 26 33 c9 24 37 cb bc ef 3b 0c 9f 9f 56 77 cb ee 58 96 34 5a 20 8e e3 3f 46 56 c2 10 30 c4 8b e4 4d 5e b1 6c 4b f2 22 59 b2 65 63 48 08 04 42 12 c2 9e 00 21 cb f3 9d 53 d5 2d b5 64 4b 16 33

The real stream is much longer than this. What I know is that the keys are stored as Hadoop Text format and the values are stored as Hadoop BytesWritable. And the values might be in Chinese, but I am not sure about this.

Does anybody know what is going on?

Copyright License:
Author:「Yuhao」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/15404622/read-hadoop-sequencefile-weird-hex-number-stream

About “Read Hadoop SequenceFile: weird hex number stream” questions

I am trying to convert a piece of Hadoop SequenceFile into plain text with the following code: Configuration config = new Configuration(); Path path = new Path( inputPath ); SequenceFile.
I am trying to read a sequencefile in hadoop 2.0 but I am unable to achieve it. I am using the below code which works perfectly fine in hadoop 1.0. Please let me know if I am missing something wrt ...
Currently I use the following code to append to an existing SequenceFile: // initialize sequence writer Writer writer = SequenceFile.createWriter( FileContext.getFileContext(this.conf), ...
I try to build nodejs server which collect binary data from user and storing it to Hadoop sequencefile. As a good tutorial, there's approach using the Hadoop executable. My question: Is there java...
I'm creating a HashMap of key value pairs of a Hadoop Vector that is stored inside a SequenceFile. For efficiency purposes I want to know how long the Vector of key value pairs is so that I can
I have written some binary image data to a Hadoop SequenceFile and would like to write it out as a PNG outside of Hadoop, if possible, using Java. [Edited] Overview of the data flow: Input files →
I am writing a MapReduce job to test some calculations. I split my input into maps so that each map does part of the calculus, the result will be a list of (X,y) pairs which I want to flush into a
I'm running a simple map-reduce job. This job uses 250 files from common crawl data. e.g. s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/ If I use, 50, 100 files, everyth...
I have a generic input stream that represents a sequence file. I would like to create a SequenceFile.Reader, or a similar class, from it with out needing to write the output stream to a temp file o...
I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long) method which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.