Is Hadoop SequenceFile binary safe?

2013-04-27T18:41:17

I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long) method which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile when splitting SequenceFile into file splits in MapReduce.

/** Seek to the next sync mark past a given position.*/
public synchronized void sync(long position) throws IOException {
  if (position+SYNC_SIZE >= end) {
    seek(end);
    return;
  }

  try {
    seek(position+4);                         // skip escape
    in.readFully(syncCheck);
    int syncLen = sync.length;
    for (int i = 0; in.getPos() < end; i++) {
      int j = 0;
      for (; j < syncLen; j++) {
        if (sync[j] != syncCheck[(i+j)%syncLen])
          break;
      }
      if (j == syncLen) {
        in.seek(in.getPos() - SYNC_SIZE);     // position before sync
        return;
      }
      syncCheck[i%syncLen] = in.readByte();
    }
  } catch (ChecksumException e) {             // checksum failure
    handleChecksumException(e);
  }
}

These codes simply look for a data sequence which contain the same data as "sync marker".

My doubt:
Consider a situation where the data in a SequenceFile happen to contain a 16 bytes data sequence the same as "sync marker", the codes above will mistakenly treat that 16-bytes data as a "sync marker" and then the SequenceFile won't be correctly parsed?

I don't find any "escape" operation about the data or the sync marker. How can SequenceFile be binary safe? Am I missing something?

Copyright License:
Author:「Shawn H」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/16251110/is-hadoop-sequencefile-binary-safe

About “Is Hadoop SequenceFile binary safe?” questions

I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long) method which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile
I try to build nodejs server which collect binary data from user and storing it to Hadoop sequencefile. As a good tutorial, there's approach using the Hadoop executable. My question: Is there java...
I have written some binary image data to a Hadoop SequenceFile and would like to write it out as a PNG outside of Hadoop, if possible, using Java. [Edited] Overview of the data flow: Input files →
Currently I use the following code to append to an existing SequenceFile: // initialize sequence writer Writer writer = SequenceFile.createWriter( FileContext.getFileContext(this.conf), ...
I am trying to read a sequencefile in hadoop 2.0 but I am unable to achieve it. I am using the below code which works perfectly fine in hadoop 1.0. Please let me know if I am missing something wrt ...
Does Hadoop officially support streaming with binary formats as of 0.21? The hadoop-streaming.jar accepts an inputFormat that is a Java class name. How do you provide the Hadoop streaming job thi...
I'm creating a HashMap of key value pairs of a Hadoop Vector that is stored inside a SequenceFile. For efficiency purposes I want to know how long the Vector of key value pairs is so that I can
The Hadoop SequenceFile is basically a collection of key/value pairs. In my application, I need to consume events from Kafka and handle the possible duplicates. Can I use SequenceFile for deduplica...
I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 bu...
I'm thinking to use a SequenceFile as "a little database" to store small files. I need that concurrency-client could store small file in this SequenceFile and retrieve an unique id (key of the reco...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.