Hadoop SequenceFile size

2012-10-10T18:31:08

I'm creating a HashMap of key value pairs of a Hadoop Vector that is stored inside a SequenceFile. For efficiency purposes I want to know how long the Vector of key value pairs is so that I can initialise the HashMap with the proper size.

I have used Mahout's seqdumper and it appends a Count at the end of each dumped Vector. I have looked into its code but it used a simple iterative counter (for each row counter++) and thus isn't what I'm looking for.

Also SequenceFile.MetaData looked promising, so I looked into it. But the debugger shows that it contains no entries.

Is there some other way to quickly get something like a .size() method for a Hadoop Vector inside a SequenceFile?

Edit: Here is the output of seqdumper of what I'm turning into a Map. Specifically, each Key Value pair is a IntWritable / NamedVector pair. I wish to create a mapping from the key number to the URI String. There are in total 46599 keys value pairs,as appended by seqdumper at the end of the file.

Input Path: luceneVectors
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: http://data.artsholland.com/production/73adae07-78c6-4180-93a4-34802090b5f1:{22118:0.18376858424635545,20381:0.40144184831236357,53753:0.2605347739121081,51569:0.2578896608715637,21930:0.2277873354603338,63035:0.27765920678967304,36979:0.2709104089668357,68351:0.15788776111071648,19436:0.2988119565549418,17991:0.12435264873296237,10356:0.3276902508762499,3410:0.27239123806574506,62942:0.18961849195965186,32527:0.24827631823639457,69909:0.11723303910369048,19832:0.2138117449778048}
Key: 1: Value: http://data.artsholland.com/production/c9fcc92b-18bb-4bfb-af52-380707f8d0d7:{41167:0.07191351238480857,61391:0.07496730342220936,[...]
[...],19156:0.0687215948604245}
Count: 46599

Copyright License:
Author:「Calavoow」,Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.
Link to:https://stackoverflow.com/questions/12817252/hadoop-sequencefile-size

About “Hadoop SequenceFile size” questions

I'm creating a HashMap of key value pairs of a Hadoop Vector that is stored inside a SequenceFile. For efficiency purposes I want to know how long the Vector of key value pairs is so that I can
Currently I use the following code to append to an existing SequenceFile: // initialize sequence writer Writer writer = SequenceFile.createWriter( FileContext.getFileContext(this.conf), ...
I am trying to read a sequencefile in hadoop 2.0 but I am unable to achieve it. I am using the below code which works perfectly fine in hadoop 1.0. Please let me know if I am missing something wrt ...
I try to build nodejs server which collect binary data from user and storing it to Hadoop sequencefile. As a good tutorial, there's approach using the Hadoop executable. My question: Is there java...
I read the SequenceFile.java in hadoop-1.0.4 source codes. And I find the sync(long) method which is used to find a "sync marker" (a 16 bytes MD5 when generated at file creation time) in SequenceFile
The Hadoop SequenceFile is basically a collection of key/value pairs. In my application, I need to consume events from Kafka and handle the possible duplicates. Can I use SequenceFile for deduplica...
I have written some binary image data to a Hadoop SequenceFile and would like to write it out as a PNG outside of Hadoop, if possible, using Java. [Edited] Overview of the data flow: Input files →
hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes...
I was trying to run a matrix multiplication example presented by Mr. Norstadt under following link http://www.norstad.org/matrix-multiply/index.html. I can run it successfully with hadoop 0.20.2 bu...
I'm thinking to use a SequenceFile as "a little database" to store small files. I need that concurrency-client could store small file in this SequenceFile and retrieve an unique id (key of the reco...

Copyright License:Reproduced under the CC 4.0 BY-SA copyright license with link to original source & disclaimer.