[ome-devel] Simple flat binary file format

Simone Leo simleo at crs4.it
Fri Mar 27 08:23:49 GMT 2015


Hi there,

are you sure you need to design a new file format from scratch?  To be 
compliant with the Hadoop ecosystem you'd probably be better off using 
Avro (http://avro.apache.org).  That way, you'd only have to write the 
schema, while taking advantage of an already optimized (and 
Hadoop-ready) generic binary format.

Regarding Python: Avro containers can be accessed with the official API; 
in addition, I've recently finished adding Avro support to Pydoop 
(http://crs4.github.io/pydoop), so now you can easily write Python 
MapReduce applications that read and write Avro records 
(http://crs4.github.io/pydoop/examples/avro.html).

I'm also working on something that's probably related to what you're 
trying to achieve: Avro serialization/deserialization for Bio-Formats 
images.  Take a look at https://github.com/simleo/pydoop-features (work 
in progress).

Hope this helps!

Simone

On 03/26/2015 06:14 PM, Kevin Mader wrote:
> So I am trying to come up with a good, simple binary file-format that
> works well with 'Big Data' platforms like Hadoop, Spark, and S3 (see
> issue https://github.com/scifio/scifio/issues/265). The idea is to keep
> the storage as simple as possible and the first implementation of such a
> format is shown here
> https://github.com/thunder-project/thunder/tree/master/python/thunder/utils/data/fish/series
> It consists of the binary file accompanied by a conf.json file with the
> following contents
>
> {
>    "valuetype": "uint8",
>    "nkeys": 3,
>    "keytype": "int16",
>    "dims": [
>      76,
>      87,
>      2
>    ],
>    "nvalues": 240,
>    "input": "key02_00000-key01_00000-key00_00000.bin"
> }
>
> Since Big Data platforms normally work with key-value pairs the idea
> would be to have a key consisting of several numbers (nkeys) of type
> (keytype) and then a value as an array of type (valuetype) with
> dimensions (dims) and all of this spread into multiple files so they can
> be easily written and read in parallel (or on different machines to a
> shared file system).
>
> Does anyone have any suggestions for making a simple format around this?
> The best case would be to have something that could be easily read into
> or written from ImageJ, Matlab, Python, or whatever other tool is around
> with just a few lines of code and no dependencies.
>
> Thanks
> Kevin
>
>
> --
> ----
> Kevin Mader
> Mobile : +41 (0)78 755 14 38
> Office (PSI) : +41 (0)56 310 58 53
> Office (ETH) : +41 (0)44 633 61 86
> Home : +1 (503) 610-8754
> WBBA 213
> 5232 Villigen PSI
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it


More information about the ome-devel mailing list