[ome-devel] Simple flat binary file format
Simone Leo
simleo at crs4.it
Fri Mar 27 08:23:49 GMT 2015
Hi there,
are you sure you need to design a new file format from scratch? To be
compliant with the Hadoop ecosystem you'd probably be better off using
Avro (http://avro.apache.org). That way, you'd only have to write the
schema, while taking advantage of an already optimized (and
Hadoop-ready) generic binary format.
Regarding Python: Avro containers can be accessed with the official API;
in addition, I've recently finished adding Avro support to Pydoop
(http://crs4.github.io/pydoop), so now you can easily write Python
MapReduce applications that read and write Avro records
(http://crs4.github.io/pydoop/examples/avro.html).
I'm also working on something that's probably related to what you're
trying to achieve: Avro serialization/deserialization for Bio-Formats
images. Take a look at https://github.com/simleo/pydoop-features (work
in progress).
Hope this helps!
Simone
On 03/26/2015 06:14 PM, Kevin Mader wrote:
> So I am trying to come up with a good, simple binary file-format that
> works well with 'Big Data' platforms like Hadoop, Spark, and S3 (see
> issue https://github.com/scifio/scifio/issues/265). The idea is to keep
> the storage as simple as possible and the first implementation of such a
> format is shown here
> https://github.com/thunder-project/thunder/tree/master/python/thunder/utils/data/fish/series
> It consists of the binary file accompanied by a conf.json file with the
> following contents
>
> {
> "valuetype": "uint8",
> "nkeys": 3,
> "keytype": "int16",
> "dims": [
> 76,
> 87,
> 2
> ],
> "nvalues": 240,
> "input": "key02_00000-key01_00000-key00_00000.bin"
> }
>
> Since Big Data platforms normally work with key-value pairs the idea
> would be to have a key consisting of several numbers (nkeys) of type
> (keytype) and then a value as an array of type (valuetype) with
> dimensions (dims) and all of this spread into multiple files so they can
> be easily written and read in parallel (or on different machines to a
> shared file system).
>
> Does anyone have any suggestions for making a simple format around this?
> The best case would be to have something that could be easily read into
> or written from ImageJ, Matlab, Python, or whatever other tool is around
> with just a few lines of code and no dependencies.
>
> Thanks
> Kevin
>
>
> --
> ----
> Kevin Mader
> Mobile : +41 (0)78 755 14 38
> Office (PSI) : +41 (0)56 310 58 53
> Office (ETH) : +41 (0)44 633 61 86
> Home : +1 (503) 610-8754
> WBBA 213
> 5232 Villigen PSI
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it
More information about the ome-devel
mailing list