[ome-devel] Fwd: File format for large data sets stored at multiple resolutions

Mark Tsuchida marktsuchida at gmail.com
Wed May 13 21:08:30 BST 2015


Hi Jason et al.

> On Tue, May 12, 2015 at 4:17 AM, Jason Swedlow (Staff) <j.r.swedlow at dundee.ac.uk> wrote:
>>  Johan’s
>> proposed pointer-based solution is, if I have things correct, already
>> implemented in Micro-Manager’s OME-TIFF, for exactly the reason described
>> (apologies to Nico and the MM team if I have this incorrect— please do
>> provide the accurate description of what is going on there if I have
>> failed).

This is correct - although the way Micro-Manager currently does this
is not really extensible.

>> Mindful of all this, OME's priority is building tools that are as useful,
>> generic and performant as possible.  In this case, that means developing a
>> format that works in many different domains and that includes support for
>> multi-res pixel sets, acquisition metadata, ROIs, trajectories, etc.

Since I'm on the writing side of things, I'd just wanted to bring up
the point that the optimizations necessary for writing raw, acquired
data to files can be quite different from (and sometimes in conflict
with) the optimizations for efficient reading.

This might slightly veer off the topic of optimization for fast
visualization, but is probably relevant to the ideal of minimizing
format conversion of large datasets.

1. With sCMOSs producing >1GB/s data (and counting), there is the
issue of raw throughput. Formats that allow mostly sequential writing
(e.g. TIFF) are good, although to what extent that really matters with
modern hardware (SSDs, RAIDs, and RAIDs of SSDs) and operating systems
needs to be determined empirically. Formats that transform the data
before writing (e.g. converting image planes into tiles for optimized
reading) could cause issues.

2. Even if the final size of the dataset is known ahead of time (and
there are cases where this requirement is not ideal), the actual
experiment may skip data points or terminate prematurely. This means
that the format should not require the data description to be fixed
when preparing the file for writing. This needs to be kept in mind
when designing data structures for indexing and metadata.

3. And finally, robustness is especially important when writing live
data (as opposed to converting from another format), because any
corruption means permanent loss of data if your 48-hour experiment
crashes after 36 hours. This means that pixels and metadata should be
written as they are acquired, avoiding any essential
(non-reconstructible) data structure that needs to be finalized and
written at the end (XML containing metadata for all image planes is an
example of this, if the same data is not also recorded along the way
by other means). Built-in redundancies for important items (e.g.
offsets of the pixels and metadata) can be a big help. Simpler formats
are probably easier to repair when data corruption occurs, although
there are techniques (such as journaling) to facilitate recovery of
complex database-like formats.

Of course, not all of these points are important for all experiments,
and it is also a good question whether it is better to try to design a
single format that is simultaneously optimized for recording and
retrieval, or whether separate formats (or subformats) are in order
(hopefully with just one conversion step between acquisition and
inspection/analysis).

Best,
Mark


-- 
Mark Tsuchida
Micro-Manager Team
UCSF Vale Lab





On Tue, May 12, 2015 at 5:44 AM, Lee Kamentsky <leek at broadinstitute.org> wrote:
> Chipping in my 2 cents since I will not be there. Our next release supports
> CellH5 and we have had very good luck with HDF5 in general. Pyramidal isn't
> quite free, but easily realizable through slicing and perhaps CellH5 can
> evolve to include pyramidal levels. Regarding database / file-format, I have
> been considering how to make that debate a little more agnostic, especially
> as we enter into the web-services era. It would be really useful for us to
> have wire format standards for our community's data and possibly some
> standardization of web interfaces - we're going to be moving CellProfiler in
> that direction in the next couple of years.
>
> Some more notes here:
> https://github.com/imagej/imagej-server/issues/1
>
> --Lee
>
> On Tue, May 12, 2015 at 4:17 AM, Jason Swedlow (Staff)
> <j.r.swedlow at dundee.ac.uk> wrote:
>>
>> Hi Nico, Johan et al-
>>
>> Apologies for the delay in responding.  Nico’s email started a lot of
>> discussion in the OME team.  I saw Nico and Curtis Rueden a couple of days
>> after this was sent at AQLM
>> (http://www.mbl.edu/education/special-topics-courses/analytical-quantitative-light-microscopy/),
>> and we had a long discussion about this issue over chips, salsa, and PSFs.
>>
>> As many of you know, this has been a hot topic of late.  Several new and
>> some established technologies (LSFM, DigPath, HCS and others) are now
>> routinely generating heterogeneous (binary pixel data, metadata, and
>> analytics) multi-dimensional, multi-TB datasets.  Within OME, we’ve been
>> discussing how we approach this trend— whether we amend OME-TIFF, define a
>> new format (mindful of a lot of work by others, in particular Cellh5
>> (http://www.cellh5.org/), BDV (http://fiji.sc/BigDataViewer), OpenSlide
>> (http://openslide.org/) and others), or just wait for someone else to
>> generate yet another file format (YAFF®) or in all likelihood several new
>> file formats (SNFFs®) and doggedly support them all in Bio-Formats.  Johan’s
>> proposed pointer-based solution is, if I have things correct, already
>> implemented in Micro-Manager’s OME-TIFF, for exactly the reason described
>> (apologies to Nico and the MM team if I have this incorrect— please do
>> provide the accurate description of what is going on there if I have
>> failed).
>>
>> AFAICT, there is a rather old, well-worn debate between the filesystem and
>> database camps on this issue.  To my mind, Jim Gray and colleagues captured
>> this tension most clearly and accurately in 2005
>> (http://research.microsoft.com/pubs/64537/tr-2005-10.pdf).  It’s almost
>> certainly true that both approaches will evolve side by side, and we (where
>> we = the community) should try to develop both solutions— they each have
>> their place and utility.  OME’s version of this is to develop a data spec
>> (e.g., OME-TIFF), an I/O  library (Bio-Formats), and applications that use
>> the format (OMERO).  We are committed to this stratgey  for any spec we
>> develop.  That might explain, but not excuse, our rather slow approach on
>> this issue.  We insist on the spec and its many implementations.
>>
>> Mindful of all this, OME's priority is building tools that are as useful,
>> generic and performant as possible.  In this case, that means developing a
>> format that works in many different domains and that includes support for
>> multi-res pixel sets, acquisition metadata, ROIs, trajectories, etc.  In
>> discussing this with Nico and Curtis, we agreed the obvious— anything we
>> build has to be done in steps.  The over-riding immediate problem that
>> several people face is support for large, multi-res, multi-D pixel sets.
>> The existing (partial) solutions are worth considering, but anything we do
>> must also support both Java and native environments— OME is committed to at
>> least bypassing, if not removing, the barriers between the Java and native
>> worlds, where we have the resources to do so.
>>
>> We’ve added this topic to the workshops at our upcoming Users meeting and
>> welcome input there
>> (https://www.openmicroscopy.org/site/community/minutes/meetings/10th-annual-users-meeting-june-2015).
>> It looks like we will have a very strong turnout for the meeting.  We’d
>> encourage anyone interested to join us there, but obviously also welcome
>> input on this list, Forums, etc.  We’ll report back with our plan for
>> addressing this very important point.
>>
>> As always, thanks for your support.
>>
>> Cheers,
>>
>> Jason
>>
>> --------------------
>>
>> Centre for Gene Regulation & Expression | Open Microscopy Environment |
>> University of Dundee
>>
>>
>>
>> Phone:  +44 (0) 1382 385819
>>
>> email: j.swedlow at dundee.ac.uk
>>
>>
>>
>> Web: http://www.lifesci.dundee.ac.uk/people/jason-swedlow
>>
>> Open Microscopy Environment: http://openmicroscopy.org
>>
>>
>>
>>
>>
>> From: Johan Henriksson <mahogny at areta.org>
>> Reply-To: OME Development <ome-devel at lists.openmicroscopy.org.uk>
>> Date: Wednesday, 6 May 2015 22:45
>> To: OME Development <ome-devel at lists.openmicroscopy.org.uk>
>> Subject: Re: [ome-devel] Fwd: File format for large data sets stored at
>> multiple resolutions
>>
>> Hi Nico!
>>
>> First of all, have you tried the pyramid compression of jpeg2000? (I have
>> not!)
>>
>> Second, last time I tried large datasets in ome-tiff it was a huge issue.
>> I tried to convert our 40gb+ recordings to ome-tiff and got indexing times
>> from hell (up to 10 minutes). I never had time to properly investigate this.
>> Part of the problem might be that I changed the OME-writer to add in
>> JPEG-compressed data (since we have a lot of jpegs since before, and I did
>> not feel like converting those to PNGs).
>>
>> JPEG2000 pyramids would help you with huge 2D-images but not with huge 5D
>> datasets. I believe the problem (anyone please correct me here) is that the
>> ome-reader first indexes the TIFF-file - but does so in worst case by going
>> through the entire file to find where each plane is(?). This is in no way
>> fast in the current implementation as I suspect it jumps through the entire
>> dataset as tifs are essentially a linked list of planes. if your output
>> compressed file is 5gb+ then this alone is really slow
>>
>> my solution to this would have been an extension with a special data
>> object containing pointers to most of all planes, in a single place. thus
>> very few reads would be needed to map all planes. but then I moved to
>> another lab and never had time to return to this. but I think it's a
>> problem/solution worth reconsidering. if a tif-reader does not understand
>> such a special data object it would just ignore it, but specialized readers
>> could gain a lot of speed by reading it
>>
>> cheers,
>> Johan
>>
>>
>>
>>
>>
>>
>>
>> On Mon, May 4, 2015 at 2:08 AM, Nico Stuurman <nico.stuurman at ucsf.edu>
>> wrote:
>>>
>>>
>>> Dear all,
>>>
>>> I have been running into more and more individual efforts to create new
>>> file formats to deal with large datasets that need to be stored at
>>> multiple resolutions to enable fast feedback to the user. Examples are
>>> the hdf5 format used by the BigDataViewer plugin by Tobias Pietzsch and
>>> Stephan Preibisch, the hdf5 format used by Chimera (UCSF-based package
>>> primarily for crystallography and EM that also has amazing capabilities
>>> for 3D visualization of light microscopy data), the Micro-Manager
>>> SlideExplorer plugin for which Arthur Edelstein developed his own
>>> storage system, and the Micro-Manager plugin "Magelan" that Henry
>>> Pinkard is developing right now, and who also stores multiple resolution
>>> versions of the data on disk. Doubtlessly, there are many more examples.
>>>
>>> Even when conversion between these formats is possible (as long as they
>>> are reasonably documented), conversion becomes time consuming and takes
>>> up large amounts of disk space, simply because the data sets have become
>>> gigantic.  The reasons why everyone designs their own formats are also
>>> clear, there simply is no standard (at least that I am aware of, if
>>> there is please do let me know! ) that let's one store gigantic datasets
>>> that give fast access to the data in multiple resolutions.
>>>
>>> Since you guys have created the standard in light microscopy with
>>> ome.tif, I assume that you have thoughts what a new standard (hdf5
>>> based?) should look like.  In any case, I am very much looking forward
>>> to hearing your thoughts and I will be happy to help avoid a wild growth
>>> of different formats that we will have to live with for years to come if
>>> we do not take action soon.
>>>
>>> Best,
>>>
>>> Nico
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ome-devel mailing list
>>> ome-devel at lists.openmicroscopy.org.uk
>>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>
>>
>>
>>
>> --
>> --
>> -----------------------------------------------------------
>> Johan Henriksson, PhD
>> Karolinska Institutet / European Bioinformatics Institute (EMBL-EBI)
>> Labstory - Integrated laboratory documentation and databases
>> (www.labstory.se)
>> http://mahogny.areta.org  http://www.endrov.net
>>
>>
>> The University of Dundee is a registered Scottish Charity, No: SC015096
>>
>> _______________________________________________
>> ome-devel mailing list
>> ome-devel at lists.openmicroscopy.org.uk
>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>


More information about the ome-devel mailing list