[ome-devel] Storing single plane OME-TIFF files (Was: Zeiss 710)

Rubén Muñoz ruben.munoz at embl.de
Thu Oct 21 10:57:28 BST 2010


Hello Andrew, 

I appreciate that the user considerations have a big influence in the project evolution. That's a key of success.
As you well describe, between the wide variety of OME.TIFF applications there's the HCS, and the circumstances that I reported to this list are occasional. The reason is that many of us use the list as a bug reporting or enhancement request point.

For one year I have used with admiration the format and tools. I find them to save our time and a reference for those that plan to generate digital images and metadata. 
With the current specification, and as the imaging systems evolve, it will be more likely to fall into the storage issues hole. But my motivation as part of the community is to help to develop a general solution, and not to fall in the temptation in making distinctions between different throughputs. 

Having understood the severity of metadata loss, some of us are willing to take the risk. I mean that the master/slave approach is a fair user choice: the more you risk the more space you save.
And I can't deny that I have been removing the metadata from some files. One OME.TIFF with metadata is enough to open, but has many disadvantages and do not recommend it to anyone. 

Having said that, I offer the best of my time for testing and reporting, because the resources that we have for development are not comparable to the ones that we have for generating data.

Best regards,

Rubén

On Oct 20, 2010, at 2:52 PM, Andrew Patterson wrote:

> Hello Ruben,
> 
> Thanks for you work with us and all the sample data you have provided over the past few months. It is always very useful to see the kind of data people wish to store as, while we can think about what data people will have, there is nothing like real data to test our model with.
> 
> In my reply I hope you do not mind if I open out the discussions to the general case. Some of this will reply to your problem, other areas, I believe from your mails with Melissa and Curtis, are solutions you have already rejected for your own reasons. This is understandable as what we are aiming to provide a general solution so it cannot be the ideal solution in all cases.
> 
> The growth of imaging systems and dataset sizes since our first OME-XML model in 2003 has been impressive and we have been expanding OME-XML as we have gone along to encompass new meta-data like the Screen/Plate/Well extension. This has in turn allowed the image collections to grow and grow. In theory the size of an OME-XML file is mainly limited by the size of the underlying file system it is stored on, but OME-XML is not the most efficient way of representing binary data. 
> 
> A alternative storage solution for the binary part of the data is desirable, in this case our solution was OME-TIFF. This had two key advantages, it can be viewed as "just a tiff", making it familiar, useful and acceptable to many people. The second key feature is like OME-XML a TIFF file can contain multiple image planes in one file. This allowed us to continue the "everything in one place" approach of keeping an image and its metadata together. 
> 
> OME-TIFF does have a problem however, the TIFF file format uses 32bit offsets and, as such, a file is limited to 4 gigabytes. A massive file when the format was designed in the late 1980s. 
> 
> To get round this limitations we introduced the idea of multi-part OME-TIFFs. This allows the binary image plane data to be stored in 2 or more OME-TIFF files. Each of these files contains the full metadata and pointers to the location of all the files containing the rest of the rest of the image planes. This allows any file to be opened and the completeness of the data detected and absent parts hopefully located.
> 
> (Another solution to the 4 gigabytes limit is a BigTIFF variant of OME-TIFF but I will not deal with that here.)
> 
> As is often the case as soon as you introduce flexibility into a solution, in this case number of files, different use cases will pull the best solution in different directions. The way our solution was designed we favour multiple image planes stored in each OME-TIFF producing a smaller number of larger files. The reasons for this are varied but are based on our experience and the problems people can have managing vast numbers of files on the file system. 
> 
> While it is proper and VALID to store a single plane in each OME-TIFF if can make for an unwieldy dataset, and as you have noticed, is not the most space efficient way to store the data. The reason for this is two-fold. Firstly if a file is used for each plane the location of every single plane has to be listed individually in a TiffData node. The way this node is designed to work is to point to the location of the first image plane and use the "PlaneCount" attribute (old name "NumPlanes") to say how many planes to read in sequence from that starting point in the TIFF structure. The use of multi-plane TIFFs allows for a much smaller number of TiffData nodes. Secondly because of our approach of always keeping matadata with its image data this metadata gets duplicated in each file. This was a decision we have taken as a general principle, of course anything can be up for review.
> 
> So what can be done to reduce the size of your data if you need for some reason to use single plane TIFFs? 
> Well under the current schema your options are limited. You can tinker with the exact structure of the "TiffData" nodes and "UUID" nodes. For example instead of having 'IFD="0"' in every "TiffData" node it can be omitted as the attribute defaults to 0. A more drastic change is to omit the FileName attribute from the "UUID" node. This still produces valid files as the key value for piecing a file together on import should be the UUID, the FileName is just a hint where to look first. An importer should scan the other files in the same folder looking for missing parts based on the UUID - this approach allows file sets to survive renaming. This file size optimisation does of course make import of the dataset less efficient.
> 
> So what changes could be made to the schema to increase the efficiency of storage using only single plane OME-TIFF files?
> First I must say these suggestions largely go against our goal of always keeping the metadata with the binary image date. But this is something that may need to be revised.
> 
> One possible solution is to add a multi-part attribute to the top level OME node and strip almost all of the OME-XML from all the single plane TIFFs. This metadata would then reside in a master file and look much as it does now. The new attribute MultiPart if set would be the UUID of the master file set the file is part of.
> 
> This would reduce the OME-XML in the TIFF header of each single plane file to a single empty OME node:
>    <OME xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>    UUID="urn:uuid:707e7b82-155a-43db-9f3d-1af4e6a212f8"
>    MultiPart="urn:uuid:4062f7ac-dc41-11df-abaf-774b41a01549"
>    xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/20??-?? http://www.openmicroscopy.org/Schemas/OME/20??-??/ome.xsd"
>    xmlns="http://www.openmicroscopy.org/Schemas/OME/20??-??"/>
> 
> The Master file would start:
>    <OME xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>    UUID="urn:uuid:4062f7ac-dc41-11df-abaf-774b41a01549"
>    MultiPart="urn:uuid:4062f7ac-dc41-11df-abaf-774b41a01549"
>    xsi:schemaLocation="http://www.openmicroscopy.org/Schemas/OME/20??-?? http://www.openmicroscopy.org/Schemas/OME/20??-??/ome.xsd"
>    xmlns="http://www.openmicroscopy.org/Schemas/OME/20??-??">
>      <Plate...  
> Followed by the rest of the metadata as it is now. You can tell it is the master as the UUID of the file and the UUID of the MultiPart are the same.
> 
> This is a clean and ruthless culling of all the metadata in the individual files. The resulting single files in isolation do not have even enough metadata to display the image properly.
> 
> 
> A second possible solution is to split the data on an Image basis. This would not be as space efficient as the above solution but would at least leave some of the metadata intact in the files.
> 
> Think of the metadata as belonging to a few distinct types. 
>    Screen/Plate/Well description
>    Project/Dataset description
>    Instrument description
>    Image description
> 
> In the sample Plate data you sent there are 260 Images. The data for all of these images is in each file in the set. What we could do is alter the structure so there is a master file containing:
>    Screen/Plate/Well description
>    Project/Dataset description
>    Instrument description
>    Image description
> This master file would be largely identical to the example above.
> 
> Each single plane TIFF file would be marked as MultiPart as above but would have some metadata. It would contain the 'Image description' metadata for only the Image it is part of. This is of course a more complex set of files for an application to read and write. You would also end up with the problem that you need to have contestant IDs across all the files and you may have the problem of the metadata in individual single plane TIFFs referring to IDs that are not present in there own metadata, for example, a lightsource only define in the Master file.
> 
> This is just two possible suggestions. I would like to gage the feeling of the community on this matter.
> It is a departure from our current position. While we have had the idea of a separate MetadataOnly file it was not something we have encouraged.
> 
> How import is this kind of storage across a vast number of single plane files?
> 
> Thanks again for letting us look at you data and raising your use case with us. I would love to hear other peoples thoughts and suggestions on this.
> 
> Cheers,
> 
> Andrew
> 
> 
> 
> 



More information about the ome-devel mailing list