[ome-devel] on the use PIDs and OMERO

Johan Henriksson mahogny at areta.org
Wed Feb 12 17:50:24 GMT 2014


>  Second, and this is where I need to pick your brain a bit, is that for
>>> documentation purposes a PID is not sufficient. We also need to cite a
>>> cryptographic hash value. We are proposing an extension to CSL-JSON which
>>> lets you store a list of 3-tuples:
>>>
>>> (hash algorithm, type of hash, value)
>>> e.g.
>>> (sha512, image, XXX)
>>> (sha256, rawfile, XXX)
>>>
>>> We are working on specialized *file format independent* hashes that will
>>> allow for re-compression of data in the future, something that I
>>> expect/hope will happen one day. FASTQ is not a very efficient format,
>>> and
>>> nor are many image file formats... Or we just want to convert to
>>> ome-tiff.
>>> But here is the thing: There are algorithms for 2d images, and there it
>>> is
>>> rather straight forward, but is there a method for nD microscopy images?
>>> I
>>> think creating and promoting one would be a task that fits OME very well,
>>> given that you should have the best idea of what an image contains these
>>> days. Maybe one would drop much of the metadata since it is hard to deal
>>> with, and just focus on pixel data... But even then, coming up with a
>>> hash
>>> method is a bit of work given SPIM and all upcoming methods. I have
>>> contemplated if one should extend the above to (hash algorithm, type of
>>> hash, hash parameters, value), such that one could in addition store down
>>> the order in which dimensions are considered for the hash. But here I'm
>>> all
>>> ears!
>>>
>>
>> I'll leave this for the moment in case anyone else has concrete
>> suggestions.
>> In general, though, yes, hashing and more broadly data slice referencing
>> is
>> an interesting problem that's getting more complicated every day!
>>
>
> Hashing the hypervolume shouldn't be any more difficult than hashing an
> image plane so long as you do it in the specified dimension order and
> consistent direction (y *cough*).  It's still effectively a linear byte
> array.  And with the caveat that it needs support for higher dimensions
> for SPIM etc.  This is doable right now with the modulo annotations, and
> with proper NDIM support would become even simpler.  Lossy formats such
> as JPEG may be a bit more challenging since we would need to hash the
> compressed data, not the logical uncompressed pixels since we can't
> guarantee the exact values.
>


Hi!

Thanks for pointing out the jpeg - that is truly a headache. I realize that
maybe the solution here is easier than it first appears; it is *unlikely*
that one would take lossy-compressed data, and put it in another format,
due to further losses. Thus just using the raw-file hash is the solution in
this case, totally ignoring that one has image data (the library already
supports this, of course)


>
> How would you copy with multi-series files?  Just concatenate all the
> series' pixel data together as for the higher dimensions?
>


Indeed, this is where the headache begins and I was hoping you could lend a
hand. For the dimensions otherwise you could use the "extra signing
informaton" to say that you did XYZWF (whatever dims you have). But if you
have a loose set of multiseries, then you need to come up with a "canonical
order" unless the order you have matters - quite frankly you could hash
each series, sort the hashes, then feed them into another round of hashing.
This is partially the idea I have for how to handle FASTA-files where you
cannot assume that the order of sequence will stay intact after re-storage
(e.g. no point keeping the order of rnaseq reads). However, in the FASTA
case it is also a matter of performance - if one would have to sort the
original sequences before hashing, one would quickly drain the ram.
Likewise for the image data, by being able to specify the processed pixel
order one can vastly increase the hashing throughput. If someone
"re-stores" the images in another file format, with different pixel order -
well, only then will performance degrade but this is the less common
scenario



> From the data provenance perspective, none of the LSID or hash features
> in the model guarantee anything, really.  They can be recomputed on
> demand, so only ensure data integrity.  It would be useful to be able to
> support some level of secure signature of the metadata which, if it
> included pixel data hashes, would allow the original data to be verified
> both for integrity and being unchanged from initial writing.  In
> OME-TIFF, we could support detached signature of the whole OME-XML
> block, stored in a separate TIFF tag, with the presence of a suitable
> secure certificate/key on the originating system, and optionally
> validation and verification on other systems on reading.
>

Metadata is one case where I think we may simply have to give up finding a
data format independent hash - it boils down to creating a canonical
metadata formatting and even you in OME have a tough time keeping up with
all attributes people invent in the wild. Thus if one want that level of
integrity, one might as well just use the raw file hash. That said, I would
not be too afraid of excluding the metadata from the hash (the one used for
proving your work, not the data integrity) - if it is anything people would
be accused of tampering, then it is the pixels. Thus we can push this
problem into the future and get something that works well enough today

/Johan


-- 
-- 
-----------------------------------------------------------
Johan Henriksson, PhD
Karolinska Institutet
Ecobima AB - Custom solutions for life sciences
http://www.ecobima.com  http://mahogny.areta.org  http://www.endrov.net

<http://www.endrov.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-devel/attachments/20140212/0e6d1ad0/attachment-0001.html>


More information about the ome-devel mailing list