[ome-devel] on the use PIDs and OMERO

Roger Leigh rleigh at dundee.ac.uk
Wed Feb 12 16:03:25 GMT 2014


On 12/02/14 11:06, Josh Moore wrote:
>> In the library we are building in services to let it automatically
>> request/generate IDs of various kinds - I know you have LSID generation
>> already, but disabled, however this library could be a common place for all
>> the upcoming PID initiatives. And it would outsource the logic from OMERO,
>> which might be a win. Long-term the library will also support in publishing
>> links to omero into the various relevant PID archives, for each image. For
>> example omero://ip/imageID , which will help us connect to the omero client
>> from other programs. No need for DBUS or similar!
>
> We tend to just rely on UUIDs for most of the naming logic. Of course,
> something more advanced would be interesting, but the concrete benefit (in the
> face of all the cost) just hasn't yet materialized for us.

 From the data model point of view, it might be useful for end users to
drop LSIDs in favour of UUIDs, given the poor use and support.  Would it
be reasonable to tweak the specification to change this to merely to be
a unique string within the document (which in practice is the case right
now)?  This would be both backward compatible with existing LSIDs, and
will work with all existing numbers as well as UUIDs or any other
site-defined identifier strategy.  This would also mean the model could
interface with any particular ID system, rather than mandating a single
implementation.

>> Second, and this is where I need to pick your brain a bit, is that for
>> documentation purposes a PID is not sufficient. We also need to cite a
>> cryptographic hash value. We are proposing an extension to CSL-JSON which
>> lets you store a list of 3-tuples:
>>
>> (hash algorithm, type of hash, value)
>> e.g.
>> (sha512, image, XXX)
>> (sha256, rawfile, XXX)
>>
>> We are working on specialized *file format independent* hashes that will
>> allow for re-compression of data in the future, something that I
>> expect/hope will happen one day. FASTQ is not a very efficient format, and
>> nor are many image file formats... Or we just want to convert to ome-tiff.
>> But here is the thing: There are algorithms for 2d images, and there it is
>> rather straight forward, but is there a method for nD microscopy images? I
>> think creating and promoting one would be a task that fits OME very well,
>> given that you should have the best idea of what an image contains these
>> days. Maybe one would drop much of the metadata since it is hard to deal
>> with, and just focus on pixel data... But even then, coming up with a hash
>> method is a bit of work given SPIM and all upcoming methods. I have
>> contemplated if one should extend the above to (hash algorithm, type of
>> hash, hash parameters, value), such that one could in addition store down
>> the order in which dimensions are considered for the hash. But here I'm all
>> ears!
>
> I'll leave this for the moment in case anyone else has concrete suggestions.
> In general, though, yes, hashing and more broadly data slice referencing is
> an interesting problem that's getting more complicated every day!

Hashing the hypervolume shouldn't be any more difficult than hashing an
image plane so long as you do it in the specified dimension order and
consistent direction (y *cough*).  It's still effectively a linear byte
array.  And with the caveat that it needs support for higher dimensions
for SPIM etc.  This is doable right now with the modulo annotations, and
with proper NDIM support would become even simpler.  Lossy formats such
as JPEG may be a bit more challenging since we would need to hash the
compressed data, not the logical uncompressed pixels since we can't
guarantee the exact values.

How would you copy with multi-series files?  Just concatenate all the
series' pixel data together as for the higher dimensions?

While we currently support SHA1 hashing of planes, it would be useful to
extend this to multiple dimensions, and to also include support for
other hash algorithms.  If we can hash all the dimensions at once, this
might be as simple as moving the hash to Image.

 From the data provenance perspective, none of the LSID or hash features
in the model guarantee anything, really.  They can be recomputed on
demand, so only ensure data integrity.  It would be useful to be able to
support some level of secure signature of the metadata which, if it
included pixel data hashes, would allow the original data to be verified
both for integrity and being unchanged from initial writing.  In
OME-TIFF, we could support detached signature of the whole OME-XML
block, stored in a separate TIFF tag, with the presence of a suitable
secure certificate/key on the originating system, and optionally
validation and verification on other systems on reading.


Regards,
Roger

--
Dr Roger Leigh -- Open Microscopy Environment
Wellcome Trust Centre for Gene Regulation and Expression,
College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK   Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096


More information about the ome-devel mailing list