[ome-devel] on the use PIDs and OMERO

Wed Feb 12 11:06:05 GMT 2014

On Feb 11, 2014, at 8:32 PM, Johan Henriksson wrote:

> Hello everyone!

Hi Johan,

...snip...

> The license and size should allow it to be integrated in many systems. Now,
> here is where OMERO enters. How can references to OMERO files be spread? I
> think it would be great if you would adopt the CSL-JSON (using this library
> or from scratch):
> 
> https://github.com/citation-style-language/schema/blob/master/csl-data.json
> 
> The quickest way forward, if you think this sounds promising:
> 
> * Add a csl-json string attribute to the database wherever there is an lsid
> attribute, but primarily for the images (or maybe only there for now). Keep
> the lsid attribute as-is for now

This seems like the most immediate win, but I'm not sure that it requires
a new DB field. Using a clear namespace ("csl-json", etc) you could attach
this information today with annotations.

The key criterion which is one that we have one our plate for 2014 anyway
is that when the image is exported and then re-imported that this information
is collated and clearly provenanced.

That would certainly be a workflow that it would be good to have you validate.

> * Add FILEID support to scifio/bio-formats, or right next to the omero
> importer if you want to skip one dependency (the library also has
> command-line utility for making FILEIDs)

Are you thinking of this as a separate file format or as an extension of
each file format? If that latter, that's fairly invasive from an architecture
point of view. From our point-of-view with the workflow above, embedding
the FILEID info directly in the TIFF comment would be the fastest way to
get started with a prototype.

> * Add a drag-n-drop MIME from the OMERO client, same schema. Thus one could
> cite an image by just dragging it into whatever other citation program you
> have. Here I am discussing with others if we can agree on a MIME format (no
> common standard yet but no doubt it will be csl-json in some form)

The only work we currently have on this front is "Open view", but any updates
to support Drag-n-drop would/could use the same mechanism.

> * Likewise, drag and drop from a citation program could let the omero
> client look up the corresponding image

> A more challenging option for the future is to try and include all the CSL
> data in the OMERO schema directly, and generate the JSON when needed - but
> this involves many many more changes. I would rather start simple

Agreed. There will always have to be a line drawn between what can and can't
be in any one spec. If we can work to make the linkages, the exporting, the
re-import, and the interpretation (DnD, etc) of the info as seamless as possible
that would benefit everyone.

> In the library we are building in services to let it automatically
> request/generate IDs of various kinds - I know you have LSID generation
> already, but disabled, however this library could be a common place for all
> the upcoming PID initiatives. And it would outsource the logic from OMERO,
> which might be a win. Long-term the library will also support in publishing
> links to omero into the various relevant PID archives, for each image. For
> example omero://ip/imageID , which will help us connect to the omero client
> from other programs. No need for DBUS or similar!

We tend to just rely on UUIDs for most of the naming logic. Of course,
something more advanced would be interesting, but the concrete benefit (in the
face of all the cost) just hasn't yet materialized for us.
> 
> It should be said that we are much willing lend a hand here in adding
> support to OMERO for this once we have also cleared out with others how
> they would want to proceed. But for a starter I would just like to know
> your opinion on this, especially since it will affect the schema of the
> OMERO database?

The next major version of OMERO (5.0) is slotted to come out very, very
(, very) soon. So if you're interested in making this work in the short
term, we'd need to find a way to interoperate without DB changes.

> Second, and this is where I need to pick your brain a bit, is that for
> documentation purposes a PID is not sufficient. We also need to cite a
> cryptographic hash value. We are proposing an extension to CSL-JSON which
> lets you store a list of 3-tuples:
> 
> (hash algorithm, type of hash, value)
> e.g.
> (sha512, image, XXX)
> (sha256, rawfile, XXX)
> 
> We are working on specialized *file format independent* hashes that will
> allow for re-compression of data in the future, something that I
> expect/hope will happen one day. FASTQ is not a very efficient format, and
> nor are many image file formats... Or we just want to convert to ome-tiff.
> But here is the thing: There are algorithms for 2d images, and there it is
> rather straight forward, but is there a method for nD microscopy images? I
> think creating and promoting one would be a task that fits OME very well,
> given that you should have the best idea of what an image contains these
> days. Maybe one would drop much of the metadata since it is hard to deal
> with, and just focus on pixel data... But even then, coming up with a hash
> method is a bit of work given SPIM and all upcoming methods. I have
> contemplated if one should extend the above to (hash algorithm, type of
> hash, hash parameters, value), such that one could in addition store down
> the order in which dimensions are considered for the hash. But here I'm all
> ears!

I'll leave this for the moment in case anyone else has concrete suggestions.
In general, though, yes, hashing and more broadly data slice referencing is
an interesting problem that's getting more complicated every day!

> Yours sincerely,
> Johan Henriksson

Cheers,
~Josh.