[ome-devel] on the use PIDs and OMERO

Tue Feb 11 19:32:05 GMT 2014

Hello everyone!

A while ago I emailed and asked about the status of LSIDs. Now after some
reading it appears to be quite dead, and there are other alternatives being
developed.

Now the reason I ask is because we need to get persistent ID (PID) support
into our documentation system Labstory (www.labstory.se). The requirements
of the PID records that I have sort of come up with are:

* the persistence
* offline generation of IDs. for example, most microscopy computers do not
have internet, but this is the best time out of any to generate an ID
* data format independent hashing
* possibly re-hashing with future algorithms (new hashes added, old ones
kept on the side)
* simple resolving, with storage of data in multiple locations, and the
ability to pick the closest one (critical for image/sequencing data)
* critical: handling of IDs "offline", that is, one must be able to
generate and use IDs right at the creation of the data, which can be years
before public dissemination
* ease of citation of data e.g. one should be able to code drag and drop of
a raw file into labstory/mendeley/similar and it will figure out how to
cite it
* it must be possible for one program to figure out which other program it
can use to open it. e.g. for images in an omero database, maybe one could
build a omero://... url and have the OS figure out the association
* the protocol above may change, and there may be multiple alternatives
(e.g. alternative image storage servers)

Currently it's a bit of a challenge finding something that fulfils all the
requirements but I think we are getting there, one way or another. What in
particular we are putting into Labstory is a new fileformat called FILEID
(until someone comes up with a better name)

For a file called
mybig.tiff
you have
mybig.tiff.FILEID

This file contains all IDs of the file (DOI, LSID, EPIC ID etc). We have
chosen to use CSL-JSON as the file format, as this means that the file can
contain any citation information that also would be included for a regular
article. This is also the most widely used schema by reference management
software so it could be added with minimal work to several programs, or
might already work, e.g. mendeley/readcube/labstory.

The FILEID support is probably going into Labstory 0.0.21, as
proof-of-principle, and we are discussing with others how to bring this
forward. That the format is PID-vendor neutral means that we really only
need to worry about getting other programs to read/write them, and which
PID standard wins is also of secondary concern. I just released our code
into the open here for further comments (it is not complete yet, but we do
use it in our latest release):

https://github.com/mahogny/citeproc-light

The license and size should allow it to be integrated in many systems. Now,
here is where OMERO enters. How can references to OMERO files be spread? I
think it would be great if you would adopt the CSL-JSON (using this library
or from scratch):

https://github.com/citation-style-language/schema/blob/master/csl-data.json

The quickest way forward, if you think this sounds promising:

* Add a csl-json string attribute to the database wherever there is an lsid
attribute, but primarily for the images (or maybe only there for now). Keep
the lsid attribute as-is for now
* Add FILEID support to scifio/bio-formats, or right next to the omero
importer if you want to skip one dependency (the library also has
command-line utility for making FILEIDs)
* Add a drag-n-drop MIME from the OMERO client, same schema. Thus one could
cite an image by just dragging it into whatever other citation program you
have. Here I am discussing with others if we can agree on a MIME format (no
common standard yet but no doubt it will be csl-json in some form)
* Likewise, drag and drop from a citation program could let the omero
client look up the corresponding image

A more challenging option for the future is to try and include all the CSL
data in the OMERO schema directly, and generate the JSON when needed - but
this involves many many more changes. I would rather start simple

In the library we are building in services to let it automatically
request/generate IDs of various kinds - I know you have LSID generation
already, but disabled, however this library could be a common place for all
the upcoming PID initiatives. And it would outsource the logic from OMERO,
which might be a win. Long-term the library will also support in publishing
links to omero into the various relevant PID archives, for each image. For
example omero://ip/imageID , which will help us connect to the omero client
from other programs. No need for DBUS or similar!

(Big side-note: I am tempted to write a local resolving service that can be
used until the data is sent to any big archive)

It should be said that we are much willing lend a hand here in adding
support to OMERO for this once we have also cleared out with others how
they would want to proceed. But for a starter I would just like to know
your opinion on this, especially since it will affect the schema of the
OMERO database?

Second, and this is where I need to pick your brain a bit, is that for
documentation purposes a PID is not sufficient. We also need to cite a
cryptographic hash value. We are proposing an extension to CSL-JSON which
lets you store a list of 3-tuples:

(hash algorithm, type of hash, value)
e.g.
(sha512, image, XXX)
(sha256, rawfile, XXX)

We are working on specialized *file format independent* hashes that will
allow for re-compression of data in the future, something that I
expect/hope will happen one day. FASTQ is not a very efficient format, and
nor are many image file formats... Or we just want to convert to ome-tiff.
But here is the thing: There are algorithms for 2d images, and there it is
rather straight forward, but is there a method for nD microscopy images? I
think creating and promoting one would be a task that fits OME very well,
given that you should have the best idea of what an image contains these
days. Maybe one would drop much of the metadata since it is hard to deal
with, and just focus on pixel data... But even then, coming up with a hash
method is a bit of work given SPIM and all upcoming methods. I have
contemplated if one should extend the above to (hash algorithm, type of
hash, hash parameters, value), such that one could in addition store down
the order in which dimensions are considered for the hash. But here I'm all
ears!

Yours sincerely,
Johan Henriksson

-- 
-- 
-----------------------------------------------------------
Johan Henriksson, PhD
Karolinska Institutet
Ecobima AB - Custom solutions for life sciences
http://www.ecobima.com  http://mahogny.areta.org  http://www.endrov.net

<http://www.endrov.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-devel/attachments/20140211/cc30811f/attachment.html>