<div dir="ltr"><div><div><div>Hello everyone!<br><br></div>A while ago I emailed and asked about the status of LSIDs. Now after some reading it appears to be quite dead, and there are other alternatives being developed.<br>
<br></div>Now the reason I ask is because we need to get persistent ID (PID) support into our documentation system Labstory (<a href="http://www.labstory.se">www.labstory.se</a>). The requirements of the PID records that I have sort of come up with are:<br>
<br><div>* the persistence<br></div>*
offline generation of IDs. for example, most microscopy computers do not
have internet, but this is the best time out of any to generate an ID<br>
<div><div>* data format independent hashing<br></div>* possibly re-hashing with future algorithms (new hashes added, old ones kept on the side)<br></div><div>*
simple resolving, with storage of data in multiple locations, and the
ability to pick the closest one (critical for image/sequencing data)<br></div>
<div>* critical: handling of IDs "offline", that is, one must be able to
generate and use IDs right at the creation of the data, which can be
years before public dissemination<br></div><div>* ease of citation of
data e.g. one should be able to code drag and drop of a raw file into
labstory/mendeley/similar and it will figure out how to cite it<br>
</div><div>* it must be possible for one program to figure out which
other program it can use to open it. e.g. for images in an omero
database, maybe one could build a omero://... url and have the OS figure
out the association<br>
</div>* the protocol above may change, and there may be multiple alternatives (e.g. alternative image storage servers)<br><div><br><br></div><div>Currently it's a bit of a challenge finding something that fulfils all the requirements but I think we are getting there, one way or another. What in particular we are putting into Labstory is a new fileformat called FILEID (until someone comes up with a better name)<br>
<br>For a file called<br></div><div>mybig.tiff<br>you have<br>mybig.tiff.FILEID<br><br></div><div>This file contains all IDs of the file (DOI, LSID, EPIC ID etc). We have chosen to use CSL-JSON as the file format, as this means that the file can contain any citation information that also would be included for a regular article. This is also the most widely used schema by reference management software so it could be added with minimal work to several programs, or might already work, e.g. mendeley/readcube/labstory.<br>
<br></div><div>The FILEID support is probably going into Labstory 0.0.21, as proof-of-principle, and we are discussing with others how to bring this forward. That the format is PID-vendor neutral means that we really only need to worry about getting other programs to read/write them, and which PID standard wins is also of secondary concern. I just released our code into the open here for further comments (it is not complete yet, but we do use it in our latest release):<br>
<br><a href="https://github.com/mahogny/citeproc-light">https://github.com/mahogny/citeproc-light</a><br></div><div><br></div><div>The license and size should allow it to be integrated in many systems. Now, here is where OMERO enters. How can references to OMERO files be spread? I think it would be great if you would adopt the CSL-JSON (using this library or from scratch):<br>
<br><a href="https://github.com/citation-style-language/schema/blob/master/csl-data.json">https://github.com/citation-style-language/schema/blob/master/csl-data.json</a><br></div><div><br></div><div>The quickest way forward, if you think this sounds promising:<br>
<br>* Add a csl-json string attribute to the database wherever there is an lsid attribute, but primarily for the images (or maybe only there for now). Keep the lsid attribute as-is for now<br>* Add FILEID support to scifio/bio-formats, or right next to the omero importer if you want to skip one dependency (the library also has command-line utility for making FILEIDs)<br>
</div><div>* Add a drag-n-drop MIME from the OMERO client, same schema. Thus one could cite an image by just dragging it into whatever other citation program you have. Here I am discussing with others if we can agree on a MIME format (no common standard yet but no doubt it will be csl-json in some form)<br>
</div><div>* Likewise, drag and drop from a citation program could let the omero client look up the corresponding image<br></div><br></div>A more challenging option for the future is to try and include all the CSL data in the OMERO schema directly, and generate the JSON when needed - but this involves many many more changes. I would rather start simple<br>
<div><br><div>In the library we are building in services to let it automatically request/generate IDs of various kinds - I know you have LSID generation already, but disabled, however this library could be a common place for all the upcoming PID initiatives. And it would outsource the logic from OMERO, which might be a win. Long-term the library will also support in publishing links to omero into the various relevant PID archives, for each image. For example omero://ip/imageID , which will help us connect to the omero client from other programs. No need for DBUS or similar!<br>
<br></div><div>(Big side-note: I am tempted to write a local resolving service that can be used until the data is sent to any big archive)<br></div><div><br></div><div>It should be said that we are much willing lend a hand here in adding support to OMERO for this once we have also cleared out with others how they would want to proceed. But for a starter I would just like to know your opinion on this, especially since it will affect the schema of the OMERO database?<br>
<br></div><div>Second, and this is where I need to pick your brain a bit, is that for documentation purposes a PID is not sufficient. We also need to cite a cryptographic hash value. We are proposing an extension to CSL-JSON which lets you store a list of 3-tuples:<br>
<br></div><div>(hash algorithm, type of hash, value)<br></div><div>e.g.<br></div><div>(sha512, image, XXX)<br></div><div>(sha256, rawfile, XXX)<br><br></div><div>We are working on specialized *file format independent* hashes that will allow for re-compression of data in the future, something that I expect/hope will happen one day. FASTQ is not a very efficient format, and nor are many image file formats... Or we just want to convert to ome-tiff. But here is the thing: There are algorithms for 2d images, and there it is rather straight forward, but is there a method for nD microscopy images? I think creating and promoting one would be a task that fits OME very well, given that you should have the best idea of what an image contains these days. Maybe one would drop much of the metadata since it is hard to deal with, and just focus on pixel data... But even then, coming up with a hash method is a bit of work given SPIM and all upcoming methods. I have contemplated if one should extend the above to (hash algorithm, type of hash, hash parameters, value), such that one could in addition store down the order in which dimensions are considered for the hash. But here I'm all ears!<br>
<br></div><div>Yours sincerely,<br></div><div>Johan Henriksson<br><br clear="all"><div><div><div><br>-- <br>-- <br>-----------------------------------------------------------<br>Johan Henriksson, PhD<br>Karolinska Institutet<br>
Ecobima AB - Custom solutions for life sciences<br><a href="http://www.ecobima.com" target="_blank">http://www.ecobima.com</a> <a href="http://mahogny.areta.org" target="_blank">http://mahogny.areta.org</a> <a href="http://www.endrov.net" target="_blank">http://www.endrov.net</a><br>
<br><a href="http://www.endrov.net" target="_blank"></a>
</div></div></div></div></div></div>