[ome-devel] Archiving from OMERO

Mon Oct 12 14:29:12 BST 2015

On Fri, Oct 2, 2015 at 11:23 AM, Alex Herbert <A.Herbert at sussex.ac.uk> wrote:
> Hi Jason, Josh,

Hi Alex,

> Here at the University of Sussex we are encouraging everyone to put their data into OMERO. Currently the storage for this is more expensive than the university's chosen data archiving vendor Arkivum. The university is keen for people to use this archiving service. So we would like to try moving data from OMERO into this archive.
>
> I have had an introduction talk with the Arkivum people about how the system works. They provide a file system that can be mounted on your server of choice. You can then copy files to this file system and when they are 'safe' the original copy can be removed and optionally symbolically linked to the archive file system.
>
> Symbolic links would allow any existing applications that use local files to carry on as normal. However the file may not actually be present on the archive file system so access can be fast (if it is locally cached) or slow (if the file system performs data retrieval from the archive off site). It is still possible to stat the files to get file sizes, last accessed time, etc.
>
> The archiving process and information about the archived files is all available through a REST API.
>
> I am aware that OMERO can perform in-place import for images by creating symbolic links to the original data location. Thus the system works using symbolic links to files. I would like to try and set up an intermediate file access layer in OMERO to support archived files.
>
> Basically all access to image data should go through this layer. The first task for the layer is to check if the image data is archived. If not then it can return the data as normal. If it is archived then it must discover if the archived file is available (i.e. locally cached), if so it can return as normal. If not it can error to the request.

We've discussed such an intermediate access layer a number of times
for this and other use cases, so there are certainly no objections
from our side. It is, however, a significant amount of work. Are there
concrete code paths or exceptions that you've run into? It might be
that an intermediate solution would be to either handle specific bugs
that have been encountered or slightly more adventurous replace key
classes by interfaces which could be configured by third parties by
Arkivum.

In general, we're more than happy to get (and act) bug reports or
feature requests at a more specific/detailed level. The more
comprehensive access layer will require some time from our side.

> We would propose that image archiving is only performed by an administrator. Users will be able to request images to be archived. We would then need to build a list of all the files associated with the images. Then then files can be copied to the archive. When the archive process is complete (verified using checksums) the originals can be removed and links created. This should be done when OMERO is off-line, e.g. during a server restart. All the images that correspond to the files are marked as archived. This requires a reverse method to find all image IDs associated with data files since we need to support all image formats, i.e. one/many images-to-one/many files.
>
> The archiving could be added as functionality to the omero admin command-line utility, i.e. it runs on the server. To support multiple archiving strategies would require a plugin system where an archive service is created for the archiving system of choice. This would need to be pre-configured with all its requirements. The admin script just provides a list of currently supported archiving strategies. The admin script can then use the chosen service to execute common commands: stage for archiving, request status, remove original files, etc.
>
> The workflow for using our vendor (Arkivum) would be:
>
> 1. MD5 checksum all files to be archived
> 2. Copy file to the archive
> 3. Verify file transfer using checksum
> 4. Verify file has been successfully archived (on tape back-up)
> 5. When OMERO is offline, remove original file and mark it as archived in the OMERO database
> 6. Generate symbolic links to an archive restore location

This workflow certainly makes sense, though from our side, we would
that that #5 wouldn't require OMERO to be offline. If we can help with
that, let us know.

> There may be a large gap between step 3 and 4 since this requires that the archive has produced an off-site tape back-up. So the workflow will require splitting into stages that are stored such that the process of archiving each file can be queried and when the archiving is complete the admin can shutdown OMERO, and run the archive command to remove the files and mark them as archived.
>
> Step 6 could be used even if the source for the link does not exist using a directory structure such as:
>
> /OMERO/Files/... -> /OMERO/Archive/...
>
> The /OMERO/Archive/... folder can be mounted on another server or local. It would provide a way for archived files to be temporarily restored allowing access to them from OMERO. This could be supported by adding methods to the archive service to be able to query if archived files are accessible.
>
> Although we do not currently require this, it is possible that the whole process can be reversed so that images become unarchived. This could just involve restoring the files to the location pointed to by the symbolic link. When they are restored the admin can run shutdown OMERO and run an unarchive command to remove the links and copy the files back the original location.
>
> We would not archive thumbnails so that browsing through the clients is still possible. Initially the clients may not require modification assuming that they will just fail to show images that are unavailable. However it would be preferable that the clients will be archive aware and present unavailable files differently.
>
> My lack of knowledge of the OMERO server may have missed many important points. For example support for tiled images/pyramids,

Currently, pyramids are stored under /OMERO/Files rather than
/OMERO/ManagedRepository as you would likely want them to be.

> or how the new system reads metadata directly from file using BioFormats. Is access to the files required to rebuild metadata after the initial import, e.g. when BioFormats is updated or the metadata removed from a cache?

No metadata is *needed* from the original file after import, but it is
accessed. This is used in the right-hand panel to display "Original
metadata". You would likely want this to show something simple like
"archived=true".

> Or implications for support for the OMERO 4 raw pixel data format.

There would only be implications if files under /OMERO/Pixels were
also archived, but I would assume that any access layer would be able
to handle it, and there are at least no complications re: metadata.

> Initially I am trying to do a feasibility assessment since I can dedicate some time to working on a branch of OMERO but do not want to commit to maintaining a branch forever. It would be preferable that any changes I make do not break anything and can be integrated into the main OMERO branch. If the archive aware layer is created to be flexible then the changes would form the basis of supporting any archiving strategy.
>
> I would welcome any suggestions.
>
> Thank you in advance.
>
> Regards,
> Alex

Cheers,
~Josh.