[ome-devel] In-place Imports

Thu Dec 19 11:21:06 GMT 2013

Hi Douglas,

On Dec 18, 2013, at 5:28 PM, Douglas Russell wrote:

> As I understand it, once we have OMERO5, we should be able to start working
> towards having in-place imports of data. This is pretty important for a
> number of reasons, not least of which is that all our users currently keep
> copies of everything on a network drive as well as OMERO which is total
> duplication. Also, it will drastically reduce import times and network
> traffic. I'm interested in understanding how this is going to work.

I went ahead and tried to put together a script which will do the in-place import:

  https://github.com/joshmoore/omero-user-scripts/blob/in-place-import/InPlaceImport.py

What's missing at the moment is knowing the location that the symlink should
happen. Likely this will be an addition to the API that will have to go in for
5.0.0 to make this possible.

> I assume that this will require a special import procedure in which the
> OMERO server is given a path to a file on the local machine instead of data
> for upload.

Yup. Doing this via scripts seems to be a good way forward so that we can improve
the in-place work without requiring a new server. Open to opinions though.

> Obviously there is no upload step per se although there might be some
> symlinking or something, then the server would move on directly to doing
> the metadata import.
> 
> I have some questions.
> 
> *Linkage*
> How would the linkage within OMERO work? An obvious solution is to symlink
> the file from the managed repository. I don't think that will work very
> well because the user will inevitably want to get the path to their data
> back out of OMERO in the future. If they are given an address in the
> managed repository then this will mean little to them compared to the
> original path of their image.

There's a field in the DB for storing the original location, the "client path".
In the case of in-place imports, this will be the actual location, and under
the managed repository there will be symlinks.

> The idea would be that they get this path and can then navigate some
> proprietary image analysis tool to that location on their network drive and
> open it. Probably there would have to be config on the server to map the
> local paths that OMERO users into a sensible path for the user. E.g.*
> /mnt/fileserver1/dpwrussell/data/omero/foo/bar/myimage.dv* might be what
> the OMERO server sees as the image path, but the user would want
> *data/omero/foo/bar/myimage.dv.*
> 
> We would not want to make all the managed repository available to all users
> because this would violate the privacy of all. I guess an alternative would
> be to map sections of the managed repository to network user accounts, but
> this would lose the niceness of being given a path they recognise from
> their actual data hierarchy.

Open to this, but it's going to require more work. Can you outline other benefits you see?
Otherwise, I think in the short-term, symlinks are going to be the best bang for the buck.

> *Permissions*
> Would OMERO deal with making the permissions (or maybe ownership) changes
> to the files to reduce the risk of deletion/overwriting or should this be
> done by whatever person/process is marking data for in-place import? I
> think probably it would make more sense that the person/process do it
> because they may have to make permissions/ownership changes to make
> whatever user OMERO is running as be able to see the data at all.

I'd tend to agree. I certainly don't mind trying to add an option into the
script to do this, but I don't think one solution is going to work for
every site.

In general, though, the rule will be "Fail to change permissions at your own risk".

> *Deletion/Overwrite (Worst Case Scenario)*
> What would happen if a user was to delete/overwrite some data somehow?

Then you'd not be able to view anything in OMERO.

> *Multiple Data Stores*
> Given that we already have a repository of data, will we be able to add
> multiple OMERO data directories? One would be all the data that's already
> been uploaded and subsequent upload imports. The other would be (in the
> solution I am envisaging) a mounted filesystem on another server which has
> a directory per user (it's basically our current fileserver), this is where
> in-place imports would be done from so the data doesn't have to move at all.

There's juvenile support for this, but it would need working out. If you're interested,
we can certainly try, but there will likely both be a number of bugs as well as changes
that need to be made in the UIs.

> *Moving Files*
> How about a mechanism for moving a file? I can imagine users having this
> requirement. Obviously they'd have to go through some special process to
> allow it to happen. What about moving a file from already uploaded data to
> the users repository, I know for sure they are going to want this if the
> above scenario would become a reality.

As with chmod'ing, it seems to be straight-forward to choose beteween "ln -s" and
"mv" as the implementation of the in-place import.

> *Deduplication*
> Finally, on a side-note, are there any plans to have a deduplication
> process? Any file that was previously uploaded with the archive option
> could presumably now have the pixels removed and perhaps be moved to the
> managed repository at the same time?

This was mentioned, but I can't remember where (Paris?) In general, there won't
be a vanilla solution to this, no. It's too dangerous. If you or another site
manager know that it's possible locally, we'll certainly help you come up with
a script and/or process for doing the conversion. But it will require human
intervention and validation.

> Thanks all,
> Douglas

and it was enough! :)
Cheers,
~Josh.