[ome-devel] of graphs and message queues

Tue May 21 12:11:28 BST 2013

On May 21, 2013, at 12:49 PM, Simone Leo wrote:

> We started a discussion on the graph issue here:
> 
> https://trac.openmicroscopy.org.uk/ome/ticket/10940

Thanks, Simone!

Either here or on the ticket, could you add a pointer to your use of the neo4j graph API? I.e. what functionality/queries/etc would critical for you to be able to migrate?

Cheers,
~Josh

> Simone
> 
> On 05/15/2013 03:03 PM, Josh Moore wrote:
>> 
>> On May 14, 2013, at 4:31 PM, Simone Leo wrote:
>> 
>>> SYNOPSIS: we are using a graph-oriented db and an MQ for our OMERO-based framework and would like to get these into OMERO
>> 
>> +1
>> 
>>> Read on for the full description :)
>>> 
>>> Hello,
>> 
>> Hi Simone!
>> 
>>> recently we have been working on a major upgrade of our "biobank" Python package. For those who don't know, the biobank is a framework for managing biomedical data built on top of a customized (i.e., we developed our own models) version of OMERO.
>>> 
>>> Our data model focuses on the "chain of custody" concept, where (almost) every object is linked to some other object that created it as the result of an operation that we call simply "action". The chain of events that leads to the creation of an object is dynamic and unpredictable. In some cases, we need to reconstruct this chain of events for an arbitrary object, e.g., retrieving all genotyping information for a given experimental subject.
>>> 
>>> Doing this by directly querying the OMERO db does not scale due to the large number of queries and the fact that you can't perform joins since you don't know in advance which tables are involved. This lead us to the conclusion that we needed to complement the OMERO-based object repository with a fast graph traversal system which, at first, we implemented in-memory with pygraph. Needless to say, this not much more scalable than the direct approach, mainly because of the memory requirements (we have to store almost all objects in advance).
>>> 
>>> Recently, we experimented with a scalable solution that uses a graph-oriented db (neo4j) and a message queue (rabbitmq). Essentially, it works as follows:
>>> 
>>> WRITE
>>> -----
>>> 1. biobank sends the add\delete\update command to the OMERO server
>>> 2. the OMERO server returns the created\updated object or confirms the object's deletion
>>> 3. biobank sends an appropriate message for the event to the MQ
>> 
>> The server already does something similar here in that it writes such actions to EventLog. Plugging in an MQ event at that location would be ideal and prevent the client (biobank) from needing to get involved.
>> 
>> 
>>> 4. a consumer daemon periodically polls the queue for new messages and updates the graph db according to the retrieved message
>>> 5.1. if everything goes fine, the daemon sends an ack message to the MQ, which deletes all consumed messages
>>> 5.2. if something goes wrong, the daemon sends a not-ack message to the MQ and logs the error; the message will remain in the queue and the daemon will try to process it again at the next poll
>>> 
>>> READ
>>> ----
>>> when we need to traverse an object's chain of custody, biobank checks to see if the queue is empty: if it is, then OMERO and the graph db are synced, so we can interrogate the latter to get what we need; if it is not, the calling process retries after a suitable amount of time.
>> 
>> This is interesting. For search, I've always assumed that having something in the queue is ok, i.e. the server & index can be (hopefully) slightly out of sync and not impact users. But clearly, callers will need/want to specify a SLA on how synced the various datastores are.
>> 
>> 
>>> Both the graph db and the MQ are currently external entities. What we would like to do is work (possibly together with other devs) to get this kind of functionality integrated into OMERO. What's the best way to do so? Regarding MQs, a ticket is already active:
>>> 
>>> http://trac.openmicroscopy.org.uk/ome/ticket/7902
>>> 
>>> Has any work already been done on this?
>> 
>> Only discussions so far. The primary TODO here is to find a deployment scenario that doesn't require Erlang for all OMERO installations. ;) Hornet-Q is what I've planned on using, but this is quite open. Once the MQ is available, then various pieces of the backend will need to be updated to make use of it.
>> 
>> 
>>> Is the description of the above ticket compatible with what we're doing?
>> 
>> Definitely.
>> 
>> 
>>> What about graphs?
>> 
>> There's no a graph-db ticket (that I can remember). Before making any use of this, we would need to agree on an API graphs within OMERO, taking other requirements from the list into account. If you'd like to launch that discussion, that would be great.
>> 
>> 
>>> Simone & Luca
>> 
>> Cheers,
>> ~Josh
>> 
> 
> -- 
> Simone Leo
> Data Fusion - Distributed Computing
> CRS4
> POLARIS - Building #1
> Piscina Manna
> I-09010 Pula (CA) - Italy
> e-mail: simone.leo at crs4.it
> http://www.crs4.it