[ome-devel] of graphs and message queues

Tue May 14 15:31:37 BST 2013

SYNOPSIS: we are using a graph-oriented db and an MQ for our OMERO-based 
framework and would like to get these into OMERO

Read on for the full description :)

Hello,

recently we have been working on a major upgrade of our "biobank" Python 
package. For those who don't know, the biobank is a framework for 
managing biomedical data built on top of a customized (i.e., we 
developed our own models) version of OMERO.

Our data model focuses on the "chain of custody" concept, where (almost) 
every object is linked to some other object that created it as the 
result of an operation that we call simply "action". The chain of events 
that leads to the creation of an object is dynamic and unpredictable. In 
some cases, we need to reconstruct this chain of events for an arbitrary 
object, e.g., retrieving all genotyping information for a given 
experimental subject.

Doing this by directly querying the OMERO db does not scale due to the 
large number of queries and the fact that you can't perform joins since 
you don't know in advance which tables are involved. This lead us to the 
conclusion that we needed to complement the OMERO-based object 
repository with a fast graph traversal system which, at first, we 
implemented in-memory with pygraph. Needless to say, this not much more 
scalable than the direct approach, mainly because of the memory 
requirements (we have to store almost all objects in advance).

Recently, we experimented with a scalable solution that uses a 
graph-oriented db (neo4j) and a message queue (rabbitmq). Essentially, 
it works as follows:

WRITE
-----
1. biobank sends the add\delete\update command to the OMERO server
2. the OMERO server returns the created\updated object or confirms the 
object's deletion
3. biobank sends an appropriate message for the event to the MQ
4. a consumer daemon periodically polls the queue for new messages and 
updates the graph db according to the retrieved message
5.1. if everything goes fine, the daemon sends an ack message to the MQ, 
which deletes all consumed messages
5.2. if something goes wrong, the daemon sends a not-ack message to the 
MQ and logs the error; the message will remain in the queue and the 
daemon will try to process it again at the next poll

READ
----
when we need to traverse an object's chain of custody, biobank checks to 
see if the queue is empty: if it is, then OMERO and the graph db are 
synced, so we can interrogate the latter to get what we need; if it is 
not, the calling process retries after a suitable amount of time.

Both the graph db and the MQ are currently external entities. What we 
would like to do is work (possibly together with other devs) to get this 
kind of functionality integrated into OMERO. What's the best way to do 
so? Regarding MQs, a ticket is already active:

http://trac.openmicroscopy.org.uk/ome/ticket/7902

Has any work already been done on this? Is the description of the above 
ticket compatible with what we're doing? What about graphs?

Simone & Luca

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it

-- 
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it