[ome-devel] of graphs and message queues
Simone Leo
simleo at crs4.it
Tue May 14 15:31:37 BST 2013
SYNOPSIS: we are using a graph-oriented db and an MQ for our OMERO-based
framework and would like to get these into OMERO
Read on for the full description :)
Hello,
recently we have been working on a major upgrade of our "biobank" Python
package. For those who don't know, the biobank is a framework for
managing biomedical data built on top of a customized (i.e., we
developed our own models) version of OMERO.
Our data model focuses on the "chain of custody" concept, where (almost)
every object is linked to some other object that created it as the
result of an operation that we call simply "action". The chain of events
that leads to the creation of an object is dynamic and unpredictable. In
some cases, we need to reconstruct this chain of events for an arbitrary
object, e.g., retrieving all genotyping information for a given
experimental subject.
Doing this by directly querying the OMERO db does not scale due to the
large number of queries and the fact that you can't perform joins since
you don't know in advance which tables are involved. This lead us to the
conclusion that we needed to complement the OMERO-based object
repository with a fast graph traversal system which, at first, we
implemented in-memory with pygraph. Needless to say, this not much more
scalable than the direct approach, mainly because of the memory
requirements (we have to store almost all objects in advance).
Recently, we experimented with a scalable solution that uses a
graph-oriented db (neo4j) and a message queue (rabbitmq). Essentially,
it works as follows:
WRITE
-----
1. biobank sends the add\delete\update command to the OMERO server
2. the OMERO server returns the created\updated object or confirms the
object's deletion
3. biobank sends an appropriate message for the event to the MQ
4. a consumer daemon periodically polls the queue for new messages and
updates the graph db according to the retrieved message
5.1. if everything goes fine, the daemon sends an ack message to the MQ,
which deletes all consumed messages
5.2. if something goes wrong, the daemon sends a not-ack message to the
MQ and logs the error; the message will remain in the queue and the
daemon will try to process it again at the next poll
READ
----
when we need to traverse an object's chain of custody, biobank checks to
see if the queue is empty: if it is, then OMERO and the graph db are
synced, so we can interrogate the latter to get what we need; if it is
not, the calling process retries after a suitable amount of time.
Both the graph db and the MQ are currently external entities. What we
would like to do is work (possibly together with other devs) to get this
kind of functionality integrated into OMERO. What's the best way to do
so? Regarding MQs, a ticket is already active:
http://trac.openmicroscopy.org.uk/ome/ticket/7902
Has any work already been done on this? Is the description of the above
ticket compatible with what we're doing? What about graphs?
Simone & Luca
--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it
--
Simone Leo
Data Fusion - Distributed Computing
CRS4
POLARIS - Building #1
Piscina Manna
I-09010 Pula (CA) - Italy
e-mail: simone.leo at crs4.it
http://www.crs4.it
More information about the ome-devel
mailing list