[ome-devel] I like apples. (or "Parallelizing the AnalysisEngine")

Joshua Moore j.moore at dkfz-heidelberg.de
Fri Aug 27 10:30:08 BST 2004


Be it apples, oranges, kiwi or star fruit, the sign over my door should 
read produce stand because you can get it all here.

Ok. Let me take a step back from your concrete problem and look at 
workflows in general. This is of course what the AnalysisEngine 
does...workflows or call it what you will. There's a bunch of that 
flying around at the moment, especially in the grid community, so it's 
likely we can find something useful that already exists (this is my 
motto; heard it before?).

I think I can reduce this down to three questions we need to answer:

(1) What infrastructure are you willing to support?
(2) Do you want to submit a whole chain or just steps?
(3) Should data or pointers-to-data be sent?

---

(1) basically means that you have to (or _really, really_ want to) have 
some scheduling software.  This is the "black box." I'm not sure what 
you're running on your cluster but it's probably one of 
PBS,LSF,SGE,Condor,LoadLeveler, roughly in order of liklihood. For most 
of these there's independent Perl scripts for interfacing [PBS,LSF,...]. 
Another choice is the [DRMAA] which also has Perl bindings 
(Perl-->C-->scheduler) [DRMAAc]. 

That's the basic setup. I know Chris will roll his eyes and possibly go 
into a conniption fit, but an extended possibility is Globus/[GRAM]. 
This provides an entire API (CLI/Java/web services) for interacting with 
jobs. The bonus with DRMAA and Globus is that you don't have to 
implement 15 different Executors. That's the business of these Globus 
guys. Biggest problem--as Chris has mentioned--it's not "entirely 
stable" and it's a bit rough to maintain.

(2) For two, Ilya mentioned both possibilities in his apples email: 
Should the engine be inside or outside the black box? Having the AE 
outside the blackbox makes this simple. You add a new Executor, the AE 
calls it like any other Executor. Voila. This also adds the possibility 
of having a client Engine which makes calls into the scheduler. All neat 
stuff.

The benefit of having the executor inside the black box is that you have 
one less server process flying about. I think it would take a good deal 
of rewrite. On the other hand, depending on how you implement it, the 
scheduling software could _become_ the AE, which would take the 
development off our hands.

(3) Finally, in general you want to reduce data flow, so the engine 
should _not_ be passing around data _by default_ but rather pointers to 
data (LSID?).  If the engine knows that it can save a round trip DB call 
(new results-->store in DB. new process-->retrieve results) by storing 
data in a shared file, etc. (~caching) then it's a possibility. This, 
however, is all pretty touchy stuff since you have to code for failures 
and distribution. Safest bet is saving to the database in a transaction; 
if it fails, the chain fails.

The best way to get started is to decide on a scheduler and a Perl 
module for it. I would also suggest that all calls be by pointer to 
begin with. The Engine then creates a SchedulerExecutor (which should 
probably be an abstract class) and calls executeModule. The Executor 
submits a job which is typically a shell script and then polls for the 
results. (Polling can prevented with a notification scheme, but this 
requires more rewriting.)

Users should only have to configure which scheduler they are using. The 
SchedulerExecutors can do everything they can to find the executables. 
All other configuration is done external to OME.

Thoughts?
  -Josh.

PS to explain the fruit metaphor: clients are also on my plate because 
I'm charged with providing infrastructure to the biologists. This means 
not just the database infrastructure but algorithms, filters, 
visualization, etc.

---LINKS---
[DRMAA] http://www.drmaa.org/
[DRMAAc] http://search.cpan.org/~tharsch/Schedule-DRMAAc-0.81/
[GRAM] http://www-unix.globus.org/toolkit/docs/3.2/gram/ws/index.html
[PBS] http://search.cpan.org/author/TMERRITT/PBS-0.02/lib/PBS.pm
[LSF] http://search.cpan.org/author/MSOUTHERN/LSF-0.9/LSF.pm


Ilya Goldberg wrote:

> Well, I'll have another go at you then.
> The sign over your door says "Gridcomputing for the Lifesciences", so 
> why all this talk about clients?
> Here's something meatier to think about:
> The OME analysis engine is built to execute analysis chains - a DAG of 
> analysis modules.  The nodes in the DAG are the analysis modules.  The 
> edges connect outputs to inputs.  Each analysis module is effectively 
> an XML wrapper around some code.  The code can be a command-line 
> executable, a matlab module, a perl class, etc.  Each kind of code has 
> its own "AnalysisHandler", so there's one for Matlab, one for 
> command-line executables, etc.  Between the XML wrapper and the 
> implementation-specific handler, the analysis engine can execute an 
> analysis implemented any which way you want - passing information into 
> the module from the DB and collecting it from the module to put back 
> into the DB, and pass it on down the edge to the next node.
>
> An almost magical aspect of the Analysis Engine is that it knows all 
> of the data dependency.  It can generate lists of tasks that can 
> execute concurrently because it knows they are independent.  It then 
> knows enough to wait for the results, then generate a new list of 
> tasks, etc.  Until its done executing the chain.  So the key thing 
> here is that right now it just executes them all on the same box.  In 
> fact a very silly implementation of this knowledge of dependency can 
> be turned on and off with a flag, which will cause the AE to fork as 
> many times as necessary to execute all independent tasks 
> concurrently.  What would be really cool is break this linkage 
> somewhere between the analysis engine and the various handlers so that 
> all it has to do is issue commands into some computational black box 
> and wait for the black box to return the results.  None of us know 
> enough about grid computing to think intelligently about where it 
> would be appropriate to make this break, what kind of information we 
> send across it, and how we send it.  It could be that the black box 
> should talk to the DB directly (using our Perl APIs to do so - or 
> not), or it could be more appropriate to send the inputs and outputs 
> back and forth and have the analysis engine talk to the DB.  It could 
> be that the entire analysis engine should live in the black box.  Or 
> all this might sound completely cookoo because my knowledge of grid 
> computing can be summarized on the back of a postage stamp with a 
> felt-tip marker.  However its done, it would be very nice to preserve 
> this agnosticism about the implementation of the module - so that we 
> could still write modules in our favorite way.  Of course, since 
> determining data dependency is probably the most difficult task in 
> figuring out the granularity of the parallelism, the fact that this is 
> already done should obviously be taken advantage of.
>
> What do you think of them apples?
> -Ilya
>



More information about the ome-devel mailing list