[ome-devel] cluster support

Mon Dec 4 19:21:37 GMT 2006

What Chris said is basically right.  In our case we have a couple of  
dozen algorithms that have to run on each image.  For various  
reasons, but mainly for modularity, each of these algorithms is an  
independent "job".  Run these on a screen-full of images, and you  
have several hundred thousand small jobs that would run for a second  
or less each.
Couple that together with a second or more of startup time to get the  
algorithm's run-time up, and it becomes a big problem.

One way out of this is to "box up" the little jobs into a bigger  
job.  For example instead of running a couple dozen algorithms on  
each image, make it more monolithic by combining the algorithms  
together in a more chuncky way.  That way, each "job" is bigger and  
will take several seconds to run so the second or two run-time  
startup is not as big of a penalty.  You lose modularity that way  
though, a lot of flexibility, and potentially execution-time  
efficiency on heterogeneous clusters.  On the other hand, this is  
pretty easy to do.

Take a look at the interface specified in the  
OME::Analysis::Engine::Executor abstract class:
http://openmicroscopy.org/APIdocs/OME/Analysis/Engine/Executor.html
This interface is implemented by  
OME::Analysis::Engine::SimpleWorkerExecutor, which makes job requests  
to worker nodes.  The worker nodes process the request using  the OME/ 
Analysis/Engine/NonblockingSlaveWorkerCGI.pl script.

If you wanted to do something completely different, you would  
probably start with OME::Analysis::Engine::SimpleWorkerExecutor, and  
change the way it communicates with an altered  
NonblockingSlaveWorkerCGI.pl.  The job specification is very simple.   
Just the MEX (module execution ID), the dependence ('G', 'D', 'I',  
'F') and the target ID (the ID of the Dataset, Image or Feature).   
These are all supplied by the Analysis Engine when it calls the  
Executor, and these are all passed on to  
OME::Analysis::Engine::UnthreadedPerlExecutor to do the actual  
execution.

For example, you could make a new derivative of SimpleWorkerExecutor  
that sets up a job request using your cluster management software  
instead of what it does now.  The job request would be picked up  
(immediately or later) by a derivative of  
NonblockingSlaveWorkerCGI.pl running on one of the nodes, which would  
execute the requested job by calling
   my $executor = OME::Analysis::Engine::UnthreadedPerlExecutor->new();
   $executor->executeModule ($mex,$Dependence,$target);
The whole job specification is the three parameters to  
executeModule.  This is what you get when the AE calls your executor,  
and this is what you need to put in the queue.  To execute the job  
(on a node), you take the job specification off the queue and pass  
these three parameters to the blocking executor UnthreadedPerlExecutor.

The job specification (what you put in the cluster manager queue) can  
be command-line parameters, entries in a job-file, whatever the  
mechanism is.  If you use XMLRPC or some other http-based mechanism  
(like SOAP), the nodes would have to run an http server in order to  
process the request.  They don't have to if you use a non-http method  
to make the job request.
You could implement your Executor to collect all of the jobs for a  
single Analysis Engine "Pass", and submit the whole thing as a big  
job to the cluster.  Or you could negotiate node-allocation with your  
cluster manager and distribute the small jobs between the nodes.  Or  
whatever.  A Pass is an attempt by the AE to execute everything  
possible without having to wait for any results to come back, so all  
of the individual jobs in a Pass can run concurrently.  When the AE  
reaches a state where it needs to wait for results to be submitted  
before it can go forward, it calls the Executor's waitForAllModules()/ 
waitForAnyModules().  This is where your Executor will need to block  
until your job(s) are finished.

Technically, the job specification also includes remote DB connection  
info.  This is essentially the DB connection string used by DBI and  
the SessionKey from the OME session.  This gets passed to the worker  
because the worker has to establish its own connection to the remote  
OME DB.  If your workers are set up to only access one particular DB,  
then these can be part of the worker configuration and won't have to  
be in the job spec.

Nothing is returned by the executeModule() method other than a  
possible exception.  The data is written to the DB internally in this  
call.  Can a cluster node establish a connection with an outside DB,  
or must all information pass out of the cluster nodes in the form of  
files?  The connection is a standard Postgres client/server  
connection on port 5432.

As far as OME Perl being a "pig" - I suppose that is a matter of  
opinion.  Is the JRE less of a pig?  MATLAB?  All three are more or  
less of the same ilk, I would say (though OME Perl is measurably  
dwarfed by the other two).  All three execute byte code in a run- 
time.  All are also more or less unavoidable depending on what you're  
trying to do.
What people are willing to do with their clusters depends a lot on  
how much idle time they're seeing, or time spent doing things deemed  
"not so important".  The last thing the manager of an over-subscribed  
cluster is going to let you do is install a bunch of software and  
commit his cluster to yet another major application.  On the other  
hand, if there's a lot of hardware sitting idle or essentially  
spinning its wheels, you'd be surprised what people will let you do.
Considering that a very adequate application-specific cluster for OME  
can be had pretty cheap (something under $50k could adequately handle  
a million-dollars worth of acquisition hardware), my philosophy would  
be to deploy application-specific mini-clusters rather than trying to  
squeeze it all under "one big general cluster".  Then you get to  
decide wether or not to let people run Blast on your mini-cluster  
during its idle-time.  Or, maybe you'd rather spin your cluster's  
wheels running all-against-all image similarity searches (image- 
blast)?.  From personal experience, with a large cluster available to  
my lab from NIH (i.e. for running MATLAB without OME), we have much  
better throughput for cheaper having bought a dozen or so of our own  
nodes which we can use for whatever we want.  In fact our throughput  
beat what we could get out of the shared cluster with a single  
dedicated node.  This is a very real scenario for an over-subscribed  
cluster.  The best of both worlds is if you can squeeze your  
dedicated nodes into someone else's rack and have them deal with  
power supply/network/backup/air conditioning while you get to use  
your nodes for whatever you want.  This is basically how our setup  
works.

There's nothing in the design to prevent you from doing whatever's  
necessary to cooperate with your cluster.  There are several options  
open, but unimplemented.  The design of the Analysis Engine lets you  
implement what you need to do very easily - essentially by  
implementing the interface between OME and your cluster management  
software.  The OME interface to specify and execute the job consists  
of two classes (one existing and one dependent on the cluster  
management software), three methods and three parameters.  Everything  
else is dependent on the cluster management software.

-Ilya

On Dec 4, 2006, at 4:58 AM, Chris Allan wrote:

> Hi guys,
>
> Just going to plop my 2 cents in here as we're doing quite a bit of
> this sort of cluster integration (albeit not directly with OME
> recently) in Dundee at the moment. If I understand things correctly,
> your run time is going to be in minutes (potentially hours?) per job.
> You haven't mentioned much about the scheduling algorithms that your
> LSF install uses though I think that's probably irrelevant for the
> purposes of this discussion. Further to that, how big is the actual
> cluster and how homogenous is it? Please correct me if I've
> misconstrued anything.
>
> All that context understood, I think it's pretty safe to say that any
> cluster administrator who's halfway sane wouldn't allow you to run
> Apache instances on his/her nodes for a multitude of reasons, not the
> least of which being that you could use it to circumvent the
> scheduler quite easily and that's an administrative nightmare. The
> cleanup phase of job scheduling on LSF may even kill all processes
> spawned during a given job execution that are still in the running
> state upon completion, I'm not sure. That said, if your job run time
> is in fact in the minute time scale it's not going to matter if it
> took seconds to begin execution anyway.
>
> Much of the greater execution state is stored in the OME database, so
> your best bet is to probably write an OME analysis handler whose sole
> responsibility it is to submit a job to the LSF scheduler. The
> submitee should probably be a reasonably intelligent script who can,
> from the MEX and/or other metadata, run the actual job that the
> analysis chain has asked it to. You've got someone else maintaining
> the scheduler for you after all, there's really no need to burden
> yourself with maintaining that sort of infrastructure within OME.
>
> Installing OME's Perl library and dependencies on every single node
> could be a bit of a pig (especially if you have a heterogeneous
> cluster) so you might decide to use the XML-RPC based remote
> framework to communicate with the server depending on the data
> volumes you've got to handle. I guess I'd have to understand exactly
> what sorts of analysis chains you'd want to schedule to run on the
> cluster. All of them? Just some of them?
>
> Given that you want to integrate with an existing cluster whose
> operating system and deployment you may or may not control and an
> existing scheduler I'd suggest that Ilya's group's cluster
> integration tools probably don't give you exactly what you want but
> would give you some sort of idea as to how to write your own and that
> "shouldn't be too hard" (tm).
>
> Ciao.
>
> -Chris
>
> On 2 Dec 2006, at 00:10, Jeremy Muhlich wrote:
>
>> The Harvard Medical School cluster uses LSF.
>>
>> http://www.platform.com/Products/Platform.LSF.Family/
>>
>> All nodes can make network connections to all other nodes, and they
>> all
>> mount a massive shared filesystem in addition to maybe 60GB of local
>> scratch space.  The interconnect is gigabit ethernet.
>>
>>
>>
>> On Fri, 2006-12-01 at 17:17 -0500, Ilya Goldberg wrote:
>>> What's the cluster management software being used?
>>> As there is growing interest in cluster computing for this, it would
>>> definitely be worth-while to commit to some scheme that everyone
>>> could be happy with.  I think it would be trivial to do this with a
>>> PBS-type manager which essentially runs command-line programs -
>>> assuming you're willing to take the startup hit.  In our case, this
>>> was approaching 50% of the total execution time, so seemed very
>>> wasteful.  I don't know much about Grid Engine, but people who do
>>> tell me that it is possible to maintain state using this system.
>>> Apache is nothing but a container to persist a MATLAB instance.  It
>>> doesn't really matter how that gets done as long as it gets done
>>> somehow.
>>>
>>> Can each node make arbitrary TCP/IP connections to the master?   
>>> To an
>>> arbitrary IP address? At the very least it would need to make  
>>> client-
>>> style http and Posgres connections, and possibly outside of the
>>> cluster unless the database and image servers are running on the
>>> master node (most likely not).  Some cluster managers insist on  
>>> doing
>>> all communication with files only.  That would be a pretty
>>> significant burden.
>>>
>>> My knowledge of Grid Engine can probably be summarized on the  
>>> back of
>>> a postage stamp with a felt-tip marker.  It seems to me to have the
>>> right bits to do what we want, and it certainly has the shiny Sun
>>> marketing juggernaut behind it, so presumably one would be able to
>>> talk a cluster manager into supporting it - no?
>>> -Ilya
>>>
>>> On Dec 1, 2006, at 12:14 PM, Jeremy Muhlich wrote:
>>>
>>>> On Thu, 2006-11-23 at 14:22 -0500, Ilya Goldberg wrote:
>>>>> So the way the OME cluster is set up is that every node is running
>>>>> Apache.  The master node issues requests that include remote DB
>>>>> connection info and job info.  The worker node establishes a DB
>>>>> connection, returns an OK message (to unblock the master), then
>>>>> continues processing the request.  When its done, its supposed to
>>>>> issue an IPC message using the DB driver, but this bit hasn't been
>>>>> working well recently.  Anyway, the master doesn't wait around
>>>>> forever for the IPC "finished" message, so things continue  
>>>>> cranking
>>>>> along fairly well.  The only effect seems to be that the master
>>>>> gets
>>>>> loaded a little more than it should be.
>>>>
>>>> Hmmm.  This is a shared cluster with time-limited job queues.  For
>>>> example the 15m queue has the highest priority but will kill your
>>>> job
>>>> after 15 minutes.  The complete list of queues in priority order is
>>>> 15m,
>>>> 2h, 12h, 1d, 7d, and unlimited.  It could be difficult to employ
>>>> your
>>>> apache-everywhere scheme on this sort of system.  However, a
>>>> group who
>>>> contributes a node gets top priority on it, so that might be the
>>>> way to
>>>> go.
>>>>
>>>>>>
>>>>>> Also, is the image server more cpu bound or I/O bound?
>>>>>
>>>>> Definitely IO bound.  It could start hitting the CPU if you  
>>>>> request
>>>>> lots and lots of rendered planes rather than raw data for  
>>>>> analysis,
>>>>> but its probably IO bound even then.
>>>>
>>>> Thanks, that's helpful to know.
>>>>
>>>>
>>>>  -- Jeremy
>>>>
>>>> _______________________________________________
>>>> ome-devel mailing list
>>>> ome-devel at lists.openmicroscopy.org.uk
>>>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>>>
>>>
>>> _______________________________________________
>>> ome-devel mailing list
>>> ome-devel at lists.openmicroscopy.org.uk
>>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>>
>>
>> _______________________________________________
>> ome-devel mailing list
>> ome-devel at lists.openmicroscopy.org.uk
>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>