[ome-devel] cluster support
Ilya Goldberg
igg at nih.gov
Mon Dec 4 19:21:37 GMT 2006
What Chris said is basically right. In our case we have a couple of
dozen algorithms that have to run on each image. For various
reasons, but mainly for modularity, each of these algorithms is an
independent "job". Run these on a screen-full of images, and you
have several hundred thousand small jobs that would run for a second
or less each.
Couple that together with a second or more of startup time to get the
algorithm's run-time up, and it becomes a big problem.
One way out of this is to "box up" the little jobs into a bigger
job. For example instead of running a couple dozen algorithms on
each image, make it more monolithic by combining the algorithms
together in a more chuncky way. That way, each "job" is bigger and
will take several seconds to run so the second or two run-time
startup is not as big of a penalty. You lose modularity that way
though, a lot of flexibility, and potentially execution-time
efficiency on heterogeneous clusters. On the other hand, this is
pretty easy to do.
Take a look at the interface specified in the
OME::Analysis::Engine::Executor abstract class:
http://openmicroscopy.org/APIdocs/OME/Analysis/Engine/Executor.html
This interface is implemented by
OME::Analysis::Engine::SimpleWorkerExecutor, which makes job requests
to worker nodes. The worker nodes process the request using the OME/
Analysis/Engine/NonblockingSlaveWorkerCGI.pl script.
If you wanted to do something completely different, you would
probably start with OME::Analysis::Engine::SimpleWorkerExecutor, and
change the way it communicates with an altered
NonblockingSlaveWorkerCGI.pl. The job specification is very simple.
Just the MEX (module execution ID), the dependence ('G', 'D', 'I',
'F') and the target ID (the ID of the Dataset, Image or Feature).
These are all supplied by the Analysis Engine when it calls the
Executor, and these are all passed on to
OME::Analysis::Engine::UnthreadedPerlExecutor to do the actual
execution.
For example, you could make a new derivative of SimpleWorkerExecutor
that sets up a job request using your cluster management software
instead of what it does now. The job request would be picked up
(immediately or later) by a derivative of
NonblockingSlaveWorkerCGI.pl running on one of the nodes, which would
execute the requested job by calling
my $executor = OME::Analysis::Engine::UnthreadedPerlExecutor->new();
$executor->executeModule ($mex,$Dependence,$target);
The whole job specification is the three parameters to
executeModule. This is what you get when the AE calls your executor,
and this is what you need to put in the queue. To execute the job
(on a node), you take the job specification off the queue and pass
these three parameters to the blocking executor UnthreadedPerlExecutor.
The job specification (what you put in the cluster manager queue) can
be command-line parameters, entries in a job-file, whatever the
mechanism is. If you use XMLRPC or some other http-based mechanism
(like SOAP), the nodes would have to run an http server in order to
process the request. They don't have to if you use a non-http method
to make the job request.
You could implement your Executor to collect all of the jobs for a
single Analysis Engine "Pass", and submit the whole thing as a big
job to the cluster. Or you could negotiate node-allocation with your
cluster manager and distribute the small jobs between the nodes. Or
whatever. A Pass is an attempt by the AE to execute everything
possible without having to wait for any results to come back, so all
of the individual jobs in a Pass can run concurrently. When the AE
reaches a state where it needs to wait for results to be submitted
before it can go forward, it calls the Executor's waitForAllModules()/
waitForAnyModules(). This is where your Executor will need to block
until your job(s) are finished.
Technically, the job specification also includes remote DB connection
info. This is essentially the DB connection string used by DBI and
the SessionKey from the OME session. This gets passed to the worker
because the worker has to establish its own connection to the remote
OME DB. If your workers are set up to only access one particular DB,
then these can be part of the worker configuration and won't have to
be in the job spec.
Nothing is returned by the executeModule() method other than a
possible exception. The data is written to the DB internally in this
call. Can a cluster node establish a connection with an outside DB,
or must all information pass out of the cluster nodes in the form of
files? The connection is a standard Postgres client/server
connection on port 5432.
As far as OME Perl being a "pig" - I suppose that is a matter of
opinion. Is the JRE less of a pig? MATLAB? All three are more or
less of the same ilk, I would say (though OME Perl is measurably
dwarfed by the other two). All three execute byte code in a run-
time. All are also more or less unavoidable depending on what you're
trying to do.
What people are willing to do with their clusters depends a lot on
how much idle time they're seeing, or time spent doing things deemed
"not so important". The last thing the manager of an over-subscribed
cluster is going to let you do is install a bunch of software and
commit his cluster to yet another major application. On the other
hand, if there's a lot of hardware sitting idle or essentially
spinning its wheels, you'd be surprised what people will let you do.
Considering that a very adequate application-specific cluster for OME
can be had pretty cheap (something under $50k could adequately handle
a million-dollars worth of acquisition hardware), my philosophy would
be to deploy application-specific mini-clusters rather than trying to
squeeze it all under "one big general cluster". Then you get to
decide wether or not to let people run Blast on your mini-cluster
during its idle-time. Or, maybe you'd rather spin your cluster's
wheels running all-against-all image similarity searches (image-
blast)?. From personal experience, with a large cluster available to
my lab from NIH (i.e. for running MATLAB without OME), we have much
better throughput for cheaper having bought a dozen or so of our own
nodes which we can use for whatever we want. In fact our throughput
beat what we could get out of the shared cluster with a single
dedicated node. This is a very real scenario for an over-subscribed
cluster. The best of both worlds is if you can squeeze your
dedicated nodes into someone else's rack and have them deal with
power supply/network/backup/air conditioning while you get to use
your nodes for whatever you want. This is basically how our setup
works.
There's nothing in the design to prevent you from doing whatever's
necessary to cooperate with your cluster. There are several options
open, but unimplemented. The design of the Analysis Engine lets you
implement what you need to do very easily - essentially by
implementing the interface between OME and your cluster management
software. The OME interface to specify and execute the job consists
of two classes (one existing and one dependent on the cluster
management software), three methods and three parameters. Everything
else is dependent on the cluster management software.
-Ilya
On Dec 4, 2006, at 4:58 AM, Chris Allan wrote:
> Hi guys,
>
> Just going to plop my 2 cents in here as we're doing quite a bit of
> this sort of cluster integration (albeit not directly with OME
> recently) in Dundee at the moment. If I understand things correctly,
> your run time is going to be in minutes (potentially hours?) per job.
> You haven't mentioned much about the scheduling algorithms that your
> LSF install uses though I think that's probably irrelevant for the
> purposes of this discussion. Further to that, how big is the actual
> cluster and how homogenous is it? Please correct me if I've
> misconstrued anything.
>
> All that context understood, I think it's pretty safe to say that any
> cluster administrator who's halfway sane wouldn't allow you to run
> Apache instances on his/her nodes for a multitude of reasons, not the
> least of which being that you could use it to circumvent the
> scheduler quite easily and that's an administrative nightmare. The
> cleanup phase of job scheduling on LSF may even kill all processes
> spawned during a given job execution that are still in the running
> state upon completion, I'm not sure. That said, if your job run time
> is in fact in the minute time scale it's not going to matter if it
> took seconds to begin execution anyway.
>
> Much of the greater execution state is stored in the OME database, so
> your best bet is to probably write an OME analysis handler whose sole
> responsibility it is to submit a job to the LSF scheduler. The
> submitee should probably be a reasonably intelligent script who can,
> from the MEX and/or other metadata, run the actual job that the
> analysis chain has asked it to. You've got someone else maintaining
> the scheduler for you after all, there's really no need to burden
> yourself with maintaining that sort of infrastructure within OME.
>
> Installing OME's Perl library and dependencies on every single node
> could be a bit of a pig (especially if you have a heterogeneous
> cluster) so you might decide to use the XML-RPC based remote
> framework to communicate with the server depending on the data
> volumes you've got to handle. I guess I'd have to understand exactly
> what sorts of analysis chains you'd want to schedule to run on the
> cluster. All of them? Just some of them?
>
> Given that you want to integrate with an existing cluster whose
> operating system and deployment you may or may not control and an
> existing scheduler I'd suggest that Ilya's group's cluster
> integration tools probably don't give you exactly what you want but
> would give you some sort of idea as to how to write your own and that
> "shouldn't be too hard" (tm).
>
> Ciao.
>
> -Chris
>
> On 2 Dec 2006, at 00:10, Jeremy Muhlich wrote:
>
>> The Harvard Medical School cluster uses LSF.
>>
>> http://www.platform.com/Products/Platform.LSF.Family/
>>
>> All nodes can make network connections to all other nodes, and they
>> all
>> mount a massive shared filesystem in addition to maybe 60GB of local
>> scratch space. The interconnect is gigabit ethernet.
>>
>>
>>
>> On Fri, 2006-12-01 at 17:17 -0500, Ilya Goldberg wrote:
>>> What's the cluster management software being used?
>>> As there is growing interest in cluster computing for this, it would
>>> definitely be worth-while to commit to some scheme that everyone
>>> could be happy with. I think it would be trivial to do this with a
>>> PBS-type manager which essentially runs command-line programs -
>>> assuming you're willing to take the startup hit. In our case, this
>>> was approaching 50% of the total execution time, so seemed very
>>> wasteful. I don't know much about Grid Engine, but people who do
>>> tell me that it is possible to maintain state using this system.
>>> Apache is nothing but a container to persist a MATLAB instance. It
>>> doesn't really matter how that gets done as long as it gets done
>>> somehow.
>>>
>>> Can each node make arbitrary TCP/IP connections to the master?
>>> To an
>>> arbitrary IP address? At the very least it would need to make
>>> client-
>>> style http and Posgres connections, and possibly outside of the
>>> cluster unless the database and image servers are running on the
>>> master node (most likely not). Some cluster managers insist on
>>> doing
>>> all communication with files only. That would be a pretty
>>> significant burden.
>>>
>>> My knowledge of Grid Engine can probably be summarized on the
>>> back of
>>> a postage stamp with a felt-tip marker. It seems to me to have the
>>> right bits to do what we want, and it certainly has the shiny Sun
>>> marketing juggernaut behind it, so presumably one would be able to
>>> talk a cluster manager into supporting it - no?
>>> -Ilya
>>>
>>> On Dec 1, 2006, at 12:14 PM, Jeremy Muhlich wrote:
>>>
>>>> On Thu, 2006-11-23 at 14:22 -0500, Ilya Goldberg wrote:
>>>>> So the way the OME cluster is set up is that every node is running
>>>>> Apache. The master node issues requests that include remote DB
>>>>> connection info and job info. The worker node establishes a DB
>>>>> connection, returns an OK message (to unblock the master), then
>>>>> continues processing the request. When its done, its supposed to
>>>>> issue an IPC message using the DB driver, but this bit hasn't been
>>>>> working well recently. Anyway, the master doesn't wait around
>>>>> forever for the IPC "finished" message, so things continue
>>>>> cranking
>>>>> along fairly well. The only effect seems to be that the master
>>>>> gets
>>>>> loaded a little more than it should be.
>>>>
>>>> Hmmm. This is a shared cluster with time-limited job queues. For
>>>> example the 15m queue has the highest priority but will kill your
>>>> job
>>>> after 15 minutes. The complete list of queues in priority order is
>>>> 15m,
>>>> 2h, 12h, 1d, 7d, and unlimited. It could be difficult to employ
>>>> your
>>>> apache-everywhere scheme on this sort of system. However, a
>>>> group who
>>>> contributes a node gets top priority on it, so that might be the
>>>> way to
>>>> go.
>>>>
>>>>>>
>>>>>> Also, is the image server more cpu bound or I/O bound?
>>>>>
>>>>> Definitely IO bound. It could start hitting the CPU if you
>>>>> request
>>>>> lots and lots of rendered planes rather than raw data for
>>>>> analysis,
>>>>> but its probably IO bound even then.
>>>>
>>>> Thanks, that's helpful to know.
>>>>
>>>>
>>>> -- Jeremy
>>>>
>>>> _______________________________________________
>>>> ome-devel mailing list
>>>> ome-devel at lists.openmicroscopy.org.uk
>>>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>>>
>>>
>>> _______________________________________________
>>> ome-devel mailing list
>>> ome-devel at lists.openmicroscopy.org.uk
>>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>>>
>>
>> _______________________________________________
>> ome-devel mailing list
>> ome-devel at lists.openmicroscopy.org.uk
>> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
More information about the ome-devel
mailing list