[ome-devel] cluster support

Sat Dec 2 00:10:55 GMT 2006

The Harvard Medical School cluster uses LSF.

http://www.platform.com/Products/Platform.LSF.Family/

All nodes can make network connections to all other nodes, and they all
mount a massive shared filesystem in addition to maybe 60GB of local
scratch space.  The interconnect is gigabit ethernet.

On Fri, 2006-12-01 at 17:17 -0500, Ilya Goldberg wrote:
> What's the cluster management software being used?
> As there is growing interest in cluster computing for this, it would  
> definitely be worth-while to commit to some scheme that everyone  
> could be happy with.  I think it would be trivial to do this with a  
> PBS-type manager which essentially runs command-line programs -  
> assuming you're willing to take the startup hit.  In our case, this  
> was approaching 50% of the total execution time, so seemed very  
> wasteful.  I don't know much about Grid Engine, but people who do  
> tell me that it is possible to maintain state using this system.   
> Apache is nothing but a container to persist a MATLAB instance.  It  
> doesn't really matter how that gets done as long as it gets done  
> somehow.
> 
> Can each node make arbitrary TCP/IP connections to the master?  To an  
> arbitrary IP address? At the very least it would need to make client- 
> style http and Posgres connections, and possibly outside of the  
> cluster unless the database and image servers are running on the  
> master node (most likely not).  Some cluster managers insist on doing  
> all communication with files only.  That would be a pretty  
> significant burden.
> 
> My knowledge of Grid Engine can probably be summarized on the back of  
> a postage stamp with a felt-tip marker.  It seems to me to have the  
> right bits to do what we want, and it certainly has the shiny Sun  
> marketing juggernaut behind it, so presumably one would be able to  
> talk a cluster manager into supporting it - no?
> -Ilya
> 
> On Dec 1, 2006, at 12:14 PM, Jeremy Muhlich wrote:
> 
> > On Thu, 2006-11-23 at 14:22 -0500, Ilya Goldberg wrote:
> >> So the way the OME cluster is set up is that every node is running
> >> Apache.  The master node issues requests that include remote DB
> >> connection info and job info.  The worker node establishes a DB
> >> connection, returns an OK message (to unblock the master), then
> >> continues processing the request.  When its done, its supposed to
> >> issue an IPC message using the DB driver, but this bit hasn't been
> >> working well recently.  Anyway, the master doesn't wait around
> >> forever for the IPC "finished" message, so things continue cranking
> >> along fairly well.  The only effect seems to be that the master gets
> >> loaded a little more than it should be.
> >
> > Hmmm.  This is a shared cluster with time-limited job queues.  For
> > example the 15m queue has the highest priority but will kill your job
> > after 15 minutes.  The complete list of queues in priority order is  
> > 15m,
> > 2h, 12h, 1d, 7d, and unlimited.  It could be difficult to employ your
> > apache-everywhere scheme on this sort of system.  However, a group who
> > contributes a node gets top priority on it, so that might be the  
> > way to
> > go.
> >
> >>>
> >>> Also, is the image server more cpu bound or I/O bound?
> >>
> >> Definitely IO bound.  It could start hitting the CPU if you request
> >> lots and lots of rendered planes rather than raw data for analysis,
> >> but its probably IO bound even then.
> >
> > Thanks, that's helpful to know.
> >
> >
> >  -- Jeremy
> >
> > _______________________________________________
> > ome-devel mailing list
> > ome-devel at lists.openmicroscopy.org.uk
> > http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
> >
> 
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>