[ome-devel] Reliability issues

Wed Sep 8 22:04:38 BST 2004

On Sep 8, 2004, at 7:51 AM, Joshua Moore wrote:

> Does anyone have reliability figures (whatever that may mean)?
>
> What I'm looking for is "proof"--chances of lost data, time gone 
> without production failure, etc.--and perhaps also subjective 
> opinions. I'm getting some visitors tomorrow to look at the system; 
> I'd like to have some quips.

You're asking us to venture into very dangerous territory on a public 
list, so I'll say this up front:  We do not make any guarantees about 
the reliability of this software.  Everyone who uses this software does 
so entirely at their own risk and peril.  We do not make any statements 
or exclusions about the performance of this software other than what is 
outlined in our License (LGPL - in OME/doc/LICENSE), specifically under 
clauses 15 and 16.

I can make some comments on the design because as users of this 
software, reliability and data integrity are very important to us.

Our system is built from components that are designed for reliability:  
Unix, Apache, Perl, Postgres.  None of these groups will make any 
statements about reliability either, but these components are used in 
many high-availability systems.  Some would call it fool-hardy to use 
anything else.

We have taken great pains (its almost a religion) to prevent data loss 
and maintain data integrity.  The data is protected at the lowest level 
by the RDBMS.  We make full use of transactional isolation and 
referential integrity constraints, and have done so from day one.  In 
principle, you can pull the plug on your back-end at any point, and the 
database will not be corrupt or be left in an invalid state (a partial 
analysis, for example).  Is this guaranteed?  No!

The philosophy of the design is the opposite of a monolithic system:  
Its a swarm of small processes who's individual death is 
incocequential.  The little process you've spawned to do whatever it is 
you requested will terminate if there is an error, its transaction will 
be rolled back, and it will be as if you've never run it.  The only 
non-recoverable state that exists is maintained in the DB.  This is all 
set up so that its safe to bail on any request at any moment.  We have 
seen segfaults in Perl (a couple), many segfaults in omeis (that we fix 
as soon as we see), but never in the main Apache process or in 
Postgres.  As long as the parent Apache and Postgres processes are able 
to spawn children, the system will remain functional and will continue 
servicing requests - even if the children die or segfault.

Obviously we have no control over how people implement their 
algorithms.  However, our use of STs allows non-experts to use 
algorithms "safely", ensuring that the algorithm's designer and its 
ultimate user agree on the semantics of the inputs and outputs.  This 
is an attempt to prevent garbage in-garbage out, but there is no 
protection from faulty algorithms.

So data integrity was in the design from day one, but we're not 
perfect, we make mistakes, and we make no guarantees.  Only experience 
over many years with millions of images will give us a feel for 
reliability, but there will never be "proof".  For our own good as 
developers, I don't think we should make any statements about 
reliability.  Any non-developers of this system are of course free to 
make any statements they wish, and we hope that they do.

>
> -Josh.
>
> PS I'm going to through another question in here because it roughly 
> pertains:
>
>            So what happens when OME hits a full-disk error?

By design, I/O errors are fatal, so the transaction will be rolled 
back.  Although OMEIS does not use transactions per se, it does 
pre-allocate disk space before doing any writing, and will return an 
error early (UploadFile and NewPixels) if its unable to do so.  Any 
error from omeis is always fatal to the back-end client that initiated 
it.  OMEIS is also highly intolerant of any errors, and is also 
designed as a swarm of small processes without state who's individual 
death is of no consequence.
Sometimes it seems that 1/2 our code is checking error conditions.  But 
again, we know we're not perfect.  There are probably unchecked I/O 
errors in the code, and we haven't tested the system under full-disk 
conditions.

Please note this very carefully:
Although most of us feel that the quality of this code is superb, we do 
not make any guarantees whatsoever.  We do not offer proof of 
reliability or data integrity.  As stated in our license, everyone who 
uses this system does so entirely at their own risk and peril.  We do 
not even imply a fitness of this software's use for any purpose 
whatsoever.  We make no exclusions or additional statements other than 
those spelled out in our license (OME/doc/LICENSE) in clause 15 and 16.

IANAL, but God I hope that's enough.

-Ilya

>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>