[ome-devel] Reliability issues
Ilya Goldberg
igg at nih.gov
Wed Sep 8 22:04:38 BST 2004
On Sep 8, 2004, at 7:51 AM, Joshua Moore wrote:
> Does anyone have reliability figures (whatever that may mean)?
>
> What I'm looking for is "proof"--chances of lost data, time gone
> without production failure, etc.--and perhaps also subjective
> opinions. I'm getting some visitors tomorrow to look at the system;
> I'd like to have some quips.
You're asking us to venture into very dangerous territory on a public
list, so I'll say this up front: We do not make any guarantees about
the reliability of this software. Everyone who uses this software does
so entirely at their own risk and peril. We do not make any statements
or exclusions about the performance of this software other than what is
outlined in our License (LGPL - in OME/doc/LICENSE), specifically under
clauses 15 and 16.
I can make some comments on the design because as users of this
software, reliability and data integrity are very important to us.
Our system is built from components that are designed for reliability:
Unix, Apache, Perl, Postgres. None of these groups will make any
statements about reliability either, but these components are used in
many high-availability systems. Some would call it fool-hardy to use
anything else.
We have taken great pains (its almost a religion) to prevent data loss
and maintain data integrity. The data is protected at the lowest level
by the RDBMS. We make full use of transactional isolation and
referential integrity constraints, and have done so from day one. In
principle, you can pull the plug on your back-end at any point, and the
database will not be corrupt or be left in an invalid state (a partial
analysis, for example). Is this guaranteed? No!
The philosophy of the design is the opposite of a monolithic system:
Its a swarm of small processes who's individual death is
incocequential. The little process you've spawned to do whatever it is
you requested will terminate if there is an error, its transaction will
be rolled back, and it will be as if you've never run it. The only
non-recoverable state that exists is maintained in the DB. This is all
set up so that its safe to bail on any request at any moment. We have
seen segfaults in Perl (a couple), many segfaults in omeis (that we fix
as soon as we see), but never in the main Apache process or in
Postgres. As long as the parent Apache and Postgres processes are able
to spawn children, the system will remain functional and will continue
servicing requests - even if the children die or segfault.
Obviously we have no control over how people implement their
algorithms. However, our use of STs allows non-experts to use
algorithms "safely", ensuring that the algorithm's designer and its
ultimate user agree on the semantics of the inputs and outputs. This
is an attempt to prevent garbage in-garbage out, but there is no
protection from faulty algorithms.
So data integrity was in the design from day one, but we're not
perfect, we make mistakes, and we make no guarantees. Only experience
over many years with millions of images will give us a feel for
reliability, but there will never be "proof". For our own good as
developers, I don't think we should make any statements about
reliability. Any non-developers of this system are of course free to
make any statements they wish, and we hope that they do.
>
> -Josh.
>
> PS I'm going to through another question in here because it roughly
> pertains:
>
> So what happens when OME hits a full-disk error?
By design, I/O errors are fatal, so the transaction will be rolled
back. Although OMEIS does not use transactions per se, it does
pre-allocate disk space before doing any writing, and will return an
error early (UploadFile and NewPixels) if its unable to do so. Any
error from omeis is always fatal to the back-end client that initiated
it. OMEIS is also highly intolerant of any errors, and is also
designed as a swarm of small processes without state who's individual
death is of no consequence.
Sometimes it seems that 1/2 our code is checking error conditions. But
again, we know we're not perfect. There are probably unchecked I/O
errors in the code, and we haven't tested the system under full-disk
conditions.
Please note this very carefully:
Although most of us feel that the quality of this code is superb, we do
not make any guarantees whatsoever. We do not offer proof of
reliability or data integrity. As stated in our license, everyone who
uses this system does so entirely at their own risk and peril. We do
not even imply a fitness of this software's use for any purpose
whatsoever. We make no exclusions or additional statements other than
those spelled out in our license (OME/doc/LICENSE) in clause 15 and 16.
IANAL, but God I hope that's enough.
-Ilya
>
>
> _______________________________________________
> ome-devel mailing list
> ome-devel at lists.openmicroscopy.org.uk
> http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-devel
>
More information about the ome-devel
mailing list