[ome-users] Import error

Mon Feb 5 16:19:49 GMT 2018

So after more exploration, I have found the problem to be with the
connection to the database from the OMERO server. Certain assumptions are
made by some elements of the cloud infrastructure, and one of these appears
to have been causing the problem. Namely a limitation of 1 hour as a
maximum connection length through a load balancer (a commonly used pattern
when using ECS to "solve" a service discovery problem) in this case.
Ordinarily this would not have been a problem, but given the size of this
data, database transactions seem be lasting for several hours, even though
there is not a lot of throughput in that time. The import process is CPU
bound in a single process.

I have worked around this by provisioning my database in a different way
(now I use an RDS postgres instead of a postgres container in ECS which was
actually in my plan to do all along anyway), but fundamentally, unlimited
duration transactions do not seem like a good idea.

Cheers,

Douglas

On Thu, 11 Jan 2018 at 12:58 Douglas Russell <
douglas_russell at hms.harvard.edu> wrote:

> If it helps, Jay was previously able to successfully import this dataset
> on our quad OMERO+, but I don't know if that is the critical difference
> between my scenario and the one that worked.
>
> Happy to try any suggestions. Also, this data is on S3 if you want to play
> with it, just let me know and I can grant your account access. I'd
> recommend playing with it within AWS as it's pretty large!
>
> Cheers,
>
> Douglas
>
> On Wed, 10 Jan 2018 at 18:00 Josh Moore <josh at glencoesoftware.com> wrote:
>
>> On Tue, Jan 9, 2018 at 2:49 PM, Douglas Russell
>> <douglas_russell at hms.harvard.edu> wrote:
>> > FYI: Just the latter three postgres logs relate to the most recent
>> attempt.
>> >
>> > On Tue, 9 Jan 2018 at 08:35 Douglas Russell
>> > <douglas_russell at hms.harvard.edu> wrote:
>> >>
>> >> And this was all there was in the postgres logs:
>> >>
>> >> 01:33:37 LOG: unexpected EOF on client connection with an open
>> transaction
>> >> 02:13:52 LOG: checkpoints are occurring too frequently (21 seconds
>> apart)
>> >> 02:13:52 HINT: Consider increasing the configuration parameter
>> >> "checkpoint_segments".
>> >> 07:52:19 LOG: checkpoints are occurring too frequently (10 seconds
>> apart)
>> >> 07:52:19 HINT: Consider increasing the configuration parameter
>> >> "checkpoint_segments".
>> >> 08:50:05 LOG: unexpected EOF on client connection with an open
>> transaction
>> >>
>> >> My gut feeling is that the database update fails, causing the whole
>> import
>> >> to fail, but it's hard to know what is going on.
>>
>> Sounds plausible. The exception that the server saw:
>>
>>   An I/O error occurred while sending to the backend.
>>
>> rang some bells:
>>
>>  * https://trac.openmicroscopy.org/ome/ticket/2977
>>  * https://trac.openmicroscopy.org/ome/ticket/5858
>>
>> both of which are _query_ issues where an argument (specifically an
>> :id: array) had been passed in which was larger than an int. In this
>> case, perhaps something similar is happening during the flush of the
>> transaction, or more generally, something is just quite big. If it's
>> the latter case, _perhaps_ there's a configuration option on the PG
>> side to permit larger transactions. Obviously, that's only a
>> workaround until the transactions can be broken up appropriately.
>>
>> I'll be in transit tomorrow but can help in the search for such a
>> property afterwards.
>>
>> ~Josh
>>
>>
>>
>>
>> >> D
>> >>
>> >> On Tue, 9 Jan 2018 at 08:20 Douglas Russell
>> >> <douglas_russell at hms.harvard.edu> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> Sorry for delay following this up.
>> >>>
>> >>> These OMERO instances are in Docker, yes, but otherwise I don't think
>> >>> there is anything remarkable about the configuration. I have allocated
>> >>> postgres 5GBs of RAM and am not seeing any messages about that
>> running out
>> >>> of memory. The OMERO server has 20GBs of RAM.
>> >>>
>> >>> The only errors in the Blitz log are:
>> >>>
>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>> >>> 00:15:32,910 ERROR [        ome.services.util.ServiceHandler]
>> (l.Server-7)
>> >>> Method interface
>> ome.api.ThumbnailStore.createThumbnailsByLongestSideSet
>> >>> invocation took 26125
>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>> >>> 00:15:33,090 ERROR [o.s.t.interceptor.TransactionInterceptor]
>> (2-thread-4)
>> >>> Application exception overridden by rollback exception
>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>> >>> 00:15:33,090 ERROR [        ome.services.util.ServiceHandler]
>> (2-thread-4)
>> >>> Method interface ome.services.util.Executor$Work.doWork invocation
>> took
>> >>> 17514887
>> >>>
>> >>> The only thing I haven't yet tried is moving postgres into the same
>> >>> container as OMERO. I can try that if it would help, but I highly
>> doubt it
>> >>> will make any difference as in this setup, there is only one
>> t2.2xlarge
>> >>> instance running everything. It was using a load balancer (easiest
>> way to
>> >>> connect things up should they actually be on different hosts), but I
>> tried
>> >>> it without that where I just give the IP of the postgres docker
>> container to
>> >>> the OMERO instance configuration and I got the same result, so it's
>> not the
>> >>> timeout of the load balancer at fault.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Douglas
>> >>>
>> >>> On Wed, 3 Jan 2018 at 06:56 Mark Carroll <m.t.b.carroll at dundee.ac.uk>
>> >>> wrote:
>> >>>>
>> >>>>
>> >>>> On 12/23/2017 12:32 PM, Douglas Russell wrote:
>> >>>> > I'd checked master logs files and there was nothing of interest in
>> >>>> > there. dmesg is more promising though, good idea. It looks like a
>> >>>> > memory
>> >>>> > issue. I've increased the amount of memory available to 20GBs from
>> >>>> > 4GBs
>> >>>> > and now it does not fail in the same way. Not sure why so much RAM
>> is
>> >>>> > needed when each image in the screen is only 2.6MBs. Now there is a
>> >>>> > nice
>> >>>> > new error.
>> >>>>
>> >>>> You have me wondering if the server does the whole plate import in
>> only
>> >>>> one transaction. Also, if memory issues could be due to PostgreSQL or
>> >>>> instead Java (e.g., Hibernate) and, assuming Java-side, if the issue
>> is
>> >>>> pixel data size (do the TIFF files use compression?) or metadata
>> (e.g.,
>> >>>> tons of ROIs?). Scalability has been an ongoing focus for us: we have
>> >>>> done much but there is much more yet to be done.
>> >>>>
>> >>>> > Going by the error that I see when the database tries to rollback,
>> I
>> >>>> > think it is timeout related.
>> >>>>
>> >>>> I'm not seeing an obvious timeout issue here but I may well be
>> missing
>> >>>> something and maybe over the holiday period you have noticed more
>> clues
>> >>>> yourself too?
>> >>>>
>> >>>> > The import log: https://s3.amazonaws.com/dpwr/pat/import_log.txt
>> >>>> > The server logs (I tried the import twice):
>> >>>> > https://s3.amazonaws.com/dpwr/pat/omero_logs.zip
>> >>>> >
>> >>>> > There are a couple of these in the database logs as you'd expect
>> for
>> >>>> > the
>> >>>> > two import attempts, but nothing else of interest.
>> >>>> >
>> >>>> > LOG: unexpected EOF on client connection with an open transaction
>> >>>>
>> >>>> Mmmm, late in the import process the EOFException from
>> >>>> PGStream.ReceiveChar looks key. I'm trying to think what in
>> PostgreSQL's
>> >>>> pg_* tables might give some hint as to relevant activity or locks at
>> the
>> >>>> time (if it's a timeout, maybe a deadlock?). I guess there's nothing
>> >>>> particularly exciting about how your OMERO server connects to
>> >>>> PostgreSQL? It's simply across a LAN, perhaps via Docker or somesuch?
>> >>>>
>> >>>> How large is the plate? Given the 5.4 database changes I am
>> wondering if
>> >>>> this could possibly be a regression since 5.3.5 and how easy the
>> error
>> >>>> might be to reproduce in a test environment.
>> >>>>
>> >>>> Now the holiday season is behind us, at OME we're starting to return
>> to
>> >>>> the office. Happy New Year! With luck we'll get this issue figured
>> out
>> >>>> promptly. My apologies if I missed some existing context from the
>> thread
>> >>>> that I didn't realize already bears on some of my questions.
>> >>>>
>> >>>> -- Mark
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-users/attachments/20180205/e9c18695/attachment.html>