[ome-users] Import error

Tue Feb 6 11:06:49 GMT 2018

Hey Douglas,

On Mon, Feb 5, 2018 at 5:19 PM, Douglas Russell
<douglas_russell at hms.harvard.edu> wrote:
> So after more exploration, I have found the problem to be with the
> connection to the database from the OMERO server. ... but fundamentally, unlimited
> duration transactions do not seem like a good idea.

Thanks for the further digging & glad to hear you have a workaround.
This is certainly known to be something that needs working on, but
I've added this thread to
https://trello.com/c/HRxgTg4e/202-large-imports-must-be-broken-up for
the record.

All the best,
~Josh

> Cheers,
>
> Douglas
>
> On Thu, 11 Jan 2018 at 12:58 Douglas Russell
> <douglas_russell at hms.harvard.edu> wrote:
>>
>> If it helps, Jay was previously able to successfully import this dataset
>> on our quad OMERO+, but I don't know if that is the critical difference
>> between my scenario and the one that worked.
>>
>> Happy to try any suggestions. Also, this data is on S3 if you want to play
>> with it, just let me know and I can grant your account access. I'd recommend
>> playing with it within AWS as it's pretty large!
>>
>> Cheers,
>>
>> Douglas
>>
>> On Wed, 10 Jan 2018 at 18:00 Josh Moore <josh at glencoesoftware.com> wrote:
>>>
>>> On Tue, Jan 9, 2018 at 2:49 PM, Douglas Russell
>>> <douglas_russell at hms.harvard.edu> wrote:
>>> > FYI: Just the latter three postgres logs relate to the most recent
>>> > attempt.
>>> >
>>> > On Tue, 9 Jan 2018 at 08:35 Douglas Russell
>>> > <douglas_russell at hms.harvard.edu> wrote:
>>> >>
>>> >> And this was all there was in the postgres logs:
>>> >>
>>> >> 01:33:37 LOG: unexpected EOF on client connection with an open
>>> >> transaction
>>> >> 02:13:52 LOG: checkpoints are occurring too frequently (21 seconds
>>> >> apart)
>>> >> 02:13:52 HINT: Consider increasing the configuration parameter
>>> >> "checkpoint_segments".
>>> >> 07:52:19 LOG: checkpoints are occurring too frequently (10 seconds
>>> >> apart)
>>> >> 07:52:19 HINT: Consider increasing the configuration parameter
>>> >> "checkpoint_segments".
>>> >> 08:50:05 LOG: unexpected EOF on client connection with an open
>>> >> transaction
>>> >>
>>> >> My gut feeling is that the database update fails, causing the whole
>>> >> import
>>> >> to fail, but it's hard to know what is going on.
>>>
>>> Sounds plausible. The exception that the server saw:
>>>
>>>   An I/O error occurred while sending to the backend.
>>>
>>> rang some bells:
>>>
>>>  * https://trac.openmicroscopy.org/ome/ticket/2977
>>>  * https://trac.openmicroscopy.org/ome/ticket/5858
>>>
>>> both of which are _query_ issues where an argument (specifically an
>>> :id: array) had been passed in which was larger than an int. In this
>>> case, perhaps something similar is happening during the flush of the
>>> transaction, or more generally, something is just quite big. If it's
>>> the latter case, _perhaps_ there's a configuration option on the PG
>>> side to permit larger transactions. Obviously, that's only a
>>> workaround until the transactions can be broken up appropriately.
>>>
>>> I'll be in transit tomorrow but can help in the search for such a
>>> property afterwards.
>>>
>>> ~Josh
>>>
>>>
>>>
>>>
>>> >> D
>>> >>
>>> >> On Tue, 9 Jan 2018 at 08:20 Douglas Russell
>>> >> <douglas_russell at hms.harvard.edu> wrote:
>>> >>>
>>> >>> Hi,
>>> >>>
>>> >>> Sorry for delay following this up.
>>> >>>
>>> >>> These OMERO instances are in Docker, yes, but otherwise I don't think
>>> >>> there is anything remarkable about the configuration. I have
>>> >>> allocated
>>> >>> postgres 5GBs of RAM and am not seeing any messages about that
>>> >>> running out
>>> >>> of memory. The OMERO server has 20GBs of RAM.
>>> >>>
>>> >>> The only errors in the Blitz log are:
>>> >>>
>>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>>> >>> 00:15:32,910 ERROR [        ome.services.util.ServiceHandler]
>>> >>> (l.Server-7)
>>> >>> Method interface
>>> >>> ome.api.ThumbnailStore.createThumbnailsByLongestSideSet
>>> >>> invocation took 26125
>>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>>> >>> 00:15:33,090 ERROR [o.s.t.interceptor.TransactionInterceptor]
>>> >>> (2-thread-4)
>>> >>> Application exception overridden by rollback exception
>>> >>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09
>>> >>> 00:15:33,090 ERROR [        ome.services.util.ServiceHandler]
>>> >>> (2-thread-4)
>>> >>> Method interface ome.services.util.Executor$Work.doWork invocation
>>> >>> took
>>> >>> 17514887
>>> >>>
>>> >>> The only thing I haven't yet tried is moving postgres into the same
>>> >>> container as OMERO. I can try that if it would help, but I highly
>>> >>> doubt it
>>> >>> will make any difference as in this setup, there is only one
>>> >>> t2.2xlarge
>>> >>> instance running everything. It was using a load balancer (easiest
>>> >>> way to
>>> >>> connect things up should they actually be on different hosts), but I
>>> >>> tried
>>> >>> it without that where I just give the IP of the postgres docker
>>> >>> container to
>>> >>> the OMERO instance configuration and I got the same result, so it's
>>> >>> not the
>>> >>> timeout of the load balancer at fault.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Douglas
>>> >>>
>>> >>> On Wed, 3 Jan 2018 at 06:56 Mark Carroll <m.t.b.carroll at dundee.ac.uk>
>>> >>> wrote:
>>> >>>>
>>> >>>>
>>> >>>> On 12/23/2017 12:32 PM, Douglas Russell wrote:
>>> >>>> > I'd checked master logs files and there was nothing of interest in
>>> >>>> > there. dmesg is more promising though, good idea. It looks like a
>>> >>>> > memory
>>> >>>> > issue. I've increased the amount of memory available to 20GBs from
>>> >>>> > 4GBs
>>> >>>> > and now it does not fail in the same way. Not sure why so much RAM
>>> >>>> > is
>>> >>>> > needed when each image in the screen is only 2.6MBs. Now there is
>>> >>>> > a
>>> >>>> > nice
>>> >>>> > new error.
>>> >>>>
>>> >>>> You have me wondering if the server does the whole plate import in
>>> >>>> only
>>> >>>> one transaction. Also, if memory issues could be due to PostgreSQL
>>> >>>> or
>>> >>>> instead Java (e.g., Hibernate) and, assuming Java-side, if the issue
>>> >>>> is
>>> >>>> pixel data size (do the TIFF files use compression?) or metadata
>>> >>>> (e.g.,
>>> >>>> tons of ROIs?). Scalability has been an ongoing focus for us: we
>>> >>>> have
>>> >>>> done much but there is much more yet to be done.
>>> >>>>
>>> >>>> > Going by the error that I see when the database tries to rollback,
>>> >>>> > I
>>> >>>> > think it is timeout related.
>>> >>>>
>>> >>>> I'm not seeing an obvious timeout issue here but I may well be
>>> >>>> missing
>>> >>>> something and maybe over the holiday period you have noticed more
>>> >>>> clues
>>> >>>> yourself too?
>>> >>>>
>>> >>>> > The import log: https://s3.amazonaws.com/dpwr/pat/import_log.txt
>>> >>>> > The server logs (I tried the import twice):
>>> >>>> > https://s3.amazonaws.com/dpwr/pat/omero_logs.zip
>>> >>>> >
>>> >>>> > There are a couple of these in the database logs as you'd expect
>>> >>>> > for
>>> >>>> > the
>>> >>>> > two import attempts, but nothing else of interest.
>>> >>>> >
>>> >>>> > LOG: unexpected EOF on client connection with an open transaction
>>> >>>>
>>> >>>> Mmmm, late in the import process the EOFException from
>>> >>>> PGStream.ReceiveChar looks key. I'm trying to think what in
>>> >>>> PostgreSQL's
>>> >>>> pg_* tables might give some hint as to relevant activity or locks at
>>> >>>> the
>>> >>>> time (if it's a timeout, maybe a deadlock?). I guess there's nothing
>>> >>>> particularly exciting about how your OMERO server connects to
>>> >>>> PostgreSQL? It's simply across a LAN, perhaps via Docker or
>>> >>>> somesuch?
>>> >>>>
>>> >>>> How large is the plate? Given the 5.4 database changes I am
>>> >>>> wondering if
>>> >>>> this could possibly be a regression since 5.3.5 and how easy the
>>> >>>> error
>>> >>>> might be to reproduce in a test environment.
>>> >>>>
>>> >>>> Now the holiday season is behind us, at OME we're starting to return
>>> >>>> to
>>> >>>> the office. Happy New Year! With luck we'll get this issue figured
>>> >>>> out
>>> >>>> promptly. My apologies if I missed some existing context from the
>>> >>>> thread
>>> >>>> that I didn't realize already bears on some of my questions.
>>> >>>>
>>> >>>> -- Mark