<div dir="ltr">So after more exploration, I have found the problem to be with the connection to the database from the OMERO server. Certain assumptions are made by some elements of the cloud infrastructure, and one of these appears to have been causing the problem. Namely a limitation of 1 hour as a maximum connection length through a load balancer (a commonly used pattern when using ECS to "solve" a service discovery problem) in this case. Ordinarily this would not have been a problem, but given the size of this data, database transactions seem be lasting for several hours, even though there is not a lot of throughput in that time. The import process is CPU bound in a single process.<br><div><br></div><div>I have worked around this by provisioning my database in a different way (now I use an RDS postgres instead of a postgres container in ECS which was actually in my plan to do all along anyway), but fundamentally, unlimited duration transactions do not seem like a good idea.</div><div><br></div><div>Cheers,</div><div><br></div><div>Douglas</div><br><div class="gmail_quote"><div dir="ltr">On Thu, 11 Jan 2018 at 12:58 Douglas Russell <<a href="mailto:douglas_russell@hms.harvard.edu">douglas_russell@hms.harvard.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>If it helps, Jay was previously able to successfully import this dataset on our quad OMERO+, but I don't know if that is the critical difference between my scenario and the one that worked.<br></div><div><br></div><div>Happy to try any suggestions. Also, this data is on S3 if you want to play with it, just let me know and I can grant your account access. I'd recommend playing with it within AWS as it's pretty large!<br class="m_-5682013227462259600inbox-inbox-Apple-interchange-newline"></div><div><br></div><div>Cheers,</div><div><br></div><div>Douglas</div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, 10 Jan 2018 at 18:00 Josh Moore <<a href="mailto:josh@glencoesoftware.com" target="_blank">josh@glencoesoftware.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Tue, Jan 9, 2018 at 2:49 PM, Douglas Russell<br>
<<a href="mailto:douglas_russell@hms.harvard.edu" target="_blank">douglas_russell@hms.harvard.edu</a>> wrote:<br>
> FYI: Just the latter three postgres logs relate to the most recent attempt.<br>
><br>
> On Tue, 9 Jan 2018 at 08:35 Douglas Russell<br>
> <<a href="mailto:douglas_russell@hms.harvard.edu" target="_blank">douglas_russell@hms.harvard.edu</a>> wrote:<br>
>><br>
>> And this was all there was in the postgres logs:<br>
>><br>
>> 01:33:37 LOG: unexpected EOF on client connection with an open transaction<br>
>> 02:13:52 LOG: checkpoints are occurring too frequently (21 seconds apart)<br>
>> 02:13:52 HINT: Consider increasing the configuration parameter<br>
>> "checkpoint_segments".<br>
>> 07:52:19 LOG: checkpoints are occurring too frequently (10 seconds apart)<br>
>> 07:52:19 HINT: Consider increasing the configuration parameter<br>
>> "checkpoint_segments".<br>
>> 08:50:05 LOG: unexpected EOF on client connection with an open transaction<br>
>><br>
>> My gut feeling is that the database update fails, causing the whole import<br>
>> to fail, but it's hard to know what is going on.<br>
<br>
Sounds plausible. The exception that the server saw:<br>
<br>
An I/O error occurred while sending to the backend.<br>
<br>
rang some bells:<br>
<br>
* <a href="https://trac.openmicroscopy.org/ome/ticket/2977" rel="noreferrer" target="_blank">https://trac.openmicroscopy.org/ome/ticket/2977</a><br>
* <a href="https://trac.openmicroscopy.org/ome/ticket/5858" rel="noreferrer" target="_blank">https://trac.openmicroscopy.org/ome/ticket/5858</a><br>
<br>
both of which are _query_ issues where an argument (specifically an<br>
:id: array) had been passed in which was larger than an int. In this<br>
case, perhaps something similar is happening during the flush of the<br>
transaction, or more generally, something is just quite big. If it's<br>
the latter case, _perhaps_ there's a configuration option on the PG<br>
side to permit larger transactions. Obviously, that's only a<br>
workaround until the transactions can be broken up appropriately.<br>
<br>
I'll be in transit tomorrow but can help in the search for such a<br>
property afterwards.<br>
<br>
~Josh<br>
<br>
<br>
<br>
<br>
>> D<br>
>><br>
>> On Tue, 9 Jan 2018 at 08:20 Douglas Russell<br>
>> <<a href="mailto:douglas_russell@hms.harvard.edu" target="_blank">douglas_russell@hms.harvard.edu</a>> wrote:<br>
>>><br>
>>> Hi,<br>
>>><br>
>>> Sorry for delay following this up.<br>
>>><br>
>>> These OMERO instances are in Docker, yes, but otherwise I don't think<br>
>>> there is anything remarkable about the configuration. I have allocated<br>
>>> postgres 5GBs of RAM and am not seeing any messages about that running out<br>
>>> of memory. The OMERO server has 20GBs of RAM.<br>
>>><br>
>>> The only errors in the Blitz log are:<br>
>>><br>
>>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09<br>
>>> 00:15:32,910 ERROR [ ome.services.util.ServiceHandler] (l.Server-7)<br>
>>> Method interface ome.api.ThumbnailStore.createThumbnailsByLongestSideSet<br>
>>> invocation took 26125<br>
>>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09<br>
>>> 00:15:33,090 ERROR [o.s.t.interceptor.TransactionInterceptor] (2-thread-4)<br>
>>> Application exception overridden by rollback exception<br>
>>> /opt/omero/server/OMERO.server/var/log/Blitz-0.log:2018-01-09<br>
>>> 00:15:33,090 ERROR [ ome.services.util.ServiceHandler] (2-thread-4)<br>
>>> Method interface ome.services.util.Executor$Work.doWork invocation took<br>
>>> 17514887<br>
>>><br>
>>> The only thing I haven't yet tried is moving postgres into the same<br>
>>> container as OMERO. I can try that if it would help, but I highly doubt it<br>
>>> will make any difference as in this setup, there is only one t2.2xlarge<br>
>>> instance running everything. It was using a load balancer (easiest way to<br>
>>> connect things up should they actually be on different hosts), but I tried<br>
>>> it without that where I just give the IP of the postgres docker container to<br>
>>> the OMERO instance configuration and I got the same result, so it's not the<br>
>>> timeout of the load balancer at fault.<br>
>>><br>
>>> Thanks,<br>
>>><br>
>>> Douglas<br>
>>><br>
>>> On Wed, 3 Jan 2018 at 06:56 Mark Carroll <<a href="mailto:m.t.b.carroll@dundee.ac.uk" target="_blank">m.t.b.carroll@dundee.ac.uk</a>><br>
>>> wrote:<br>
>>>><br>
>>>><br>
>>>> On 12/23/2017 12:32 PM, Douglas Russell wrote:<br>
>>>> > I'd checked master logs files and there was nothing of interest in<br>
>>>> > there. dmesg is more promising though, good idea. It looks like a<br>
>>>> > memory<br>
>>>> > issue. I've increased the amount of memory available to 20GBs from<br>
>>>> > 4GBs<br>
>>>> > and now it does not fail in the same way. Not sure why so much RAM is<br>
>>>> > needed when each image in the screen is only 2.6MBs. Now there is a<br>
>>>> > nice<br>
>>>> > new error.<br>
>>>><br>
>>>> You have me wondering if the server does the whole plate import in only<br>
>>>> one transaction. Also, if memory issues could be due to PostgreSQL or<br>
>>>> instead Java (e.g., Hibernate) and, assuming Java-side, if the issue is<br>
>>>> pixel data size (do the TIFF files use compression?) or metadata (e.g.,<br>
>>>> tons of ROIs?). Scalability has been an ongoing focus for us: we have<br>
>>>> done much but there is much more yet to be done.<br>
>>>><br>
>>>> > Going by the error that I see when the database tries to rollback, I<br>
>>>> > think it is timeout related.<br>
>>>><br>
>>>> I'm not seeing an obvious timeout issue here but I may well be missing<br>
>>>> something and maybe over the holiday period you have noticed more clues<br>
>>>> yourself too?<br>
>>>><br>
>>>> > The import log: <a href="https://s3.amazonaws.com/dpwr/pat/import_log.txt" rel="noreferrer" target="_blank">https://s3.amazonaws.com/dpwr/pat/import_log.txt</a><br>
>>>> > The server logs (I tried the import twice):<br>
>>>> > <a href="https://s3.amazonaws.com/dpwr/pat/omero_logs.zip" rel="noreferrer" target="_blank">https://s3.amazonaws.com/dpwr/pat/omero_logs.zip</a><br>
>>>> ><br>
>>>> > There are a couple of these in the database logs as you'd expect for<br>
>>>> > the<br>
>>>> > two import attempts, but nothing else of interest.<br>
>>>> ><br>
>>>> > LOG: unexpected EOF on client connection with an open transaction<br>
>>>><br>
>>>> Mmmm, late in the import process the EOFException from<br>
>>>> PGStream.ReceiveChar looks key. I'm trying to think what in PostgreSQL's<br>
>>>> pg_* tables might give some hint as to relevant activity or locks at the<br>
>>>> time (if it's a timeout, maybe a deadlock?). I guess there's nothing<br>
>>>> particularly exciting about how your OMERO server connects to<br>
>>>> PostgreSQL? It's simply across a LAN, perhaps via Docker or somesuch?<br>
>>>><br>
>>>> How large is the plate? Given the 5.4 database changes I am wondering if<br>
>>>> this could possibly be a regression since 5.3.5 and how easy the error<br>
>>>> might be to reproduce in a test environment.<br>
>>>><br>
>>>> Now the holiday season is behind us, at OME we're starting to return to<br>
>>>> the office. Happy New Year! With luck we'll get this issue figured out<br>
>>>> promptly. My apologies if I missed some existing context from the thread<br>
>>>> that I didn't realize already bears on some of my questions.<br>
>>>><br>
>>>> -- Mark<br>
</blockquote></div></blockquote></div></div>