[ome-users] Omero CLI error - NonZeroReturnCode : assert failed
Josh Moore
josh at glencoesoftware.com
Wed Jul 9 10:20:11 BST 2014
On Jul 9, 2014, at 10:43 AM, John Webber (NBI) wrote:
> Hi Josh,
>
> Thanks again for the response, and sorry that I've taken a few days to respond - I've had a few days of holiday.
No worries. Sure you've earned it. :)
> Have you had any more opportunity to look through my code to identify any obvious errors on my part? I do wonder if I have made some incorrect assumptions or code errors.
I haven't spent more time looking through it, no. But I really don't think *any* coding should be able to cause what you're seeing happening. Especially after just 2 images as you describe below. Can you remind me: how big are these files?
> I have added "conn.c.enableKeepAlive(600)" to two different places within my code, following on from your advice below. I have added this line to the following places:
>
> (1) In the "connect()" function, follow on from the "conn.connect()". Is this the correct place for it?
Yes. It starts an internal thread (if none is present) which checks every 10 minutes, though you might want to reduce to 5 minutes.
> (2) At the start of the "addImage() function. As this part of the code is looped several times to add multiple images, I thought it made sense to put it here as well.
It probably can't hurt. It should only make a difference after some silent background re-connection. (I don't know if the "enableKeepAlive" property gets lost in that case, but it's possible).
> When I attempted to run a bulk upload with these changes, it only successfully uploaded two images, before I saw the same seg fault:
>
> [root at v0246 omero-upload]# python UploadControl.py
> Previous session expired for webberj on localhost:4064
> Created session 84811516-1155-4e2c-9e84-724fb2e666c7 (webberj at localhost:4064). Idle timeout: 10.0 min. Current group: system
> Using session 84811516-1155-4e2c-9e84-724fb2e666c7 (webberj at localhost:4064). Idle timeout: 10.0 min. Current group: system
> -! 07/08/14 15:12:48.487 warning: Proxy keep alive failed.
> -! 07/08/14 15:17:47.995 warning: Proxy keep alive failed.
>
>
> I therefore removed the second "conn.c.enableKeepAlive(600)" line in the "addImage" function and amended the entry in the connect() function to be 3600 instead: "conn.c.enableKeepAlive(3600)". I then ran the bulk upload again.
>
> This time the code uploaded half a dozen images before I saw the "nonZeroReturnCode" three times and my code stopped, as follows:
>
> [root at v0246 omero-upload]# python UploadControl.py
> ...
> Previously logged in to localhost:4064 as webberj
> InternalException: Failed to connect: exception ::Glacier2::CannotCreateSessionException
> {
> reason = Server not fully initialized
> }
>
> I have noticed that, on the latest version of Omero, when we see a "nonZeroReturnCode", an image IS still left in Omero! When I log into Omero I see an empty image holder, maybe with a clock icon, maybe with a "no image" icon. The failure this time has, however, left the Omero database in a strange state! When I attempt to remove (delete) any of the images that were uploaded in this upload, I see either a "server error 500", or it just crashes me out of Omero. This upload has left things in a bit of a strange state!
Was there an "OutOfMemory" printed to any of the var/log files? This is very strange. That exception is usually only ever seen when the server is starting up. Did it segfault as well?!
> Any advice would be very gratefully received!
Same here!
~J
> Thanks
> John
>
>
>
> -----Original Message-----
> From: Josh Moore [mailto:josh at glencoesoftware.com]
> Sent: 02 July 2014 20:51
> To: John Webber (NBI)
> Cc: Simon Li; ome-users at lists.openmicroscopy.org.uk
> Subject: Re: [ome-users] Omero CLI error - NonZeroReturnCode : assert failed
>
>
> On Jul 1, 2014, at 5:45 PM, John Webber (NBI) wrote:
>
>> Hi Josh,
>>
>> How are you?
>
> Well, thanks, though I'd be better if we could figure out this problem!
>
>
>>> 1) "file checksum mismatch on upload" points to either the upload itself having corrupted the file or the files having changed on disk. Is this a result of a subprocess segfaul'ting and then something else continuing?
>>
>> I can be confident in saying that the files being uploaded are not being changed on disk, so I believe this is an issue with the upload corrupting the file. Is there something I can specifically check for with regard to a segfaulting subprocess? I have not seen any other signs of this issue, apart from the log files that I have previously uploaded.
>
> I don't think there's anything to check for, if nothing's clear in the logs. Just talking myself through this corrupt uploads should only be able to happen if there's a burp as the data goes over the wire. In a loop, the importer:
>
> * reads bytes
> * adds them to the hash
> * sends them
> * repeats.
>
> After all files are uploaded like that, all the hashes are sent at once. If anything segfaulted along the way, that would lead to the upload not completing, not to some hash mismatch.
>
>
>>> * Is this code viewable somewhere?
>>
>> I'm afraid it isn't yet, but we can share the code at some stage. The plan will be to share this bulk upload process back with anyone who would find it useful.
>>
>> I did attached the code to one of my emails the other day - the one dated 27/06 at 11:22. The code is in a tar ball called: "omero-upload.tar.gz"
>
> Went back and found the code; I'll do some reading. Glancing over the code, nothing seems to wave a red flag at me.
> I do wonder if you have any statistics along the line of "this falls over after N imports", though.
>
>
>>> * What are you doing in the way of a "keep alive"?
>>
>> This might be the issue - the answer is "Nothing"! What **should** I be doing in the way of keep alive? I think I might need to do something that I am currently not doing.
>
> In your connect() method, you could call:
>
> conn.c.enableKeepAlive(seconds)
>
> and let us know if that improves the situation. It might be that BlitzGateway is retrying connections, and we're not seeing the exact situation in the logs.
>
>> Thanks again for your assistance!
>
> Gladly!
> ~Josh.
>
>> Thanks
>> John
>>
>>
>> -----Original Message-----
>> From: Josh Moore [mailto:josh at glencoesoftware.com]
>> Sent: 30 June 2014 13:20
>> To: John Webber (NBI)
>> Cc: Simon Li; ome-users at lists.openmicroscopy.org.uk
>> Subject: Re: [ome-users] Omero CLI error - NonZeroReturnCode : assert
>> failed
>>
>> Hi John,
>>
>>
>> there are a couple of interesting things here.
>>
>> 1) "file checksum mismatch on upload" points to either the upload itself having corrupted the file or the files having changed on disk. Is this a result of a subprocess segfaul'ting and then something else continuing?
>>
>> 2) from the upload log, "ObjectNotExistException" points to a session being closed. This is either related to or similar to:
>> WARNING:omero.gateway:ConnectionLostException on <class
>> 'omero.gateway.OmeroGatewaySafeCallWrapper'>
>>
>> 3) "ERROR:root:UploadControl.py: Lockfile(s) ['/tmp/UploadControl.py.7485'] found, but no process identified" -- I assume this is your output?
>>
>> A couple of questions:
>> * Is this code viewable somewhere?
>> * What are you doing in the way of a "keep alive"?
>>
>> Cheers,
>> ~Josh.
>>
>>
>> On Jun 27, 2014, at 5:42 PM, John Webber (NBI) wrote:
>>
>>> Hi Simon (et al)
>>>
>>> As per the second part of the email below, please see attached a concatenated cli.err file which contains a snapshot of the cli.err files from the bulk upload , together with a log file called omero-upload.log, which is the log file produced from the bulk upload process.
>>>
>>> It looks like, when we see a " NonZeroReturnCode - assert failed" error, the following is logged in the cli.err file:
>>>
>>> 2014-06-27 15:24:53,512 23308 [ main] ERROR ome.formats.importer.cli.ErrorHandler - FILE_EXCEPT ION: /tmp/tmpT858EL/13_06_13_Ler_2-1/13_06_13_Ler_2-1_split.inr.gz
>>> java.io.IOException: file checksum mismatch on upload: /root/omero/tmp/omero_root/4921 at v0362/ORIGINAL_METADATA _TXT3494584423817398813f9d905f3-97a6-468a-8875-91e5634f4c3c (client has 346e20d92d95c9157b1fe7c91310e4b778368f 4c, server has fb32d4c023d6a9155b6ab14266363d31f57fb77d)
>>> at ome.formats.OMEROMetadataStoreClient.writeFilesToFileStore(OMEROMetadataStoreClient.java:1863)
>>> at ome.formats.importer.ImportLibrary.importImage(ImportLibrary.java:694)
>>> at ome.formats.importer.ImportLibrary.importCandidates(ImportLibrary.java:298)
>>> at ome.formats.importer.cli.CommandLineImporter.start(CommandLineImporter.java:144)
>>> at ome.formats.importer.cli.CommandLineImporter.main(CommandLineImporter.java:463)
>>> 2014-06-27 15:24:53,560 23356 [ main] ERROR ome.formats.importer.cli.ErrorHandler - INTERNAL_EX CEPTION: /tmp/tmpT858EL/13_06_13_Ler_2-1/13_06_13_Ler_2-1_split.inr.gz
>>> java.lang.RuntimeException: java.io.IOException: file checksum mismatch on upload: /root/omero/tmp/omero_root/ 4921 at v0362/ORIGINAL_METADATA_TXT3494584423817398813f9d905f3-97a6-468a-8875-91e5634f4c3c (client has 346e20d92d 95c9157b1fe7c91310e4b778368f4c, server has fb32d4c023d6a9155b6ab14266363d31f57fb77d)
>>> at ome.formats.importer.ImportLibrary.importCandidates(ImportLibrary.java:301)
>>> at ome.formats.importer.cli.CommandLineImporter.start(CommandLineImporter.java:144)
>>> at ome.formats.importer.cli.CommandLineImporter.main(CommandLineImporter.java:463)
>>> Caused by: java.io.IOException: file checksum mismatch on upload: /root/omero/tmp/omero_root/4921 at v0362/ORIGIN AL_METADATA_TXT3494584423817398813f9d905f3-97a6-468a-8875-91e5634f4c3c (client has 346e20d92d95c9157b1fe7c9131 0e4b778368f4c, server has fb32d4c023d6a9155b6ab14266363d31f57fb77d)
>>> at ome.formats.OMEROMetadataStoreClient.writeFilesToFileStore(OMEROMetadataStoreClient.java:1863)
>>> at ome.formats.importer.ImportLibrary.importImage(ImportLibrary.java:694)
>>> at ome.formats.importer.ImportLibrary.importCandidates(ImportLibrary.java:298)
>>> ... 2 more
>>>
>>> java.lang.RuntimeException: java.io.IOException: file checksum mismatch on upload: /root/omero/tmp/omero_root/ 4921 at v0362/ORIGINAL_METADATA_TXT3494584423817398813f9d905f3-97a6-468a-8875-91e5634f4c3c (client has 346e20d92d 95c9157b1fe7c91310e4b778368f4c, server has fb32d4c023d6a9155b6ab14266363d31f57fb77d)
>>> at ome.formats.importer.ImportLibrary.importCandidates(ImportLibrary.java:301)
>>> at ome.formats.importer.cli.CommandLineImporter.start(CommandLineImporter.java:144)
>>> at ome.formats.importer.cli.CommandLineImporter.main(CommandLineImporter.java:463)
>>> Caused by: java.io.IOException: file checksum mismatch on upload: /root/omero/tmp/omero_root/4921 at v0362/ORIGIN AL_METADATA_TXT3494584423817398813f9d905f3-97a6-468a-8875-91e5634f4c3c (client has 346e20d92d95c9157b1fe7c9131 0e4b778368f4c, server has fb32d4c023d6a9155b6ab14266363d31f57fb77d)
>>> at ome.formats.OMEROMetadataStoreClient.writeFilesToFileStore(OMEROMetadataStoreClient.java:1863)
>>> at ome.formats.importer.ImportLibrary.importImage(ImportLibrary.java:694)
>>> at ome.formats.importer.ImportLibrary.importCandidates(ImportLibrary.java:298)
>>> ... 2 more
>>>
>>> Thanks
>>> John
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Simon Li [mailto:s.p.li at dundee.ac.uk]
>>> Sent: 27 June 2014 10:08
>>> To: John Webber (NBI)
>>> Cc: ome-users at lists.openmicroscopy.org.uk
>>> Subject: Re: [ome-users] Core dump attempting to perfrom bulk upload
>>> to Omero
>>>
>>> Hi John
>>>
>>> If it's not too much work for you and your colleague to gather the setup instructions and test scripts together it'd be useful to see them, though we probably won't be able to setup a test system straight away. In addition to your instructions a full package list would be useful so we can verify versions of everything (e.g. `rpm -qa`, `pip freeze` if you're using python/pip, etc).
>>>
>>> For the CLI error if you send us the output we can take a look.
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>>
>>> On 26 Jun 2014, at 17:56, John Webber (NBI) <John.Webber at nbi.ac.uk> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>> Thank you for your response today.
>>>>
>>>> The testing that I have done up until now has just been on a single system. When I first encountered the core dump, I tried different software versions and have found that, just changing the Omero version, the core dumps stop.
>>>>
>>>> Following on from your email below I have, however, used another Omero 5 server, to test this process. This separate Omero server was built by a colleague (Matthew Hartley) following a slightly different build. I have found that my Omero Bulk Upload is core dumping (after about 4 uploads) on that server running Omero 5 as well. This process does not core dump on any of our Omero 4 builds.
>>>>
>>>> I can provide build instructions for the servers that I have built, and I can also provide my bulk upload scripts for you to use to try to replicate issue. Would you like me to send all of this to you?
>>>>
>>>> With regards to the second issue, the "NonZeroReturnCode" from the Onero CLI with the error "assert failed", I don't currently have the stdout or stderr output that accompany these, but I can send them to you tomorrow. I have just checked the ulimit on the server where I am seeing this the most, and the ulimit is set to "unlimited".
>>>>
>>>> Thanks again for your help!
>>>> John
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Josh Moore [mailto:josh at glencoesoftware.com]
>>>> Sent: 25 June 2014 20:36
>>>> To: John Webber (NBI)
>>>> Cc: Roger Leigh; Roger Leigh; OME Users
>>>> Subject: Re: [ome-users] FW: Core dump attempting to perfrom bulk
>>>> upload to Omero
>>>>
>>>>
>>>> On Jun 24, 2014, at 4:55 PM, John Webber (NBI) wrote:
>>>>
>>>>> Hi Roger (et al)
>>>>
>>>> Hi John,
>>>>
>>>>> Sorry for the delay in coming back to you on this this issue - I have been trying some other testing to see if I can narrow down the issue.
>>>>> ...
>>>>> With Omero 5, using both ive 3.4 and ice3.5 I see the core dumps, as reported previously. With the other versions of Omero I do not see the core dump.
>>>>
>>>> Thanks for the detailed analysis. Unfortunately, everyone on the team is still quite perplexed.
>>>>
>>>>
>>>>> I have kept ALL other components the same, but only see these core dumps with Omero 5.
>>>>>
>>>>> I have attached to this email the following log files, which contain additional information which was produced when code dumped:
>>>>
>>>> Is it just the one system or do you have other systems with a similar configuration that you could try? Do you have an installation script or a VM-like image that we might could try to reproduce with?
>>>>
>>>>
>>>>> As stated, exactly the same test data was used, on the same server, with all other versions remaining the same, but I did not see the core dumps. The core dumps do not always occur on the same upload, but normally do occur after just a few uploads.
>>>>>
>>>>> For the versions of Omero that do not core dump, I am seeing some a-hoc issues as well: sometimes (again seemingly randomly and not always for the same upload), I see a "NonZeroReturnCode" from the Onero CLI with the error "assert failed". This appears to happen much more frequently when testing with Ice version of 3.5.
>>>>
>>>> Is there any stdout or stderr output that accompanies these? Could you be running into a ulimit constraint? Have we already touched on that? What is your ulimit currently set to?
>>>>
>>>> Cheers,
>>>> ~Josh.
>>>>
>>>>
>>>>> Can anyone give me any pointers on what is going wrong here?
>>>>>
>>>>> Thanks
>>>>> John
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Roger Leigh [mailto:r.leigh at dundee.ac.uk]
>>>>> Sent: 13 June 2014 16:57
>>>>> To: John Webber (NBI); Roger Leigh;
>>>>> ome-users at lists.openmicroscopy.org.uk
>>>>> Subject: Re: FW: [ome-users] Core dump attempting to perfrom bulk
>>>>> upload to Omero
>>>>>
>>>>> On 13/06/2014 15:25, John Webber (NBI) wrote:
>>>>>> Not sure I am getting very far with this error!
>>>>>>
>>>>>> I have re-run the process a few times today for the purpose of testing.
>>>>>>
>>>>>> I have a number of images that are being uploaded to Omero as part of this "bulk upload process". This jobs is failing at DIFFERENT stages each time I run the job - sometimes it uploads 1 or 2 images before it fails, sometimes it will upload 5 or 6 before it fails. It is therefore extremely had to track down the issue.
>>>>>>
>>>>>> When I ran the job, I saw the following error:
>>>>>>
>>>>>> -! 06/13/14 12:33:54.390 warning: Proxy keep alive failed.
>>>>>>
>>>>>> Which appears after the session information, i.e.
>>>>>>
>>>>>> Using session c4bef719-ae6f-4d85-b4bb-ffc09ee189b9 (webberj at localhost:4064). Idle timeout: 10.0 min. Current group: system
>>>>>> -! 06/13/14 12:33:54.390 warning: Proxy keep alive failed.
>>>>>>
>>>>>> At this point, a core dump is generated.
>>>>>>
>>>>>> As per Roger's email below, I have run an strace, but the resulting logfile is too big to send! I've looked through the logfile but am not sure what I am looking for in order to just snip a section out of it! Is there something specific in the log that I should look for, or can I transfer this file another way?
>>>>>
>>>>> Looking through the log, the first SIGSEGV is in thread 29086.
>>>>> Immediately before, it's using fd=38 (client/logback-classic.jar) and a bit further back fd=5 (ice-glacier2.jar). After this point, it repeatedly segfaults, presumably inside it's own SEGV handler.
>>>>>
>>>>> It also spawns threads 29101 and 29105, both of which also subsequently segfault; not clear why but the jvm may be a horrible mess by this point.
>>>>>
>>>>> This /might/ point to an issue in client-logback.jar, but that's not certain.
>>>>>
>>>>>> The /var/log/messages file just has the following lines:
>>>>>>
>>>>>> Jun 13 12:29:42 v0246 abrt[29178]: Saved core dump of pid 28122 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre/bin/java) to /var/spool/abrt/ccpp-2014-06-13-12:29:35-28122 (750710784 bytes) Jun 13 12:29:42 v0246 abrtd: Directory 'ccpp-2014-06-13-12:29:35-28122' creation detected Jun 13 12:29:42 v0246 abrt[29178]: /var/spool/abrt is 1527378116 bytes (more than 1279MiB), deleting 'ccpp-2014-06-13-11:58:33-27089'
>>>>>> Jun 13 12:29:51 v0246 kernel: end_request: I/O error, dev fd0,
>>>>>> sector
>>>>>> 0 Jun 13 12:29:51 v0246 kernel: end_request: I/O error, dev fd0,
>>>>>> sector 0
>>>>>
>>>>> The latter looks erroneous (just no floppy disc present; not sure what's trying to access it). The former may be useful for getting a stacktrace; if it's different from the one you posted, might potentially be useful to get a backtrace of each thread.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Roger
>>>>>
>>>>> --
>>>>> Dr Roger Leigh -- Open Microscopy Environment Wellcome Trust Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dow Street,
>>>>> Dundee DD1 5EH Scotland UK Tel: (01382) 386364
>>>>>
>>>>> The University of Dundee is a registered Scottish Charity, No:
>>>>> SC015096
>>>>> <files.when.core.dump.tar.gz>______________________________________
More information about the ome-users
mailing list