[ome-users] FW: Core dump attempting to perfrom bulk upload to Omero

Tue Jun 24 15:55:14 BST 2014

Hi Roger (et al)

Sorry for the delay in coming back to you on this this issue - I have been trying some other testing to see if I can narrow down the issue.

I have tested my bulk upload process with the following versions of Omero and Ice, with the r

	OMERO.server-5.0.2-ice34-b26		(ice 3.4)
	OMERO.server-5.0.2-ice35-b26		(ice 3.5)
	OMERO.server-4.4.11-ice35-b114	(ice 3.5)
	OMERO.server-4.4.11-ice34-b114	(ice 3.4)

With Omero 5, using both ive 3.4 and ice3.5 I see the core dumps, as reported previously.  With the other versions of Omero I do not see the core dump.  

I have kept ALL other components the same, but only see these core dumps with Omero 5.

I have attached to this email the following log files, which contain additional information which was produced when code dumped:

	cli.err				named "cli.error.when.core.dump"
	The command line output	named "cmd.line.when.core.dump"
	The bulk upload log		named "var.log.upload.log.when.core.dump"
	The system log file		named "var.log.message.when.core.dump"

I also have a copy of the core dump file, which I can provide if that would be useful.

Each time the core dump was generated, I had called "omero.cli".  The following are the typical arguments that were used:

['import', '-s', 'localhost', '-u', 'shared', '-w', '<password removed>', '-p', '4064', '-d', '101', '-n', '13_06_13_Ler_2-1_new_walls_rotated.inr.gz', '---errs=cli.err', '---file=cli.out', '--', '/tmp/tmpVnY9vZ/13_06_13_Ler_2-1/13_06_13_Ler_2-1_new_walls_rotated.inr.gz']

And 

strict=True

As stated, exactly the same test data was used, on the same server, with all other versions remaining the same, but I did not see the core dumps.  The core dumps do not always occur on the same upload, but normally do occur after just a few uploads.

For the versions of Omero that do not core dump, I am seeing some a-hoc issues as well:  sometimes (again seemingly randomly and not always for the same upload), I see a "NonZeroReturnCode" from the Onero CLI with the error "assert failed".  This appears to happen much more frequently when testing with Ice version of 3.5.

Can anyone give me any pointers on what is going wrong here?

Thanks
John

-----Original Message-----
From: Roger Leigh [mailto:r.leigh at dundee.ac.uk] 
Sent: 13 June 2014 16:57
To: John Webber (NBI); Roger Leigh; ome-users at lists.openmicroscopy.org.uk
Subject: Re: FW: [ome-users] Core dump attempting to perfrom bulk upload to Omero

On 13/06/2014 15:25, John Webber (NBI) wrote:
> Not sure I am getting very far with this error!
>
> I have re-run the process a few times today for the purpose of testing.
>
> I have a number of images that are being uploaded to Omero as part of this "bulk upload process".  This jobs is failing at DIFFERENT stages each time I run the job - sometimes it uploads 1 or 2 images before it fails, sometimes it will upload 5 or 6 before it fails.  It is therefore extremely had to track down the issue.
>
> When I ran the job, I saw the following error:
>
>       -! 06/13/14 12:33:54.390 warning: Proxy keep alive failed.
>
> Which appears after the session information, i.e.
>
>       Using session c4bef719-ae6f-4d85-b4bb-ffc09ee189b9 (webberj at localhost:4064). Idle timeout: 10.0 min. Current group: system
>       -! 06/13/14 12:33:54.390 warning: Proxy keep alive failed.
>
> At this point, a core dump is generated.
>
> As per Roger's email below, I have run an strace, but the resulting logfile is too big to send! I've looked through the logfile but am not sure what I am looking for in order to just snip a section out of it!  Is there something specific in the log that I should look for, or can I transfer this file another way?

Looking through the log, the first SIGSEGV is in thread 29086.
Immediately before, it's using fd=38 (client/logback-classic.jar) and a bit further back fd=5 (ice-glacier2.jar).  After this point, it repeatedly segfaults, presumably inside it's own SEGV handler.

It also spawns threads 29101 and 29105, both of which also subsequently segfault; not clear why but the jvm may be a horrible mess by this point.

This /might/ point to an issue in client-logback.jar, but that's not certain.

> The /var/log/messages file just has the following lines:
>
> Jun 13 12:29:42 v0246 abrt[29178]: Saved core dump of pid 28122 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.55.x86_64/jre/bin/java) to /var/spool/abrt/ccpp-2014-06-13-12:29:35-28122 (750710784 bytes) Jun 13 12:29:42 v0246 abrtd: Directory 'ccpp-2014-06-13-12:29:35-28122' creation detected Jun 13 12:29:42 v0246 abrt[29178]: /var/spool/abrt is 1527378116 bytes (more than 1279MiB), deleting 'ccpp-2014-06-13-11:58:33-27089'
> Jun 13 12:29:51 v0246 kernel: end_request: I/O error, dev fd0, sector 
> 0 Jun 13 12:29:51 v0246 kernel: end_request: I/O error, dev fd0, 
> sector 0

The latter looks erroneous (just no floppy disc present; not sure what's trying to access it).  The former may be useful for getting a stacktrace; if it's different from the one you posted, might potentially be useful to get a backtrace of each thread.

Regards,
Roger

--
Dr Roger Leigh -- Open Microscopy Environment Wellcome Trust Centre for Gene Regulation and Expression, College of Life Sciences, University of Dundee, Dow Street,
Dundee DD1 5EH Scotland UK   Tel: (01382) 386364

The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
A non-text attachment was scrubbed...
Name: files.when.core.dump.tar.gz
Type: application/x-gzip
Size: 3366 bytes
Desc: files.when.core.dump.tar.gz
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-users/attachments/20140624/de6b3150/attachment.bin>