[ome-users] Stall issues/download issues still, even with gevent...

Sun Jan 31 13:40:43 GMT 2016

Hi Jake,

128/256 workers seams a lot. Did you have a chance to look at http://docs.gunicorn.org/en/stable/design.html#how-many-workers.
Are you using the latest libs: gunicorn 19.4.5, gevent 1.0.2 and nginx 1.9+?

In your nginx config turn buffering off:

location @proxy_to_app {
        ...
        proxy_buffering off;
        proxy_pass http://omeroweb_omero;
    }

Start gunicorn in debug mode:

$ omero web start --workers 5 --worker-connections 1000 --wsgi-args ' --timeout 60 --graceful-timeout 60  --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log '

Make sure

omero config set omero.web.application_server.max_requests 0

Gunicorn log will show:

[2016-01-31 12:48:56 +0000] [31569] [INFO] Starting gunicorn 19.4.5
[2016-01-31 12:48:56 +0000] [31569] [DEBUG] Arbiter booted
[2016-01-31 12:48:56 +0000] [31569] [INFO] Listening at: http://127.0.0.1:4080 (31569)
[2016-01-31 12:48:56 +0000] [31569] [INFO] Using worker: gevent
[2016-01-31 12:48:56 +0000] [31574] [INFO] Booting worker with pid: 31574
[2016-01-31 12:48:56 +0000] [31575] [INFO] Booting worker with pid: 31575
[2016-01-31 12:48:56 +0000] [31576] [INFO] Booting worker with pid: 31576
[2016-01-31 12:48:56 +0000] [31581] [INFO] Booting worker with pid: 31581
[2016-01-31 12:48:56 +0000] [31590] [INFO] Booting worker with pid: 31590
[2016-01-31 12:48:56 +0000] [31569] [DEBUG] 5 workers

I tested that on 8 Cores Xeon with 20 simultaneous downloads of 3GB tiff, like:

wget --limit-rate=10M -O /data/files/30661.tiff https://server.openmicroscopy.org/omero/webgateway/archived_files/download/30661/

100%[======================================================================>] 3,087,100,535 9.76MB/s   in 9m 21s
2016-01-31 13:21:10 (5.25 MB/s) - '/data/files/30661.tiff' saved [3087100535/3087100535]

There were no timeouts. OMERO.web was responding as well.

As Simon mentioned, we experience problems with higher speed when transfer goes above 15-20MB/s as that blocks workers. This issue is under investigation.
Based on your example I am guessing your average speed was about 12MB/s, it could hit the limit.

Could you try it and let us know?

Ola
Software Engineer
Open Microscopy Environment
University of Dundee

On 31 Jan 2016, at 01:31, Jake Carroll <jake.carroll at uq.edu.au<mailto:jake.carroll at uq.edu.au>> wrote:

Hi again

Unfortunately, still having issues on large downloads failing via the web interface.

I'm using a startup string such as this:

omero web start --workers 128 --wsgi-args '--worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'

And it doesn't seem to really matter what workers INT I use, we'll still see stalls and fails on download over the web interface.

I'm trying to download a 9.5GB ims format file.

The g_error.log looks interesting?

root at omero-prod-gen2:~# tail -f ~omero/OMERO.server/var/log/g_error.log
2016-01-31 09:23:53 [4781] [INFO] Booting worker with pid: 4781
2016-01-31 09:23:53 [4794] [INFO] Booting worker with pid: 4794
2016-01-31 09:23:53 [4798] [INFO] Booting worker with pid: 4798
2016-01-31 09:23:53 [4814] [INFO] Booting worker with pid: 4814
2016-01-31 09:23:53 [4808] [INFO] Booting worker with pid: 4808
2016-01-31 09:23:53 [4823] [INFO] Booting worker with pid: 4823
2016-01-31 09:23:53 [4827] [INFO] Booting worker with pid: 4827
2016-01-31 09:23:53 [4838] [INFO] Booting worker with pid: 4838
2016-01-31 09:23:53 [4858] [INFO] Booting worker with pid: 4858
2016-01-31 09:23:53 [4874] [INFO] Booting worker with pid: 4874
2016-01-31 09:26:00 [3852] [CRITICAL] WORKER TIMEOUT (pid:4608)
2016-01-31 09:26:00 [3852] [CRITICAL] WORKER TIMEOUT (pid:4608)
2016-01-31 09:26:01 [5314] [INFO] Booting worker with pid: 5314

I managed to download (randomly?) more than I ever have before, with 1.7GB of the file downloaded in this configuration - but it is still failing/stalling.

What could I be missing?

I even tried with 256 workers:

omero at omero-prod-gen2:~$ omero web start --workers 256 --wsgi-args '--worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'

...but the workers still seem to time out at *some* random point early on:

2016-01-31 09:29:24 [7360] [INFO] Booting worker with pid: 7360
2016-01-31 09:29:24 [7371] [INFO] Booting worker with pid: 7371
2016-01-31 09:30:14 [5433] [CRITICAL] WORKER TIMEOUT (pid:7045) <-- happened almost immediately after booting the workers.
2016-01-31 09:30:14 [5433] [CRITICAL] WORKER TIMEOUT (pid:7045)
2016-01-31 09:30:15 [8273] [INFO] Booting worker with pid: 8273

*SO THEN* I tried booting the worker processes with a very long time out:

omero web start --workers 256 --wsgi-args '-t 360 --worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'

And, after a much much much longer download length of 4.2GB of my 9.5GB ims file it finally started to show problem signs again:

2016-01-31 09:49:32 [8394] [CRITICAL] WORKER TIMEOUT (pid:10451)
2016-01-31 09:49:32 [8394] [CRITICAL] WORKER TIMEOUT (pid:10451)
2016-01-31 09:49:33 [11503] [INFO] Booting worker with pid: 11503

And then it failed again, unfortunately.

So made the timeout an enormous number:

omero web start --workers 256 --wsgi-args '-t 1440 --worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'

...and I can finally drag in my 9.5GB file over the OMERO web interface, without timeout failures.

Something doesn't feel quite right, does it?

-jc

_______________________________________________
ome-users mailing list
ome-users at lists.openmicroscopy.org.uk<mailto:ome-users at lists.openmicroscopy.org.uk>
http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users

The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-users/attachments/20160131/fb4e5bbc/attachment.html>