[ome-users] Stall issues/download issues still, even with gevent...
Aleksandra Tarkowska (Staff)
A.Tarkowska at dundee.ac.uk
Sun Jan 31 13:40:43 GMT 2016
Hi Jake,
128/256 workers seams a lot. Did you have a chance to look at http://docs.gunicorn.org/en/stable/design.html#how-many-workers.
Are you using the latest libs: gunicorn 19.4.5, gevent 1.0.2 and nginx 1.9+?
In your nginx config turn buffering off:
location @proxy_to_app {
...
proxy_buffering off;
proxy_pass http://omeroweb_omero;
}
Start gunicorn in debug mode:
$ omero web start --workers 5 --worker-connections 1000 --wsgi-args ' --timeout 60 --graceful-timeout 60 --worker-class gevent --log-level=DEBUG --error-logfile=/home/omero/OMERO.server/var/log/gunicorn.log '
Make sure
omero config set omero.web.application_server.max_requests 0
Gunicorn log will show:
[2016-01-31 12:48:56 +0000] [31569] [INFO] Starting gunicorn 19.4.5
[2016-01-31 12:48:56 +0000] [31569] [DEBUG] Arbiter booted
[2016-01-31 12:48:56 +0000] [31569] [INFO] Listening at: http://127.0.0.1:4080 (31569)
[2016-01-31 12:48:56 +0000] [31569] [INFO] Using worker: gevent
[2016-01-31 12:48:56 +0000] [31574] [INFO] Booting worker with pid: 31574
[2016-01-31 12:48:56 +0000] [31575] [INFO] Booting worker with pid: 31575
[2016-01-31 12:48:56 +0000] [31576] [INFO] Booting worker with pid: 31576
[2016-01-31 12:48:56 +0000] [31581] [INFO] Booting worker with pid: 31581
[2016-01-31 12:48:56 +0000] [31590] [INFO] Booting worker with pid: 31590
[2016-01-31 12:48:56 +0000] [31569] [DEBUG] 5 workers
I tested that on 8 Cores Xeon with 20 simultaneous downloads of 3GB tiff, like:
wget --limit-rate=10M -O /data/files/30661.tiff https://server.openmicroscopy.org/omero/webgateway/archived_files/download/30661/
100%[======================================================================>] 3,087,100,535 9.76MB/s in 9m 21s
2016-01-31 13:21:10 (5.25 MB/s) - '/data/files/30661.tiff' saved [3087100535/3087100535]
There were no timeouts. OMERO.web was responding as well.
As Simon mentioned, we experience problems with higher speed when transfer goes above 15-20MB/s as that blocks workers. This issue is under investigation.
Based on your example I am guessing your average speed was about 12MB/s, it could hit the limit.
Could you try it and let us know?
Ola
Software Engineer
Open Microscopy Environment
University of Dundee
On 31 Jan 2016, at 01:31, Jake Carroll <jake.carroll at uq.edu.au<mailto:jake.carroll at uq.edu.au>> wrote:
Hi again
Unfortunately, still having issues on large downloads failing via the web interface.
I'm using a startup string such as this:
omero web start --workers 128 --wsgi-args '--worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'
And it doesn't seem to really matter what workers INT I use, we'll still see stalls and fails on download over the web interface.
I'm trying to download a 9.5GB ims format file.
The g_error.log looks interesting?
root at omero-prod-gen2:~# tail -f ~omero/OMERO.server/var/log/g_error.log
2016-01-31 09:23:53 [4781] [INFO] Booting worker with pid: 4781
2016-01-31 09:23:53 [4794] [INFO] Booting worker with pid: 4794
2016-01-31 09:23:53 [4798] [INFO] Booting worker with pid: 4798
2016-01-31 09:23:53 [4814] [INFO] Booting worker with pid: 4814
2016-01-31 09:23:53 [4808] [INFO] Booting worker with pid: 4808
2016-01-31 09:23:53 [4823] [INFO] Booting worker with pid: 4823
2016-01-31 09:23:53 [4827] [INFO] Booting worker with pid: 4827
2016-01-31 09:23:53 [4838] [INFO] Booting worker with pid: 4838
2016-01-31 09:23:53 [4858] [INFO] Booting worker with pid: 4858
2016-01-31 09:23:53 [4874] [INFO] Booting worker with pid: 4874
2016-01-31 09:26:00 [3852] [CRITICAL] WORKER TIMEOUT (pid:4608)
2016-01-31 09:26:00 [3852] [CRITICAL] WORKER TIMEOUT (pid:4608)
2016-01-31 09:26:01 [5314] [INFO] Booting worker with pid: 5314
I managed to download (randomly?) more than I ever have before, with 1.7GB of the file downloaded in this configuration - but it is still failing/stalling.
What could I be missing?
I even tried with 256 workers:
omero at omero-prod-gen2:~$ omero web start --workers 256 --wsgi-args '--worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'
...but the workers still seem to time out at *some* random point early on:
2016-01-31 09:29:24 [7360] [INFO] Booting worker with pid: 7360
2016-01-31 09:29:24 [7371] [INFO] Booting worker with pid: 7371
2016-01-31 09:30:14 [5433] [CRITICAL] WORKER TIMEOUT (pid:7045) <-- happened almost immediately after booting the workers.
2016-01-31 09:30:14 [5433] [CRITICAL] WORKER TIMEOUT (pid:7045)
2016-01-31 09:30:15 [8273] [INFO] Booting worker with pid: 8273
*SO THEN* I tried booting the worker processes with a very long time out:
omero web start --workers 256 --wsgi-args '-t 360 --worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'
And, after a much much much longer download length of 4.2GB of my 9.5GB ims file it finally started to show problem signs again:
2016-01-31 09:49:32 [8394] [CRITICAL] WORKER TIMEOUT (pid:10451)
2016-01-31 09:49:32 [8394] [CRITICAL] WORKER TIMEOUT (pid:10451)
2016-01-31 09:49:33 [11503] [INFO] Booting worker with pid: 11503
And then it failed again, unfortunately.
So made the timeout an enormous number:
omero web start --workers 256 --wsgi-args '-t 1440 --worker-class gevent --error-logfile=/home/omero/OMERO.server/var/log/g_error.log'
...and I can finally drag in my 9.5GB file over the OMERO web interface, without timeout failures.
Something doesn't feel quite right, does it?
-jc
_______________________________________________
ome-users mailing list
ome-users at lists.openmicroscopy.org.uk<mailto:ome-users at lists.openmicroscopy.org.uk>
http://lists.openmicroscopy.org.uk/mailman/listinfo/ome-users
The University of Dundee is a registered Scottish Charity, No: SC015096
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openmicroscopy.org.uk/pipermail/ome-users/attachments/20160131/fb4e5bbc/attachment.html>
More information about the ome-users
mailing list