uwsgi + nginx: status 502 on heavy load

uwsgi + nginx: status 502 on heavy load - django

I have this nginx + uwsgi + django application which works pretty well.
But when load on server gets a bit heavier, I start getting these 502 errors on random requests.
Here's the relevant section of uwsgi.ini:
master = true
#more processes, more computing power
processes = 32
threads = 3
post-buffering = 1
# this one must match net.core.somaxconn
listen = 8000
I have this huge number of processes to handle larger numbers of simultaneous requests - and I even added some threads on top of that - but I must admit I expected the listen directive should handle that. In reality it seems 502 errors start happening the moment I exceed 32 * 3 simultaneous requests.
The way I understood it, listen should allow up to 8000 (in this case) of active connections, waiting to be served by the server, but it doesn't seem to have any serious effect.
uwsgi log clearly says this is active:
your server socket listen backlog is limited to 8000 connections
But I seem to be misunderstanding the effect of the setting.
Anyway, I'd just like to solve the 502, so I'm open for suggestions in any direction.

I had a similar issue where under load, I start to get 502's back from uWSGI.
Long story short, check that you are on HTTP/1.1, that you are using --http11-socket, and that you are on uWSGI=2.0.16 or later.
http/1.1 does a persistent connection by default. I found that 502's occur when the listen queue grows and connections start to get dropped. Have a look at my article for the whole story
https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html

Related

Recommended settings for uwsgi

I have a mysql + django + uwsgi + nginx application and I recently had some issues with uwsgi's default configuration so I want to reconfigure it but I have no idea what the recommended values are.
Another problem is that I couldn't find the default settings that uwsgi uses and that makes debugging really hard.
Using the default configuration, the site was too slow under real traffic (too many requests stuck waiting for the uwsgi socket). So I used a configuration from some tutorial and it had cpu-affinity=1 and processes=4 which fixed the issue. The configuration also had limit-as=512 and now the app gets MemoryErrors so I guess 512MB is not enough.
My questions are:
How can I tell what the recommended settings are? I don't need it to be super perfect, just to handle the traffic in a descent way and to not crash from memory errors etc. Specifically the recommended value for limit-as is what I need most right now.
What are the default values of uwsgi's settings?
Thanks!

We run usually quite small applications... Rarely more than 2000 requests per minute. But anyway its is hard to compare different applications. Thats what we use on production:
Recommendation by the documentation
Haharakiri = 20 # respawn processes taking more than 20 seconds
limit-as = 256 # limit the project to 256 MB
max-requests = 5000 # respawn processes after serving 5000 requests
daemonize = /var/log/uwsgi/yourproject.log # background the process & log
uwsgi_conf.yml
processes: 4
threads: 4
# This part might be important too, that way you limit the log file to 200 MB and
# rotate it once
log-maxsize : 200000000
log-backupname : /var/log/uwsgi/yourproject_backup.log
We use the following project for deployment and configuration of our django apps. (No documentation here sorry... Just used it internally)
https://github.com/iterativ/deployit/blob/ubuntu1604/deployit/fabrichelper/fabric_templates/uwsgi.yaml
How can you tell if you configured it correctly... ? Since it depends much on your application I would recommend to use some monitoring tools such as newrelic.com and analyse it.

How to solve AWS ELB/EC2 HTTP 503 with timeout settings?

I'm getting intermittent but regular 503 errors ("Service Unavailable: Back-end server is at capacity") from a site consisting of 2 t2.medium instances behind an ELB. None are under particularly heavy load and all monitoring seems normal.
The AWS docs here:
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html
say that a potential cause is mismatched timeout settings between the ELB and EC2s: "set the keep-alive timeout to greater than or equal to the idle timeout settings of your load balancer"
Apache conf on the EC2s has:
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 5
Idle timeout on the load balancer is 60 seconds.
This would seem, then, to be a cause, but I'm unsure about the fix. Increasing the Apache KeepAliveTimeout isn't - I understand - normally recommended, and I'm equally unsure about the effect that reducing the idle timeout on the ELB will have on site performance.
What's the recommended approach? How can I get an idea of what the ideal settings are for my setup and the sort of traffic level (currently about 30-50 requests/min) it deals with?

I would lower the idle timeout in the ELB. Clients will need to open new connections more often, but its just slightly slower than reusing an keepalive connection.
Raising the keepalive to 60 in Apache would also fix the 503, but you need to be careful not to run out of connections or memory, especially with prefork mpm because you will get a lot more slots used in keepalive. Use worker mpm (or event mpm if youre not scared of the "This MPM is experimental" warning), make sure you have high enough MaxClients to handle all requests but low enough to not run out of memo.

why uwsgi workers idle but nginx showed a lot of timeouts?

Stack: nginx, uwsgi, django
uwsgitop and top both showed uwsgi workers are idle, while nginx error log said upstream timed out.
I thought some of request needed a lot of resources such as waiting for db or cache, while the others not. After checking the timed out requests, most of them were not voracious. Any kind of request had been timed out.
So why nginx didn't seed the requests to idle ones if the others were really busy? why uwsgi master just keep someone busy and the others idle?

I'd like to answer my own question.
change the kernel parameter: net.ipv4.ip_conntrack_max from 65560 to 6556000
I have a full story on how we found the answer:
user said slow, slow, slow
nginx flooded with "upstream connection timed out"
I checked uwsgi log, found some of errors, fixed it; found more, fix more, and this loop lasted days. Till yesterday, I thought there was no relevance with uwsgi, memcached, db, redis, or anything backend because uwsgi were idle
so I thought nginx must have had something wrong, reload, restart, check connections, workers, proxy_read_timeout, etc. no luck.
checked ulimit -n, which reported 1024, the default one. I have 8 nginx workers, so connections should reach to 1024 * 8, I thought that could be ok as nginx never said too many open files. Anyway, I changed it to 4096. NO luck.
check connections number, and the state, then problem appears. upstream connections were all in syn_sent state, then timeout happends. Only 2 or 3 of 300 connections are in established state. We wanted to know why. One of my friends told my part to use tcpdump, the magic tool I never dare to try once.
Then we go to syslog and found the following error, and finally we resolved the problem

I had a similar issue where my listen queue was growing and the RPS was low despite all the workers idling.
samuel found one of the cases but there are a few more potential causes for this behavior:
net.core.somaxconn not high enough
uwsgi --listen not high enough
nginx worker processes too low
nginx worker connections too low
If none of these work, then you'll want to check your logs that inbound requests to uWSGI are under http/1.1 and not http/1.0, and then use --http11-socket
Here are some of my findings when I wrestled with this issue: https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html
The nginx tuning page also has some other configurations that may or may not be useful in solving this issue:
https://www.nginx.com/blog/tuning-nginx/

nginx/gunicorn connection hanging for 60 seconds

I'm doing a HTTP POST request against a nginx->gunicorn->Django app. Response body comes back quickly, but the request doesn't finish completely for about 60 more seconds.
By "finish completely" I mean that various clients I've tried (Chrome, wget, Android app I'm building) indicate that request is still in progress, as if waiting for more data. Listening in from Wireshark I see that all data arrives quickly, then after 60 seconds ACK FIN finally comes.
The same POST request on local development server (./manage.py runserver) executes quickly. Also, it executes quickly against gunicorn directly, bypassing nginx. Also works quickly in Apache/mod_wsgi setup.
GET requests have no issues. Even other POST requests are fine. One difference I'm aware of is that this specific request returns 201 not 200.
I figure it has someting to do with Content-Length headers, closed vs keepalive connections, but don't yet know how things are supposed to work correctly.
Backend server (gunicorn) is currently closing connections, this makes sense.
Should backend server include Content-Length header, or Transfer-encoding: chunked? Or should nginx be able to cope without these, and add them as needed?
I assume connection keep-alive is good to have, and should not be disabled in nginx.
Update: Setting keepalive_timeout to 0 in nginx.conf fixes my problem. But, of course, keep-alive is gone. I'm still not sure what's the issue. Probably something in the stack (my Django app or gunicorn) doesn't implement chunked transfer correctly, and confuses clients.

It sounds like your upstream server (gunicorn) is somehow holding the connection open on that specific api call - I don't know why (depends on your code, I think), but the default proxy_read_timeout option in nginx is 60 seconds, so it sounds like this response isn't being received, for some reason.
I use a very similar setup and I don't notice any issues on POSTs generally, or any other requests.
Note that return HttpResponse(status=201) has caused me issues before - it seems Django prefers an explicitly empty body: return HttpResponse("", status=201) to work. I usually set something in the body where I'm expected to - this may be something to watch out for.

JMeter, Jetty Performance test and Keep-Alive issues

Ok, so I created a very simple WAR which serves a simple Hello World .jsp. With all the HTML it's about 200bytes.
Deployed it on my server running Jetty 7.5.x jdk 6u27
On my client computer create simple JMeter test plan with: Thread Group, HTTP Request, Response Assertion, Summary Report Client also running jdk6u27
I set up the thread group to 5 threads running for 60secs and I got 5800 requests/sec
Then I setup 10 threads and got 6800 requests/sec
The moment I disable Keep-Alive in JMeter on the HTTP Request sampler. I seem to get lots of big pauses on the client side I suppose, it doesn't seem the server is receiving anything. I get less pauses at 5 threads or barely none but at 10 threads it hangs pretty much all the time.
What does this mean exactly?
Keep in mind I'm technically creating a REST service and I was getting the same issue, so I though maybe I was doing something funky in my service, till I figured out it's a Keep-Alive issue as it's doing it pretty much on a staic web app. So in reality I will have 1 client request 1 server response. The client will not be keeping the connection open.

My guess is that since Keep-Alive is what allows HTTP Connection (and thereby, socket) reuse, you are running out of available ephemeral port numbers -- there are only 64k port numbers, and since connections must have unique client/server port combos (and server port is fixed), you can quickly go through those. Now, if ports were reusable as soon as connection was closed by one side, it would not matter: however, as per TCP spec, both sides MUST wait for configurable amount of time (default: 2 minutes) until reuse is considered safe.
For more details you can read a TCP book (like "Stevens book"); above is a simplification.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js