nginx/gunicorn connection hanging for 60 seconds - django

I'm doing a HTTP POST request against a nginx->gunicorn->Django app. Response body comes back quickly, but the request doesn't finish completely for about 60 more seconds.
By "finish completely" I mean that various clients I've tried (Chrome, wget, Android app I'm building) indicate that request is still in progress, as if waiting for more data. Listening in from Wireshark I see that all data arrives quickly, then after 60 seconds ACK FIN finally comes.
The same POST request on local development server (./manage.py runserver) executes quickly. Also, it executes quickly against gunicorn directly, bypassing nginx. Also works quickly in Apache/mod_wsgi setup.
GET requests have no issues. Even other POST requests are fine. One difference I'm aware of is that this specific request returns 201 not 200.
I figure it has someting to do with Content-Length headers, closed vs keepalive connections, but don't yet know how things are supposed to work correctly.
Backend server (gunicorn) is currently closing connections, this makes sense.
Should backend server include Content-Length header, or Transfer-encoding: chunked? Or should nginx be able to cope without these, and add them as needed?
I assume connection keep-alive is good to have, and should not be disabled in nginx.
Update: Setting keepalive_timeout to 0 in nginx.conf fixes my problem. But, of course, keep-alive is gone. I'm still not sure what's the issue. Probably something in the stack (my Django app or gunicorn) doesn't implement chunked transfer correctly, and confuses clients.

It sounds like your upstream server (gunicorn) is somehow holding the connection open on that specific api call - I don't know why (depends on your code, I think), but the default proxy_read_timeout option in nginx is 60 seconds, so it sounds like this response isn't being received, for some reason.
I use a very similar setup and I don't notice any issues on POSTs generally, or any other requests.
Note that return HttpResponse(status=201) has caused me issues before - it seems Django prefers an explicitly empty body: return HttpResponse("", status=201) to work. I usually set something in the body where I'm expected to - this may be something to watch out for.

Related

libcurl: send GET requests after timeout limit is reached

Problem:
OS: Ubuntu 20.04.1 LTS
When a target URL updates its content, recently libcurl has had unexpected polling delays / timeouts anywhere between 2 and 20+ seconds between sending a GET request to the target URL and receiving any response.
I have no idea what has been causing this behaviour, and have detailed all of the strace reports, tshark results, entire libcurl C++ program, attempts to diagnose, and other terminal outputs at the following SO question, but have had no luck in diagnosing this for about four months:
libcurl: abnormal GET response delays
There seems to be something between the client server and remote server that is stopping packets from being returned, but only when the page changes its content. During this polling delay / timeout, no other requests can be sent - therefore any new data uploaded on the remote server cannot be retrieved quickly.
This issue did not exist before mid-July 2021. Given that after four months this problem still hasn't been solved, I want to attempt a workaround that will still send requests to the target when this polling delay presents itself. I won't understand what caused the polling timeouts, but hopefully I will be able to retrieve the data without delays like the program used to do.
Target URL: https://ir.eia.gov/wpsr/table4.csv
Summary questions:
Q1. Is there a timeout option with libcurl that, when exceeded, the program does not exit but instead sends another GET request to the target URL?
Q2. Since this problem only arises when the target URL makes a scheduled content update, could there be a chance the target URL changes its IP address and thus there is some delay caused by a DNS resolution server in between the client and remote side on the return leg? I am going to attempt to use a tool like pingPlotter to see if there is a delay at some specific IP address between the outbound GET request and the response.
Before any scheduled page content changes, the latency between the outbound GET request and the response is <100 ms.

Propagate http abort/close from nginx to uwsgi / Django

I have a Django application web application, and I was wondering if it was possible to have nginx propagate the abort/close to uwsgi/Django.
Basically I know that nginx is aware of the premature abort/close because it defaults to uwsgi_ignore_client_abort to "off", and you get nginx 499 errors in your nginx logs when requests are aborted/closed before the response is sent. Once uwsgi finishes processing the request it throws an "IO Error" when it goes to return the response to nginx.
Turning uwsgi_ignore_client_abort to "on" just makes nginx unaware of the abort/close, and removes the uwsgi "IO Errors" because uwsgi can still write back to nginx.
My use case is that I have an application where people page through some ajax results very quickly, and so if the quickly page through I abort the pending ajax request for the page that they skipped, this keeps the client clean and efficient. But this does nothing for the server side (uwsgi/Django) because they still have to process every single request even if nothing will be waiting for the response.
Now obviously there may be certain pages, where I don't want the request to be prematurely aborted for any reason. But I use celery for long running requests that may fall into that category.
So is this possible? uwsgi's hariakari setting makes me think that it is at some level.... just can't figure out how to do it.
My use case is that I have an application where people page through some ajax results very quickly, and so if the quickly page through I abort the pending ajax request for the page that they skipped, this keeps the client clean and efficient.
Aborting an AJAX request on the client side is done through XMLHttpRequest.abort(). If the request has not yet been sent out when abort() is called, then the request won't go out. But if the request has been sent, the server won't know that the request has been aborted. The connection won't be closed, there won't be any message sent to the server, nothing. If you want the server to know that a request is no longer needed, you basically need to come up with a way to identify requests so that when you make the initial request you get an identifier for it. Then, through another AJAX request you could tell the server that an earlier request should be cancelled. (If you search questions about abort() like this one and search for "server" you'll find explanations saying the same.)
Note that uwsgi_ignore_client_abort is something that deals with connection closures at the TCP level. That's a different thing from aborting an AJAX request. There is generally no action you can take in JavaScript that will entail closing a TCP connection. The browser optimizes the creation and destruction of connections to suit its needs. Just now, I did this:
I used lsof to check whether any process had a connection to example.com. There were none. (lsof is a *nix utility that allows listing open files. Network connections are "files" in *nix.)
I opened a page to example.com in Chrome. lsof showed the connection and the process that opened it.
Then I closed the page.
I polled with lsof to see if the connection I identified earlier was still opened. It stayed open for about one minute after I closed the page even though there was no real need to keep the connection open.
And there's no amount of fiddling with uswgi settings that will make it be aware of aborts performed through XMLHttpRequest.abort()
The use-case scenario you gave was one where users were paging fast through some results. I can see two possibilities for the description given in the question:
The user waits for a refresh before paging further. For instance, Alice is looking through an list of user names sorted alphabetically for user "Zeno" and each time a new page is shown, she sees the name is not there and pages down. In this case, there's nothing to abort because the user's action is dependent on the request having been handled first. (The user has to see the new page before making a decision.)
The user just pages down without waiting for a refresh. Alice again is looking for "Zeno" but she figures it's going to be on the last page so click, click, click she goes. In this case, you can debounce the requests made to the server. Then the next page button is pressed, increment the number of the page that should be shown to the user but don't send the request right away. Instead, you wait for a small delay after the user ceases clicking the button to then send the request with final page number and so you make one request instead of a dozen. Here is an example of a debounce performed for a DataTables search.
Now obviously there may be certain pages, where I don't want the request to be prematurely aborted for any reason.
This is precisely the problem behind taking this one way or the other.
Obviously, you may not want to continue spending system resources processing a connection that has since been aborted, e.g., an expensive search operation.
But then maybe the connection was important enough that it still has to be processed even if the client has disconnected.
E.g., the very same expensive search operation, but one that's actually not client-specific, and will be cached by nginx for all subsequent clients, too.
Or maybe an operation that modifies the state of your application — you clearly wouldn't want to have your application to have an inconsistent state!
As mentioned, the problem is with uWSGI, not with NGINX. However, you cannot have uWSGI automatically decide what was your intention, without you revealing such intention yourself to uWSGI.
And how exactly will you reveal your intention in your code? A whole bunch of programming languages don't really support multithreaded and/or asynchronous programming models, which makes it entirely non-trivial to cancel operations.
As such, there is no magic solution here. Even the concurrency-friendly programming languages like Golang are having issues around the WithCancel context — you may have to pass it around in every function call that could possibly block, making the code very ugly.
Are you already doing the above context passing in Django? If not, then the solution is ugly but very simple — any time you can clearly abort the request, check whether the client is still connected with uwsgi.is_connected(uwsgi.connection_fd()):
http://lists.unbit.it/pipermail/uwsgi/2013-February/005362.html

Use Nginx to prevent downstream timeout by sending blank lines

We have a setup where a CDN is calling Nginx which is calling a uwsgi server. Some of the requests take a lot of time for Django to handle, so we are relying on the CDN for caching. However, the CDN has a hard timeout of 30 seconds, which is unfortunately not configurable.
If we were able to send a blank line every few seconds before the request is received from the uwsgi server, it would mean that the CDN would not timeout. Is there a way to send a blank line every few seconds with Nginx until the response is received?
I see a few possibilities:
Update your Django app to work this way-- have /it/ start dribbling a response immediately.
Rework your design to avoid user's periodically having requests that take more than 30 seconds to respond. Use a frequent cron job to prime the cache on your backend server, so when the CDN asks for assets, they are already ready. Web servers can be configured to check for a static ".gz" versions of URLs, which might be a good fit here.
Configure Nginx to cache the requests. The first time the CDN requests the slow URL, it may timeout, but Nginx ought to eventually cache the result anyway. The next time the CDN asks, Nginx should have the cached response ready.

Nginx uwsgi (104: Connection reset by peer) while reading response header from upstream

Environment is Nginx + uwsgi.
Getting a 502 bad gateway error from Nginx on certain GET requests. Seems to be related to the length of the URL. In our particular case, it was a long list of GET parameters. Shorten the GET parameters and no 502 error.
From the nginx/error.log
[error] 22113#0: *1 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.1.100, server: server.domain.com, request: "GET <long_url_here>"
No information in the uwsgi error log.
After spending a lot of time on this, I finally figured it out. There are many references to Nginx and connection reset by peer. Most of them seemed to be related to PHP. I couldn't find an answer that was specific to Nginx and uwsgi.
I finally found a reference to fastcgi and a 502 bad gateway error (https://support.plesk.com/hc/en-us/articles/213903705). That lead me to look for a buffer size limit in the uwsgi configuration which exists as buffer-size. The default value is 4096. From the documentation, it says:
If you plan to receive big requests with lots of headers you can increase this value up to 64k (65535).
There are many ways to configure uwsgi, I happen to use a .ini file. So in my .ini file I tried:
buffer-size=65535
This fixed the problem. You can adjust that to taste. Maybe start with the max and work back until you have an acceptable value, or just leave it at the max.
This was frustrating to track down because there was no error on the uwsgi side of things.
I was getting the same nginx error and also there was no information in uwsgi log. The problem was that in some cases the application was not consuming the whole request body as advised in http://uwsgi-docs.readthedocs.org/en/latest/ThingsToKnow.html:
If an HTTP request has a body (like a POST request generated by a form), you have to read (consume) it in your application. If you do not do this, the communication socket with your webserver may be clobbered. If you are lazy you can use the post-buffering option that will automatically read data for you. For Rack applications this is automatically enabled.
Of course, this is not a problem in your case, but it may be useful for others who are getting the same nginx error.
We just need to increase the attribute "output_buffering" value, in php.ini, to a greater value like 65535 or another appropriate value.
When we receive a message like (104: Connection reset by peer) while reading response header from upstream, most often, we could blame the upstream side of this kind of error.
As described, the connection was reset by the upstream peer, not by nginx itself. Nginx as a client can barely do anything to make it right.
I'm suspecting if modifying buffer-size will do the magic. Basically the command changes the buffer size where response headers are cached. This would take effect when the response header is too big, of which case we receive a message saying upstream sent too big header while reading response header from upstream, and that is totally different thing from connection reset by peer.
Since this kind of error is trigger randomly, I would suggest you check whether nginx uses keepalive when talking to upstreams. If this was the case, the connection might be reset by upstream server when the idle timed out whereas nginx had no idea that the connection had been dropped, hence forwarding the request using the same connection.
There's no elegant solution to fix it as far as I know. You could do retry or set a keepalive_timeout value to the upstream connection pool in nginx to avoid the problem.
referencing:
Apache HttpClient Interim Error: NoHttpResponseException
http://tengine.taobao.org/document/http_upstream_keepalive_timeout.html
--post-buffering 32768
worked for me as suggested (and discouraged) here NGINX + uWSGI Connection Reset by Peer
I don't have time to investigate it further at the moment (quick prototyping mode :), but since it took me a lot of time to find this hack, it might be worth posting here.
This can happen if your request/response headers are quite large.
To fix it, in /etc/uwsgi/apps-available/your-app.ini add buffer-size=65535
It doesn't come up occasionally.
I guess the most possible reason of that is the size of your php-fpm.log is oversize. Try to change your log_level to upper level in php-fpm.conf and clear the logs.
Anyway, it works for me.
You need to re-install PHP:
apt-get install --reinstall php5-fpm

JMeter, Jetty Performance test and Keep-Alive issues

Ok, so I created a very simple WAR which serves a simple Hello World .jsp. With all the HTML it's about 200bytes.
Deployed it on my server running Jetty 7.5.x jdk 6u27
On my client computer create simple JMeter test plan with: Thread Group, HTTP Request, Response Assertion, Summary Report Client also running jdk6u27
I set up the thread group to 5 threads running for 60secs and I got 5800 requests/sec
Then I setup 10 threads and got 6800 requests/sec
The moment I disable Keep-Alive in JMeter on the HTTP Request sampler. I seem to get lots of big pauses on the client side I suppose, it doesn't seem the server is receiving anything. I get less pauses at 5 threads or barely none but at 10 threads it hangs pretty much all the time.
What does this mean exactly?
Keep in mind I'm technically creating a REST service and I was getting the same issue, so I though maybe I was doing something funky in my service, till I figured out it's a Keep-Alive issue as it's doing it pretty much on a staic web app. So in reality I will have 1 client request 1 server response. The client will not be keeping the connection open.
My guess is that since Keep-Alive is what allows HTTP Connection (and thereby, socket) reuse, you are running out of available ephemeral port numbers -- there are only 64k port numbers, and since connections must have unique client/server port combos (and server port is fixed), you can quickly go through those. Now, if ports were reusable as soon as connection was closed by one side, it would not matter: however, as per TCP spec, both sides MUST wait for configurable amount of time (default: 2 minutes) until reuse is considered safe.
For more details you can read a TCP book (like "Stevens book"); above is a simplification.