Django on Apache - Prevent 504 Gateway Timeout - django

I have a Django server running on Apache via mod_wsgi. I have a massive background task, called via a API call, that searches emails in the background (generally takes a few hours) that is done in the background.
In order to facilitate debugging - as exceptions and everything else happen in the background - I created a API call to run the task blocking. So the browser actually blocks for those hours and receives the results.
In localhost this is fine. However, in the real Apache environment, after about 30 minutes I get a 504 Gateway Timeout error.
How do I change the settings so that Apache allows - just in this debug phase - for the HTTP request to block for a few hours without returning a 504 Gateway Timeout?
I'm assuming this can be changed in the Apache configuration.

You should not be doing long running tasks within Apache processes, nor even waiting for them. Use a background task queueing system such as Celery to run them. Have any web request return as soon as it is queued and implement some sort of polling mechanism as necessary to see if the job is complete and results can be obtained.
Also, are you sure the 504 isn't coming from some front end proxy (explicit or transparent) or load balancer? There is no default timeout in Apache which is 30 minutes.

Related

504 timeout on AWS with nginx and gunicorn

I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

Performance issues AWS - Docker - Django

I have an application deployed with docker on an EC2 instance: t3a.xlarge.
My application is using 7 different containers (cf image docker-ps.png):
A Django App, as an API (using python 3.6)
An Angular Application (using Angular2+)
A memcached server
A cerbot (using letsencrypt to automatically renew my SSL
certificats)
A Nginx, used as a reverse proxy to serve my angular application and
my Django API
A Postgres database
A Pgadmin in order to mananage my database
The issues happen when we send a push notification to our users using Firebase (around 42,000 users). The API is not responding during a certain amount of time: from 1min to 6min.
The Django API use the webserver Gunicorn (https://gunicorn.org/ ) with this configuration:
gunicorn xxxx_api.wsgi -b 0.0.0.0:80 --max-requests 500 --max-requests-jitter 50 --enable-stdio-inheritance -k gevent --workers=16 -t 80
The server or the container never crashed. When I check the metrics, we never use more than 60% of the CPU. Here is a screenshot of some metrics when the notification has been sent: https://ibb.co/Mc0v7R1
Is it because we are using too much bandwidth than our instance allowed us to use? Or should I use another AWS service?
Memory utilisation metrics are not captured for ec2 instances since OS level metrics are not available to AWS. You can collect custom metrics by your self
Reference:
https://awscloudengineer.com/create-custom-cloudwatch-metrics-centos-7/
I think your problem is about the design, you could try sending your push notifications as an async queue using things like SNS & SQS (it's AWS Way) or Celery & Redis (it's a traditional way)
If you choose the traditional way this post could help you
https://blog.devartis.com/sending-real-time-push-notifications-with-django-celery-and-redis-829c7f2a714f
I think Its because of queuing Http requests to firebase. I believe that you are sending 42000 firebase requests in a loop. I/O calls are blocking in nature. if you are running the Django app in single thread using gunicorn. these 42000 http calls will block the new calls until they are finished. they will stay in queue until the connection is alive or the requests are within nginx threshold. I don't think 42000 push notifications will exhaust memory and processing unless payload is too high.

Use Nginx to prevent downstream timeout by sending blank lines

We have a setup where a CDN is calling Nginx which is calling a uwsgi server. Some of the requests take a lot of time for Django to handle, so we are relying on the CDN for caching. However, the CDN has a hard timeout of 30 seconds, which is unfortunately not configurable.
If we were able to send a blank line every few seconds before the request is received from the uwsgi server, it would mean that the CDN would not timeout. Is there a way to send a blank line every few seconds with Nginx until the response is received?
I see a few possibilities:
Update your Django app to work this way-- have /it/ start dribbling a response immediately.
Rework your design to avoid user's periodically having requests that take more than 30 seconds to respond. Use a frequent cron job to prime the cache on your backend server, so when the CDN asks for assets, they are already ready. Web servers can be configured to check for a static ".gz" versions of URLs, which might be a good fit here.
Configure Nginx to cache the requests. The first time the CDN requests the slow URL, it may timeout, but Nginx ought to eventually cache the result anyway. The next time the CDN asks, Nginx should have the cached response ready.

jetty 404 error page on hot deployment

I am currently using Jetty 9.1.4 on Windows.
When I deploy the war file without hot deployment config, and then restart the Jetty service. During that 5-10 seconds starting process, all client connections to my Jetty server are waiting for the server to finish loading. Then clients will be able to view the contents.
Now with hot deployment config on, the default Jetty 404 error page shows within that 5-10 second loading interval.
Is there anyway I can make the hot deployment has the same behavior as the complete restart - clients connections will wait instead seeing the 404 error page ?
Unfortunately this does not seem to be possible currently after talking with the Jetty developers on IRC #jetty.
One solution I will try to use are two Jetty instances with a loadbalancing reverse proxy (e.g. nginx) before them and taking one instance down for deployment.
Of course this will instantly lead to new requirements (session persistence/sharing) which need to be handled. So in conclusion: much work to do in the Java world for zero downtime on deployments.
Edit: I will try this, seems like a simple enough solution http://rafaelsteil.com/zero-downtime-deploy-script-for-jetty/ Github: https://github.com/rafaelsteil/jetty-zero-downtime-deploy

Asynchronous celery task is blocking main application (maybe)

I have a django application running behind varnish and nginx.
There is a periodic task running every two minutes, accessing a locally running jsonrpc daemon and updating a django model with the result.
Sometimes the django app is not responding, ending up in an nginx gateway failed message. Looking through the logs it seems that when this happens the backend task accessing the jsonrpc daemon is also timing out.
The task itself is pretty simple: A value is requested from jsonrpc daemon and saved in a django model, either updating an existing entry or creating a new one. I don't think that any database deadlock is involved here.
I am a bit lost in how to track this down. To start, I don't know if the timeout of the task is causing the overall site timeout OR if some other problem is causing BOTH timeouts. After all, a timout in the asynchronous task should not have any influence on the website response?