Huge time latency between nginx and upstream django backdend - django

So we have a setup of nginx ingress controller as reverse proxy for a django based backend app in production (GKE k8s cluster). We have used opentelemetry to trace this entire stack(Signoz being the actual tool). One of our most critical api is validate-cart.
And we have observed that this api sometimes take a lotta time, like 10-20 seconds and even more. But if we look at the trace of one of such request in Signoz, the actual backend takes very less time like 100ms but the total trace starting from nginx shows 29+ seconds. As you can see from the screen shot attached.
And looking at the p99 latency, the nginx service has way bigger spikes than the order. This graph is populated for the same validate-cart api.
Have been banging my head around this for quite some time and i am still stuck.
I am assuimg there might be some case of request being queued at either nginx or django layer. But i am trusting otel libraries that are used to trace django, to start the trace the moment it hit django layer and since there isn't a big latency at django layer, issue might be at nginx layer. I have traced nginx using the third party module open-tracing since ingress-nginx doesn't yet support opentelemetry where more tracing information about nginx layer is provided.
Nginx opentracing

Related

504 Gateway Time-out from google cloud platform, but only sometimes

I'm hosting a CouchBase single node cluster in GCP and a flask backend OpenShift cluster which supports angular frontend. The problem is, when a post is called in my flask by my angular, it is taking too much time to get connected the VM (couchbase) and hence flask has to return a "504 Gateway Time-out". But this happens only sometimes. Sometimes it just works very well with proper speed. Not able to troubleshoot. The total data size is less than 100M, and everything is 100% memory resident in Couchbase. So I guess this is not a problem with Couchbase. Just the connection latency to GCP.
My guess is that the first time your flask backend is trying to connect for the first time to your VM, it's taking more than usual as it needs to establish the connection, authenticate and possibly do other things depending on your use-case.
This is a common problem when hosting your app on App Engine or something similar and the solution there is to use "warm-up requests". This basically spins up the whole connection (and in app engine case the instance) and makes a test connection just so when the desired connection comes, everything is already set up.
So I suggest that you check how warm-up requests work and configure something similar between your flask and VM. So basically a route in flask with the only purpose of establishing a test connections with a test package. This way your next connection will be up to speed with no 504 errors.
try to clear to cache of load balancer in GCP console
I already faced same kind of issue and resolved it using above technique

Intermittent 502 gateway errors with AWS ALB in front of ECS services running express / nginx

Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

Optimizing nginx requests per second

I'm trying to optimize a single core 1GB ram Digital Ocean VPS to handle more requests per second. After some tweaking (workers/gzip etc.) it serves about 15 requests per second. I don't have anything to compare it with but I think this number can be higher.
The stack works like this:
VPS -> Docker container -> nginx (ssl) -> Varnish -> nginx -> uwsgi (Django)
I'm aware of the fact that this is a long chain and that Docker might cause some overhead. However, almost all requests can be handled by Varnish.
These are my tests results:
ab -kc 100 -n 1000 https://mydomain | grep 'Requests per second'
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests
Requests per second: 18.87 [#/sec] (mean)
I have actually 3 questions:
Am I correct that 18.87 requests per second is low?
For a simple Varnished Django blog app, what would be an adequate value (indication)?
I already applied the recommended tweaks (tuned for my system) from this tutorial. What can I tweak more and how do I figure out the bottlenecks.
First some note about Docker. It is not meant to run a multiple processes in a single docker container. Docker is not a replacement for a VM. It simply allows to run processes in isolation. So the docker diagram should be:
VPS -> docker nginx container -> docker varnish container -> docker django container
To make your life using multiple Docker containers simpler, I would recommend to use Docker-compose. It is not perfect but its an excellent start.
Old but still fundamentally relent blog post about that. Note that some suggestions are no-longer relevant like nsenter since docker exec command is now available but most of the blog post is still correct.
As for your performance issues, yes, 18 requests per second is pretty low. However the issue is probably has nothing to do with nginx and is most likely in your Django application and possibly varnish (however very unlikely).
To debug PA issues in Django, I would recommend to use django-debug-toolbar. Most issues in Django are caused by unnecessary SQL queries. You can see them easily in debug toolbar. To solve most them you can use select_related() and prefetch_related. For more detailed analysis, I would also recommend profiling your application. cProfile is a great start. Also some IDEs like PyCharm include built-in profilers so its pretty easy to profile your application to see which functions are taking most of the time which you can optimize. Finally you can use 3rd party tools to profile your application. Even free newrelic account will give you quite a bit of information. Alternatively you can use opbeat which is a new cool kid on the block.

jetty 404 error page on hot deployment

I am currently using Jetty 9.1.4 on Windows.
When I deploy the war file without hot deployment config, and then restart the Jetty service. During that 5-10 seconds starting process, all client connections to my Jetty server are waiting for the server to finish loading. Then clients will be able to view the contents.
Now with hot deployment config on, the default Jetty 404 error page shows within that 5-10 second loading interval.
Is there anyway I can make the hot deployment has the same behavior as the complete restart - clients connections will wait instead seeing the 404 error page ?
Unfortunately this does not seem to be possible currently after talking with the Jetty developers on IRC #jetty.
One solution I will try to use are two Jetty instances with a loadbalancing reverse proxy (e.g. nginx) before them and taking one instance down for deployment.
Of course this will instantly lead to new requirements (session persistence/sharing) which need to be handled. So in conclusion: much work to do in the Java world for zero downtime on deployments.
Edit: I will try this, seems like a simple enough solution http://rafaelsteil.com/zero-downtime-deploy-script-for-jetty/ Github: https://github.com/rafaelsteil/jetty-zero-downtime-deploy