How to solve high request latency on Cloud Run (nearly 4minutes) - google-cloud-platform

Hi everyone at this point I am at a loss. I am in the process of brining our application up in Cloud Run. I did have a little bit of worry running "serverless" for this application stack. I have 1 frontend NextJS application and Backend GraphQL (node) application. I use schema stitching against the backend to connect to our managed CMS service.
I bumped the number of max instances the Serverless VPC access connector and this really helped the requests from 5/6 slowdown to 1/6.
Right now, 1/6 requests are being very slow (nearly 4minutes to resolve). The slow request is when the browser asks for the API to return information from the CMS + some information on the CMS data from our internal DB. I have this same application running in heroku with less resources and it is running smooth so to me that rules out a code issue.
Areas I have checked:
Cloud SQL CPU and connection pool (very low)
Container Image size (small as a node app can be)
Keeping min and max number of containers at the same number
Allocating CPU
I am at the point of thinking the backend application is just not a proper fit for Cloud Run but I do like the simplicity of getting applications running in cloud run and their approach to serverless.
Any help appreciated

Related

Concurrent requests to Nginx Server

I am having a problem with my server dealing with a large volume of concurrent users signing on and operating at the same time. Our business case requires the user base to be logging in at the exact same time (1 min window) and performing various operations in our application. Once the server goes past a 1000 concurrent users, the website starts loading very slowly and constant giving 502 errors.
We have reviewed the server metrics (CPU, RAM, Data traffic utilization) on the cloud are most resources are operating at below 10%. Scaling up the server doesn't help.
While the website is constant giving 502 errors are responding after a long time, any direct database queries and SSH connection are working fine. As such we have concluded that issue is primarily focused on the number of concurrent requests the server is handling due to any Nginx or Gunicorn configuration we may have set up incorrectly.
Please advice on any possible incorrect configuration (or any other solution) to this issue :
Server info :
Server specs - AWS c4.8xlarge ( CPU and RAM)
Web Server -Nginx
Backend Server -Gunicorn
nginx conf imageGunicorn conf file

Django extremely slow page loads when using remote database

I have a working Django application that is running locally using an sqlite3 database without problem. However, when I change the Django database settings to use my external AWS RDS database all my pages start taking upwards of 40 seconds to load. I have checked my AWS metrics and my instance is not even close to being fully utilized. When I make a request to a view with no database read/write operations I also get the same problem. My activity monitor shows my local CPU spiking with each request. It shows a process named 'WindowsServer' using most of the CPU during each request.
I am aware more latency is expected when using a remote database but I don't think this should result in 40 second page lags. What other problems that could be causing this behaviour?
AWS database monitoring
Local machine
So your computer has connection to the server in Amazon, that's the problem with latency. Production servers should be in the same place as DB servers(or should have very very good connection, so the latency is lowered as much as possible.)
--edit--
So we need more details. What is your ISP? What is your connection properties? Uplink, downlink? What are pings to servers in AWS?

504 Gateway Time-out from google cloud platform, but only sometimes

I'm hosting a CouchBase single node cluster in GCP and a flask backend OpenShift cluster which supports angular frontend. The problem is, when a post is called in my flask by my angular, it is taking too much time to get connected the VM (couchbase) and hence flask has to return a "504 Gateway Time-out". But this happens only sometimes. Sometimes it just works very well with proper speed. Not able to troubleshoot. The total data size is less than 100M, and everything is 100% memory resident in Couchbase. So I guess this is not a problem with Couchbase. Just the connection latency to GCP.
My guess is that the first time your flask backend is trying to connect for the first time to your VM, it's taking more than usual as it needs to establish the connection, authenticate and possibly do other things depending on your use-case.
This is a common problem when hosting your app on App Engine or something similar and the solution there is to use "warm-up requests". This basically spins up the whole connection (and in app engine case the instance) and makes a test connection just so when the desired connection comes, everything is already set up.
So I suggest that you check how warm-up requests work and configure something similar between your flask and VM. So basically a route in flask with the only purpose of establishing a test connections with a test package. This way your next connection will be up to speed with no 504 errors.
try to clear to cache of load balancer in GCP console
I already faced same kind of issue and resolved it using above technique

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

Django app freezing with a few concurrent requests

I have a django app without views, I only use it to provide a REST API using django-piston package.
Since I have deployed it to amazon-ec2 with mod-wsgi, after some requests it freezes, and the CPU goes to 100% of usage divided by python and httpd processes.
I'm using Postgres 8.4, Python 2.5 and Django 'ENGINE': 'django.contrib.gis.db.backends.postgis'.
Logs don't show me any problem. How can I debug the problem?
Sounds like you're in a micro instance. Micro instances are able to burst large amounts of cpu for a VERY short amount of time, after that they must drop to very low background levels for an extended duration or else amazon with harshly throttle it. If you're getting concurrent requests most likely even a lightly cpu intensive app would cause the throttling to kick in.
Micro instances are only usable for very very light traffic on something like a very basic blog and that's like it.
Their user guide goes into this in detail: Micro Instance guide.