Concurrent requests to Nginx Server - amazon-web-services

I am having a problem with my server dealing with a large volume of concurrent users signing on and operating at the same time. Our business case requires the user base to be logging in at the exact same time (1 min window) and performing various operations in our application. Once the server goes past a 1000 concurrent users, the website starts loading very slowly and constant giving 502 errors.
We have reviewed the server metrics (CPU, RAM, Data traffic utilization) on the cloud are most resources are operating at below 10%. Scaling up the server doesn't help.
While the website is constant giving 502 errors are responding after a long time, any direct database queries and SSH connection are working fine. As such we have concluded that issue is primarily focused on the number of concurrent requests the server is handling due to any Nginx or Gunicorn configuration we may have set up incorrectly.
Please advice on any possible incorrect configuration (or any other solution) to this issue :
Server info :
Server specs - AWS c4.8xlarge ( CPU and RAM)
Web Server -Nginx
Backend Server -Gunicorn
nginx conf imageGunicorn conf file

Related

Cloud SQL Proxy connection timesout occasionaly

We use single-tenant architecture for our instances. Each instance contains 3 Django Apps i.e Django, Celery{worker, beat} and few other things that don't interact with the database. We deploy cloudsql-proxy as a sidecar for these django containers which are running as Pod in Google Kubernetes Engine. We are using CloudSQL (Postgres 9.6) by Google and it has Public IP address.
The problem is that we are getting Operational Errors on Django side i.e
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
and when we check the Pod's logs at the same time when the OperationalError occurred we see the following log error from cloudsql-proxy container
couldn't connect to db_instance: dial tcp our_db_instance_public_ip:3307: connect: connection timed out
It is not that the connection to database doesn't work. It works most of the time but sometime it throws the above errors, which is kind of a pain because we run celery tasks every other minute and they fail due to this. Sometime, this occurs when the end user is interacting with our application and their requests fails.
Our application isn't under very high load. We set the maximum connection of our database to 1000. And the peak number of connection is around 35 (sum of all instance's connections). I checked the stats of Database and it seems pretty happy i.e CPU utilization almost never goes above 50%, Disk is 30% used, Memory usage is around 50%.
I can provide more details if needed. Would appreciate any help!

Django extremely slow page loads when using remote database

I have a working Django application that is running locally using an sqlite3 database without problem. However, when I change the Django database settings to use my external AWS RDS database all my pages start taking upwards of 40 seconds to load. I have checked my AWS metrics and my instance is not even close to being fully utilized. When I make a request to a view with no database read/write operations I also get the same problem. My activity monitor shows my local CPU spiking with each request. It shows a process named 'WindowsServer' using most of the CPU during each request.
I am aware more latency is expected when using a remote database but I don't think this should result in 40 second page lags. What other problems that could be causing this behaviour?
AWS database monitoring
Local machine
So your computer has connection to the server in Amazon, that's the problem with latency. Production servers should be in the same place as DB servers(or should have very very good connection, so the latency is lowered as much as possible.)
--edit--
So we need more details. What is your ISP? What is your connection properties? Uplink, downlink? What are pings to servers in AWS?

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

Apache to slow to responde, but CPU and memory not max out

The problem
2 apache servers have a long response time, but I do not see CPU or memory max out.
Details
I have 2 apache server servering static content for client.
This web site has a lot of traffic.
At high traffic I have ~10 request per second (html, css, js, images).
Each HTML is making 30 other request to the servers for loading js, css, and images.
Safari developer tool show that 2MB of that is getting transfer each time I hit a html page
These two server are running on Amazon Web Service
both instances are m1.large (2 CPUS, 7.5 RAM)
I'm serving images in the same server
server are in US but a lot of traffic comes from Europe
I tried
changing from prefork to worker
increasing processses
increasing threads
increasing time out
I'm running benchmarks with ab (apachebench) and I do not see improvement.
My question are:
Is it possible that serving the images and large resorouces like js (400k) might be slowing down the server?
Is it possible that 5 request per second per server is just too much traffic and there is no tuning I can do, so only solution is to add more servers?
does amazon web services have a problem with bandwidth?
New Info
My files are being read from a mounted directory on GlusterFS
Metrics collected with ab (apache bench) run on a EC2 instance on same network
Connections: 500
Concurrency: 200
Server with files on mounted directory (files on glusterfs)
Request per second: 25.26
Time per request: 38.954
Transfer rate: 546.02
Server without files on mounted directory (files on local storage)
Request per second: 1282.62
Time per request: 0.780
Transfer rate: 27104.40
New Question
Is it possible that a reading the resources (htmls, js, css, images) from a mounted directory (NFS or GlusterFS) might slow down dramatically the performance of Apache?
Thanks
It is absolutely possible (and indeed probable) that serving up large static resources could slow down your server. You have to have Apache worker threads open the entire time that each one of these pieces of content are being downloaded. The larger the file, the longer the download, and the longer you have to hold a thread open. You might be reaching your max threads limits before reaching any sort of memory limitations you have set for Apache.
First, I would recommend getting all of your static content off of your server and into Cloudfront or similar CDN. This will make it to where your web server will only have to worry about the primary web requests. This might take the requests per second (and related number of open Apache threads) down from 10 request/second to like .3 requests/second (based on your 30:1 ratio of primary requests to secondary content requests).
Reducing the number of requests you are serving by over an order of magnitude will certainly help server performance and possibly allow you to reduce down to a single server (or if you still want multiple servers - which is a good idea) possibly reduce the size of your servers.
One thing you will find that basically all high volume websites have in common is that they leave the business of serving up static content to a CDN. Once you get to the point of being a high volume site, you must absolutely consider this (or at least serve static content from different servers using Nginx, Lighty, or some other web server better suited for serving static content than Apache is).
After offloading your static traffic, then you can really start with worrying about tuning your web servers to handle the primary requests. When you get to that point, you will need to know a few things:
The average memory usage for a single request thread
The amount of memory that you have allocated to Apache (maybe 70-80% of overall instance memory if this is dedicated Apache server)
The average amount of time it takes your application to respond to requests
Based on that, it is a pretty simple formula to make a good starting point for tuning your max thread settings.
Say you had the following:
Apache memory: 4000KB
Avg. thread memory: 20KB
Avg. time per request: 0.5 s
That means your configuration could handle request throughput as follows:
100 requests/second = 4000kb / (20kb * 0.5 seconds/request )
Since each request averages 0.5s, you could assume that you would need 50 threads to handle this throughput.
Obviously, you would want to set you max threads higher then 50 to account for request spikes and such, but at least this gives you a good place to start.
Try to start/stop the instance. This will move you to a different host. If the host your instance is on is having any issues, that will mitigate it.
Beyond checking system load numbers, take a look at memory usage, IO and CPU usage.
Look at your system log to see if anything produced an error that may explain the current situation.
Checkout Eric J. answer in this thread Amazon EC2 Bitnami Wordpress Extremely Slow

Normal CPU Usage But Slow Jetty Response Time During Peak Hour

We have several web servers running jetty to serve 100 request per second.
During peak hour, the response time of the jetty becomes slow and the number of request that the jetty handled is dropped.
We have checked that
- The cpu usage of the jetty is around 20-30% which is healthy.
- IO figure is normal
- no slow query in our DB and the DB server is healthy too.
- network is healthy.
By adding more web servers, the problem is solved.
But I don't understand why the CPU usage is not rised when the web traffic loading is heavy?
Does anyone has similar experience?