We have a Django web app that serves a moderate number of users, running on an Ubuntu machine with 8 cores and at least 32GB RAM. We have no problems with users connecting via their browser. However, on the backend (on the same server) we are also running a twisted server. The Django webapp tries to connect to our twisted server, but after about 1100-1200 such connections (including a bunch of persistent connections to other devices on the backend), all the connections start to timeout. Our twisted server worked fine under low load but now the server seems to be unable to handle any new connections from Django. All connections time out. We do not see anything obviously wrong with our code (which we have been working on for a couple of years now so it should be pretty stable). We have already set our soft and hard ulimits in /etc/security/limits.conf to 50000/65000 and we have upped somaxconn to 65536. The print of limits for our twisted process is listed below. The total number of files across the tope 25 processes is just over 5000. Unfortunately we still cannot get more than roughly 1100-1200 simultaneous connections to our twisted server. What things should we look at to make our twisted connections start connecting again? Are there other sysctl or other Ubuntu Linux parameters that we need to change? Are there twisted parameters we need to change?
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 465901 465901 processes
Max open files 50000 65000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 465901 465901 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Twisted is a thin shell around your application. When there is a performance problem, almost always, the problem is somewhere inside the application and not in Twisted. So there is no general answer to this question.
That said, there are a couple of investigation techniques you could use. Is your Twisted process consuming 100% CPU? If so, then you are going to need to split it up into multiple processes somehow (using spawnProcess, sendFileDescriptor and adoptStreamPort to allow I/O to be done in subprocesses). If not, then your problem is probably some inadvertent blocking I/O preventing the reactor from servicing requests: you might use something like twisted_hang to diagnose hot-spots where the reactor is getting "stuck".
There's also the possibility that the problem could be on Django's side of the connection. However, with no information about how Django is making those connections, there's little more I can even guess.
Related
I deployed .Netcore MVC on AWS windows server 2019(32gb RAM and 8 cores). 100k concurrent requests because its an online exam application. 100k concurrent request should be entertained. Which server should I use?
The concurrent connection configuration depend on the maximum concurrent connection in site's advanced setting, queue length and maximum worker process in application pool advanced setting, maximum thread in thread pool.Besides, I notice the serverruntime/httpruntime has a limit of appconcurrentRequestLimit with 5000. So if you need to achieve the high concurrent request, you could go to
IIS manager->site->configuration editor->system.webServer/serverRuntime->appConcurrentRequest.
I set number of core connections per host using cass_cluster_set_max_connections_per_host() and i/o threads using cass_cluster_set_num_threads_io().
I see that my client host is establishing,
core connections * num i/o threads, number of tcp connections with each host in my cluster using netstat command. I am wondering what is the difference between an i/o thread and a core connection? Also, if a client is communicating with Cassandra cluster of 10 hosts and number of core connections is set to 2, i/o threads is set to 4 then there are essentially 10*4*2, 80 connections established from a host to cluster - and this all in single session, how are those connections utilized? doesn't it seem extraneous?
I am trying to tune those values so if a cluster is connected by 100 hosts simultaneously then the speed wouldn't slow down. Or are those settings unrelated to speed? Any more information or links are appreciated!
This is the official documentation of the fields present here
cass_cluster_set_num_threads_io : This is the number of threads that will handle query requests. Default value: 1
cass_cluster_set_max_connections_per_host: Sets the maximum number of connections made to each server in each IO thread. Default value : 2
I am wondering what is the difference between an i/o thread and a
core connection?
I/O threads are basically responsible for doing all the network operations between the Client and the Server. So if you have 1000 message waiting for the network operation, this thread will pick the request one by one and execute them. The default value is 1.
Once a message is picked by the I/O thread, it uses the connections specified in set_max_connections to make the request. The default value is 2 for this so that the I/O thread can intelligently switch connections based on server latency and throughput.
I am trying to tune those values so if a cluster is connected by 100
hosts simultaneously then the speed wouldn't slow down.
You can either keep max connection constant and increase the number of i/o threads or the other way around for scaling. There is no clear better approach between the two. You will need to benchmark and see what approach works for your case.
I think that if you have less number of request but they are big request then increasing the number of connections makes more sense but it still requires benchmarking.
This link also provides some extra info.
I'm trying to get a starting point on where to begin understanding what could cause a socket stall and would appreciate any insights any of you might have.
So, server is a modern dual socket xeon (2 x 6 core # 3.5 ghz) running windows 2012. In a single process, there are 6 blocking tcp sockets with default options, each of which are running on their own threads (not numa/core specified). 5 of them are connected to the same remote server and receiving very heavy loads (hundreds of thousands of small ~75 byte msgs per second). The last socket is connected to a different server with a very light send/receive load for administrative messaging.
The problem I ran into was a 5 second stall in the admin messaging socket. Multiple send calls to the socket returned successfully, however nothing was received from the remote server (should receive a protocol ack within milliseconds) or received BY the remote admin server for 5 seconds. It was as if that socket just turned off for a bit. After the 5 seconds stall passed, all of the acks came in a burst, and afterwards everything continued normally. During this, the other sockets were receiving much higher numbers of messages than normal, however there was no indication of any interruption or stall as the data logs displayed nothing unusual (light logging, maybe 500 msgs/sec).
From what I understand, the socket send call does not ensure that data has gone out on the wire, just that a transfer to the tcp stack was successful. So, I'm trying to understand the different scenarios that could have taken place that would cause a 5 second stall on the admin socket. Is it possible that due to the large amount of data being received the tcp stack was essentially overwhelmed and prioritized those sockets that were being most heavily utilized? What other situations could have potentially caused this?
Thank you!
If the sockets are receiving hundreds of thousands of 75-byte messages per second there is a possibility that the server is at maximum capacity with some resources. Maybe not bandwidth, as with 100K messages you might be consuming around 10Mbps. But it could be CPU utilization.
You should use two tools to understand you problem:
perfmon to see utilization of CPU (user and priviledged https://technet.microsoft.com/en-us/library/aa173932(v=sql.80).aspx) , memory, bandwidth, and disk queue length. You can also check number of interrupts and context switches with perfmon.
A sniffer like Wireshark to see if at TCP level data is being transmitted and responses received.
Something else I would do is to write a timestamp right after the send call and right before and after the read call in the thread in charge of admin socket. Maybe it is a coding problem.
The fact that send calls return successfully doesn't mean data was immediately sent. In TCP data will be stored in the send buffer and from there, TCP stack will send the data to the other end.
If your system is CPU bound (you can see with perfmon if this is true), then you should put attention to the comments written by #EJP, this is something that could happen when the machine is under heavy load. With the tools I mentioned, you can see if the receive window in the admin socket is closed or if it is just that socket read is taking time in the admin socket.
I am deploying a uwsgi server for a django app. Each request will have a latency around 2 seconds. I need to handle 100 QPS. On a 4 cores machines, how should I configure the number of processes and the number of threads? I tried to play with the values but I do not understand what I am doing.
Go through the uWSGI Things to know page. 100 requests per second should be easily attainable with uWSGI.
Based on uWSGI behavior I've experienced, I would recommend that you start with only processes and don't use any threads. With both processes and threads we observed that there seemed to be an affinity to use threads over processes. That resulted in a single process handling all requests until it's thread pool was fully occupied and only then were requests handled by the next process. This resulted in poor utilization of resources as a single core was maxed out with all other idle. Turning off threading resulted in a massive performance boost for our particular use model.
Your experience may be different. The uWSGI authors stress that there isn't any magic config combination- it's completely dependent on your particular use case. You need benchmark your app against various configurations to find the sweet spot. Additionally, unless you're able to use benchmarks that perfectly model your actual production load, you'll want to continue to monitor performance and methodically tweak settings after you deploy.
From the Things to know page:
There is no magic rule for setting the number of processes or threads
to use. It is very much application and system dependent. Simple math
like processes = 2 * cpucores will not be enough. You need to
experiment with various setups and be prepared to constantly monitor
your apps. uwsgitop could be a great tool to find the best values.
The problem
2 apache servers have a long response time, but I do not see CPU or memory max out.
Details
I have 2 apache server servering static content for client.
This web site has a lot of traffic.
At high traffic I have ~10 request per second (html, css, js, images).
Each HTML is making 30 other request to the servers for loading js, css, and images.
Safari developer tool show that 2MB of that is getting transfer each time I hit a html page
These two server are running on Amazon Web Service
both instances are m1.large (2 CPUS, 7.5 RAM)
I'm serving images in the same server
server are in US but a lot of traffic comes from Europe
I tried
changing from prefork to worker
increasing processses
increasing threads
increasing time out
I'm running benchmarks with ab (apachebench) and I do not see improvement.
My question are:
Is it possible that serving the images and large resorouces like js (400k) might be slowing down the server?
Is it possible that 5 request per second per server is just too much traffic and there is no tuning I can do, so only solution is to add more servers?
does amazon web services have a problem with bandwidth?
New Info
My files are being read from a mounted directory on GlusterFS
Metrics collected with ab (apache bench) run on a EC2 instance on same network
Connections: 500
Concurrency: 200
Server with files on mounted directory (files on glusterfs)
Request per second: 25.26
Time per request: 38.954
Transfer rate: 546.02
Server without files on mounted directory (files on local storage)
Request per second: 1282.62
Time per request: 0.780
Transfer rate: 27104.40
New Question
Is it possible that a reading the resources (htmls, js, css, images) from a mounted directory (NFS or GlusterFS) might slow down dramatically the performance of Apache?
Thanks
It is absolutely possible (and indeed probable) that serving up large static resources could slow down your server. You have to have Apache worker threads open the entire time that each one of these pieces of content are being downloaded. The larger the file, the longer the download, and the longer you have to hold a thread open. You might be reaching your max threads limits before reaching any sort of memory limitations you have set for Apache.
First, I would recommend getting all of your static content off of your server and into Cloudfront or similar CDN. This will make it to where your web server will only have to worry about the primary web requests. This might take the requests per second (and related number of open Apache threads) down from 10 request/second to like .3 requests/second (based on your 30:1 ratio of primary requests to secondary content requests).
Reducing the number of requests you are serving by over an order of magnitude will certainly help server performance and possibly allow you to reduce down to a single server (or if you still want multiple servers - which is a good idea) possibly reduce the size of your servers.
One thing you will find that basically all high volume websites have in common is that they leave the business of serving up static content to a CDN. Once you get to the point of being a high volume site, you must absolutely consider this (or at least serve static content from different servers using Nginx, Lighty, or some other web server better suited for serving static content than Apache is).
After offloading your static traffic, then you can really start with worrying about tuning your web servers to handle the primary requests. When you get to that point, you will need to know a few things:
The average memory usage for a single request thread
The amount of memory that you have allocated to Apache (maybe 70-80% of overall instance memory if this is dedicated Apache server)
The average amount of time it takes your application to respond to requests
Based on that, it is a pretty simple formula to make a good starting point for tuning your max thread settings.
Say you had the following:
Apache memory: 4000KB
Avg. thread memory: 20KB
Avg. time per request: 0.5 s
That means your configuration could handle request throughput as follows:
100 requests/second = 4000kb / (20kb * 0.5 seconds/request )
Since each request averages 0.5s, you could assume that you would need 50 threads to handle this throughput.
Obviously, you would want to set you max threads higher then 50 to account for request spikes and such, but at least this gives you a good place to start.
Try to start/stop the instance. This will move you to a different host. If the host your instance is on is having any issues, that will mitigate it.
Beyond checking system load numbers, take a look at memory usage, IO and CPU usage.
Look at your system log to see if anything produced an error that may explain the current situation.
Checkout Eric J. answer in this thread Amazon EC2 Bitnami Wordpress Extremely Slow