Locust.io: Controlling the request per second parameter - amazon-web-services

I have been trying to load test my API server using Locust.io on EC2 compute optimized instances. It provides an easy-to-configure option for setting the consecutive request wait time and number of concurrent users. In theory, rps = wait time X #_users. However while testing, this rule breaks down for very low thresholds of #_users (in my experiment, around 1200 users). The variables hatch_rate, #_of_slaves, including in a distributed test setting had little to no effect on the rps.
Experiment info
The test has been done on a C3.4x AWS EC2 compute node (AMI image) with 16 vCPUs, with General SSD and 30GB RAM. During the test, CPU utilization peaked at 60% max (depends on the hatch rate - which controls the concurrent processes spawned), on an average staying under 30%.
Locust.io
setup: uses pyzmq, and setup with each vCPU core as a slave. Single POST request setup with request body ~ 20 bytes, and response body ~ 25 bytes. Request failure rate: < 1%, with mean response time being 6ms.
variables: Time between consecutive requests set to 450ms (min:100ms and max: 1000ms), hatch rate at a comfy 30 per sec, and RPS measured by varying #_users.
The RPS follows the equation as predicted for upto 1000 users. Increasing #_users after that has diminishing returns with a cap reached at roughly 1200 users. #_users here isn't the independent variable, changing the wait time affects the RPS as well. However, changing the experiment setup to 32 cores instance (c3.8x instance) or 56 cores (in a distributed setup) doesn't affect the RPS at all.
So really, what is the way to control the RPS? Is there something obvious I am missing here?

(one of the Locust authors here)
First, why do you want to control the RPS? One of the core ideas behind Locust is to describe user behavior and let that generate load (requests in your case). The question Locust is designed to answer is: How many concurrent users can my application support?
I know it is tempting to go after a certain RPS number and sometimes I "cheat" as well by striving for an arbitrary RPS number.
But to answer your question, are you sure your Locusts doesn't end up in a dead lock? As in, they complete a certain number of requests and then become idle because they have no other task to perform? Hard to tell what's happening without seeing the test code.
Distributed mode is recommended for larger production setups and most real-world load tests I've run have been on multiple but smaller instances. But it shouldn't matter if you are not maxing out the CPU. Are you sure you are not saturating a single CPU core? Not sure what OS you are running but if Linux, what is your load value?

While there is no direct way of controlling rps, you can try constant_pacing and constant_throughput option in wait_time
From docs
https://docs.locust.io/en/stable/api.html#locust.wait_time.constant_throughput
In the following example the task will always be executed once every 1 seconds, no matter the task execution time:
class MyUser(User):
wait_time = constant_throughput(1)
constant_pacing is inverse of this.
So if you run with 100 concurrent users, test will run at 100rps (assuming each request takes less than 1 second in first place

Related

Latency jitters when using shared memory for IPC

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.
Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Performance testing using Ultimate thread group

I want to use ultimate thread group for my test with 2100 users concurrency and synchronising timer with number of simulated users to group by 100.
Here I want to configure the thread group for 10 mins.
I am not sure how to distribute it across initial delay ,start up time, hold load and shut down time
We cannot suggest anything meaningful because we don't know what is your desired load pattern.
Normally people configure threads arrival/leaving so it would be:
Ramp-up phase - so the load would increase gradually, it will allow you to correlate increasing load with the changing metrics like response time, transactions per second, errors per second, etc.
"Plateau" phase - check how does system behave under constant sustained load
Ramp-down phase - it will allow to check whether system gets back to normal when the load decreases
If you don't have better ideas - go for 33% for ramp-up, plateau and ramp-down, in your case it will be easier to take 3 minutes for ramp-up and ramp-down and 4 minutes for the time to holds the load.
The relevant Ultimate Thread Group configuration:
With regards to the Synchronizing Timer, what it will do is to act as a rendezvous point for all Samplers in it's scope so given ramp-up of 180 seconds for 2100 users it means that 11.6 users will arrive every second so first request will be executed on 8th second of your test with 100 users then requests will be executed one by one each with 100 users in form of "spikes"

How to do a very large number of HTTP requests in shortest time

So we have a very huge database which has around 300,000 urls. These urls have to be pinged and get data from.(these urls are radio stations which are playing song. The data is metadata)
Some of them are sometimes inactive and sometimes active.
On any given time, around 80,000 are active. Some respond slow, some respond quickly. I have a server and I am thinking to do this using c++
My goal is to ping and parse(or crawl) them within 1 minute and keep repeating the process because information(the song playing on them) can change over time. ranging from 2-7 minutes mostly. But I am not sure if it is possible.
What should be my approach to do it?
I have thought of creating two programs, one to test if the url is active or not and run it twice a day. And how much time it generally takes to respond. Does it usually respond slow or whether it is responding slower now.
And the other to do the actual crawling where fastest will be crawled first and some dedicated threads for urls which respond faster.
Please i would love more better ideas or better solutions for it. Can any one tell me how to do the maths to find out the number of dedicated threads i should allot to each for getting the results in least number of time
You don't need performance of your CPU (not your bottleneck at the moment), but you need to avoid network layer stall... if the request timeout is 60 seconds, and you have 16 threads, and hit 16 very slow servers (which will time-out eventually), you are generally stalled for 60 seconds and not processing anything more.
So I would start with let's say 500 threads (and like 15-30s timeout, if you know the very slow radios are capable to fit even this), and keep some statistic about their turnaround, and keep adding more working threads dynamically for every original which didn't get response within 2-3 secs. 80000/500 = 160, so each "normally quick" worker thread has then to ping around 160 urls, if each does take 2 seconds, that's still 320 = 5min! So 500 sounds like minimum.
That said, having 500+ threads will somewhat burden CPU and memory (not sure how much, with decent thread/memory model implementation 500 doesn't sounds like much for modern x86 CPU with GB of RAM, even 5000 sounds still reasonable), but I would worry lot more about the network layer and about possible firewalls around, you need server-grade like network for such amount of requests (if I would try something like that from my home, my own router would filter me out with default settings, detecting it as some kind of DoS attack).
So get some statistic how long the request on average take, then take your target time (2-7min), and divide the number of urls by those, like average ping 5s, round time 3min = 300,000/(3*60/5) = 8333.33 threads at least needed. Then you will have to profile your app to verify, that with 8000 threads it will not choke on something else, but it will really handle the task as expected.
(other option is to fire asynchronous http request from single thread, but that sort of creates its own threads for each task any way, so I would rather manage the threads myself, and use synchronous http calls)
And thinking about dynamic grow mechanics... you can keep some counters about how many new requests were added in last second, and how many finished (either responded or failed), and after few seconds of running these should start to form some kind of "throughput" statistic, then if throughput is under desired threshold, you can add more threads.
About active/inactive... keep the response time/last-seen/last-check together with url, and add some further logic to check url only when it makes sense (like not within next 60s, if it did just respond, or check inactive just after 6h from last test). You need also avoid checking the same url in two different threads at the same time, so some central manager code should feed the threads with target (maybe some FIFO thread-safe queue ... actually you can use its size to estimate how well the worker threads are processing it, so you can add more threads when you see the queue is not emptying fast enough = that avoids adding the statistic code to thread themselves).

Elastic Beanstalk high CPU load after a week of running

I am running a single-instance worker on AWS Beanstalk. It is a single-container Docker that runs some processes once every business day. Mostly, the processes sync a large number of small files from S3 and analyze those.
The setup runs fine for about a week, and then CPU load starts growing linearly in time, as in this screenshot.
The CPU load stays at a considerable level, slowing down my scheduled processes. At the same time, my top-resource tracking running inside the container (privileged Docker mode to enable it):
echo "%CPU %MEM ARGS $(date)" && ps -e -o pcpu,pmem,args --sort=pcpu | cut -d" " -f1-5 | tail
shows nearly no CPU load (which changes only during the time that my daily process runs, seemingly accurately reflecting system load at those times).
What am I missing here in terms of the origin of this "background" system load? Wondering if anybody seen some similar behavior, and/or could suggest additional diagnostics from inside the running container.
So far I have been re-starting the setup every week to remove the "background" load, but that is sub-optimal since the first run after each restart has to collect over 1 million small files from S3 (while subsequent daily runs add only a few thousand files per day).
The profile is a bit odd. Especially that it is a linear growth. Almost like something is accumulating and taking progressively longer to process.
I don't have enough information to point at a specific issue. A few things that you could check:
Are you collecting files anywhere, whether intentionally or in a cache or transfer folder? It could be that the system is running background processes (AV, index, defrag, dedupe, etc) and the "large number of small files" are accumulating to become something that needs to be paged or handled inefficiently.
Does any part of your process use a weekly naming convention or house keeping process. Might you be getting conflicts, or accumulating work load as the week rolls over. i.e. the 2nd week is actually processing both the 1st & 2nd week data, but never completing so that the next day it is progressively worse. I saw something similar where an inappropriate bubble sort process was not completing (never reached the completion condition due to the slow but steady inflow of data causing it to constantly reset) and the demand by the process got progressively higher as the array got larger.
Do you have some logging on a weekly rollover cycle ?
Are there any other key performance metrics following the trend ? (network, disk IO, memory, paging, etc)
Do consider if it is a false positive. if it is high CPU there should be other metrics mirroring the CPU behaviour, cache use, disk IO, S3 transfer statistics/logging.
RL

How to make TensorFlow use more available CPU

How can I fully utilize each of my EC2 cores?
I'm using a c4.4xlarge AWS Ubuntu EC2 instance and TensorFlow to build a large convoluted neural network. nproc says that my EC2 instance has 16 cores. When I run my convnet training code, the top utility says that I'm only using 400% CPU. I was expecting it to use 1600% CPU because of the 16 cores. The AWS EC2 monitoring tab confirms that I'm only using 25% of my CPU capacity. This is a huge network, and on my new Mac Pro it consumes about 600% CPU and takes a few hours to build, so I don't think the reason is because my network is too small.
I believe the line below ultimately determines CPU usage:
sess = tf.InteractiveSession(config=tf.ConfigProto())
I admit I don't fully understand the relationship between threads and cores, but I tried increasing the number of cores. It had the same effect as the line above: still 400% CPU.
NUM_THREADS = 16
sess = tf.InteractiveSession(config=tf.ConfigProto(intra_op_parallelism_threads=NUM_THREADS))
EDIT:
htop shows that shows that I am actually using all 16 of my EC2 cores, but each core is only at about 25%
top shows that my total CPU % is around 400%, but occasionally it will shoot up to 1300% and then almost immediately go back down to ~400%. This makes me think there could be a deadlock problem
Several things you can try:
Increase the number of threads
You already tried changing the intra_op_parallelism_threads. Depending on your network it can also make sense to increase the inter_op_parallelism_threads. From the doc:
inter_op_parallelism_threads:
Nodes that perform blocking operations are enqueued on a pool of
inter_op_parallelism_threads available in each process. 0 means the
system picks an appropriate number.
intra_op_parallelism_threads:
The execution of an individual op (for
some op types) can be parallelized on a pool of
intra_op_parallelism_threads. 0 means the system picks an appropriate
number.
(Side note: The values from the configuration file referenced above are not the actual default values tensorflow uses but just example values. You can see the actual default configuration by manually inspecting the object returned by tf.ConfigProto().)
Tensorflow uses 0 for the above options meaning it tries to choose appropriate values itself. I don't think tensorflow picked poor values that caused your problem but you can try out different values for the above option to be on the safe side.
Extract traces to see how well your code parallelizes
Have a look at
tensorflow code optimization strategy
It gives you something like this. In this picture you can see that the actual computation happens on far fewer threads than available. This could also be the case for your network. I marked potential synchronization points. There you can see that all threads are active for a short moment which potentially is the reason for the sporadic peaks in CPU utilization that you experience.
Miscellaneous
Make sure you are not running out of memory (htop)
Make sure you are not doing a lot of I/O or something similar