WSO2 API Manager CPU graphs - wso2

Wanted to understand what is this system load average parameter that is displayed under CPU metrics. How do we interpret it?

It must be the load average in the CPU.
In case you're not familiar with it,
CPU load average can be used to troubleshoot Tenable products'
performance issue. This metric can be collected by a debug, such as SC
debug sc-systeminfo.txt shows like this: Results of "uptime" are:
15:23:38 up 13 days, 8 min, 1 user, load average: 3.84, 3.72, 2.41
CPU load is the number of processes which are being executed by CPU or
waiting to be executed by CPU. So CPU load average is the average
number of processes being or waiting executed over past 1, 5 and 15
minutes. So the number shown above means:
load average over the last 1 minute is 3.84
load average over the last 5 minute is 3.72
load average over the last 15 minute is 2.41
Ref: https://community.tenable.com/s/article/What-is-CPU-Load-Average

Related

Latency jitters when using shared memory for IPC

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.
Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Performance testing using Ultimate thread group

I want to use ultimate thread group for my test with 2100 users concurrency and synchronising timer with number of simulated users to group by 100.
Here I want to configure the thread group for 10 mins.
I am not sure how to distribute it across initial delay ,start up time, hold load and shut down time
We cannot suggest anything meaningful because we don't know what is your desired load pattern.
Normally people configure threads arrival/leaving so it would be:
Ramp-up phase - so the load would increase gradually, it will allow you to correlate increasing load with the changing metrics like response time, transactions per second, errors per second, etc.
"Plateau" phase - check how does system behave under constant sustained load
Ramp-down phase - it will allow to check whether system gets back to normal when the load decreases
If you don't have better ideas - go for 33% for ramp-up, plateau and ramp-down, in your case it will be easier to take 3 minutes for ramp-up and ramp-down and 4 minutes for the time to holds the load.
The relevant Ultimate Thread Group configuration:
With regards to the Synchronizing Timer, what it will do is to act as a rendezvous point for all Samplers in it's scope so given ramp-up of 180 seconds for 2100 users it means that 11.6 users will arrive every second so first request will be executed on 8th second of your test with 100 users then requests will be executed one by one each with 100 users in form of "spikes"

How to calculate average turnaround time - Round Robin and FIFO scheduling?

Five processes begins with their execution at (0, 0, 2, 3, 3) seconds and execute for (2, 2, 1, 2, 2) seconds. How do I calculate average turnaround time if:
a) We use Round Robin (quantum 1 sec.)
b) We use FIFO scheduling?
I am not sure how to solve this, could you guys help me out?
Here is the link of .png table;
table link
I suppose that your exercise is about scheduling tasks on a single processor. My understanding is hence the following:
With FIFO, each task is scheduled in order of arrival and is executed until it's completed
With RR, earch tasks scheduled is executed for a quantum of time only, sharing the processor between all active processes.
In this case you obtain such a scheduling table:
The turnaround is the time between the time the job is submitted, and the time it is ended. In first case, I find 19 in total thus 3.8 in average. In the second case, I find 25 in total and 5 on average.
On your first try, you have processes running in parallel. This would assume 2 processors. But if 2 processors are available, the round robin and the FIFO would have the same result, as there are always enough processors for serving the active processes (thus no waiting time). The total turnaround would be 9 and the average 1,8.

Profiling Python on large dataset

I have a dataset with 3 Mio lines to process. Processing functions are cythonized. When I do the entire processing on a small subsample of 10000 lines, processing time is about 1,5 minute, a subsample of 30000 lines gives a processing time of 3 min. However, when I process the whole dataset after 10 hours only 1/4th of the dataset is processed, although I expect a processing time of max. 5 hours. I'm running Ubuntu 14.04 64 Bit and Anaconda 64 bit. RAM usage is at 50%. I deactivated directing to login after a period of inactivity, performance stayed the same. Switching of the screen after inactivity didn't influence execution time eighter. What else could be the reason for this unexpectedly slow execution?

Locust.io: Controlling the request per second parameter

I have been trying to load test my API server using Locust.io on EC2 compute optimized instances. It provides an easy-to-configure option for setting the consecutive request wait time and number of concurrent users. In theory, rps = wait time X #_users. However while testing, this rule breaks down for very low thresholds of #_users (in my experiment, around 1200 users). The variables hatch_rate, #_of_slaves, including in a distributed test setting had little to no effect on the rps.
Experiment info
The test has been done on a C3.4x AWS EC2 compute node (AMI image) with 16 vCPUs, with General SSD and 30GB RAM. During the test, CPU utilization peaked at 60% max (depends on the hatch rate - which controls the concurrent processes spawned), on an average staying under 30%.
Locust.io
setup: uses pyzmq, and setup with each vCPU core as a slave. Single POST request setup with request body ~ 20 bytes, and response body ~ 25 bytes. Request failure rate: < 1%, with mean response time being 6ms.
variables: Time between consecutive requests set to 450ms (min:100ms and max: 1000ms), hatch rate at a comfy 30 per sec, and RPS measured by varying #_users.
The RPS follows the equation as predicted for upto 1000 users. Increasing #_users after that has diminishing returns with a cap reached at roughly 1200 users. #_users here isn't the independent variable, changing the wait time affects the RPS as well. However, changing the experiment setup to 32 cores instance (c3.8x instance) or 56 cores (in a distributed setup) doesn't affect the RPS at all.
So really, what is the way to control the RPS? Is there something obvious I am missing here?
(one of the Locust authors here)
First, why do you want to control the RPS? One of the core ideas behind Locust is to describe user behavior and let that generate load (requests in your case). The question Locust is designed to answer is: How many concurrent users can my application support?
I know it is tempting to go after a certain RPS number and sometimes I "cheat" as well by striving for an arbitrary RPS number.
But to answer your question, are you sure your Locusts doesn't end up in a dead lock? As in, they complete a certain number of requests and then become idle because they have no other task to perform? Hard to tell what's happening without seeing the test code.
Distributed mode is recommended for larger production setups and most real-world load tests I've run have been on multiple but smaller instances. But it shouldn't matter if you are not maxing out the CPU. Are you sure you are not saturating a single CPU core? Not sure what OS you are running but if Linux, what is your load value?
While there is no direct way of controlling rps, you can try constant_pacing and constant_throughput option in wait_time
From docs
https://docs.locust.io/en/stable/api.html#locust.wait_time.constant_throughput
In the following example the task will always be executed once every 1 seconds, no matter the task execution time:
class MyUser(User):
wait_time = constant_throughput(1)
constant_pacing is inverse of this.
So if you run with 100 concurrent users, test will run at 100rps (assuming each request takes less than 1 second in first place