Latency jitters when using shared memory for IPC - c++

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.

Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Related

CLIPS EnvAssertString API-function slow performance

I'm trying to develop an expert-system which is capable of managing a real-time data flow. During the coding procedure I found a delay in operation which varied from 3 to almost 20 milliseconds and this is totally inappropriate for the project. The application profiling showed that the problem resided in EnvAssertString function, whilst EnvRun did not produce any delay. I tried to temporaily disable garbage collection before EnvAssertString, but it didn't help. The function in question is performed between 10 and 50 times in a row during the processing of a single block of data and the blocks are arriving at a rate of approx. 15 blocks per second.
How can I fix this? Is there any chance of speeding the process up? Is CLIPS at all suitable for a real-time response like that (sevaral calls in a row to EnvAssertString shouldn't take longer than 1 ms)?

MQTTnet poor performance when multi-threaded despite same number of messages being sent

I am sending around 20,000 messages per second, these can be across a number of arbitary threads (the messages are processed before sending to MQTTnet).
I have found, that the fewer the threads the better the performance, going over 16 simultanous senders causes MQTTnet to grind to a halt even with 10k messages per second.
It is not the threads that are slow, I poll the MQTTnet Managed Client buffer size every 10 seconds and see it increasing to the point where it becomes full (at the limit that I have set).
This is with the most recent code version and something I noticed a number of months ago (from today August 2020) - it was highlighted with my recent ThreadRipper system upgrade and my code creating number of threads equal to the number of Environment Processors - same code base but 8 with previous hardware with 48 on new hardware caused the "failure".
48 decode/send threads caused MQTTnet to grind to a halt, whereas 4 to 8 threads was OK and performant. I can see the speed on the NIC to the MQTT server drop from 8Mbps (with 4 to 8 sending threads) to less than 100kbps when higher thread counts are used.
Local or remote MQTT server makes no difference - as mentioned I can see the send buffer within MQTT increase and will do so until memory exhaustion unless a limit is set (either way, it will drop messages once under duress from threads and in this state).
Note, in both cases, the total number of messages being sent per second remained the same - the only variable was the number of worker threads the messages were being sent from.
Is this a bug, or something I am doing wrong? Should i create my own queue to front-end the managed client and dispatch one at a time (i don't want to reinvent the wheel, but want to ensure i am using the library correctly).
I have found this seems to be related to debug and start without debugs - starting without debug is magnitudes faster and can scale all the way to 48 threads (as per the environment processor count) without any issue when started without debug without any queuing whatsoever.
Strange, as the message volume is the same in both cases, only difference being the thread count as mentioned (and even with debug and 8 threads, debug can keep up without issue).
Seems there is an overhead when debugging with multiple sending threads - which may be obvious but couldn't find where this was warned.

How flink checkpointing time is related to buffer alignment size or alignment time?

My streaming flink job has checkpointing time of 2-3s(15-20% of time) and 3-4 mins(8-12% of time) and 2 mins on an average. We have two operators which are stateful. First is kafka consumer as source(FlinkKafkaConsumer010) and another is hdfs sink(CustomBucketingSink). This two makes state of around 1-1.5Gb for savepoints and 800mb-6Gb(3gb average) for checkpoint. We have 30sec of tumbling processing window. Checkpointing duration and minimum pause between two checkpoiting is 3 mins. My job consumes around 3 millions of records per minute on an average and around 20 millions/min records on peak time. There is more than enough cpu and memory for flink.
Now here are my doubts :
1) Even when few checkpointing state sizes are less(70-80% less) as compare to other checkpointing state, it takes minutes(15-20% of time) as compare to other one which takes 5-10 secs.
2) Buffer alignment size sometimes increases to 7-8gb as compare to 800mb-1gb average but checkpointing time is not affected by this. I guess it should take more time as it should wait for checkpoint barrier.
3) Will checkponting time be affected if we increase tumbling window size. I am considering it shouldn't affect neither savepoint time and nor checkpoint time.
4) Few of the sub-tasks which sinks into hdfs takes 2-3 mins (5-10% time). So while 98% of subtasks are completed in 30-50 secs. 1-2(95% of time, it's only one) subtasks takes 2-3 mins. Which delays the whole checkpointing time. Problem is not with the node on which this sub-tasks are running because it happens sometimes to some node and sometimes to another node.
5) We are getting one exception once every 6-8 hour which restarts the job. TimerException{java.nio.channels.ClosedByInterruptException} at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$TriggerTask.run(SystemProcessingTimeService.java:288)
6) How to minimize the alignment buffer time.
7) Savepoint time increases or decreases with increase and decrease of rate of input or state size but checkpointing time doesn't hold the same. Checkpointing time sometimes shows inverse relation with state size or we can saw it's not affected with the state size.
8) Whenever we restart the job, all sub-tasks take uniform time for 2-3 days on all nodes but afterwards 1-2 sub-tasks takes 2-3 minutes as compare to other which are taking 15-30 secs. I might be wrong on this behaviour but as far i have observed, this is also a case.
Note that windows are stateful, and unless you are doing incremental aggregation, longer windows have more state, which will in turn affect checkpoint sizes and durations.
It would be helpful to know which state backend you are using, and whether or not you are using incremental checkpointing.
I would start by trying to find the cause of the slow sink subtask(s) causing the backpressure, which is in turn causing the painful checkpointing. Could be data skew, or resource starvation, for example. Some common causes include insufficient CPU, network, or disk bandwidth, or AWS (or other API) rate limits. It may seem that you have plenty of CPU, for example, but one hot key can put way too much load on one thread, and thereby hold back the entire cluster.
If you find a way to correct the imbalance at the sink, then the checkpoint alignment problems should calm down. (Note that if you can tolerate duplicate results, you could disable checkpoint barrier alignment by choosing CheckpointingMode.AT_LEAST_ONCE.)

How to do a very large number of HTTP requests in shortest time

So we have a very huge database which has around 300,000 urls. These urls have to be pinged and get data from.(these urls are radio stations which are playing song. The data is metadata)
Some of them are sometimes inactive and sometimes active.
On any given time, around 80,000 are active. Some respond slow, some respond quickly. I have a server and I am thinking to do this using c++
My goal is to ping and parse(or crawl) them within 1 minute and keep repeating the process because information(the song playing on them) can change over time. ranging from 2-7 minutes mostly. But I am not sure if it is possible.
What should be my approach to do it?
I have thought of creating two programs, one to test if the url is active or not and run it twice a day. And how much time it generally takes to respond. Does it usually respond slow or whether it is responding slower now.
And the other to do the actual crawling where fastest will be crawled first and some dedicated threads for urls which respond faster.
Please i would love more better ideas or better solutions for it. Can any one tell me how to do the maths to find out the number of dedicated threads i should allot to each for getting the results in least number of time
You don't need performance of your CPU (not your bottleneck at the moment), but you need to avoid network layer stall... if the request timeout is 60 seconds, and you have 16 threads, and hit 16 very slow servers (which will time-out eventually), you are generally stalled for 60 seconds and not processing anything more.
So I would start with let's say 500 threads (and like 15-30s timeout, if you know the very slow radios are capable to fit even this), and keep some statistic about their turnaround, and keep adding more working threads dynamically for every original which didn't get response within 2-3 secs. 80000/500 = 160, so each "normally quick" worker thread has then to ping around 160 urls, if each does take 2 seconds, that's still 320 = 5min! So 500 sounds like minimum.
That said, having 500+ threads will somewhat burden CPU and memory (not sure how much, with decent thread/memory model implementation 500 doesn't sounds like much for modern x86 CPU with GB of RAM, even 5000 sounds still reasonable), but I would worry lot more about the network layer and about possible firewalls around, you need server-grade like network for such amount of requests (if I would try something like that from my home, my own router would filter me out with default settings, detecting it as some kind of DoS attack).
So get some statistic how long the request on average take, then take your target time (2-7min), and divide the number of urls by those, like average ping 5s, round time 3min = 300,000/(3*60/5) = 8333.33 threads at least needed. Then you will have to profile your app to verify, that with 8000 threads it will not choke on something else, but it will really handle the task as expected.
(other option is to fire asynchronous http request from single thread, but that sort of creates its own threads for each task any way, so I would rather manage the threads myself, and use synchronous http calls)
And thinking about dynamic grow mechanics... you can keep some counters about how many new requests were added in last second, and how many finished (either responded or failed), and after few seconds of running these should start to form some kind of "throughput" statistic, then if throughput is under desired threshold, you can add more threads.
About active/inactive... keep the response time/last-seen/last-check together with url, and add some further logic to check url only when it makes sense (like not within next 60s, if it did just respond, or check inactive just after 6h from last test). You need also avoid checking the same url in two different threads at the same time, so some central manager code should feed the threads with target (maybe some FIFO thread-safe queue ... actually you can use its size to estimate how well the worker threads are processing it, so you can add more threads when you see the queue is not emptying fast enough = that avoids adding the statistic code to thread themselves).

Locust.io: Controlling the request per second parameter

I have been trying to load test my API server using Locust.io on EC2 compute optimized instances. It provides an easy-to-configure option for setting the consecutive request wait time and number of concurrent users. In theory, rps = wait time X #_users. However while testing, this rule breaks down for very low thresholds of #_users (in my experiment, around 1200 users). The variables hatch_rate, #_of_slaves, including in a distributed test setting had little to no effect on the rps.
Experiment info
The test has been done on a C3.4x AWS EC2 compute node (AMI image) with 16 vCPUs, with General SSD and 30GB RAM. During the test, CPU utilization peaked at 60% max (depends on the hatch rate - which controls the concurrent processes spawned), on an average staying under 30%.
Locust.io
setup: uses pyzmq, and setup with each vCPU core as a slave. Single POST request setup with request body ~ 20 bytes, and response body ~ 25 bytes. Request failure rate: < 1%, with mean response time being 6ms.
variables: Time between consecutive requests set to 450ms (min:100ms and max: 1000ms), hatch rate at a comfy 30 per sec, and RPS measured by varying #_users.
The RPS follows the equation as predicted for upto 1000 users. Increasing #_users after that has diminishing returns with a cap reached at roughly 1200 users. #_users here isn't the independent variable, changing the wait time affects the RPS as well. However, changing the experiment setup to 32 cores instance (c3.8x instance) or 56 cores (in a distributed setup) doesn't affect the RPS at all.
So really, what is the way to control the RPS? Is there something obvious I am missing here?
(one of the Locust authors here)
First, why do you want to control the RPS? One of the core ideas behind Locust is to describe user behavior and let that generate load (requests in your case). The question Locust is designed to answer is: How many concurrent users can my application support?
I know it is tempting to go after a certain RPS number and sometimes I "cheat" as well by striving for an arbitrary RPS number.
But to answer your question, are you sure your Locusts doesn't end up in a dead lock? As in, they complete a certain number of requests and then become idle because they have no other task to perform? Hard to tell what's happening without seeing the test code.
Distributed mode is recommended for larger production setups and most real-world load tests I've run have been on multiple but smaller instances. But it shouldn't matter if you are not maxing out the CPU. Are you sure you are not saturating a single CPU core? Not sure what OS you are running but if Linux, what is your load value?
While there is no direct way of controlling rps, you can try constant_pacing and constant_throughput option in wait_time
From docs
https://docs.locust.io/en/stable/api.html#locust.wait_time.constant_throughput
In the following example the task will always be executed once every 1 seconds, no matter the task execution time:
class MyUser(User):
wait_time = constant_throughput(1)
constant_pacing is inverse of this.
So if you run with 100 concurrent users, test will run at 100rps (assuming each request takes less than 1 second in first place