MQTTnet poor performance when multi-threaded despite same number of messages being sent

MQTTnet poor performance when multi-threaded despite same number of messages being sent - mqttnet

I am sending around 20,000 messages per second, these can be across a number of arbitary threads (the messages are processed before sending to MQTTnet).
I have found, that the fewer the threads the better the performance, going over 16 simultanous senders causes MQTTnet to grind to a halt even with 10k messages per second.
It is not the threads that are slow, I poll the MQTTnet Managed Client buffer size every 10 seconds and see it increasing to the point where it becomes full (at the limit that I have set).
This is with the most recent code version and something I noticed a number of months ago (from today August 2020) - it was highlighted with my recent ThreadRipper system upgrade and my code creating number of threads equal to the number of Environment Processors - same code base but 8 with previous hardware with 48 on new hardware caused the "failure".
48 decode/send threads caused MQTTnet to grind to a halt, whereas 4 to 8 threads was OK and performant. I can see the speed on the NIC to the MQTT server drop from 8Mbps (with 4 to 8 sending threads) to less than 100kbps when higher thread counts are used.
Local or remote MQTT server makes no difference - as mentioned I can see the send buffer within MQTT increase and will do so until memory exhaustion unless a limit is set (either way, it will drop messages once under duress from threads and in this state).
Note, in both cases, the total number of messages being sent per second remained the same - the only variable was the number of worker threads the messages were being sent from.
Is this a bug, or something I am doing wrong? Should i create my own queue to front-end the managed client and dispatch one at a time (i don't want to reinvent the wheel, but want to ensure i am using the library correctly).

I have found this seems to be related to debug and start without debugs - starting without debug is magnitudes faster and can scale all the way to 48 threads (as per the environment processor count) without any issue when started without debug without any queuing whatsoever.
Strange, as the message volume is the same in both cases, only difference being the thread count as mentioned (and even with debug and 8 threads, debug can keep up without issue).
Seems there is an overhead when debugging with multiple sending threads - which may be obvious but couldn't find where this was warned.

Related

Latency jitters when using shared memory for IPC

I am using shared memory for transferring data between two process, using boost::interprocess::managed_shared_memory to allocate a vector as buffer and atomic variables for enforcing memory synchronization (similar to boost::lockfree::spsc_queue).
I was measuring the end-to-end latency for the setup with 2 processes -
sender process - writes to the buffer in shared memory, and sleeps. So the rate at which it pushes data is in interval of around 55 microseconds.
receiver process - runs a busy loop to see if something can be consumed from the buffer.
I am using a RingBuffer of size 4K (high for safety), although ideally a maximun of 1 element will be present in the buffer as per the current setup. Also, I am pushing data around 3 million times to get a good estimate for the end to end latency.
To measure the latency - I get the current time in nanoseconds and store it in a vector (resized to size 3 million at the beginning). I have a 6 core setup, with isolated cpus, and I do taskset to different cores for both sender and receiver process. I also make sure no other program is running from my end on the machine when doing this testing. Output of /proc/cmdline
initrd=\initramfs-linux-lts.img root=PARTUUID=cc2a533b-d26d-4995-9166-814d7f59444d rw isolcpus=0-4 intel_idle.max_cstate=0 idle=poll
I have already done the verification that all data transfer is accurate and nothing is lost. So simple row-wise subtraction of the timestamp is sufficient to get the latency.
I am getting latency of around a 300-400 nanosecods as mean and median of the distribution, but the standard deviation was too high (few thousands of nanos). On looking at the numbers, I found out that there are 2-3 instances where the latency shoots upto 600000 nanos, and then gradually comes down (in steps of around 56000 nanos - probably queueing is happening and consecutive pops from the buffer are successful). Attaching a sample "jitter" here -
568086 511243 454416 397646 340799 284018 227270 170599 113725 57022 396
If I filter out these jittery datapoints, the std_dev becomes very less. So I went digging into what can be the reason for this. Initially I was looking if there was some pattern, or if it is occuring periodically, but it doesnot seem so in my opinion.
I ran the receiver process with perf stat -d, it clearly shows the number of context switches to be 0.
Interestingly, when looking the receiver process's /proc/${pid}/status, I monitor
voluntary_ctxt_switches, nonvoluntary_ctxt_switches and see that the nonvoluntary_ctxt_switches increase at a rate of around 1 per second, and voluntary_ctxt_switches is constant once the data sharing starts. But the problem is that for around the 200 seconds of my setup runtime, the number of latency spikes is around 2 or 3 and does not match the frequency of this context_switch numbers. (what is this count then?)
I also followed a thread which feels relevant, but cant get anything.
For the core running the receiver process, the trace on core 1 with context switch is (But the number of spikes this time was 5)-
$ grep " 1)" trace | grep "=>"
1) jemallo-22010 => <idle>-0
2) <idle>-0 => kworker-138
3) kworker-138 => <idle>-0
I also checked the difference between /proc/interrupts before and after the run of the setup.
The differences are
name
receiver_core
sender_core
enp1s0f0np1-0
2
0
eno1
0
3280
Non-maskable interrupts
25
25
Local timer interrupts
2K
~3M
Performance monitoring interrupts
25
25
Rescheduling interrupts
9
12
Function call interrupts
120
110
machine-check polls
1
1
I am not exactly sure of what most of these numbers represent. But I am curious as why there are rescheduling interrupts, and what is enp1s0f0np1-0.
It might be the case that the spike is not coming due to context switches at the first place, but a number of the range 600 mics does hunch towards that. Leads towards any other direction would be very helpful. I have also tried restarting the server.

Turns out the problem was indeed not related to context switch.
I was also dumping the received data in a file. Stopping that recording removed the spikes. So, the high latency was due to some kind of write flush happening.

Response time of an application unintutively correlated with number of input triggers in a give time period

I have an Multi threaded C++ network application which listens for UDP packets as input - this data then hops through the application/processed via various queues and threads and finally pushed out on a TCP socket.
What I am seeing is that if inputs come in slow lets say 5/sec, the total response time (in-to-out) is slow (lets say 100ms) and if the inputs come in fast e.g. 20/sec, the response time is also fast (~50ms). This observation is really weird. And also messes up the response time in the fast case because the 1st response is always slow. Just to make sure - the application is doing exactly the same amount of work in both slow and fast cases.
Things that have been tried to investigate this -
Its a dual Xeon box running Linux 2.6 kernel - disabled Turbo boost, made sure processors are in C0 states.
eliminated network causes. the root cause is within the box. ( in software or hardware )
I have a fake input going through the system from input to output on a timer to keep the application "warm" - no effect. (the application
's worker threads are busy waiting and pinned to cores).
perf points indicate that EVERY thing gets slower - which basically mean that the processors are slowing down when not under continuous load - but nothing else suggests that ( 17z/turbostat) or I am reading them incorrectly.
Does someone has color on what might be happening?

In my experience, this would be the nasty doings of Nagle's devilish device. Sounds extremeliy plausible to me that more data in the buffer fills up TCP packet which than get's sent. Without much data, tcp packet is waiting for the ack from the other side, which is, as we all know, delayed.
Solution - learn to make sure Nagle's program killer is disabled the first thing after you create any send-capable TCP socket.

Idendify the reason for a 200 ms freezing in a time critical loop

New description of the problem:
I currently run our new data acquisition software in a test environment. The software has two main threads. One contains a fast loop which communicates with the hardware and pushes the data into a dual buffer. Every few seconds, this loop freezes for 200 ms. I did several tests but none of them let me figure out what the software is waiting for. Since the software is rather complex and the test environment could interfere too with the software, I need a tool/technique to test what the recorder thread is waiting for while it is blocked for 200 ms. What tool would be useful to achieve this?
Original question:
In our data acquisition software, we have two threads that provide the main functionality. One thread is responsible for collecting the data from the different sensors and a second thread saves the data to disc in big blocks. The data is collected in a double buffer. It typically contains 100000 bytes per item and collects up to 300 items per second. One buffer is used to write to in the data collection thread and one buffer is used to read the data and save it to disc in the second thread. If all the data has been read, the buffers are switched. The switch of the buffers seems to be a major performance problem. Each time the buffer switches, the data collection thread blocks for about 200 ms, which is far too long. However, it happens once in a while, that the switching is much faster, taking nearly no time at all. (Test PC: Windows 7 64 bit, i5-4570 CPU #3.2 GHz (4 cores), 16 GB DDR3 (800 MHz)).
My guess is, that the performance problem is linked to the data being exchanged between cores. Only if the threads run on the same core by chance, the exchange would be much faster. I thought about setting the thread affinity mask in a way to force both threads to run on the same core, but this also means, that I lose real parallelism. Another idea was to let the buffers collect more data before switching, but this dramatically reduces the update frequency of the data display, since it has to wait for the buffer to switch before it can access the new data.
My question is: Is there a technique to move data from one thread to another which does not disturb the collection thread?
Edit: The double buffer is implemented as two std::vectors which are used as ring buffers. A bool (int) variable is used to tell which buffer is the active write buffer. Each time the double buffer is accessed, the bool value is checked to know which vector should be used. Switching the buffers in the double buffer just means toggling this bool value. Of course during the toggling all reading and writing is blocked by a mutex. I don't think that this mutex could possibly be blocking for 200 ms. By the way, the 200 ms are very reproducible for each switch event.

Locking and releasing a mutex just to switch one bool variable will not take 200ms.
Main problem is probably that two threads are blocking each other in some way.
This kind of blocking is called lock contention. Basically this occurs whenever one process or thread attempts to acquire a lock held by another process or thread. Instead parallelism you have two thread waiting for each other to finish their part of work, having similar effect as in single threaded approach.
For further reading I recommend this article for a read, which describes lock contention with more detailed level.

Since you are running on windows maybe you use visual studio? if yes I would resort to VS profiler which is quite good (IMHO) in such cases, once you don't need to check data/instruction caches (then the Intel's vTune is a natural choice). From my experience VS is good enough to catch contention problems as well as CPU bottlenecks. you can run it directly from VS or as standalone tool. you don't need the VS installed on your test machine you can just copy the tool and run it locally.
VSPerfCmd.exe /start:SAMPLE /attach:12345 /output:samples - attach to process 12345 and gather CPU sampling info
VSPerfCmd.exe /detach:12345 - detach from process
VSPerfCmd.exe /shutdown - shutdown the profiler, the samples.vsp is written (see first line)
then you can open the file and inspect it in visual studio. if you don't see anything making your CPU busy switch to contention profiling - just change the "start" argument from "SAMPLE" to "CONCURRENCY"
The tool is located under %YourVSInstallDir%\Team Tools\Performance Tools\, AFAIR it is available from VS2010
Good luck

After discussing the problem in the chat, it turned out that the Windows Performance Analyser is a suitable tool to use. The software is part of the Windows SDK and can be opened using the command wprui in a command window. (Alois Kraus posted this useful link: http://geekswithblogs.net/akraus1/archive/2014/04/30/156156.aspx in the chat). The following steps revealed what the software had been waiting on:
Record information with the WPR using the default settings and load the saved file in the WPA.
Identify the relevant thread. In this case, the recording thread and the saving thread obviously had the highest CPU load. The saving thread could be easily identified. Since it saves data to disc, it is the one that with file access. (Look at Memory->Hard Faults)
Check out Computation->CPU usage (Precise) and select Utilization by Process, Thread. Select the process you are analysing. Best display the columns in the order: NewProcess, ReadyingProcess, ReadyingThreadId, NewThreadID, [yellow bar], Ready (µs) sum, Wait(µs) sum, Count...
Under ReadyingProcess, I looked for the process with the largest Wait (µs) since I expected this one to be responsible for the delays.
Under ReadyingThreadID I checked each line referring to the thread with the delays in the NewThreadId column. After a short search, I found a thread that showed frequent Waits of about 100 ms, which always showed up as a pair. In the column ReadyingThreadID, I was able to read the id of the thread the recording loop was waiting for.
According to its CPU usage, this thread did basically nothing. In our special case, this led me to the assumption that the serial port io command could cause this wait. After deactivating them, the delay was gone. The important discovery was that the 200 ms delay was in fact composed of two 100 ms delays.
Further analysis showed that the fetch data command via the virtual serial port pair gets sometimes lost. This might be linked to very high CPU load in the data saving and compression loop. If the fetch command gets lost, no data is received and the first as well as the second attempt to receive the data timed out with their 100 ms timeout time.

Can I set a single thread's priority above 15 for a normal priority process?

I have a data acquisition application running on Windows 7, using VC2010 in C++. One thread is a heartbeat which sends out a change every .2 seconds to keep-alive some hardware which has a timeout of about .9 seconds. Typically the heartbeat call takes 10-20ms and the thread spends the rest of the time sleeping.
Occasionally however there will be a delay of 1-2 seconds and the hardware will shut down momentarily. The heartbeat thread is running at THREAD_PRIORITY_TIME_CRITICAL which is 15 for a normal priority process. My other threads are running at normal priority, although I use a DLL to control some other hardware and have noticed with Process Explorer that it starts several threads running at level 15.
I can't track down the source of the slow down but other theads in my application are seeing the same kind of delays when this happens. I have made several optimizations to the heartbeat code even though it is quite simple, but the occasional failures are still happening. Now I wonder if I can increase the priority of this thread beyond 15 without specifying REALTIME_PRIORITY_CLASS for the entire process. If not, are there any downsides I should be aware of to using REALTIME_PRIORITY_CLASS? (Other than this heartbeat thread, the rest of the application doesn't have real-time timing needs.)
(Or does anyone have any ideas about how to track down these slowdowns...not sure if the source could be in my app or somewhere else on the system).
Update: So I hadn't actually tried passing 31 into my AfxBeginThread call and turns out it ignores that value and sets the thread to normal priority instead of the 15 that I get with THREAD_PRIORITY_TIME_CRITICAL.
Update: Turns out running the Disk Defragmenter is a good way to cause lots of thread delays. Even running the process at REALTIME_PRIORITY_CLASS and the heartbeat thread at THREAD_PRIORITY_TIME_CRITICAL (level 31) doesn't seem to help. Next thing to try is calling AvSetMmThreadCharacteristics("Pro Audio")
Update: Scheduling heartbeat thread as "Pro Audio" does work to increase the thread's priority beyond 15 (Base=1, Dynamic=24) but it doesn't seem to make any real difference when defrag is running. I've been able to correlate many of the slowdowns with the disk defragmenter so turned off the weekly scan. Still can't explain some delays so we're going to increase to a 5-10 second watchdog timeout.

Even if you could, increasing the priority will not help. The highest priority runnable thread gets the processor at all times.
Most likely there is some extended interrupt processing occurring while interrupts are disabled. Interrupts effectively work at a higher priority than any thread.
It could be video, network, disk, serial, USB, etc., etc. It will take some insight to selectively disable or use an alternate driver to see if the problem system hesitation is affected. Once you find that, then figuring out a way to prevent it might range from trivial to impossible depending on what it is.
Without more knowledge about the system, it is hard to say. Have you tried running it on a different PC?

Officially you can't use REALTIME threads in a process which does not have the REALTIME_PRIORITY_CLASS.
Unoficially you could play with the undocumented NtSetInformationThread
see:
http://undocumented.ntinternals.net/UserMode/Undocumented%20Functions/NT%20Objects/Thread/NtSetInformationThread.html
But since I have not tried it, I don't have any more info about this.
On the other hand, as it was said before, you can never be sure that the OS will not take its time when your thread's quantum will expire. Certain poorly written drivers are often the cause of such latency.
Otherwise there is a software which can tell you if you have misbehaving kernel parts:
http://www.thesycon.de/deu/latency_check.shtml

I would try using CreateWaitableTimer() & SetWaitableTimer() and see if they are subject to the same preemption problems.

c++ process cpu usage jump causes detection

Given: multithreaded (~20 threads) C++ application under RHEL 5.3.
When testing under load, top shows that CPU usage jumps in range 10-40% every second.
The design mostly pretty simple - most of the threads implement active object design pattern: thread has a thread-safe queue, requests from other queues are pushed to the queue, while the thread only polling on the queue and process incomming requests. Processed request causes to a new request to be pushed to next processing thread.
The process has several TCP/UDP connection over each a data is received/sent in a high load.
I know I did not provided sufficiant data. This is pretty big application, and I'n not familiar well with all it's parts. It's now ported from Windows on Linux over ACE library (used for networking part).
Suppusing the problem is in the application and not external one, what are the techicues/tools/approaches can be used to discover the problem. For example I suspect that this maybe caused by some mutex contention.

I have faced similar problem some time back and here are the steps that helped me.
1) Start with using strace to see where the application is spending the time executing system calls.
2) Use OProfile to profile both the application and the kernel.
3) If you are using an SMP system , look at the numa settings,
In my case that caused a havoc .
/proc/appPID/numa_maps will give a quick look at how the access to the memory is happening.
numa misses can cause the jumps.
4) You have mentioned about TCP connections in your app.
Look at the MTU size and see its set to right value and
Depending upon the type of Data getting transferred use the Nagles Delay appropriately.
Nagles Delay

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js