ZeroMQ: how to reduce multithread-communication latency with inproc? - c++

I'm using inproc and PAIR to achieve inter-thread communication and trying to solve a latency problem due to polling. Correct me if I'm wrong: Polling is inevitable, because a plain recv() call will usually block and cannot take a specific timeout.
In my current case, among N threads, each of the N-1 worker threads has a main while-loop. The N-th thread is a controller thread which will notify all the worker threads to quit at any time. However, worker threads have to use polling with a timeout to get that quit message. This introduces a latency, the latency parameter is usually 1000ms.
Here is an example
while (true) {
const std::chrono::milliseconds nTimeoutMs(1000);
std::vector<zmq::poller_event<std::size_t>> events(n);
size_t nEvents = m_poller.wait_all(events, nTimeoutMs);
bool isToQuit = false;
for (auto& evt : events) {
zmq::message_t out_recved;
try {
evt.socket.recv(out_recved, zmq::recv_flags::dontwait);
}
catch (std::exception& e) {
trace("{}: Caught exception while polling: {}. Skipped.", GetLogTitle(), e.what());
continue;
}
if (!out_recved.empty()) {
if (IsToQuit(out_recved))
isToQuit = true;
break;
}
}
if (isToQuit)
break;
//
// main business
//
...
}
To make things worse, when the main loop has nested loops, the worker threads then need to include more polling code in each layer of the nested loops. Very ugly.
The reason why I chose ZMQ for multithread communication is because of its elegance and the potential of getting rid of thread-locking. But I never realized the polling overhead.
Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation? Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?

An above posted statement ( a hypothesis ):
"...a plain recv() call will usually block and cannot take a specific timeout."
is not correct:
a plain .recv( ZMQ_NOBLOCK )-call will never "block",
a plain .recv( ZMQ_NOBLOCK )-call can get decorated so as to mimick "a specific timeout"
An above posted statement ( a hypothesis ):
"...have to use polling with a timeout ... introduces a latency, the latency parameter is usually 1000ms."
is not correct:
- one need not use polling with a timeout
- the less one need not set 1000 ms code-"injected"-latency, spent obviously only on-no-new-message state
Q : "Am I able to achieve the typical latency when using a regular mutex or an std::atomic data operation?"
Yes.
Q : "Should I understand that the inproc is in fact a network communication pattern in disguise so that some latency is inevitable?"
No. inproc-transport-class is the fastest of all these kinds as it is principally protocol-less / stack-less and has more to do with ultimately fast pointer-mechanics, like in a dual-end ring-buffer pointer-management.
The Best Next Step:
1 )Re-factor your code, so as to always harness but the zero-wait { .poll() | .recv() }-methods, properly decorated for both { event- | no-event- }-specific looping.
2 )
If then willing to shave the last few [us] from the smart-loop-detection turn-around-time, may focus on improved Context()-instance setting it to work with larger amount of nIOthreads > N "under the hood".
optionally 3 )
For almost hard-Real-Time systems' design one may finally harness a deterministically driven Context()-threads' and socket-specific mapping of these execution-vehicles onto specific, non-overlapped CPU-cores ( using a carefully-crafted affinity-map )
Having set 1000 [ms] in code, no one is fair to complain about spending those very 1000 [ms] waiting in a timeout, coded by herself / himself. No excuse for doing this.
Do not blame ZeroMQ for behaviour, that was coded from the application side of the API.
Never.

Related

boost::asio wite() API stuck while writing the data

While sending some data to client (multiple chunks of data); if the client stop reading the data after some packets, the server gets stuck on boost::asio::write() which results in unwanted behavior of the product.
We thought of shifting to async_write() and have a timer over it so that if such condition occurs, we could fallback to original good state, but due to design faults we could not use io_service (due to high concurrency) after async_write which resulted in not getting callbacks to stop the timer.
So, is there any way through which (without using io_serivce) we can unblock the write() API.
Somthing like we could execute write() API on a separate thread and terminate it through some timer. But here the question arises, is there any way through which we can clear out the boost buffers which already has some pending write data ?
Any help would be appreciated.
Thanks.
Eventually went with using boost::asio::async_write() but with io_service::poll() -> poll being non-blocking.
run() was not an option as the system is highly concurrent and read/write had to share the same io_service.
Pseudo code looks something like this:
data_to_write = size of data;
set current_bytes_transffered = 0
set timeout_occurred to false
/*
current_bytes_transffered -> obtained from async_write() callback
timeout_occurred -> obtained from a seperate timer
*/
while((data_to_write != current_bytes_transffered) || (!timeout_occurred))
{
// poll() is used instead of run() as the system
// has high concurrency and read and write operations
// shares same io_service
io_service.poll();
if(data_to_write == current_bytes_transffered)
{
// SUCCESS write logic
}
else if(timeout_occurred)
{
// timeout logic
}
}

ZeroMQ - pub / sub latency

I'm looking into ZeroMQ to see if it's a fit for a soft-realtime application. I was very pleased to see that the latency for small payloads were in the range of 30 micro-seconds or so. However in my simple tests, I'm getting about 300 micro-seconds.
I have a simple publisher and subscriber, basically copied from examples off the web and I'm sending one byte through localhost.
I've played around for about two days w/ different sockopts and I'm striking out.
Any help would be appreciated!
publisher:
#include <iostream>
#include <zmq.hpp>
#include <unistd.h>
#include <sys/time.h>
int main()
{
zmq::context_t context (1);
zmq::socket_t publisher (context, ZMQ_PUB);
publisher.bind("tcp://*:5556");
struct timeval timeofday;
zmq::message_t msg(1);
while(true)
{
gettimeofday(&timeofday,NULL);
publisher.send(msg);
std::cout << timeofday.tv_sec << ", " << timeofday.tv_usec << std::endl;
usleep(1000000);
}
}
subscriber:
#include <iostream>
#include <zmq.hpp>
#include <sys/time.h>
int main()
{
zmq::context_t context (1);
zmq::socket_t subscriber (context, ZMQ_SUB);
subscriber.connect("tcp://localhost:5556");
subscriber.setsockopt(ZMQ_SUBSCRIBE, "", 0);
struct timeval timeofday;
zmq::message_t update;
while(true)
{
subscriber.recv(&update);
gettimeofday(&timeofday,NULL);
std::cout << timeofday.tv_sec << ", " << timeofday.tv_usec << std::endl;
}
}
Is the Task Definition real?
Once speaking about *-real-time design, the architecture-capability validation is more important, than the following implementation itself.
If taking your source code as-is, your readings ( which would be ideally posted together with your code snippets for a cross-validation of the replicated MCVE-retest ) will not serve much, as the numbers do not distinguish what portions ( what amounts of time ) were spent on sending-side loop-er, on sending side zmq-data-acquisition/copy/scheduling/wire-level formatting/datagram-dispatch and on receiving side unloading from media/copy/decode/pattern-match/propagate to receiver buffer(s)
If interested in ZeroMQ internals, there are good performance-related application notes available.
If striving for a minimum-latency design do:
remove all overheads
replace all tcp-header processing from the proposed PUB/SUB channel
avoid all non-cardinal logic overheads from processing ( no sense to spend time on subscribe-side ( sure, newer versions of ZMQ have moved into publisher-side filtering, but the idea is clear ) with pattern-matching encoded in the selected archetype processing ( using ZMQ_PAIR avoids any such, independently from the transport class ) - if it is intended to block something, then rather change the signalling socket layout accordingly, so as to principally avoid blocking ( this ought to be a real-time system, as you have said above)
apply a "latency-masking" where possible in the target multi-core / many-core hardware architectures so as to squeeze the last drops of spare-time from your hardware / tools capabilities ... benchmark with experiments setups with more I/O-threads' help zmq::context_t context( N );, where N > 1
Missing target:
As Alice in the Wonderlands stated more than a century ago, whenever there was no goal defined, any road leads to the target.
Having a soft-real time ambition, there shan´t be an issue to state a maximum allowed end-to-end latency and from that derive a constraint for transport-layer latency.
Having not done so, 30 us, 300 us or even 3 ms have no meaning per se, so no-one can decide, whether these figures are "enough" for some subsystem or not.
A reasonable next step:
define real-time stability horizon(s) ... if using for a real-time control
define real-time design constraints ... for signal / data acquisition(s), for processing task(s), for self-diagnostic & control services
avoid any blocking, design-wise & validate / prove no blocking will ever appear under all possible real-world operations circumstances [formal proof methods are ready for such task] ( no one would like to see an AlertPanel [ Waiting for data] during your next jet landing or have the last thing to see, before an autonomous car crashes right into the wall, a lovely looking [hour-glass] animated-icon as it moves the sand while the control system got busy, whatever a reason for that was behind it, in a devastatingly blocking manner.
Quantified targets make sense for testing.
If a given threshold permits to have 500 ms stability horizon (which may be a safe value for a slo-mo hydraulic-actuator/control-loop, but may fail to work for a guided missile control system, the less for any [mass&momentum-of-inertia]-less system (alike DSP family of RT-control-systems)), you can test end-to-end if your processing fits in between.
If you know, your incoming data-stream brings about 10 kB each 500 us, you can test your design if it can keep the pace with the burst traffic or not.
If you test, your mock-up design does miss the target (not meeting the performance / time-constrained figures) you know pretty well, where the design or where the architecture needs to get improved.
First make sure you run producer and consumer on different physical cores (not HT).
Second, it depends A LOT on the hardware and OS. Last time I measured kernel IO (4-5 years ago) the results were indeed 10 to 20us around send/recv system calls.
You have to optimize your kernel settings to low latency and set TCP_NODELAY.

High CPU usage while using poll system call to wait on fds

I have this unique problem where the poll system call of linux used in my code gets the fds it waits on polled in , i mean POLLIN every millisecond. This is causing high CPU usage . I have supplied a timeout of 100 milliseconds and it seems to be of no use. Can any one suggest an alternative.
for (;;) {
ACE_Time_Value doWork(0, 20000);
ACE_OS::sleep(doWork); ----------------------------> Causing low throughput, put to decrease CPU usage / On removing this we see high CPU , but throughput is achieved.
..
.
..
if ((exitCode = fxDoWork()) < 0) {
break;}
}
fxDoWork()
{
ACE_Time_Value selectTime;
selectTime.set(0, 100000);
..
..
..
ACE_INT32 waitResult = ACE_OS::poll(myPollfds, eventCount, &selectTime);-----------------------------> Pollin happens for every milli second/Timeout is not at all useful
..
..
..
}
===============================================================
It sounds like you want to accumulate enough data OR a specific timeout happens to reduce CPU usage, right? If that's the case, you can use recvmmsg(): http://man7.org/linux/man-pages/man2/recvmmsg.2.html
The recvmmsg() system call is an extension of recvmsg(2) that allows
the caller to receive multiple messages from a socket using a single
system call. (This has performance benefits for some applications.)
A further extension over recvmsg(2) is support for a timeout on the
receive operation.

How to call a method/function 50 time in a second

How to call a method/function 50 time in a second then calculate time spent, If time spent is less than one second then sleep for (1-timespent) seconds.
Below is the pseudo code
while(1)
{
start_time = //find current time
int msg_count=0;
send_msg();
msg_count++;
// Check time after sending 50 messages
if(msg_count%50 == 0)
{
curr_time = //Find current time
int timeSpent = curr_time - start_time ;
int waitingTime;
start_time = curr_time ;
waitingTime = if(start_time < 1 sec) ? (1 sec - timeSpent) : 0;
wait for waitingTime;
}
}
I am new with Timer APIs. Can anyone help me that what are the timer APIs, I have to use to achieve this. I want portable code.
First, read the time(7) man page.
Then you may want to call timer_create(2) to set up a timer. To query about time, use clock_gettime(2)
You probably may want to wait and multiplex on some input and output. poll(2) is useful for this. To sleep for a small amount of time without using the CPU consider nanosleep(2)
If using timer doing signals, read signal(7) and be careful because signal handlers are restricted to async-signal-safe functions (consider having a signal handler which just sets some global volatile sig_atomic_t flag). You may also be interested by the Linux specific timerfd_create(2) (which you could poll or pass to your event loop).
You might want to use some existing event loop library, like libevent or libev (or those from GTK/Glib, Qt, etc...), which are often using poll (or fancier things). The linux specific eventfd(2) and signalfd(2) might be very helpful.
Advanced Linux Programming is also useful to read.
If send_msg is doing network I/O, you probably need to redesign your program around some event loop (perhaps your own, based on poll) - you'll need to multiplex (i.e. poll) both on network sends and network recieves. continuation-passing style is then a useful paradigm.

Limit iterations per time unit

Is there a way to limit iterations per time unit? For example, I have a loop like this:
for (int i = 0; i < 100000; i++)
{
// do stuff
}
I want to limit the loop above so there will be maximum of 30 iterations per second.
I would also like the iterations to be evenly positioned in the timeline so not something like 30 iterations in first 0.4s and then wait 0.6s.
Is that possible? It does not have to be completely precise (though the more precise it will be the better).
#FredOverflow My program is running
very fast. It is sending data over
wifi to another program which is not
fast enough to handle them at the
current rate. – Richard Knop
Then you should probably have the program you're sending data to send an acknowledgment when it's finished receiving the last chunk of data you sent then send the next chunk. Anything else will just cause you frustrations down the line as circumstances change.
Suppose you have a good Now() function (GetTickCount() is bad example, it's OS specific and has bad precision):
for (int i = 0; i < 1000; i++){
DWORD have_to_sleep_until = GetTickCount() + EXPECTED_ITERATION_TIME_MS;
// do stuff
Sleep(max(0, have_to_sleep_until - GetTickCount()));
};
You can check elapsed time inside the loop, but it may be not an usual solution. Because computation time is totally up to the performance of the machine and algorithm, people optimize it during their development time(ex. many game programmer requires at least 25-30 frames per second for properly smooth animation).
easiest way (for windows) is to use QueryPerformanceCounter(). Some pseudo-code below.
QueryPerformanceFrequency(&freq)
timeWanted = 1.0/30.0 //time per iteration if 30 iterations / sec
for i
QueryPerf(count1)
do stuff
queryPerf(count2)
timeElapsed = (double)(c2 - c1) * (double)(1e3) / double(freq) //time in milliseconds
timeDiff = timeWanted - timeElapsed
if (timeDiff > 0)
QueryPerf(c3)
QueryPerf(c4)
while ((double)(c4 - c3) * (double)(1e3) / double(freq) < timeDiff)
queryPerf(c4)
end for
EDIT: You must make sure that the 'do stuff' area takes less time than your framerate or else it doesn't matter. Also instead of 1e3 for milliseconds, you can go all the way to nanoseconds if you do 1e9 (if you want that much accuracy)
WARNING... this will eat your CPU but give you good 'software' timing... Do it in a separate thread (and only if you have more than 1 processor) so that any guis wont lock. You can put a conditional in there to stop the loop if this is a multi-threaded app too.
#FredOverflow My program is running very fast. It is sending data over wifi to another program which is not fast enough to handle them at the current rate. – Richard Knop
What you might need a buffer or queue at the receiver side. The thread that receives the messages from the client (like through a socket) get the message and put it in the queue. The actual consumer of the messages reads/pops from the queue. Of course you need concurrency control for your queue.
Besides the flow control methods mentioned, if you also have the need to maintain an accurate specific data sending rate in your sender part. Usually it can be done like this.
E.x. if you want to send at 10Mbps, create a timer of interval 1ms so it will call a predefined function every 1ms. Then in the timer handler function, by keep tracking of 2 static variables 1)Time elapsed since beginning of sending data 2)How much data in bytes have been sent up to last call, you can easily calculate how much data is needed to be sent in the current call (or just sleep and wait for next call).
By this way, you can do "streaming" of data in a very stable way with very little jitterness, and this is usually adopted in streaming of videos. Of course it also depends on how accurate the timer is.