Crash in a modified version of an official ZeroMQ mutithreaded example - c++

I'm new to zmq and cppzmq. While trying to run the multithreaded example in the official guide: http://zguide.zeromq.org/cpp:mtserver
My setup
macOS Mojave, Xcode 10.3
libzmq 4.3.2 via Homebrew
cppzmq GitHub HEAD
I hit a few problems.
Problem 1
When running source code in the guide, it hangs forever without any stdout output shown up.
Here is the code directly copied from the Guide.
/*
Multithreaded Hello World server in C
*/
#include <pthread.h>
#include <unistd.h>
#include <cassert>
#include <string>
#include <iostream>
#include <zmq.hpp>
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
zmq::message_t request;
socket.recv (&request);
std::cout << "Received request: [" << (char*) request.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
socket.send (reply);
}
return (NULL);
}
int main ()
{
// Prepare our context and sockets
zmq::context_t context (1);
zmq::socket_t clients (context, ZMQ_ROUTER);
clients.bind ("tcp://*:5555");
zmq::socket_t workers (context, ZMQ_DEALER);
workers.bind ("inproc://workers");
// Launch pool of worker threads
for (int thread_nbr = 0; thread_nbr != 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine, (void *) &context);
}
// Connect work threads to client threads via a queue
zmq::proxy (static_cast<void*>(clients),
static_cast<void*>(workers),
nullptr);
return 0;
}
It crashes soon after I put a breakpoint in the while loop of the worker.
Problem 2
Noticing that the compiler prompted me to replace deprecated API calls, I modified the above sample code to make the warnings disappear.
/*
Multithreaded Hello World server in C
*/
#include <pthread.h>
#include <unistd.h>
#include <cassert>
#include <string>
#include <iostream>
#include <cstdio>
#include <zmq.hpp>
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
zmq::mutable_buffer request(buf.data(), buf.size());
socket.recv(request, zmq::recv_flags::dontwait);
std::cout << "Received request: [" << (char*) request.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %X\n", e.num());
}
}
return (NULL);
}
int main ()
{
// Prepare our context and sockets
zmq::context_t context (1);
zmq::socket_t clients (context, ZMQ_ROUTER);
clients.bind ("tcp://*:5555"); // who i talk to.
zmq::socket_t workers (context, ZMQ_DEALER);
workers.bind ("inproc://workers");
// Launch pool of worker threads
for (int thread_nbr = 0; thread_nbr != 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine, (void *) &context);
}
// Connect work threads to client threads via a queue
zmq::proxy (clients, workers);
return 0;
}
I'm not pretending to have a literal translation of the original broken example, but it's my effort to make things compile and run without obvious memory errors.
This code keeps giving me error number 9523DFB (156384763in Hex) from the try-catch block. I can't find the definition of the error number in official docs, but got it from this question that it's the native ZeroMQ error EFSM:
The zmq_send() operation cannot be performed on this socket at the moment due to the socket not being in the appropriate state. This error may occur with socket types that switch between several states, such as ZMQ_REP.
I'd appreciate it if anyone can point out where I did wrong.
UPDATE
I tried polling according to #user3666197 's suggestion. But still the program hangs. Inserting any breakpoint effectively crashes the program, making it difficult to debug.
Here is the new worker code
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
zmq::pollitem_t items[1] = { { socket, 0, ZMQ_POLLIN, 0 } };
while (true) {
if(zmq::poll(items, 1, -1) < 1) {
printf("Terminating worker\n");
break;
}
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
socket.recv(zmq::buffer(buf), zmq::recv_flags::none);
std::cout << "Received request: [" << (char*) buf.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %s\n", e.what());
}
}
return (NULL);
}

Welcome to the domain of the Zen-of-Zero
Suspect #1: the code jumps straight into an unresolveable live-lock due to a move into ill-directed state of the distributed-Finite-State-Automaton:
While I since ever advocate for preferring non-blocking .recv()-s, the code above simply commits suicide right by using this step:
socket.recv( request, zmq::recv_flags::dontwait ); // socket being == ZMQ_REP
kills all chances for any other future life but the very error The zmq_send() operation cannot be performed on this socket at the moment due to the socket not being in the appropriate state.
as
going into the .send()-able state is possible if and only if a previous .recv()-ed has delivered a real message.
The Best Next Step :
Review the code and may either use a blocking-form of the .recv() before going to .send() or, better, use a { blocking | non-blocking }-form of .poll( { 0 | timeout }, ZMQ_POLLIN ) before entering into an attempt to .recv() and keep doing other things, if there is nothing to receive yet ( so as to avoid the self suicidal throwing the dFSA into an uresolvable collision, flooding your stdout/stderr with a second-spaced flow of printf(" ERROR: %X\n", e.num() ); )
Error Handling :
Better use const char *zmq_strerror ( int errnum ); being fed by int zmq_errno (void);
The Problem 1 :
On the contrary to the suicidal ::dontwait flag in the Problem 2 root cause, the Problem 2 root cause is, that a blocking-form of the first .recv() here moves all the worker-threads into an undeterministically long, possibly infinite, waiting-state, as the .recv()-blocks proceeding to any further step until a real message arrives ( which it does not seem from the MCVE, that it ever will ) and so your pool-of-threads remains in a pool-wide blocked-waiting-state and nothing will ever happen until any message arrived.
Update on how the REQ/REP works :
The REQ/REP Scalable Communication Pattern Archetype works like a distributed pair of people - one, let's call her Mary, asks ( Mary .send()-s the REQ ), while the other one, say Bob the REP listens in a potentially infinitely long blocking .recv() ( or takes a due care, using .poll() to orderly and regularly check, if Mary has asked about something or not and continues to do his own hobbies or gardening otherwise ) and once the Bob's end gets a message, Bob can go and .send() Mary a reply ( not before, as he knows nothing when and what Mary would ( or would not ) ask in the nearer of farther future ) ) and Mary is fair not to ask her next REQ.send()-question to Bob anytime sooner but after Bob has ( REP.send() ) replied and Mary has received Bob's message ( REQ.recv() ) - which is fair and more symmetric, than a real life may exhibit among real people under one roof :o)
The code?
The code is not a reproducible MCVE. The main() creates five Bobs ( hanging waiting a call from Mary, somewhere over inproc:// transport-class ), but no Mary ever calls, or does she? Not visible sign of any Mary trying to do so, the less her ( their, could be a (even a dynamic) community of N:M herd-of-Mary(s):herd-of-5-Bobs relation ) attempt(s) to handle REP-ly(s) coming from either one of the 5-Bobs.
Persevere, ZeroMQ took me some time of scratching my own head, yet the years after I took a due care to learn the Zen-of-Zero are still a rewarding eternal walk in the Gardens of Paradise. No localhost serial-code IDE will ever be able to "debug" a distributed-system (unless a distributed-inspector infrastructure is inplace, a due architecture for a distributed-system monitor/tracer/debugger is another layer of distributed messaging/signaling layer atop of the debugged distributed messaging/signaling system - so do not expect it from a trivial localhost serial-code IDE.
If still in doubts, isolate potential troublemakers - replace inproc:// with tcp:// and if toys do not work with tcp:// (where one can wire-line trace the messages) it won't with inproc:// memory-zone tricks.

About the hanging that I saw in my UPDATED question, I finally figured out what's going on. It's a false expectation on my part.
This very sample code in my question is never meant to be a self-contained service/client code: It is a server-only app with ZMQ_REP socket. It just waits for any client code to send request through ZMQ_REQ sockets. So the "hang" that I was seeing is completely normal!
As soon as I hook up a client app to it, things start rolling instantly. This chapter is somewhere in the middle of the Guide and I was only concerned with multithreading so I skipped many code samples and messaging patterns, which led to my confusion.
The code comments even said it's a server, but I expected to see explicit confirmation from the program. So to be fair the lack of visual cue and the compiler deprecation warning caused me to question the sample code as a new user, but the story that the code tells is valid.
Such a shame on wasted time! But all of a sudden all #user3666197 says in his answer starts to make sense.
For the completeness of this question, the updated server thread worker code that works:
// server.cpp
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
socket.recv(zmq::buffer(buf), zmq::recv_flags::none);
std::cout << "Received request: [" << (char*) buf.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %s\n", e.what());
}
}
return (NULL);
}
The much needed client code:
// client.cpp
int main (void)
{
void *context = zmq_ctx_new ();
// Socket to talk to server
void *requester = zmq_socket (context, ZMQ_REQ);
zmq_connect (requester, "tcp://localhost:5555");
int request_nbr;
for (request_nbr = 0; request_nbr != 10; request_nbr++) {
zmq_send (requester, "Hello", 6, 0);
char buf[6];
zmq_recv (requester, buf, 6, 0);
printf ("Received reply %d [%s]\n", request_nbr, buf);
}
zmq_close (requester);
zmq_ctx_destroy (context);
return 0;
}
The server worker does not have to poll manually because it has been wrapped into the zmq::proxy.

Related

ZeroMQ IPC across several instances of a program

I am having some problems with inter process communication in ZMQ between several instances of a program
I am using Linux OS
I am using zeromq/cppzmq, header-only C++ binding for libzmq
If I run two instances of this application (say on a terminal), I provide one with an argument to be a listener, then providing the other with an argument to be a sender. The listener never receives a message. I have tried TCP and IPC to no avail.
#include <zmq.hpp>
#include <string>
#include <iostream>
int ListenMessage();
int SendMessage(std::string str);
zmq::context_t global_zmq_context(1);
int main(int argc, char* argv[] ) {
std::string str = "Hello World";
if (atoi(argv[1]) == 0) ListenMessage();
else SendMessage(str);
zmq_ctx_destroy(& global_zmq_context);
return 0;
}
int SendMessage(std::string str) {
assert(global_zmq_context);
std::cout << "Sending \n";
zmq::socket_t publisher(global_zmq_context, ZMQ_PUB);
assert(publisher);
int linger = 0;
int rc = zmq_setsockopt(publisher, ZMQ_LINGER, &linger, sizeof(linger));
assert(rc==0);
rc = zmq_connect(publisher, "tcp://127.0.0.1:4506");
if (rc == -1) {
printf ("E: connect failed: %s\n", strerror (errno));
return -1;
}
zmq::message_t message(static_cast<const void*> (str.data()), str.size());
rc = publisher.send(message);
if (rc == -1) {
printf ("E: send failed: %s\n", strerror (errno));
return -1;
}
return 0;
}
int ListenMessage() {
assert(global_zmq_context);
std::cout << "Listening \n";
zmq::socket_t subscriber(global_zmq_context, ZMQ_SUB);
assert(subscriber);
int rc = zmq_setsockopt(subscriber, ZMQ_SUBSCRIBE, "", 0);
assert(rc==0);
int linger = 0;
rc = zmq_setsockopt(subscriber, ZMQ_LINGER, &linger, sizeof(linger));
assert(rc==0);
rc = zmq_bind(subscriber, "tcp://127.0.0.1:4506");
if (rc == -1) {
printf ("E: bind failed: %s\n", strerror (errno));
return -1;
}
std::vector<zmq::pollitem_t> p = {{subscriber, 0, ZMQ_POLLIN, 0}};
while (true) {
zmq::message_t rx_msg;
// when timeout (the third argument here) is -1,
// then block until ready to receive
std::cout << "Still Listening before poll \n";
zmq::poll(p.data(), 1, -1);
std::cout << "Found an item \n"; // not reaching
if (p[0].revents & ZMQ_POLLIN) {
// received something on the first (only) socket
subscriber.recv(&rx_msg);
std::string rx_str;
rx_str.assign(static_cast<char *>(rx_msg.data()), rx_msg.size());
std::cout << "Received: " << rx_str << std::endl;
}
}
return 0;
}
This code will work if I running one instance of the program with two threads
std::thread t_sub(ListenMessage);
sleep(1); // Slow joiner in ZMQ PUB/SUB pattern
std::thread t_pub(SendMessage str);
t_pub.join();
t_sub.join();
But I am wondering why when running two instances of the program the code above won't work?
Thanks for your help!
In case one has never worked with ZeroMQ,one may here enjoy to first look at "ZeroMQ Principles in less than Five Seconds"before diving into further details
Q : wondering why when running two instances of the program the code above won't work?
This code will never fly - and it has nothing to do with thread-based nor the process-based [CONCURENT] processing.
It was caused by a wrong design of the Inter Process Communication.
ZeroMQ can provide for this either one of the supported transport-classes :{ ipc:// | tipc:// | tcp:// | norm:// | pgm:// | epgm:// | vmci:// } plus having even smarter one for in-process comms, an inproc:// transport-class ready for inter-thread comms, where a stack-less communication may enjoy the lowest ever latency, being just a memory-mapped policy.
The selection of L3/L2-based networking stack for an Inter-Process-Communication is possible, yet sort of the most "expensive" option.
The Core Mistake :
Given that choice, any single processes ( not speaking about a pair of processes ) will collide on an attempt to .bind() its AccessPoint onto the very same TCP/IP-address:port#
The Other Mistake :
Even for the sake of a solo programme launched, both of the spawned threads attempt to .bind() its private AccessPoint, yet none does an attempt to .connect() a matching "opposite" AccessPoint.
At least one has to successfully .bind(), and
at least one has to successfully .connect(), so as to get a "channel", here of the PUB/SUB Archetype.
ToDo:
decide about a proper, right-enough Transport-Class ( best avoid an overkill to operate the full L3/L2-stack for localhost/in-process IPC )
refactor the Address:port# management ( for 2+ processes not to fail on .bind()-(s) to the same ( hard-wired ) address:port#
always detect and handle appropriately the returned {PASS|FAIL}-s from API calls
always set LINGER to zero explicitly ( you never know )

Binding a subscriber socket and connecting a publisher socket in ZeroMQ is giving error when the code is run. Why?

In this code the Subscriber ( in subscriber.cpp code ) socket binds to port 5556.
It receives updates/messages from publisher ( in subscriber.cpp ), and publisher socket connects to the subscriber at 5556 and sends updates/messages to it.
I know that convention is to .bind() a publisher and not to call .connect() on it. But by theory every socket type can .bind() or .connect().
But, both the codes give zmq error when run. Why?
This is CPP code.
publisher.cpp
#include <iostream>
#include <zmq.hpp>
#include <zhelpers.hpp>
using namespace std;
int main () {
zmq::context_t context (1);
zmq::socket_t publisher(context, ZMQ_PUB);
publisher.connect("tcp://*:5556");
while (1) {
zmq::message_t request (12);
memcpy (request.data (), "Pub-1 Data", 12);
sleep(1);
publisher.send (request);
}
return 0;
}
subcriber.cpp
#include <iostream>
#include <zmq.hpp>
int main (int argc, char *argv[])
{
zmq::context_t context (1);
zmq::socket_t subscriber (context, ZMQ_SUB);
subscriber.bind("tcp://localhost:5556");
subscriber.setsockopt(ZMQ_SUBSCRIBE, "", 0); // subscribe to all messages
// Process 10 updates
int update_nbr;
for (update_nbr = 0; update_nbr < 10 ; update_nbr++) {
zmq::message_t update;
subscriber.recv (&update);
std::string updt = std::string(static_cast<char*>(update.data()), update.size());
std::cout << "Received Update/Messages/TaskList " << update_nbr <<" : "<< updt << std::endl;
}
return 0;
}
No, there is no problem in reversed .bind()/.connect()
This principally works fine.
Yet, the PUB/SUB Formal Archetype is subject to a so called late-joiner syndrome.
Without thorough debugging details, as were requested above, one may just repeat general rules of thumb:
In newer API versions one may
add rc = <aSocket>.setsockopt( ZMQ_CONFLATE, 1 ); assert( rc & "CONFLATE" );
add rc = <aSocket>.setsockopt( ZMQ_IMMEDIATE, 1 ); assert( rc & "IMMEDIATE" );
and so forth,
all that to better tune the context-instance + socket-instance attributes, so as to minimise the late-joiner syndrome effects.
There is no problem with reversed bind()/connect().
The code is working when I changed the line - subscriber.bind("tcp://localhost:5556");
to
subscriber.bind("tcp://:5556");
and
publisher.connect("tcp://:5556");
to
publisher.connect("tcp://localhost:5556");

ZeroMQ socket.recv() raised a STACK_OVERFLOW exception

if use this code in .dll, a call to a socket.recv() raised an exception STACK_OVERFLOW, but when this code compiled as .exe it works.
Why?
I run a .dll-test by "C:\windows\system32\rundll32.exe myDll.dll StartUp"
void StartUp()
{
zmq::context_t context(1);
zmq::socket_t socket(context, ZMQ_REP);
socket.bind("tcp://127.0.0.1:3456");
zmq::message_t msgIN, msgOUT("test", 4);
while (true){
socket.recv(&msgIN);
socket.send(msgOUT);
};
}
callstack :
libzmq-v120-mt-gd-4_2_2.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_=0x0231f700, int timeout_=0x00000000)
libzmq-v120-mt-gd-4_2_2.dll!zmq::io_thread_t::in_event()
libzmq-v120-mt-gd-4_2_2.dll!zmq::select_t::loop()
libzmq-v120-mt-gd-4_2_2.dll!zmq::select_t::worker_routine(void * arg_=0x002f1778)
libzmq-v120-mt-gd-4_2_2.dll!thread_routine(void * arg_=0x002f17c0)
main thread callstack:
libzmq-v120-mt-gd-4_2_2.dll!zmq::signaler_t::wait(int timeout_=0xffffffff)
libzmq-v120-mt-gd-4_2_2.dll!zmq::mailbox_t::recv(zmq::command_t * cmd_=0x0019f3c0, int timeout_=0xffffffff)
libzmq-v120-mt-gd-4_2_2.dll!zmq::socket_base_t::process_commands(int timeout_, bool throttle_)
libzmq-v120-mt-gd-4_2_2.dll!zmq::socket_base_t::recv(zmq::msg_t * msg_=0x0019f628, int flags_=0x00000000)
libzmq-v120-mt-gd-4_2_2.dll!s_recvmsg(zmq::socket_base_t * s_=0x006f6c70, zmq_msg_t * msg_=0x0019f628, int flags_=0x00000000)
libzmq-v120-mt-gd-4_2_2.dll!zmq_msg_recv(zmq_msg_t * msg_=0x0019f628, void * s_=0x006f6c70, int flags_=0x00000000)
mydll.dll!zmq::socket_t::recv(zmq::message_t * msg_=0x0019f628, int flags_=0x00000000)
mydll.dll!StartUp()
Update:
this example, also crashed with the same reason. Does someone know any reasons for exception stack overflow?
zmq::context_t context(1);
zmq::socket_t socket(context, ZMQ_REP);
socket.bind("tcp://*:7712");
while (1){
Sleep(10);
}
A reverse problem-isolation MCVE:
And how did this myDll.dll-test work,
if run by C:\windows\system32\rundll32.exe myDll.dll StartUp? Post the screen outputs.
void StartUp()
{
std::cout << "INF:: ENTRY POINT ( C:\windows\system32\rundll32.exe myDll.dll StartUp )" << std::endl;
std::cout << "INF:: WILL SLEEP ( C:\windows\system32\rundll32.exe myDll.dll StartUp )" << std::endl;
Sleep( 10 );
std::cout << "INF:: SLEPT WELL ( C:\windows\system32\rundll32.exe myDll.dll StartUp )" << std::endl;
std::cout << "INF:: WILL RETURN ( C:\windows\system32\rundll32.exe myDll.dll StartUp )" << std::endl;
}
The reason of crash is SizeOfStackCommit value in OPTIONAL_HEADER rundll32 file.
It too small (0xC000), i change it to 0x100000. Now all works.
ZeroMQ objects require certain respect to work with:
there are many features under the radar, that may go wreck havoc, as you have already seen on your screen.
Best read with due care both the ZeroMQ C++ binding reference documentation plus the original ZeroMQ API ( which is often mentioned in the C++ binding either ).
Both do emphasise to never handle zmq::message_t instances directly, but via using "service"-functions ( often re-wrapped as instance methods in C++ ).
zmq::message_t messageIN,
messageOUT;
bool successFlag;
while (true){
successFlag = socket.recv( &messageIN );
assert( successFlag && "EXC: .recv( &messageIN )" );
/* The zmq_recv() function shall receive a message
from the socket referenced by the socket argument
and store it in the message referenced by the msg
argument.
Any content previously stored in msg shall be
properly deallocated.
If there are no messages available on the specified
socket the zmq_recv() function shall block
until the request can be satisfied.
*/
messageOUT.copy( messageIN );
successFlag = socket.send( messageOUT );
assert( successFlag && "EXC: .send( messageOUT )" );
/* The zmq_send() function shall queue the message
referenced by the msg argument to be sent to
the socket referenced by the socket argument.
The flags argument is a combination of the flags
defined { ZMQ_NOBLOCK, ZMQ_SNDMORE }
The zmq_msg_t structure passed to zmq_send()
is nullified during the call.
If you want to send the same message to multiple
sockets you have to copy it using (e.g.
using zmq_msg_copy() ).
A successful invocation of zmq_send()
does not indicate that the message
has been transmitted to the network,
only that it has been queued on the socket
and ØMQ has assumed responsibility for the message.
*/
};
My suspect is a reference counting, adding more and more instances, produced by a zmq::message_t message; constructor in an infinite while( true ){...}-loop, none of which has ever met it's own fair destructor. The STACK, having a physically-limited capacity and none STACK-management care inside DLL, will fail sooner or later.
zmq::message_t instances are quite an expensive toy, so a good resources-management practices ( pre-allocation, reuse, controlled destructions ) are always welcome for professional code.
Q.E.D.
Tail remarks for clarity purposes:
A bit paraphrasing Dijkstra's view on error hunting and software testing: "If I see no Error, that does not mean, there is none in the piece of code ( the less if any external functions are linked in addition to it )."
No stack allocations?
Yes, no visible ones.
ZeroMQ API puts more light into it:
"The zmq_msg_init_size() function shall allocate any resources required to store a message size bytes long and initialise the message object referenced by msg to represent the newly allocated message.
The implementation shall choose whether to store message content on the stack (small messages) or on the heap (large messages). For performance reasons zmq_msg_init_size() shall not clear the message data."
Many years, so far spent on using cross-platform distributed systems, based on ZeroMQ API since v.2.1+, has taught me lot on being careful on explicit resources control. The more once you did not develop your own language binding for the native API.
After all unsupported criticism, let's add one more citation from ZeroMQ:
This adds a view, how a proper indirect manipulation of the message_t content is done by the library C++ bindings itself, wrapped into trivial helper functions:
from zhelpers.hpp:
// Receive 0MQ string from socket and convert into string
static std::string
s_recv (zmq::socket_t & socket) {
zmq::message_t message;
socket.recv(&message);
return std::string(static_cast<char*>(message.data()), message.size());
}
// Convert string to 0MQ string and send to socket
static bool
s_send (zmq::socket_t & socket, const std::string & string) {
zmq::message_t message(string.size());
memcpy (message.data(), string.data(), string.size());
bool rc = socket.send (message);
return (rc);
}
// Sends string as 0MQ string, as multipart non-terminal
static bool
s_sendmore (zmq::socket_t & socket, const std::string & string) {
zmq::message_t message(string.size());
memcpy (message.data(), string.data(), string.size());
bool rc = socket.send (message, ZMQ_SNDMORE);
return (rc);
}

ZeroMQ PUSH/PULL

Some part of zmq is not behaving in a predictable manner.
I'm using VS2013 and zmq 3.2.4. In order to not 'lose' messages in my pubsub framework [aside: I believe this is a design flaw. I should be able to start my subscriber first, then publisher and I should receive all messages] I must synchronise the publisher with the subscriber a la durapub/durasub etc. I am using the durasub.cpp and durapub.cpp examples found in the zeromq guide.
If I use the examples as-is, the system works perfectly.
If I now add scoping brackets around ZMQ_PUSH in durasub.cpp
{
zmq::socket_t sync (context, ZMQ_PUSH);
sync.connect(syncstr.c_str());
s_send (sync, "sync");
}
the system stops working. The matching 'ZMQ_PULL' signal never reaches the application level in durapub.cpp.
I have stepped through the C++ wrapper to check the return values from zmq_close and all is well. As far as ZMQ is concerned it has delivered the message to the endpoint.
Hopefully I've done something obviously stupid?
There's more. The addition of
std::this_thread::sleep_for(std::chrono::milliseconds(1));
allows the system (ie the pub/sub) to start working again. So it's clearly a race-condition, presumably in the reaper thread as it destroys the socket.
More digging around. I think LIBZMQ-179 refers to the problem as well.
EDIT#2 2014-08-13 03:00 [UTC+0000]
Publisher.cpp:
#include <zmq.hpp>
#include <zhelpers.hpp>
#include <string>
int main (int argc, char *argv[])
{
zmq::context_t context(1);
std::string bind_point("tcp://*:5555");
std::string sync_bind("tcp://*:5554");
zmq::socket_t sync(context, ZMQ_PULL);
sync.bind(sync_bind.c_str());
// We send updates via this socket
zmq::socket_t publisher(context, ZMQ_PUB);
publisher.bind(bind_point.c_str());
// Wait for synchronization request
std::string tmp = s_recv (sync);
std::cout << "Recieved: " << tmp << std::endl;
int numbytessent = s_send (publisher, "END");
std::cout << numbytessent << "bytes sent" << std::endl;
}
Subscriber.cpp
#include <zmq.hpp>
#include <zhelpers.hpp>
#include <string>
int main (int argc, char *argv[])
{
std::string connectstr("tcp://127.0.0.1:5555");
std::string syncstr("tcp://127.0.0.1:5554");
zmq::context_t context(1);
zmq::socket_t subscriber (context, ZMQ_SUB);
subscriber.setsockopt(ZMQ_SUBSCRIBE, "", 0);
subscriber.connect(connectstr.c_str());
#if ENABLE_PROBLEM
{
#endif ENABLE_PROBLEM
zmq::socket_t sync (context, ZMQ_PUSH);
sync.connect(syncstr.c_str());
s_send (sync, "sync");
#if ENABLE_PROBLEM
}
#endif ENABLE_PROBLEM
while (1)
{
std::cout << "Receiving..." << std::endl;
std::string s = s_recv (subscriber);
std::cout << s << std::endl;
if (s == "END")
{
break;
}
}
}
Compile each cpp to its own exe.
Start both exes (starting order is irrelevant)
If ENABLE_PROBLEM is defined:
Publisher: (EMPTY prompt)
Subscriber: 'Receiving...'
And then you have to kill both processes because they're hung...
If ENABLE_PROBLEM is not defined:
Publisher: 'Received: sync'
'3 bytes sent'
Subscriber: 'Receiving...'
'END'
EDIT#1 2014-08-11: Original post has changed, without leaving revisions visible
What is the goal?
With all due respect, it is quite hard to isolate the goal and mock-up any PASS/FAIL-test to validate the goal, from just the three SLOC-s above.
So let´s start step by step.
What ZMQ-primitives are used there?
T.B.D.
post-EDIT#1: ZMQ_PUSH + ZMQ_PULL + ( hidden ZMQ_PUB + ZMQ_SUB ... next time rather post ProblemDOMAIN-context-complete sources, best enriched with self-test-case outputs alike:
...
// <code>-debug-isolation-framing ------------------------------------------------
std::cout << "---[Pre-test]: sync.connect(syncstr.c_str()) argument" << std::endl;
std::cout << syncstr.c_str() << std::endl;
std::cout << "---[Use/exec]: " << std::endl;
sync.connect( syncstr.c_str());
// <code>-debug-isolation-framing ------------------------------------------------
...
)
What ZMQ-context create/terminate life-cycle-policy is deployed?
T.B.D.
post-EDIT#1: n.b.: ZMQ_LINGER rather influences .close() of the resource, which may take way before a ZMQ_Context termination appears. ( And may block ... which hurts... )
A note on "When does ZMQ_LINGER really matter?"
This parameter comes into action once a Context is near to be terminated, while a sending-queue is not empty yet and an attempt to zmq_close() is being handled.
In most architectures ( ... the more in low-latency / high-performance, where microseconds and nanoseconds count ... ) the (shared/restricted) resources setup/disposal operations appear for many reasons either at the very begining, or at the very end of the system life-cycle. Needless to tell more why, just imagine the overheads directly associated with all the setup / discard operations, that are altogether simply not feasible to take place ( the less to repetitively take place ... ) during routine flow of operations in near-real-time system designs.
So, having the system processes come to the final "tidy-up" phase ( just before exit )
setting ZMQ_LINGER == 0 simply ignores whatever is still inside the <sender>'s queue and allows a prompt zmq_close() + zmq_term()
Similarly ZMQ_LINGER == -1 puts whatever is still hanging inside the <sender>'s queue a label of [Having an utmost value], that the whole system has to wait ad-infimum, after ( hopefully any ) <receiver> retrieves & "consumes" all the en-queued messages, before any zmq_close() + zmq_term() is allowed to take place ... which could be pretty long and is fully out of your control ...
And finally ZMQ_LINGER > 0 serves as a compromise to wait a defined amount of [msec]-s, should some <receiver> comes and retrieves an en-queued message(s). However on the given TimeDOMAIN milestone, the system proceeds to zmq_close() + zmq_term() to a graceful clean release of all reserved resources and exit in accord with the system design timing constraints.

async_receive_from stops receiving after a few packets under Linux

I have a setup with multiple peers broadcasting udp packets (containing images) every 200ms (5fps).
While receiving both the local stream as external streams works fine under Windows, the same code (except for the socket->cancel(); in Windows XP, see comment in code) produces rather strange behavior under Linux:
The first few (5~7) packets sent by another machine (when this machine starts streaming) are received as expected;
After this, the packets from the other machine are received after irregular, long intervals (12s, 5s, 17s, ...) or get a time out (defined after 20 seconds). At certain moments, there is again a burst of (3~4) packets received as expected.
The packets sent by the machine itself are still being received as expected.
Using Wireshark, I see both local as external packets arriving as they should, with correct time intervals between consecutive packages. The behavior also presents itself when the local machine is only listening to a single other stream, with the local stream disabled.
This is some code from the receiver (with some updates as suggested below, thanks!):
Receiver::Receiver(port p)
{
this->port = p;
this->stop = false;
}
int Receiver::run()
{
io_service io_service;
boost::asio::ip::udp::socket socket(
io_service,
boost::asio::ip::udp::endpoint(boost::asio::ip::udp::v4(),
this->port));
while(!stop)
{
const int bufflength = 65000;
int timeout = 20000;
char sockdata[bufflength];
boost::asio::ip::udp::endpoint remote_endpoint;
int rcvd;
bool read_success = this->receive_with_timeout(
sockdata, bufflength, &rcvd, &socket, remote_endpoint, timeout);
if(read_success)
{
std::cout << "read succes " << remote_endpoint.address().to_string() << std::endl;
}
else
{
std::cout << "read fail" << std::endl;
}
}
return 0;
}
void handle_receive_from(
bool* toset, boost::system::error_code error, size_t length, int* outsize)
{
if(!error || error == boost::asio::error::message_size)
{
*toset = length>0?true:false;
*outsize = length;
}
else
{
std::cout << error.message() << std::endl;
}
}
// Update: error check
void handle_timeout( bool* toset, boost::system::error_code error)
{
if(!error)
{
*toset = true;
}
else
{
std::cout << error.message() << std::endl;
}
}
bool Receiver::receive_with_timeout(
char* data, int buffl, int* outsize,
boost::asio::ip::udp::socket *socket,
boost::asio::ip::udp::endpoint &sender_endpoint, int msec_tout)
{
bool timer_overflow = false;
bool read_result = false;
deadline_timer timer( socket->get_io_service() );
timer.expires_from_now( boost::posix_time::milliseconds(msec_tout) );
timer.async_wait( boost::bind(&handle_timeout, &timer_overflow,
boost::asio::placeholders::error) );
socket->async_receive_from(
boost::asio::buffer(data, buffl), sender_endpoint,
boost::bind(&handle_receive_from, &read_result,
boost::asio::placeholders::error,
boost::asio::placeholders::bytes_transferred, outsize));
socket->get_io_service().reset();
while ( socket->get_io_service().run_one())
{
if ( read_result )
{
timer.cancel();
}
else if ( timer_overflow )
{
//not to be used on Windows XP, Windows Server 2003, or earlier
socket->cancel();
// Update: added run_one()
socket->get_io_service().run_one();
}
}
// Update: added run_one()
socket->get_io_service().run_one();
return read_result;
}
When the timer exceeds the 20 seconds, the error message "Operation canceled" is returned, but it is difficult to get any other information about what is going on.
Can anyone identify a problem or give me some hints to get some more information about what is going wrong? Any help is appreciated.
Okay, what you're doing is that when you call receive_with_timeout, you're setting up the two asynchronous requests (one for the recv, one for the timeout). When the first one completes, you cancel the other.
However, you never invoke ioservice::run_one() again to allow it's callback to complete. When you cancel an operation in boost::asio, it invokes the handler, usually with an error code indicating that the operation has been aborted or canceled. In this case, I believe you have a handler dangling once you destroy the deadline service, since it has a pointer onto the stack for it to store the result.
The solution is to call run_one() again to process the canceled callback result prior to exiting the function. You should also check the error code being passed to your timeout handler, and only treat it as a timeout if there was no error.
Also, in the case where you do have a timeout, you need to execute run_one so that the async_recv_from handler can execute, and report that it was canceled.
After a clean installation with Xubuntu 12.04 instead of an old install with Ubuntu 10.04, everything now works as expected. Maybe it is because the new install runs a newer kernel, probably with improved networking? Anyway, a re-install with a newer version of the distribution solved my problem.
If anyone else gets unexpected network behavior with an older kernel, I would advice to try it on a system with a newer kernel installed.