ZeroMQ: Address in use error when re-binding socket - c++

After binding a ZeroMQ socket to an endpoint and closing the socket, binding another socket to the same endpoint requires several attempts. The previous calls to zmq_bind up until the successful one fail with the error "Address in use" (EADDRINUSE).
The following code demonstrates the problem:
#include <cassert>
#include <iostream>
#include "zmq.h"
int main() {
void *ctx = zmq_ctx_new();
assert( ctx );
void *skt;
skt = zmq_socket( ctx, ZMQ_REP );
assert( skt );
assert( zmq_bind( skt, "tcp://*:5555" ) == 0 );
assert( zmq_close( skt ) == 0 );
skt = zmq_socket( ctx, ZMQ_REP );
assert( skt );
int fail = 0;
while ( zmq_bind( skt, "tcp://*:5555" ) ) { ++fail; }
std::cout << fail << std::endl;
}
I'm using ZeroMQ 4.0.3 on Windows XP SP3, compiler is VS 2008. libzmq.dll has been built with the provided Visual Studio solution.
This prints 1 here when doing a "Debug" build (both of the code above and of libzmq.dll) and 0 using a "Release" build. Strange enough, when running the code above with mixed build configuration (Debug with Release lib), fail counts up to 6.

Pieter Hintjens gave me the hint on the mailing list:
The call to zmq_close initiates the socket shutdown. This is done in a special "reaper" thread started by ZeroMQ to make the call to zmq_close asynchronous and non-blocking. See "The reaper thread" in a whitepaper about ZeroMQ's architecture.
The code above does not wait for the thread doing the actual work, so the endpoint will not become available immediately.

When a TCP socket is closed, it enters a state called TIME_WAIT. This means that while the socket is in that state, it's not really closed, and that in turn means that the address used by the socket is not available until it leave the state.
So if you run your program two times in close succession the socket will be in this TIME_WAIT state from the first run when you try the second run, and you get an error like this.
You might want to read more about TCP, and especially about its operation and states.

Related

Crash in a modified version of an official ZeroMQ mutithreaded example

I'm new to zmq and cppzmq. While trying to run the multithreaded example in the official guide: http://zguide.zeromq.org/cpp:mtserver
My setup
macOS Mojave, Xcode 10.3
libzmq 4.3.2 via Homebrew
cppzmq GitHub HEAD
I hit a few problems.
Problem 1
When running source code in the guide, it hangs forever without any stdout output shown up.
Here is the code directly copied from the Guide.
/*
Multithreaded Hello World server in C
*/
#include <pthread.h>
#include <unistd.h>
#include <cassert>
#include <string>
#include <iostream>
#include <zmq.hpp>
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
zmq::message_t request;
socket.recv (&request);
std::cout << "Received request: [" << (char*) request.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
socket.send (reply);
}
return (NULL);
}
int main ()
{
// Prepare our context and sockets
zmq::context_t context (1);
zmq::socket_t clients (context, ZMQ_ROUTER);
clients.bind ("tcp://*:5555");
zmq::socket_t workers (context, ZMQ_DEALER);
workers.bind ("inproc://workers");
// Launch pool of worker threads
for (int thread_nbr = 0; thread_nbr != 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine, (void *) &context);
}
// Connect work threads to client threads via a queue
zmq::proxy (static_cast<void*>(clients),
static_cast<void*>(workers),
nullptr);
return 0;
}
It crashes soon after I put a breakpoint in the while loop of the worker.
Problem 2
Noticing that the compiler prompted me to replace deprecated API calls, I modified the above sample code to make the warnings disappear.
/*
Multithreaded Hello World server in C
*/
#include <pthread.h>
#include <unistd.h>
#include <cassert>
#include <string>
#include <iostream>
#include <cstdio>
#include <zmq.hpp>
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
zmq::mutable_buffer request(buf.data(), buf.size());
socket.recv(request, zmq::recv_flags::dontwait);
std::cout << "Received request: [" << (char*) request.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %X\n", e.num());
}
}
return (NULL);
}
int main ()
{
// Prepare our context and sockets
zmq::context_t context (1);
zmq::socket_t clients (context, ZMQ_ROUTER);
clients.bind ("tcp://*:5555"); // who i talk to.
zmq::socket_t workers (context, ZMQ_DEALER);
workers.bind ("inproc://workers");
// Launch pool of worker threads
for (int thread_nbr = 0; thread_nbr != 5; thread_nbr++) {
pthread_t worker;
pthread_create (&worker, NULL, worker_routine, (void *) &context);
}
// Connect work threads to client threads via a queue
zmq::proxy (clients, workers);
return 0;
}
I'm not pretending to have a literal translation of the original broken example, but it's my effort to make things compile and run without obvious memory errors.
This code keeps giving me error number 9523DFB (156384763in Hex) from the try-catch block. I can't find the definition of the error number in official docs, but got it from this question that it's the native ZeroMQ error EFSM:
The zmq_send() operation cannot be performed on this socket at the moment due to the socket not being in the appropriate state. This error may occur with socket types that switch between several states, such as ZMQ_REP.
I'd appreciate it if anyone can point out where I did wrong.
UPDATE
I tried polling according to #user3666197 's suggestion. But still the program hangs. Inserting any breakpoint effectively crashes the program, making it difficult to debug.
Here is the new worker code
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
zmq::pollitem_t items[1] = { { socket, 0, ZMQ_POLLIN, 0 } };
while (true) {
if(zmq::poll(items, 1, -1) < 1) {
printf("Terminating worker\n");
break;
}
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
socket.recv(zmq::buffer(buf), zmq::recv_flags::none);
std::cout << "Received request: [" << (char*) buf.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %s\n", e.what());
}
}
return (NULL);
}
Welcome to the domain of the Zen-of-Zero
Suspect #1: the code jumps straight into an unresolveable live-lock due to a move into ill-directed state of the distributed-Finite-State-Automaton:
While I since ever advocate for preferring non-blocking .recv()-s, the code above simply commits suicide right by using this step:
socket.recv( request, zmq::recv_flags::dontwait ); // socket being == ZMQ_REP
kills all chances for any other future life but the very error The zmq_send() operation cannot be performed on this socket at the moment due to the socket not being in the appropriate state.
as
going into the .send()-able state is possible if and only if a previous .recv()-ed has delivered a real message.
The Best Next Step :
Review the code and may either use a blocking-form of the .recv() before going to .send() or, better, use a { blocking | non-blocking }-form of .poll( { 0 | timeout }, ZMQ_POLLIN ) before entering into an attempt to .recv() and keep doing other things, if there is nothing to receive yet ( so as to avoid the self suicidal throwing the dFSA into an uresolvable collision, flooding your stdout/stderr with a second-spaced flow of printf(" ERROR: %X\n", e.num() ); )
Error Handling :
Better use const char *zmq_strerror ( int errnum ); being fed by int zmq_errno (void);
The Problem 1 :
On the contrary to the suicidal ::dontwait flag in the Problem 2 root cause, the Problem 2 root cause is, that a blocking-form of the first .recv() here moves all the worker-threads into an undeterministically long, possibly infinite, waiting-state, as the .recv()-blocks proceeding to any further step until a real message arrives ( which it does not seem from the MCVE, that it ever will ) and so your pool-of-threads remains in a pool-wide blocked-waiting-state and nothing will ever happen until any message arrived.
Update on how the REQ/REP works :
The REQ/REP Scalable Communication Pattern Archetype works like a distributed pair of people - one, let's call her Mary, asks ( Mary .send()-s the REQ ), while the other one, say Bob the REP listens in a potentially infinitely long blocking .recv() ( or takes a due care, using .poll() to orderly and regularly check, if Mary has asked about something or not and continues to do his own hobbies or gardening otherwise ) and once the Bob's end gets a message, Bob can go and .send() Mary a reply ( not before, as he knows nothing when and what Mary would ( or would not ) ask in the nearer of farther future ) ) and Mary is fair not to ask her next REQ.send()-question to Bob anytime sooner but after Bob has ( REP.send() ) replied and Mary has received Bob's message ( REQ.recv() ) - which is fair and more symmetric, than a real life may exhibit among real people under one roof :o)
The code?
The code is not a reproducible MCVE. The main() creates five Bobs ( hanging waiting a call from Mary, somewhere over inproc:// transport-class ), but no Mary ever calls, or does she? Not visible sign of any Mary trying to do so, the less her ( their, could be a (even a dynamic) community of N:M herd-of-Mary(s):herd-of-5-Bobs relation ) attempt(s) to handle REP-ly(s) coming from either one of the 5-Bobs.
Persevere, ZeroMQ took me some time of scratching my own head, yet the years after I took a due care to learn the Zen-of-Zero are still a rewarding eternal walk in the Gardens of Paradise. No localhost serial-code IDE will ever be able to "debug" a distributed-system (unless a distributed-inspector infrastructure is inplace, a due architecture for a distributed-system monitor/tracer/debugger is another layer of distributed messaging/signaling layer atop of the debugged distributed messaging/signaling system - so do not expect it from a trivial localhost serial-code IDE.
If still in doubts, isolate potential troublemakers - replace inproc:// with tcp:// and if toys do not work with tcp:// (where one can wire-line trace the messages) it won't with inproc:// memory-zone tricks.
About the hanging that I saw in my UPDATED question, I finally figured out what's going on. It's a false expectation on my part.
This very sample code in my question is never meant to be a self-contained service/client code: It is a server-only app with ZMQ_REP socket. It just waits for any client code to send request through ZMQ_REQ sockets. So the "hang" that I was seeing is completely normal!
As soon as I hook up a client app to it, things start rolling instantly. This chapter is somewhere in the middle of the Guide and I was only concerned with multithreading so I skipped many code samples and messaging patterns, which led to my confusion.
The code comments even said it's a server, but I expected to see explicit confirmation from the program. So to be fair the lack of visual cue and the compiler deprecation warning caused me to question the sample code as a new user, but the story that the code tells is valid.
Such a shame on wasted time! But all of a sudden all #user3666197 says in his answer starts to make sense.
For the completeness of this question, the updated server thread worker code that works:
// server.cpp
void *worker_routine (void *arg)
{
zmq::context_t *context = (zmq::context_t *) arg;
zmq::socket_t socket (*context, ZMQ_REP);
socket.connect ("inproc://workers");
while (true) {
// Wait for next request from client
std::array<char, 1024> buf{'\0'};
socket.recv(zmq::buffer(buf), zmq::recv_flags::none);
std::cout << "Received request: [" << (char*) buf.data() << "]" << std::endl;
// Do some 'work'
sleep (1);
// Send reply back to client
zmq::message_t reply (6);
memcpy ((void *) reply.data (), "World", 6);
try {
socket.send (reply, zmq::send_flags::dontwait);
}
catch (zmq::error_t& e) {
printf("ERROR: %s\n", e.what());
}
}
return (NULL);
}
The much needed client code:
// client.cpp
int main (void)
{
void *context = zmq_ctx_new ();
// Socket to talk to server
void *requester = zmq_socket (context, ZMQ_REQ);
zmq_connect (requester, "tcp://localhost:5555");
int request_nbr;
for (request_nbr = 0; request_nbr != 10; request_nbr++) {
zmq_send (requester, "Hello", 6, 0);
char buf[6];
zmq_recv (requester, buf, 6, 0);
printf ("Received reply %d [%s]\n", request_nbr, buf);
}
zmq_close (requester);
zmq_ctx_destroy (context);
return 0;
}
The server worker does not have to poll manually because it has been wrapped into the zmq::proxy.

ZeroMQ with NORM - address already in use error was thrown on 2nd .bind() - why?

I'm using ZeroMQ with NACK-Oriented Reliable Multicast ( NORM ) norm:// protocol. The documentation contains only a Python code, so here is my C++ code:
PUB Sender :
string sendHost = "norm://2,127.0.0.1:5556";// <NormNodeId>,<addr:port>
string tag = "MyTag";
string sentMessage = "HelloWorld";
string fullMessage = tag + sentMessage;
zmq::context_t *context = new zmq::context_t( 20 );
zmq::socket_t publisher( *context, ZMQ_PUB );
zmq_connect( publisher, sendHost.c_str() );
zmq_send( publisher,
fullMessage.c_str(),
fullMessage.size(),
0
);
SUB Receiver :
char message[256];
string receiveHost = "norm://1,127.0.0.1:5556";// <NormNodeId>,<addr:port>
string tag = "MyTag";
zmq::context_t *context = new zmq::context_t( 20 );
zmq::socket_t subscriber( *context, ZMQ_SUB );
zmq_bind( subscriber, receiveHost.c_str() );
zmq_setsockopt( subscriber, ZMQ_SUBSCRIBE, tag.c_str(), tag.size() );
zmq_recv( subscriber,
message,
256,
0
);
cout << bytesReceived << endl;
cout << message << endl;
The problem I'm facing is that according to the documentation both .bind() and .connect() are interchangeable.
In my case they both do a .bind(), which causes ZeroMQ to throw an error saying the second bind fails, due to address already in use error.
... they both do a bind, which causes ZeroMQ to throw an error saying the second bind fails
Yes, this is a correct state to fail.
The first .bind() "takes ownership" of the port and this is an exclusive role.
The interchangeability of { .bind() | .connect() } is to be understood so that it does not matter which side .bind()-s and which one .connect()-s.
Until this moment, I saw no one interpreting this property in such a manner, that both sides would try to .connect() ( a non-existent .bind()-(not)-exposed Access Point ), the less to try to .bind() an already "occupied" port ( in case of residing on the same localhost ), or to remain in a nox-et-solitudo state, for the cases that either of the .bind()-s establishes such a .connect()-ready state on both ports on different localhost-s, which both after that remain in a silent solitude ( forever ), as there is ( and will be ) no attempt to make any .connect()-ion going live and operational.
No, you need just 1 .bind(), that may since that moment handle 0+ future .connect()-requests, arriving to establish a live-channel PUB/SUB, for any respective <transport-class> protocol, including the newly added norm://.
Anyways, welcome norm:// to the Family of ZeroMQ protocols.
Confused ?
May enjoy a further 5-seconds read
about the main conceptual differences in [ ZeroMQ hierarchy in less than a five seconds ] or other posts and discussions here.

Can statvfs block on certain network devices? How to handle that case?

I am using keybase (a cloud base data store for your SSH and other keys) and today somehow it did not restart when I started X-Windows.
As a result, the command df (and thus statvfs() in my code) would just block after telling me that the transport was down.
$ df
df: '/home/alexis/"/home/alexis/.local/share/keybase/fs"': Transport endpoint is not connected
df: /run/user/1000/gvfs: Transport endpoint is not connected
_
The prompt would sit there and never return.
I don't care much that df would get stuck at the moment, but I'm wondering how I should update my C++ code to handle the case where statvfs() blocks in my application because that's not acceptable there. I just don't see a way to break out of that call without using a signal (SIGALRM comes to mind).
Is there a better way to handle this case?
(Note: my code is in C++, although a C solution should work just fine and is likely what is required, hence the tagging with both languages.)
This code will wrap statvfs() in a function that sets up an alarm to interrupt the call. It will return -1 with errno set to EINTR should the alarm fire and interrupt the call to statvfs() (I haven't tried this so it may not be perfect...):
#include <sigaction.h>
#include <sys/statvfs.h>
#include <unistd.h>
#include <string.h>
// alarm handler doesn't need to do anything
// other than simply exist
static void alarm_handler( int sig )
{
return;
}
.
.
.
// statvfs() with a timeout measured in seconds
// will return -1 with errno set to EINTR should
// it time out
int statvfs_try( const char *path, struct statvfs *s, unsigned int seconds )
{
struct sigaction newact;
struct sigaction oldact;
// make sure they're entirely clear (yes I'm paranoid...)
memset( &newact, 0, sizeof( newact ) );
memset( &oldact, 0, sizeof( oldact) );
sigemptyset( &newact.sa_mask );
// note that does not have SA_RESTART set, so
// statvfs should be interrupted on a signal
// (hopefully your libc doesn't restart it...)
newact.sa_flags = 0;
newact.sa_handler = alarm_handler;
sigaction( SIGALRM, &newact, &oldact );
alarm( seconds );
// clear errno
errno = 0;
int rc = statvfs( path, s );
// save the errno value as alarm() and sigaction() might change it
int save_errno = errno;
// clear any alarm and reset the signal handler
alarm( 0 );
sigaction( SIGALRM, &oldact, NULL );
errno = saved_errno;
return( rc );
}
That could also use some error checking, especially on the sigaction() calls, but it's long enough to generate a scroll bar already, so I left that out.
If you find your process remains stuck in statvfs() call, and if you're running on Linux, run your process under the strace and trace the actual system calls. You should see the call to statvfs(), then an alarm signal that interrupts the statvfs() call. If you see another call to statvfs(), that means your libc has restarted the system call.

address reuse error when using fork() + execlp with boost::asio in Linux

I have a program which listens on a TCP port for particular string and launches an application using execlp call. I am doing a fork() to launch a child process before this execlp call. After this launch parent process again starts listening on the same port. I am closing the socket in child process.
I have written a wrapper over boost::asio::tcp_socket where I am setting the addr_reuse option to true before binding the socket.
Now my problem is in Linux I get an Address reuse error after a few launches of the application. In my program it continuously tries to accept connections (or more precisely tries to schedule an accept to boost::asio::io_service) until bind and then accept succeed. So I receive the error in this loop.
Strangely, if I close (or kill) the launched executable this error stops coming, which means bind succeeds. I am sure that in the launched application the same port is not being used anywhere.
I am using asynchronous socket operations. Any idea why I am getting this error?
Here is how I am accepting on the socket: (I also call reset on the boost::asio::tcp_socket(_tcpSocket) shared pointer before starting a new accept.)
boost::asio::ip::tcp::endpoint endPoint(boost::asio::ip::tcp::v4(), port);
_acceptor.reset ( new boost::asio::ip::tcp::acceptor( *_ioService.get() ) );
_acceptor->open( endPoint.protocol() );
_acceptor->set_option(boost::asio::ip::tcp::acceptor::reuse_address(true));
boost::system::error_code ec;
_acceptor->bind(endPoint, ec);
if ( ec.value() != boost::system::errc::success )
{
ec.clear();
_acceptor->close(ec);
close();
return false;
}
ec.clear();
_acceptor->listen(boost::asio::socket_base::max_connections, ec);
if ( ec.value() != boost::system::errc::success )
{
return false;
}
_acceptor->async_accept(*_tcpSocket,
boost::bind(&TCPSocket::_handleAsyncAccept,
this,
boost::asio::placeholders::error) );
Here is how I am forking:
pid_t pid = fork();
switch (pid)
{
case 0:
{
/// close all sockets for child process. as it might cause addr reuse error in parent process
_asyncNO->closeAll();
std::string binary = "<binaryName>";
std::string path = "<binaryPath>";
if ( execlp( path.c_str(), binary.c_str(), controllerIP.c_str(), (char *)0 ) == -1 )
{
LOG_ERROR("System call failed !!")
}
}
break;
default:
}
I have removed logging for simplicity.
As #TannerSansbury said in the comments, this is likely because Boost.Asio needs to be notified of fork():
Newer version of the documentation on forking with Boost.Asio.
Relevant section reproduced here:
Boost.Asio supports programs that utilise the fork() system call. Provided the program calls io_service.notify_fork() at the appropriate times, Boost.Asio will recreate any internal file descriptors (such as the "self-pipe trick" descriptor used for waking up a reactor). The notification is usually performed as follows:
io_service_.notify_fork(boost::asio::io_service::fork_prepare);
if (fork() == 0)
{
io_service_.notify_fork(boost::asio::io_service::fork_child);
// ...
}
else
{
io_service_.notify_fork(boost::asio::io_service::fork_parent);
// ...
}
User-defined services can also be made fork-aware by overriding the io_service::service::fork_service() virtual function.
Note that any file descriptors accessible via Boost.Asio's public API (e.g. the descriptors underlying basic_socket<>, posix::stream_descriptor, etc.) are not altered during a fork. It is the program's responsibility to manage these as required.

C++ / Gloox: how to check when connection is down?

I'm trying to write own jabber bot on c++/gloox. Everything goes fine, but when internet connection is down - bot thinks that it's still connected, and when connection is up again - of course bot doesn't respond to any message.
Each time since bot is successfully connected gloox' recv() returns ConnNoError, even if interface is down and cable unplugged.
Tried use blocking and non-blocking gloox' connection and recv() and all was without any result. Periodic checks of availability of xmpp server in different thread is not seems like a good idea, so how to properly check is bot connected right now or no?
If it's not possible to do with gloox only - please point me on some good method, but let it be availible in unix.
I have the same question, and found the reason why recv always retrun ConnNoError. Here is what I found. When the connection is established, the recv calls a funciton named dataAvailable In ConnectionTCPBase.cpp which return
( ( select( m_socket + 1, &fds, 0, 0, timeout == -1 ? 0 : &tv ) > 0 ) && FD_ISSET( m_socket, &fds ) != 0 )
searching google, I found this thread, it said FD_ISSET( m_socket, &fds ) would detect the socket is readble but not is closed ... Return value of FD_ISSET( m_socket, &fds ) is always 0, even the network is down. In such case, the return value of dataAvailable is false, so the code below finally returns ConnNoError in recv.
if( !dataAvailable( timeout ) )
{
m_recvMutex.unlock();
return ConnNoError;
}
I don't know whether it is a bug or what, seems not.
Later I tried another way, write to the socket directly, and this will cause a SIGPIPE if the socket is closed, catch that signal, then use cleanup to disconnect.
I finally figure out a graceful solution to this problem, using heartbeat.
in the gloox thread, call heartBeat(), where m_pClient is an pointer to a instance of gloox::Client
void CXmpp::heartBeat()
{
m_pClient->xmppPing(m_pClient->jid(), this);
if (++heart) > 3) {
m_pClient->disconnect();
}
}
xmppPing will register itself to eventhandler, when ping comes back, it will call handleEvent, and in handleEvent
void CEventHandler::handleEvent(const Event& event)
{
std::string sEvent;
switch (event.eventType())
{
case Event::PingPing:
sEvent = "PingPing";
break;
case Event::PingPong:
sEvent = "PingPong";
//recieve from server, decrease the count of heart
--heart;
break;
case Event::PingError:
sEvent = "PingError";
break;
default:
break;
}
return;
}
connect to the server, turn off the network, 3 seconds later, I got a disconnect!
You have to define the onDisconnect(ConnectionError e) to be able to handle the disconnect event. The address to documentation is http://camaya.net/api/gloox-0.9.9.12/classgloox_1_1ConnectionListener.html#a2