multiple boost::asio ssl clients running on same system - c++

I have a simple Boost ASIO SSL Client which calls a web api. The client is slight modification of the Boost SSL documentation example.
//http.h
class Http {
public:
static void WebApiCall(...);
}
//http.cpp
void Http::WebApiCall(...) {
try {
// .......
boost::asio::io_service io_service;
tcp::resolver resolver(io_service);
tcp::resolver::query query(serverip, serverport);
tcp::resolver::iterator endpoint_iterator = resolver.resolve(query);
boost::asio::ssl::context ctx(io_service, boost::asio::ssl::context::tlsv1); // ERROR # 1
// ....
// Setting SSL Context Properties Here
// ....
boost::shared_ptr<boost::asio::ssl::stream<tcp::socket> > ssocket(new boost::asio::ssl::stream<tcp::socket>(io_service, ctx));
boost::asio::ip::tcp::endpoint endpoint = *endpoint_iterator;
ssocket->lowest_layer().connect(endpoint);
boost::system::error_code er;
ssocket->handshake(boost::asio::ssl::stream_base::client,er);
boost::asio::streambuf request;
std::ostream request_stream(&request);
// ....
// Set Headers & Body of HTTP Request here
// ....
size_t written = 0;
written = boost::asio::write(*ssocket, request); // ERROR # 2
// .....
// Read server response
boost::asio::streambuf response;
boost::system::error_code error;
int read_bytes = 0;
std::string TempBuf = "";
std::ostringstream responseStringstream;
std::stringstream response_stream;
while ( boost::asio::read(*ssocket,response,boost::asio::transfer_at_least(1), error)) {
read_bytes = read_bytes + response.size();
responseStringstream << &response;
}
}
// Do some stuff with server response....
// ....
} catch ( const boost::system::system_error &error ) {
// Print the exception ..
}
}
// client.cpp
Http::WebApiCall(<api_to_call>)
You can see its a simple HTTP client with one static function which implements the actual SSL enabled HTTP Client using ASIO.
Use Case:
1000 processes are running of this client on one machine. All processes are making a POST request periodically (e.g after every one minute) to one resource in approximately the same time. Machine is Ubuntu and I do not seems to be out of memory (I have around 6 GB free)
This client works perfect but in one case where I have to simulate some load on my server I have launched 1000 processes of this client, all on one machine, all calling same API to same server using same public certificates, except that every client has its own OAuth token¹. In this situation I am getting two types of exceptions:
Errors:
ERROR # 2: Some clients (NOT ALL) while writing get error (write: short read). From different forums and Boost sources it seems the server is sending SSL_Shutdown causing ASIO to throw this error, which, as per my finding is normal behavior. My question is, why server is sending SSL_Shutdown at this point? Does this have to do anything with multiple processes calling the same resource from same machine? From ASIO docs ASIO SSL is not thread safe, but in this case I am running only one thread but different processes (which I believe is perfectly safe), besides above code is itself thread safe. Is underlying openssl behaving erratically?
ERROR # 1: Sometimes get an exception while creating Boost ASIO SSL Context, simply saying "context: ssl error". Again same thoughts, why it behaves like this? Does this has something to do with multiple processes, is openssl mixing things up in this scenario?
My client is running perfectly for last one year as one process per machine and I have never seen these errors before. Any thoughts are appreciated.
¹ (Just mentioning about OAuth but I don't think this has anything to do with it)

Q: Some clients (NOT ALL) while writing get error (write: short read). From different forums and Boost sources it seems the server is sending SSL_Shutdown causing ASIO to throw this error, which, as per my finding is normal behavior.
Most likely cause is you're pushing the system beyond a resource limit. E.g. the client or the server may run out of file handles. E.g. on my Linux box the number of open files is limited to 1024 by default: ulimit -a outputs:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256878
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 95
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 256878
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Q: My question is, why server is sending SSL_Shutdown at this point?
Most likely because of the above.
Q: Does this have to do anything with multiple processes calling the same resource from same machine?
No.
Q: From ASIO docs ASIO SSL is not thread safe, but in this case I am running only one thread but different processes (which I believe is perfectly safe), besides above code is itself thread safe. Is underlying openssl behaving erratically?
Thread safety or the underlying SSL library is not the issue here.
Q: Sometimes get an exception while creating Boost ASIO SSL Context, simply saying "context: ssl error". Again same thoughts, why it behaves like this? Does this has something to do with multiple processes, is openssl mixing things up in this scenario?
It's unlikely but it's possible that each instance of ssl::context incurs overhead. You might try allocating it statically/out of the loop.
That said, it's more likely that the initialization if the SSL context simply runs into (the same) resource limit, as it will likely open some system-configuration files and/or check for existence of well known paths (e.g. the CApath etc.)

Related

OpenSSL client stuck in endless read

I am using cpp-httplib to retrieve some data from a server using long polling (that is, the client will issue a request to the server, and the server will just keep the connection open until the required data is available or a timeout is reached).
The program is running on my raspberry pi, which sits behind a router that does not have an outgoing static ip address. Every time the ip is reassigned (or, at least, close to that time point), my program breaks, in that the thread currently performing the poll will be forever stuck in httplib::SSLClient::Get, which is caused by a blocking read() syscall. Both server- and client timeouts are unable to do anything, while a connection close should make read immediately return 0, which is what i would have expected in this situation.
Inspecting the program with gdb shows the following:
(gdb) thread 2
(gdb) where
__libc_read (nbytes=5, buf=0x75608edb, fd=3) at ../sysdeps/unix/sysv/linux/read.c:26
__libc_read (fd=3, buf=0x75608edb, nbytes=5) at ../sysdeps/unix/sysv/linux/read.c:24
0x76d1862c in ?? () from /usr/lib/arm-linux-gnueabihf/libcrypto.so.1.1
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I am not doing anything (as far as I know) that could accidentally overwrite return addresses.
For comparison, a 'healthy' stack trace during a SSLCLient::Get can be found here.
The actual code is quite a lot, but here's a short version that shows the same behaviour:
#include <iostream>
#define CPPHTTPLIB_OPENSSL_SUPPORT 1
#include "httplib.h"
void poll(httplib::SSLClient* c, char* path) {
while (true) {
auto response = c->Get(path);
std::cout << response.body << std::endl;
}
}
int main(int argc, char* argv[]) {
if (argc >= 3) {
httplib::SSLClient client(argv[1], 443, 20);
std::thread poll_thread(poll, &client, argv[2]);
poll_thread.join();
} else {
std::cerr << "Usage: ./poll <host> <path>" << std::endl;
return 1;
}
}
I can think of some workarounds that might or might not work, but I'd really like to know why and how this is happening in the first place.
Just expanding on the keep_alive option I mentioned in the comment.
In the scenario you described, it seems possible that the underlying TCP socket connection was terminated in an unclean fashion. I.e., you say the IP address was reassigned.
Ideally when there is a TCP socket termination, you want your code to exit out of any blocked read/poll operation. That is what will happen for normal socket closures, e.g., say the remote process is killed, or the remote process just decides it is time to close. But if the IP address of your host is changed .... I'm not sure there will necessarily be a low level TCP messages that says, to affect, this connection is now closed. So the consequence for your program is that is can still hold a local socket (the local TCP endpoint), and not realise the connection has dropped.
This is where something like keep_alive. The idea is that the kernel will send keep alive packets to keep testing if the connection is established; if these ever fail, then it can close the local socket (and so your blocking read, or blocking select, will return with some sort of end-of-stream error).
Separately to keep_alive, you can also consider application heart-beat messages (e.g., websocket has ping/pong). In addition to ensuring the TCP connection remains established, it confirms whether the remote application is healthy.

ZeroMq: Too many open files.. Number of fd usage growing continuosly on the same object

Through the same class object which includes 2 zeromq subscriber and 1 zeromq request socket, I create objects in different threads. I use inproc zeromq sockets and that belong to same ZContext.
Each time I create the object the number of open files (lsof | wc -l) in the server (operating Centos 7) system increases incrementally. After creating the first object the open file # increases by amount of 300 and the second one increases the open file number by 304 and continuously growing.
As my programme can use many of these objects during runtime this can result in too many open files error for zeromq even though I set the limit to 524288 (ulimit -n). As the # of objects getting higher each object consumes the open file limit much more as some of them around 1500.
During runtime my programme crashes with the too many open files error at the times of many objects created and threads doing their work (sending messages to another server or clients) on the objects.
How can I overcome this through?
example code:
void Agent::run(void *ctx) {
zmq::context_t *_context = (zmq::context_t *) ctx;
zmq::socket_t dataSocket(*(_context),ZMQ_SUB);
zmq::socket_t orderRequestSocket(*(_context),ZMQ_REQ);//REQ
std::string bbpFilter = "obprice.1;
std::string bapFilter = "obprice.2"
std::string orderFilter = "order";
dataSocket.connect("inproc://ordertrade_publisher");
dataSocket.connect("inproc://orderbook_prices_pub");
orderRequestSocket.connect("inproc://frontend_oman_agent");
int rc;
try {
zmq::message_t filterMessage;
zmq::message_t orderMessage;
rc = dataSocket.recv(&filterMessage);
dataSocket.recv(&orderMessage);
//CALCULATION AND SEND ORDER
// end:
return;
}
catch(std::exception& e) {
std::cerr<< "Exception:" << e.what() << std::endl;
Order.cancel_order(orderRequestSocket);
return;
}
}
I'm running into this as well. I'm not sure I have a solution, but I see that a context (zmq::context_t) has a maximum number of sockets. See zmq_ctx_set for more detail. This limit defaults to ZMQ_MAX_SOCKETS_DFLT which appears to be 1024.
You might just need to increase the number of sockets your context can have, although I suspect there might be some leaking going on (at least in my case).
UPDATE:
I was able to fix my leak through a combination of socket options:
ZMQ_RCVTIMEO - I was already using this to avoid waiting forever if the other end wasn't there. My system handles this by only making one request on a socket, then closing it.
ZMQ_LINGER - set to 0 so the socket doesn't wait around trying to send the failed message. The default behavior is infinite linger. This is probably the key to your problem
ZMQ_IMMEDIATE - this option restricts the queueing of messages to only completed connections. Without a queue, there's no need for the socket to linger.
I can't say for sure if I need both linger and immediate, but they both seemed appropriate to my use case; they might help yours. With these options set, my number of open files does not grow infinitely.

Windows XP socket error with recv()

I'm having a strange behaviour with the recv() function.
My C++ (MFC) application with WinSock implements a simple HTTP client (non-blocking socket) for accessing HTML pages on a web server. Some of these pages are taking a few seconds for loading. On Windows 7 this is not a problem, because recv() also returns partial data. But on Windows XP the recv() function always returns SOCKET_ERROR and the error code is WSAEWOULDBLOCK. Only when the connection is finished the data is returned in one access.
Does anyone know this problem? How can I force Windows XP to also receive partial data?
I setted the buffer size (SO_RCVBUF) to 1000 Bytes. On Windows 7 this is also reflected to the TCP Window Size - on XP not.
The real problem which I have with this issue is, that I don't know how to check if the connection is still alive or not. How can I check if a connection is still alive? Or how can I specify a timeout (max time between two received packets from the server)?
By default, a socket operates in blocking mode, so the only way you can get a WSAEWOULDBLOCK error at all is if you explicitly put the socket into non-blocking mode instead. Doing so, you agree to handle WSAEWOULDBLOCK (otherwise, don't use non-blocking mode).
WSAEWOULDBLOCK is not a real error, it is just an indication that the operation you attempted to perform cannot be completed at that moment because it would block the calling thread. You need to detect this "error" and simply retry the same operation again at a later time, preferably after a socket state change is detected.
For recv(), WSAEWOULDBLOCK simply means there is no data available on the socket to be read at that moment. In non-blocking mode, you should be using select() (or WSAEventSelect(), or WSAAsyncSelect(), or Overlapped I/O, or an I/O Completion Port) to detect inbound data before you then read it.
That being said, you are implementing an HTTP client, so you must follow the HTTP protocol properly, regardless of the socket I/O mode you are using, regardless of your socket buffer sizes. You must follow the pseudo code logic I outlined in this answer on another question:
You must follow the rules outlined in RFC 2616. Namely:
Read until the "\r\n\r\n" sequence is encountered. Do not read any more bytes past that yet.
Analyze the received headers, per the rules in RFC 2616 Section 4.4. They tell you the actual format of the remaining response data.
Read the data per the format discovered in #2.
Check the received headers for the presence of a Connection: close header if the response is using HTTP 1.1, or the lack of a Connection: keep-alive header if the response is using HTTP 0.9 or 1.0. If detected, close your end of the socket connection because the server is closing its end. Otherwise, keep the connection open and re-use it for subsequent requests (unless you are done using the connection, in which case do close it).
Process the received data as needed.
In short, you need to do something more like this instead (pseudo code):
string headers[];
byte data[];
string statusLine = read a CRLF-delimited line;
int statusCode = extract from status line;
string responseVersion = extract from status line;
do
{
string header = read a CRLF-delimited line;
if (header == "") break;
add header to headers list;
}
while (true);
if ( !((statusCode in [1xx, 204, 304]) || (request was "HEAD")) )
{
if (headers["Transfer-Encoding"] ends with "chunked")
{
do
{
string chunk = read a CRLF delimited line;
int chunkSize = extract from chunk line;
if (chunkSize == 0) break;
read exactly chunkSize number of bytes into data storage;
read and discard until a CRLF has been read;
}
while (true);
do
{
string header = read a CRLF-delimited line;
if (header == "") break;
add header to headers list;
}
while (true);
}
else if (headers["Content-Length"] is present)
{
read exactly Content-Length number of bytes into data storage;
}
else if (headers["Content-Type"] == "multipart/byteranges")
{
string boundary = extract from Content-Type header;
read into data storage until terminating boundary has been read;
}
else
{
read bytes into data storage until disconnected;
}
}
if (!disconnected)
{
if (responseVersion == "HTTP/1.1")
{
if (headers["Connection"] == "close")
close connection;
}
else
{
if (headers["Connection"] != "keep-alive")
close connection;
}
}
check statusCode for errors;
process data contents, per info in headers list;
As you can see, HTTP requires reading CRLF-delimited lines of text, or fixed lengths of raw bytes. To do that, you must call recv() in a loop until you encounter the terminating CRLF, or have received the expected number of bytes, whichever the case may be. Whether you use a synchronous loop that just ignores WSAEWOULDBLOCK errors while looping, or you use a state machine driven by asynchronous events/callbacks, that is up to you to decide. That doesn't change how you must process the HTTP protocol.
This applies to all versions of Windows (even all platforms that use BSD-style socket APIs). What you are encountering is not a Windows bug at all. It is an underlying flaw in your understanding of how to use socket I/O correctly and effectively.
As for checking if the connection is alive, recv() will return 0 if the server closed the connection gracefully, or will report an error otherwise (usually WSAECONNABORTED or WSAECONNRESET, though there can be others). But an abnormal disconnect may take a long time to detect, so you should implement timeouts in your code instead. In synchronous mode, you can use setsockopt(SO_RCVTIMEO). In non-blocking mode, you can use select(). In asynchronous (overlapped) mode, you can use WaitForSingleObject() on whatever event/object you use to drive your state machine.
You can't expect recv to give you any data on a non-blocking socket. If there's no data available it returns WOULDBLOCK. You just need to call recv again (normally after select notifies you some data is available). Whether you get data on the first (or any) call is going to depend on how fast the server is sending it.
When the socket is closed you'll get a different error from recv, like WSAECONNRESET or WSAENOTCONN. select will also notify you when the socket is closed.
It's very strange.
Today I have changed my software to use blocking sockets. But it still doesn't work on Windows XP. Windows 7 is no problem.
So I thought: Let's try another PC. On this PC (also Windows XP) it does work. Now I tried a 3rd PC with Windows XP and here it also works.
I still don't know what the problem is but I think there must be a bug with the PC.

Boost.Asio - how to check that all intermediate handlers performed in a strand?

I have client-server app that uses Boost.Asio and SSL. Server has many threads. Server uses async_read and sometimes I get invoked async handler with bytes_received != bytes_expected and ec == 0. Most of the time app misses 1-500 bytes out of 2048-16384 bytes (usual size of logical packets in the app).
My issue similar to this one - Why boost::asio::async_read completion sometimes executed with bytes_transferred=0 and ec=0?.
To fix it I made dedicated io_service::strand for each ssl::stream
and one io_service::strand for ssl::context (which is shared among threads). I wrapped all async invocations with io_service::strand::wrap() (async_read, async_write, async_wait). Routine launched via io_service::strand::post(). I made debug checks before each invocation of socket method (io_service::strand::running_in_this_thread()). But I still have broken packets...
Due to this answer https://stackoverflow.com/a/12801042/1802974 during the processing of async_read/async_write Boost.Asio can make intermediate handlers which must be inside the strand. This is the last chance to fix this weird issue. How can I check that all intermediate handlers actually performed inside dedicated socket strand?
UPDATE
After a lot of debugging I found my issue - it is not connected to SSL at all. That was because of wrong manipulation of buffers...
In my app I use std::vector<char> as a buffer (two vectors for each socket - one for read operations and one for write operations). Vectors created right after socket opened with default (zero) length.
I have a binary protocol. Each packet consists of a header and payload. Header contains length of payload. After server received header of a packet it checks that read buffer is sufficient to receive payload. If buffer is smaller than packet payload - server increases the buffer.
Code that was used to manage buffer size (do not do this ever):
if (buf.capacity() >= newSize)
{
return;
}
buf.resize(newSize);
After that I constructed asio buffer like this:
boost::asio::buffer(myVector, expectedPacketLength)
I.e. this overload was used: boost.org
Now we will see the error:
- Client sends packet of the length 400
- Server increases size of the buffer to 400 (size == 400, capacity == 400)
- Server successfully receives packet
- Client sends packet of the length 415
- Server increases size of the buffer to 415.
But `std::vector` has its own logic to increase capacity and reallocate data.
After 'resizing' actual parameters - capacity == 800, size == 415.
- Server successfully receives packet
- Client sends packet of the length 430
- Server try to increase size of a buffer, but capacity is already bigger than 430.
Nothing is done.
- Server creates buffer.
But mentioned overload of `boost::asio::buffer` has its own logic - size of created buffer 415.
- voila, we miss 15 last bytes of the packet
Why I post it here?
Be carefull with the buffers )) Just use std::vector::reserve to increase size and use most basic overload of boost::asio::buffer boost.org

zeromq: reset REQ/REP socket state

When you use the simple ZeroMQ REQ/REP pattern you depend on a fixed send()->recv() / recv()->send() sequence.
As this article describes you get into trouble when a participant disconnects in the middle of a request because then you can't just start over with receiving the next request from another connection but the state machine would force you to send a request to the disconnected one.
Has there emerged a more elegant way to solve this since the mentioned article has been written?
Is reconnecting the only way to solve this (apart from not using REQ/REP but use another pattern)
As the accepted answer seem so terribly sad to me, I did some research and have found that everything we need was actually in the documentation.
The .setsockopt() with the correct parameter can help you resetting your socket state-machine without brutally destroy it and rebuild another on top of the previous one dead body.
(yeah I like the image).
ZMQ_REQ_CORRELATE: match replies with requests
The default behaviour of REQ sockets is to rely on the ordering of messages to match requests and responses and that is usually sufficient. When this option is set to 1, the REQ socket will prefix outgoing messages with an extra frame containing a request id. That means the full message is (request id, 0, user frames…). The REQ socket will discard all incoming messages that don't begin with these two frames.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
ZMQ_REQ_RELAXED: relax strict alternation between request and reply
By default, a REQ socket does not allow initiating a new request with zmq_send(3) until the reply to the previous one has been received. When set to 1, sending another message is allowed and has the effect of disconnecting the underlying connection to the peer from which the reply was expected, triggering a reconnection attempt on transports that support it. The request-reply state machine is reset and a new request is sent to the next available peer.
If set to 1, also enable ZMQ_REQ_CORRELATE to ensure correct matching of requests and replies. Otherwise a late reply to an aborted request can be reported as the reply to the superseding request.
Option value type int
Option value unit 0, 1
Default value 0
Applicable socket types ZMQ_REQ
A complete documentation is here
The good news is that, as of ZMQ 3.0 and later (the modern era), you can set a timeout on a socket. As others have noted elsewhere, you must do this after you have created the socket, but before you connect it:
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
Then, when you actually try to receive the reply (after you have sent a message to the REP socket), you can catch the error that will be asserted if the timeout is exceeded:
try:
send( message, 0 )
send_failed = False
except zmq.Again:
logging.warning( "Image send failed." )
send_failed = True
However! When this happens, as observed elsewhere, your socket will be in a funny state, because it will still be expecting the response. At this point, I cannot find anything that works reliably other than just restarting the socket. Note that if you disconnect() the socket and then re connect() it, it will still be in this bad state. Thus you need to
def reset_my_socket:
zmq_req_socket.close()
zmq_req_socket = zmq_context.socket( zmq.REQ )
zmq_req_socket.setsockopt( zmq.RCVTIMEO, 500 ) # milliseconds
zmq_req_socket.connect( zmq_endpoint )
You will also notice that because I close()d the socket, the receive timeout option was "lost", so it is important set that on the new socket.
I hope this helps. And I hope that this does not turn out to be the best answer to this question. :)
There is one solution to this and that is adding timeouts to all calls. Since ZeroMQ by itself does not really provide simple timeout functionality I recommend using a subclass of the ZeroMQ socket that adds a timeout parameter to all important calls.
So, instead of calling s.recv() you would call s.recv(timeout=5.0) and if a response does not come back within that 5 second window it will return None and stop blocking. I had made a futile attempt at this when I run into this problem.
I'm actually looking into this at the moment, because I am retro fitting a legacy system.
I am coming across code constantly that "needs" to know about the state of the connection. However the thing is I want to move to the message passing paradigm that the library promotes.
I found the following function : zmq_socket_monitor
What it does is monitor the socket passed to it and generate events that are then passed to an "inproc" endpoint - at that point you can add handling code to actually do something.
There is also an example (actually test code) here : github
I have not got any specific code to give at the moment (maybe at the end of the week) but my intention is to respond to the connect and disconnects such that I can actually perform any resetting of logic required.
Hope this helps, and despite quoting 4.2 docs, I am using 4.0.4 which seems to have the functionality
as well.
Note I notice you talk about python above, but the question is tagged C++ so that's where my answer is coming from...
Update: I'm updating this answer with this excellent resource here: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ Socket programming is complicated so do checkout the references in this post.
None of the answers here seem accurate or useful. The OP is not looking for information on BSD socket programming. He is trying to figure out how to robustly handle accept()ed client-socket failures in ZMQ on the REP socket to prevent the server from hanging or crashing.
As already noted -- this problem is complicated by the fact that ZMQ tries to pretend that the servers listen()ing socket is the same as an accept()ed socket (and there is no where in the documentation that describes how to set basic timeouts on such sockets.)
My answer:
After doing a lot of digging through the code, the only relevant socket options passed along to accept()ed socks seem to be keep alive options from the parent listen()er. So the solution is to set the following options on the listen socket before calling send or recv:
void zmq_setup(zmq::context_t** context, zmq::socket_t** socket, const char* endpoint)
{
// Free old references.
if(*socket != NULL)
{
(**socket).close();
(**socket).~socket_t();
}
if(*context != NULL)
{
// Shutdown all previous server client-sockets.
zmq_ctx_destroy((*context));
(**context).~context_t();
}
*context = new zmq::context_t(1);
*socket = new zmq::socket_t(**context, ZMQ_REP);
// Enable TCP keep alive.
int is_tcp_keep_alive = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE, &is_tcp_keep_alive, sizeof(is_tcp_keep_alive));
// Only send 2 probes to check if client is still alive.
int tcp_probe_no = 2;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_CNT, &tcp_probe_no, sizeof(tcp_probe_no));
// How long does a con need to be "idle" for in seconds.
int tcp_idle_timeout = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_IDLE, &tcp_idle_timeout, sizeof(tcp_idle_timeout));
// Time in seconds between individual keep alive probes.
int tcp_probe_interval = 1;
(**socket).setsockopt(ZMQ_TCP_KEEPALIVE_INTVL, &tcp_probe_interval, sizeof(tcp_probe_interval));
// Discard pending messages in buf on close.
int is_linger = 0;
(**socket).setsockopt(ZMQ_LINGER, &is_linger, sizeof(is_linger));
// TCP user timeout on unacknowledged send buffer
int is_user_timeout = 2;
(**socket).setsockopt(ZMQ_TCP_MAXRT, &is_user_timeout, sizeof(is_user_timeout));
// Start internal enclave event server.
printf("Host: Starting enclave event server\n");
(**socket).bind(endpoint);
}
What this does is tell the operating system to aggressively check the client socket for timeouts and reap them for cleanup when a client doesn't return a heart beat in time. The result is that the OS will send a SIGPIPE back to your program and socket errors will bubble up to send / recv - fixing a hung server. You then need to do two more things:
1. Handle SIGPIPE errors so the program doesn't crash
#include <signal.h>
#include <zmq.hpp>
// zmq_setup def here [...]
int main(int argc, char** argv)
{
// Ignore SIGPIPE signals.
signal(SIGPIPE, SIG_IGN);
// ... rest of your code after
// (Could potentially also restart the server
// sock on N SIGPIPEs if you're paranoid.)
// Start server socket.
const char* endpoint = "tcp://127.0.0.1:47357";
zmq::context_t* context;
zmq::socket_t* socket;
zmq_setup(&context, &socket, endpoint);
// Message buffers.
zmq::message_t request;
zmq::message_t reply;
// ... rest of your socket code here
}
2. Check for -1 returned by send or recv and catch ZMQ errors.
// E.g. skip broken accepted sockets (pseudo-code.)
while (1):
{
try
{
if ((*socket).recv(&request)) == -1)
throw -1;
}
catch (...)
{
// Prevent any endless error loops killing CPU.
sleep(1)
// Reset ZMQ state machine.
try
{
zmq::message_t blank_reply = zmq::message_t();
(*socket).send (blank_reply);
}
catch (...)
{
1;
}
continue;
}
Notice the weird code that tries to send a reply on a socket failure? In ZMQ, a REP server "socket" is an endpoint to another program making a REQ socket to that server. The result is if you go do a recv on a REP socket with a hung client, the server sock becomes stuck in a broken receive loop where it will wait forever to receive a valid reply.
To force an update on the state machine, you try send a reply. ZMQ detects that the socket is broken, and removes it from its queue. The server socket becomes "unstuck", and the next recv call returns a new client from the queue.
To enable timeouts on an async client (in Python 3), the code would look something like this:
import asyncio
import zmq
import zmq.asyncio
#asyncio.coroutine
def req(endpoint):
ms = 2000 # In milliseconds.
sock = ctx.socket(zmq.REQ)
sock.setsockopt(zmq.SNDTIMEO, ms)
sock.setsockopt(zmq.RCVTIMEO, ms)
sock.setsockopt(zmq.LINGER, ms) # Discard pending buffered socket messages on close().
sock.setsockopt(zmq.CONNECT_TIMEOUT, ms)
# Connect the socket.
# Connections don't strictly happen here.
# ZMQ waits until the socket is used (which is confusing, I know.)
sock.connect(endpoint)
# Send some bytes.
yield from sock.send(b"some bytes")
# Recv bytes and convert to unicode.
msg = yield from sock.recv()
msg = msg.decode(u"utf-8")
Now you have some failure scenarios when something goes wrong.
By the way -- if anyone's curious -- the default value for TCP idle timeout in Linux seems to be 7200 seconds or 2 hours. So you would be waiting a long time for a hung server to do anything!
Sources:
https://github.com/zeromq/libzmq/blob/84dc40dd90fdc59b91cb011a14c1abb79b01b726/src/tcp_listener.cpp#L82 TCP keep alive options preserved for client sock
http://www.tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ How does keep alive work
https://github.com/zeromq/libzmq/blob/master/builds/zos/README.md Handling sig pipe errors
https://github.com/zeromq/libzmq/issues/2586 for information on closing sockets
https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/
https://github.com/zeromq/libzmq/issues/976
Disclaimer:
I've tested this code and it seems to be working, but ZMQ does complicate testing this a fair bit because the client re-connects on failure? If anyone wants to use this solution in production, I recommend writing some basic unit tests, first.
The server code could also be improved a lot with threading or polling to be able to handle multiple clients at once. As it stands, a malicious client can temporarily take up resources from the server (3 second timeout) which isn't ideal.