Boost and Windows sockets - Properly handling TCP client disconnect scenarios - c++

I have a class called ServerConnectionHandler that creates a boost thread for reading data from the server. The boost thread is bound to the ServerConnectionHandler object. Relevant pieces of code are below:
ServerConnectionHandler::~ServerConnectionHandler()
{
close();
}
void ServerConnectionHandler::close()
{
closesocket(m_ConnectSocket);
WSACleanup();
}
void ServerConnectionHandler::MsgLoop()
{
int size_recv = 0;
char chunk[DEFAULT_BUFLEN];
while(1)
{
memset(chunk, 0, DEFAULT_BUFLEN);
size_recv = recv(m_ConnectSocket, chunk, DEFAULT_BUFLEN, 0);
if(size_recv > 0)
{
for( int i=0; i < size_recv; ++i )
{
if(chunk[i] == '\n')
{
m_tcpEventHandler.OnClientMessage(m_RecBuffer);
m_RecBuffer.clear();
}
else
{
m_RecBuffer.append(1, chunk[i]);
}
}
}
else if(size_recv == 0)
{
close();
const std::string error = "MsgReceiver Received 0 bytes because connection was closed. MsgReceiver shutting down.\n";
m_tcpEventHandler.OnClientSocketError(error);
break;
}
else
{
char error [512];
sprintf(error, "Error on Receiving Socket. Recv=[%d], WSAError=[%d]. MsgReceiver shutting down.\n", size_recv, WSAGetLastError());
m_tcpEventHandler.OnClientSocketError(error);
close();
break;
}
}
// NOTE: This will eventually call the destructor of ServerConnectionHandler...
m_tcpEventHandler.OnClientDisconnect("Disconnected. Reason: Remote host snapped connection.");
}
My problem is that when close() is called in the destructor, the receiver thread is still running and crashes when it attempts to call any of the m_tcpEventHandler.OnClient...() methods because the object has been destroyed at this point.
I need to be able to handle this cleanly in 3 different cases:
When the user manually disconnects the client (the destructor will be called in this case).
When the client is disconnected from the server (maybe because the server crashed for example).
When the application shuts down (needs to cleanly disconnect everything - similar to #1).
Right now, this code only works for case #2. I don't want to slow down the receiver thread with any locking as the performance is critical. From what I've read, I've seen people create a volatile bool flag that tells the receiver thread to stop. The problem I see with this approach is that what if it is in the middle of handling a message (m_tcpEventHandler.OnClientMessage()) right when the destructor is called? Then it could immediately hit code for the destroyed object (m_tcpEventHandler could in turn use ServerConnectionHandler's member variables or methods). I can't think of a clean way to handle all 3 cases here.

Before you close the socket in the destructor, shut it down for input. That will cause the receive thread to get an end of stream and exit nicely. You might want to add a little handshake between the dtor and the receiver thread before the final close, or you might just want to rely on the receiver thread closing the socket and not close it in the dtor at all.

"My problem is that when close() is called in the destructor, the receiver thread is still running" - seems to me your problem is merely thread synchronization, then. Little to do with connections.
Making the communications asynchronous gives you a lot more control over the receiving thread.
You could e.g. use Boost Asio to do the asynchronous socket reads (and writes, of course). If you add an "infinite" deadline_timer to the asynch queue, you can just cancel() that timer, which could be used by the receiving thread to stop the receive and do some more cleanups (e.g. write a "Goodbye" message to the remote end).
(If the latter were not required, just cancelling all async operations could be achieved by simply shutting down the io_service. That would be rather uncourteous, but not a bad idea in fast shutdown paths.)

Related

How to stop a detached thread which is blocking on a socket?

I have two threads. The first creates a Logic object, detaching a second thread to spin, blocking on OpenSSL socket to receive messages:
struct Logic
{
Logic()
{
std::thread t1(&Logic::run, this);
t1.detach();
}
void run()
{
while(true)
{
// Gets data from SSL (blocking socket)
// Processes data
// Updates timestamp
}
}
uint64_t timestamp;
};
The first thread returns, enters a while loop and continually checks if the detached thread is still running (or whether its blocked permanently).
while(true)
{
Logic logic();
while(true)
{
if(timestamp_not_updated)
{
break; // Break, destroy current Logic object and create another
}
}
}
If the timestamp stops being updated, the inner while loop breaks, causing the Logic object to be destroyed and a new one created.
When this restart behaviour triggers I get a seg fault. thread apply all bt shows 3 threads, not 2. The original detached thread (blocking on OpenSSL) still exists. I thought this would get destroyed due to the object.
How do I stop a detached thread which is blocking/waiting on a resource, so I can restart my class? I need the blocking behaviour because I don't have anything else to do (besides receive the packet) and it's better for performance, than to keep calling in to OpenSSL.

boost::asio::ip::tcp::socket.read_some() stops working. No exception or errors detected

I am currently debugging a server(win32/64) that utilizes Boost:asio 1.78.
The code is a blend of legacy, older legacy and some newer code. None of this code is mine. I can't answer for why something is done in a certain way. I'm just trying to understand why this is happening and hopefully fix it wo. rewriting it from scratch. This code has been running for years on 50+ servers with no errors. Just these 2 servers that missbehaves.
I have one client (dot.net) that is connected to two servers. Client is sending the same data to the 2 servers. The servers run the same code, as follows in code sect.
All is working well but now and then communications halts. No errors or exceptions on either end. It just halts. Never on both servers at the same time. This happens very seldom. Like every 3 months or less. I have no way of reproducing it in a debugger bc I don't know where to look for this behavior.
On the client side the socket appears to be working/open but does not accept new data. No errors is detected in the socket.
Here's a shortened code describing the functions. I want to stress that I can't detect any errors or exceptions during these failures. Code just stops at "m_socket->read_some()".
Only solution to "unblock" right now is to close the socket manually and restart the acceptor. When I manually close the socket the read_some method returns with error code so I know it is inside there it stops.
Questions:
What may go wrong here and give this behavior?
What parameters should I log to enable me to determine what is happening, and from where.
main code:
std::shared_ptr<boost::asio::io_service> io_service_is = std::make_shared<boost::asio::io_service>();
auto is_work = std::make_shared<boost::asio::io_service::work>(*io_service_is.get());
auto acceptor = std::make_shared<TcpAcceptorWrapper>(*io_service_is.get(), port);
acceptor->start();
auto threadhandle = std::thread([&io_service_is]() {io_service_is->run();});
TcpAcceptorWrapper:
void start(){
m_asio_tcp_acceptor.open(boost::asio::ip::tcp::v4());
m_asio_tcp_acceptor.bind(boost::asio::ip::tcp::endpoint(boost::asio::ip::tcp::v4(), m_port));
m_asio_tcp_acceptor.listen();
start_internal();
}
void start_internal(){
m_asio_tcp_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { /* Handler code */ });
}
Handler code:
m_current_session = std::make_shared<TcpSession>(&m_socket);
std::condition_variable condition;
std::mutex mutex;
bool stopped(false);
m_current_session->run(condition, mutex, stopped);
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&stopped] { return stopped; });
}
TcpSession runner:
void run(std::condition_variable& complete, std::mutex& mutex, bool& stopped){
auto self(shared_from_this());
std::thread([this, self, &complete, &mutex, &stopped]() {
{ // mutex scope
// Lock and hold mutex from tcp_acceptor scope
std::lock_guard<std::mutex> lock(mutex);
while (true) {
std::array<char, M_BUFFER_SIZE> buffer;
try {
boost::system::error_code error;
/* Next call just hangs/blocks but only rarely. like once every 3 months or more seldom */
std::size_t read = m_socket->read_some(boost::asio::buffer(buffer, M_BUFFER_SIZE), error);
if (error || read == -1) {
// This never happens
break;
}
// inside this all is working
process(buffer);
} catch (std::exception& ex) {
// This never happens
break;
} catch (...) {
// Neither does this
break;
}
}
stopped = true;
} // mutex released
complete.notify_one();
}).detach();
}
This:
m_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { // Handler code });
Handler code:
std::condition_variable condition;
std::mutex mutex;
bool stopped(false);
m_current_session->run(condition, mutex, stopped);
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&stopped] { return stopped; });
}
Is strange. It suggests you are using an "async" accept, but the handler block unconditionally until the session completes. That's the opposite of asynchrony. You could much easier write the same code without the asynchrony, and also without the thread and synchronization around it.
My intuition says something is blocking the mutex. Have you established that the session stack is actually inside the read_some frame when e.g. doing a debugger break during a "lock-up"?
When I manually close the socket the read_some method returns with error code so I know it is inside there I have an issue.
You can't legally do that. Your socket is in use on a thread - in a blocking read -, and you're bound to close it from a separate thread. That's a race-condition (see docs). If you want cancellable operations, use async_read*.
There are more code smells (read_some is a lowlevel primitive that is rarely what you want at the application level, detached threads with manual synchronization on termination could be packaged tasks, shared boolean flags could be atomics, notify_one outside the mutex could lead to thread starvation on some platforms etc.).
If you can share more code I'll be happy to sketch simplified solutions that remove the problems.

For boost io_service, is only-one thread blocked on epoll_wait?

I read the source code of Boost ASIO, and I wanna find out it is only one thread for it to call epoll_wait(Of course,if I use epoll reactor).
I wanna find its solution about more than one thread to call epoll_wait, this may cause different threads doing the read for the same socket at the same time .
I read some key codes as follows:
// Prepare to execute first handler from queue.
operation* o = op_queue_.front();
op_queue_.pop();
bool more_handlers = (!op_queue_.empty());
if (o == &task_operation_)
{
task_interrupted_ = more_handlers;
if (more_handlers && !one_thread_)
wakeup_event_.unlock_and_signal_one(lock);
else
lock.unlock();
task_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Run the task. May throw an exception. Only block if the operation
// queue is empty and we're not polling, otherwise we want to return
// as soon as possible.
task_->run(!more_handlers, this_thread.private_op_queue);
}
task_ is epoll reactor and it will call epoll_wait in the run,
I guess it may only one thread to call it because only one "task_operation_" in the op_queue_, am I right ?
If I wanna use epoll in multi-threading, or I may use "EPOLLONESHOT" so that it can ensure that one thread handle one socket at one time.
First case, is when you are using a single instance of io_service and calling io_service::run method from multiple threads.
Lets see the schduler::run function (simplified):
std::size_t scheduler::run(asio::error_code& ec)
{
mutex::scoped_lock lock(mutex_);
std::size_t n = 0;
for (; do_run_one(lock, this_thread, ec); lock.lock())
if (n != (std::numeric_limits<std::size_t>::max)())
++n;
return n;
}
So, with the lock held, it calls the do_run_one method, which is something like:
std::size_t scheduler::do_run_one(mutex::scoped_lock& lock,
scheduler::thread_info& this_thread,
const asio::error_code& ec)
{
while (!stopped_)
{
if (!op_queue_.empty())
{
// Prepare to execute first handler from queue.
operation* o = op_queue_.front();
op_queue_.pop();
bool more_handlers = (!op_queue_.empty());
if (o == &task_operation_)
{
task_interrupted_ = more_handlers;
if (more_handlers && !one_thread_)
wakeup_event_.unlock_and_signal_one(lock);
else
lock.unlock();
task_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
task_->run(!more_handlers, this_thread.private_op_queue);
}
else
{
//......
}
}
else
{
wakeup_event_.clear(lock);
wakeup_event_.wait(lock);
}
}
return 0;
}
The interesting part of the code sre these lines:
if (more_handlers && !one_thread_)
wakeup_event_.unlock_and_signal_one(lock);
else
lock.unlock();
The case we are discussing now is the one with multiple threads, so the first condition will satisfy (assuming we have quite a number of pending tasks in op_queue_).
What wakeup_event_.unlock_and_signal_one ends up doing is release/unlock the lock and notify one of threads who is waiting on a conditional wait. So, with this, atleast one another thread (whoever gets the lock) can call do_run_one now.
The task_ in your case is epoll_reactor as you have said. And, in it's run method it calls epoll_wait (not holding the lock_ of scheduler).
The interesting thing here is what it does when it iterates over all the ready descriptors that epoll_wait returned. It pushes them back in the operational queue it received as reference in the argument. The operations pushed now have the run time type of descriptor_state instead of task_operation_:
for (int i = 0; i < num_events; ++i)
{
void* ptr = events[i].data.ptr;
if (ptr == &interrupter_)
{
// don't call work_started() here. This still allows the scheduler to
// stop if the only remaining operations are descriptor operations.
descriptor_state* descriptor_data = static_cast<descriptor_state*>(ptr);
descriptor_data->set_ready_events(events[i].events);
ops.push(descriptor_data);
}
}
So, in the next iteration of the while loop inside scheduler::do_run_one, for the completed tasks, it will hit the else branch (which I elided in my paste earlier):
else
{
std::size_t task_result = o->task_result_;
if (more_handlers && !one_thread_)
wake_one_thread_and_unlock(lock);
else
lock.unlock();
// Ensure the count of outstanding work is decremented on block exit.
work_cleanup on_exit = { this, &lock, &this_thread };
(void)on_exit;
// Complete the operation. May throw an exception. Deletes the object.
o->complete(this, ec, task_result);
return 1;
}
Which call the complete function pointer which inturn probably will call the user passed handle to the async_read or async_write API.
Second case, is where you create a pool of io_service objects and call its run method on 1 or more threads i.e the mapping between io_service and thread could be 1:1 or 1:N as may suit your application. This way you can assign an io_service object to a soucket object in round robin fashion.
Now, coming to your question:
If I wanna use epoll in multi-threading, or I may use "EPOLLONESHOT"
so that it can ensure that one thread handle one socket at one time.
If I understood this correctly, you want to handle all the events to a socket using 1 thread ? I think this is possible by following approach number 2, i.e to create a pool of io_service objects and map it to 1 thread. This way you can be sure that all the activity on a particular socket will be addressed by only one thread i.e the thread on which that io_service:run.
You do not have to worry about setting EPOLLONESHOT in the above case.
I am not so sure about getting the same behaviour using the first approach, which is multiple thread and 1 io_service.
But, if you are not using threads at all i.e your io_service runs on single thread, then you don't have to worry about all this, after all the purpose of asio is to abstract all these stuff.
Only a single thread will invoke epoll_wait. Once the thread receives event notifications for descriptors, it will demultiplex the descriptors to all threads running the io_service. Per the Platform-Specific Implementation Notes:
Threads:
Demultiplexing using epoll is performed in one of the threads that calls io_service::run(), io_service::run_one(), io_service::poll() or io_service::poll_one().
A single descriptor will be processed by a single thread that will perform the I/O. Hence, when using asynchronous operations, I/O will not be performed concurrently for a given socket.

How to handle a SIGPIPE error inside the object that generated it?

I have two applications, one server and other client, both written in C++ and Qt, but both of them also uses a C library that uses C socket methods to perform a socket communication between them (and this all in Linux).
When both of them are connected and I close the client, when the server tries to send a new message to it, it gets a SIGPIPE error and closes. I did some research on the web and in SO to see how could I create a handler for the SIGPIPE so instead of closing the application, I'ld tell the timers that constantly send the information to stop.
Now I did learn how to simply handle the signal: create a method that receives a int and use signal(SIGPIPE, myMethod) inside main() or global (note: learned that from SO and yes, I know that signal() is obsolete).
But the problem is that by doing this way I'm unable to stop the sending of information to the dead client, for the method that handles the signal needs to be either outside the class which sends the message or a static method, which don't have access to my server object.
To clarify, here is the current architecture:
//main.cpp
void signal_callback_handler(int signum)
{
qDebug() << "Caught signal SIGPIPE" << signum << "; closing the application";
exit(EXIT_FAILURE);
}
int main(int argc, char *argv[])
{
QApplication app(argc, argv);
app.setApplicationName("ConnEmulator");
app.setApplicationVersion("1.0.0");
app.setOrganizationName("Embrasul");
app.setOrganizationDomain("http://www.embrasul.com.br");
MainWidget window;
window.show();
/* Catch Signal Handler SIGPIPE */
signal(SIGPIPE, signal_callback_handler);
return app.exec();
}
//The MainWidget class (simplified)
MainWidget::MainWidget(QWidget *parent) :
QWidget(parent),
ui(new Ui::MainWidget),
timerSendData(new QTimer(this))
{
ui->setupUi(this);
connect(timerSendData,SIGNAL(timeout()),this,SLOT(slotSendData()));
timerSendData->start();
//...
}
void MainWidget::slotSendData()
{
//Prepares data
//...
//Here the sending message is called with send()
if (hal_socket_write_to_client(&socket_descriptor, (u_int8_t *)buff_write, myBufferSize) == -1)
qDebug() << "Error writting to client";
}
//Socket library
int hal_socket_write_to_client(socket_t *obj, u_int8_t *buffer, int size)
{
struct s_socket_private * const socket_obj = (struct s_socket_private *)obj;
int retval = send(socket_obj->client_fd, buffer, size, 0);
if (retval < 0)
perror("write_to_client");
return retval;
}
So how can I make my MainWidget object created inside int main() handle the signal so he may call timerSendData->stop()?
SIGPIPE is ugly, but possible to deal with in a way that's fully encapsulated, thread-safe, and does not affect anything but the code making the write that might cause SIGPIPE. The general method is:
Block SIGPIPE with pthread_sigmask (or sigprocmask, but the latter is not guaranteed to be safe in multi-threaded programs) and save the original signal mask.
Perform the operation that might raise SIGPIPE.
Call sigtimedwait with a zero timeout to consume any pending SIGPIPE signal.
Restore the original signal mask (unblocking SIGPIPE if it was unblocked before).
Here's a try at some sample code using this method, in the form of a pure wrapper to write that avoids SIGPIPE:
ssize_t write_nosigpipe(int fd, void *buf, size_t len)
{
sigset_t oldset, newset;
ssize_t result;
siginfo_t si;
struct timespec ts = {0};
sigemptyset(&newset);
sigaddset(&newset, SIGPIPE);
pthread_sigmask(SIG_BLOCK, &newset, &oldset);
result = write(fd, buf, len);
while (sigtimedwait(newset, &si, &ts)>=0 || errno != EAGAIN);
pthread_sigmask(SIG_SETMASK, &oldset, 0);
return result;
}
It's untested (not even compiled) and may need minor fixes, but hopefully gets the point across. Obviously for efficiency you'd want to do this on a larger granularity than single write calls (for example, you could block SIGPIPE for the duration of the whole library function until it returns to the outside caller).
An alternate design would be simply blocking SIGPIPE and never unblocking it, and documenting in the function's interface that it leaves SIGPIPE blocked (note: blocking is thread-local and does not affect other threads) and possibly leaves SIGPIPE pending (in the blocked state). Then the caller would be responsible for restoring it if necessary, so the rare caller that wants SIGPIPE could get it (but after your function finishes) by unblocking the signal while the majority of callers could happily leave it blocked. The blocking code works like in the above, with the sigtimedwait/unblocking part removed. This is similar to Maxim's answer except that the impact is thread-local and thus thread-safe.
Now I did learn how to simply handle the signal: create a method that receives a int and use signal(SIGPIPE, myMethod)
You just need to ignore SIGPIPE, no handler is needed:
// don't raise SIGPIPE when sending into broken TCP connections
::signal(SIGPIPE, SIG_IGN);
But the problem is that by doing this way I'm unable to stop the sending of information to the dead client, for the method that handles the signal needs to be either outside the class which sends the message or a static method, which don't have access to my server object.
When SIGPIPE is ignored writing into a broken TCP connection returns error code EPIPE, which the socket wrappers you use should handle like the connection has been closed. Ideally, the socket wrapper should pass MSG_NOSIGNAL flag to send, so that send never raises SIGPIPE.

boost asio io_service::run() exits 'early' - or not?

Can anyone tell me under what conditions boost::asio's io_service::run() method will return? The documentation documentation for io_service::run() seems to suggest that as long as there is work to be done or handlers to be dispatched, run() won't return.
The reason I'm asking this is that we have a legacy https client that contacts a server and executes http POST's. The separation of concerns in the client is a bit different than what we'd like so we're changing a few things about it, but we're running into problems.
Right now, the client basically has a mis-named connect() call that effectively drives the entire protocol conversation with the server. The connect() call starts off by creating a boost::asio::ip::tcp::resolver object and calling ::async_resolve() on it. This starts a chain where new asio calls are made from within asio callbacks.
void connect()
{
m_resolver.async_resolve( query, bind( &clientclass::resolve_callback, this ) );
thread = new boost::thread( bind( &boost::asio::io_service::run, m_io_service ) );
}
void resolve_callback( error_code & e, resolver::iterator i )
{
if (!e)
{
tcp::endpoint = *i;
m_socket.lowest_layer().async_connect(endpoint, bind(&clientclass::connect_callback,this,_1,++i));
}
}
void connect_callback( error_code & e, resolve::iterator i )
{
if (!e)
{
m_socket.lowest_layer().async_handshake(boost::asio::ssl::stream_base::client,
bind(&clientclass::handshake_callback,this,_1,++i));
}
}
void handshake_callback( error_code &e )
{
if (!e)
{
mesg = format_hello_message();
http_send( mesg, bind(&clientlass::hello_resp_handler,this,_1,_2) );
}
}
void http_send( stringstream & mesg, reply_handler handler )
{
async_write(m_socket, m_request_buffer, bind(&clientclass::write_complete_callback,this,_1,handler));
}
void write_comlete_callback( error_code &e, reply_handler handler )
{
if (!e)
{
async_read_until(m_socket,m_reply_buffer,"\r\n\r\n", bind(&clientclass::handle_reply,this,handler));
}
}
...
Anyways, this continues through the protocol until the protocol conversation is done. From the code here you can see that while connect() is running on the main thread, all of the subsequent callbacks and requests are coming back on the worker thread that is created in connect(). This is 'working' code.
When I try to break this chain up and expose it via an external interface, it stops working. In particular, I'm having the call handle_handshake() call outside of the clientclass object. Then http_send() is part of the interface (or is called by the external interface) and it creates a new worker thread to call io_service::run(). What happens is even though async_write() has been called and even though write_complete_callback() hasn't returned, io_service::run() exits. It exits without error and claims that no handlers were dispatched, but there's still 'work' to be done?
So what I'm wondering is what is io_service::run()'s definition of 'work'? Is it a pending request? Why is it that io_service::run() never returns during this chain of requests and responses in the existing code, but when I try to start the thread up again and start a new chain, it returns almost immediately before it's finished its work?
The definition of work in the context of the run() call is any pending asynchronous operations on that io_service object. This includes the invocations of the handlers in response to an operation. So, if a handler for one operation starts another operation, there is always work available.
In addition, there is an io_service::work class that can be used to create work on an io_service that never completes until the object is destroyed.
When a single chain completes, the io_service has completed all asynchronous operations, and all of the handler's have been invoked without starting a new operation, so it returns. Until you call io_service::reset(), further calls to run() will return without executing any operations.