What causes a random crash in boost::coroutine? - c++

I have a multithread application which uses boost::asio and boost::coroutine via its integration in boost::asio. Every thread has its own io_service object. The only shared state between threads are connection pools which are locked with mutex when connection is get or returned from/to the connection pool. When there is not enough connections in the pool I push infinite asio::steady_tiemer in internal structure of the pool and asynchronously waiting on it and I yielding from the couroutine function. When other thread returns connection to the pool it checks whether there is waiting timers, it gets waiting timer from the internal structure, it gets its io_service object and posts a lambda which wakes up the timer to resume the suspended coroutine. I have random crashes in the application. I try to investigate the problem with valgrind. It founds some issues but I cannot understand them because they happen in boost::coroutine and boost::asio internals. Here are fragments from my code and from valgrind output. Can someone see and explain the problem?
Here is the calling code:
template <class ContextsType>
void executeRequests(ContextsType& avlRequestContexts)
{
AvlRequestDataList allRequests;
for(auto& requestContext : avlRequestContexts)
{
if(!requestContext.pullProvider || !requestContext.toAskGDS())
continue;
auto& requests = requestContext.pullProvider->getRequestsData();
copy(requests.begin(), requests.end(), back_inserter(allRequests));
}
if(allRequests.size() == 0)
return;
boost::asio::io_service ioService;
curl::AsioMultiplexer multiplexer(ioService);
for(auto& request : allRequests)
{
using namespace boost::asio;
spawn(ioService, [&multiplexer, &request](yield_context yield)
{
request->prepare(multiplexer, yield);
});
}
while(true)
{
try
{
VLOG_DEBUG(avlGeneralLogger, "executeRequests: Starting ASIO event loop.");
ioService.run();
VLOG_DEBUG(avlGeneralLogger, "executeRequests: ASIO event loop finished.");
break;
}
catch(const std::exception& e)
{
VLOG_ERROR(avlGeneralLogger, "executeRequests: Error while executing GDS request: " << e.what());
}
catch(...)
{
VLOG_ERROR(avlGeneralLogger, "executeRequests: Unknown error while executing GDS request.");
}
}
}
Here is the prepare function implementation which is called in spawned lambda:
void AvlRequestData::prepareImpl(curl::AsioMultiplexer& multiplexer,
boost::asio::yield_context yield)
{
auto& ioService = multiplexer.getIoService();
_connection = _pool.getConnection(ioService, yield);
_connection->prepareRequest(xmlRequest, xmlResponse, requestTimeoutMS);
multiplexer.addEasyHandle(_connection->getHandle(),
[this](const curl::EasyHandleResult& result)
{
if(0 == result.responseCode)
returnQuota();
VLOG_DEBUG(lastSeatLogger, "Response " << id << ": " << xmlResponse);
_pool.addConnection(std::move(_connection));
});
}
void AvlRequestData::prepare(curl::AsioMultiplexer& multiplexer,
boost::asio::yield_context yield)
{
try
{
prepareImpl(multiplexer, yield);
}
catch(const std::exception& e)
{
VLOG_ERROR(lastSeatLogger, "Error wile preparing request: " << e.what());
returnQuota();
}
catch(...)
{
VLOG_ERROR(lastSeatLogger, "Unknown error while preparing request.");
returnQuota();
}
}
The returnQuota function is pure virtual method of the AvlRequestData class and its implementation for the TravelportRequestData class which is used in all my tests is the following:
void returnQuota() const override
{
auto& avlQuotaManager = AvlQuotaManager::getInstance();
avlQuotaManager.consumeQuotaTravelport(-1);
}
Here are push and pop methods of the connection pool.
auto AvlConnectionPool::getConnection(
TimerPtr timer,
asio::yield_context yield) -> ConnectionPtr
{
lock_guard<mutex> lock(_mutex);
while(_connections.empty())
{
_timers.emplace_back(timer);
timer->expires_from_now(
asio::steady_timer::clock_type::duration::max());
_mutex.unlock();
coroutineAsyncWait(*timer, yield);
_mutex.lock();
}
ConnectionPtr connection = std::move(_connections.front());
_connections.pop_front();
VLOG_TRACE(defaultLogger, str(format("Getted connection from pool: %s. Connections count %d.")
% _connectionPoolName % _connections.size()));
++_connectionsGiven;
return connection;
}
void AvlConnectionPool::addConnection(ConnectionPtr connection,
Side side /* = Back */)
{
lock_guard<mutex> lock(_mutex);
if(Front == side)
_connections.emplace_front(std::move(connection));
else
_connections.emplace_back(std::move(connection));
VLOG_TRACE(defaultLogger, str(format("Added connection to pool: %s. Connections count %d.")
% _connectionPoolName % _connections.size()));
if(_timers.empty())
return;
auto timer = _timers.back();
_timers.pop_back();
auto& ioService = timer->get_io_service();
ioService.post([timer](){ timer->cancel(); });
VLOG_TRACE(defaultLogger, str(format("Connection pool %s: Waiting thread resumed.")
% _connectionPoolName));
}
This is implementation of coroutineAsyncWait.
inline void coroutineAsyncWait(boost::asio::steady_timer& timer,
boost::asio::yield_context yield)
{
boost::system::error_code ec;
timer.async_wait(yield[ec]);
if(ec && ec != boost::asio::error::operation_aborted)
throw std::runtime_error(ec.message());
}
And finally the first part of the valgrind output:
==8189== Thread 41:
==8189== Invalid read of size 8
==8189== at 0x995F84: void boost::coroutines::detail::trampoline_push_void, void, boost::asio::detail::coro_entry_point, void (anonymous namespace)::executeRequests > >(std::vector<(anonymous namespace)::AvlRequestContext, std::allocator<(anonymous namespace)::AvlRequestContext> >&)::{lambda(boost::asio::basic_yield_context >)#1}>&, boost::coroutines::basic_standard_stack_allocator > >(long) (trampoline_push.hpp:65)
==8189== Address 0x2e3b5528 is not stack'd, malloc'd or (recently) free'd
When I use valgrind with debugger attached it stops in the following function in trampoline_push.hpp in boost::coroutine library.
53│ template< typename Coro >
54│ void trampoline_push_void( intptr_t vp)
55│ {
56│ typedef typename Coro::param_type param_type;
57│
58│ BOOST_ASSERT( vp);
59│
60│ param_type * param(
61│ reinterpret_cast< param_type * >( vp) );
62│ BOOST_ASSERT( 0 != param);
63│
64│ Coro * coro(
65├> reinterpret_cast< Coro * >( param->coro) );
66│ BOOST_ASSERT( 0 != coro);
67│
68│ coro->run();
69│ }

Ultimately I found that when objects need to be deleted, boost::asio doesn't handle it gracefully without proper use of shared_ptr and weak_ptr. When crashes do occur, they are very difficult to debug, because its hard to look into what the io_service queue is doing at the time of failure.
After doing a full asynchronous client architecture recently and running into random crashing issues, I have a few tips to offer. Unfortunately, I cannot know whether these will solve your issues, but hopefully it provides a good start in the right direction.
Boost Asio Coroutine Usage Tips
Use boost::asio::asio_handler_invoke instead of io_service.post():
auto& ioService = timer->get_io_service();
ioService.post(timer{ timer->cancel(); });
Using post/dispatch within a coroutine is usually a bad idea. Always use the asio_handler_invoke when you are called from a coroutine. In this case, however, you can probably safely call timer->cancel() without posting it to the message loop anyways.
Your timers do not appear to use shared_ptr objects. Regardless of what is going on in the rest of your application, there is no way to know for sure when these objects should be destroyed. I would highly recommend using shared_ptr objects for all of your timer objects. Also, any pointer to class methods should use shared_from_this() as well. Using a plain this can be quite dangerous if this is destructed (on the stack) or goes out of scope somewhere else in a shared_ptr. Whatever you do, do not use shared_from_this() in the constructor of an object!
If you're getting a crash when a handler within the io_service is being executed, but part of the handler is no longer valid, this is a seriously difficult thing to debug. The handler object that is pumped into the io_service includes any pointers to timers, or pointers to objects that might be necessary to execute the handler.
I highly recommend going overboard with shared_ptr objects wrapped around any asio classes. If the problem goes away, then its likely order of destruction issues.
Is the failure address location on the heap somewhere or is it pointing to the stack? This will help you diagnose whether its an object going out of scope in a method at the wrong time, or if it is something else. For instance, this proved to me that all of my timers must become shared_ptr objects even within a single threaded application.

Related

boost::asio::ip::tcp::socket.read_some() stops working. No exception or errors detected

I am currently debugging a server(win32/64) that utilizes Boost:asio 1.78.
The code is a blend of legacy, older legacy and some newer code. None of this code is mine. I can't answer for why something is done in a certain way. I'm just trying to understand why this is happening and hopefully fix it wo. rewriting it from scratch. This code has been running for years on 50+ servers with no errors. Just these 2 servers that missbehaves.
I have one client (dot.net) that is connected to two servers. Client is sending the same data to the 2 servers. The servers run the same code, as follows in code sect.
All is working well but now and then communications halts. No errors or exceptions on either end. It just halts. Never on both servers at the same time. This happens very seldom. Like every 3 months or less. I have no way of reproducing it in a debugger bc I don't know where to look for this behavior.
On the client side the socket appears to be working/open but does not accept new data. No errors is detected in the socket.
Here's a shortened code describing the functions. I want to stress that I can't detect any errors or exceptions during these failures. Code just stops at "m_socket->read_some()".
Only solution to "unblock" right now is to close the socket manually and restart the acceptor. When I manually close the socket the read_some method returns with error code so I know it is inside there it stops.
Questions:
What may go wrong here and give this behavior?
What parameters should I log to enable me to determine what is happening, and from where.
main code:
std::shared_ptr<boost::asio::io_service> io_service_is = std::make_shared<boost::asio::io_service>();
auto is_work = std::make_shared<boost::asio::io_service::work>(*io_service_is.get());
auto acceptor = std::make_shared<TcpAcceptorWrapper>(*io_service_is.get(), port);
acceptor->start();
auto threadhandle = std::thread([&io_service_is]() {io_service_is->run();});
TcpAcceptorWrapper:
void start(){
m_asio_tcp_acceptor.open(boost::asio::ip::tcp::v4());
m_asio_tcp_acceptor.bind(boost::asio::ip::tcp::endpoint(boost::asio::ip::tcp::v4(), m_port));
m_asio_tcp_acceptor.listen();
start_internal();
}
void start_internal(){
m_asio_tcp_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { /* Handler code */ });
}
Handler code:
m_current_session = std::make_shared<TcpSession>(&m_socket);
std::condition_variable condition;
std::mutex mutex;
bool stopped(false);
m_current_session->run(condition, mutex, stopped);
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&stopped] { return stopped; });
}
TcpSession runner:
void run(std::condition_variable& complete, std::mutex& mutex, bool& stopped){
auto self(shared_from_this());
std::thread([this, self, &complete, &mutex, &stopped]() {
{ // mutex scope
// Lock and hold mutex from tcp_acceptor scope
std::lock_guard<std::mutex> lock(mutex);
while (true) {
std::array<char, M_BUFFER_SIZE> buffer;
try {
boost::system::error_code error;
/* Next call just hangs/blocks but only rarely. like once every 3 months or more seldom */
std::size_t read = m_socket->read_some(boost::asio::buffer(buffer, M_BUFFER_SIZE), error);
if (error || read == -1) {
// This never happens
break;
}
// inside this all is working
process(buffer);
} catch (std::exception& ex) {
// This never happens
break;
} catch (...) {
// Neither does this
break;
}
}
stopped = true;
} // mutex released
complete.notify_one();
}).detach();
}
This:
m_acceptor.async_accept(m_socket, [this](boost::system::error_code error) { // Handler code });
Handler code:
std::condition_variable condition;
std::mutex mutex;
bool stopped(false);
m_current_session->run(condition, mutex, stopped);
{
std::unique_lock<std::mutex> lock(mutex);
condition.wait(lock, [&stopped] { return stopped; });
}
Is strange. It suggests you are using an "async" accept, but the handler block unconditionally until the session completes. That's the opposite of asynchrony. You could much easier write the same code without the asynchrony, and also without the thread and synchronization around it.
My intuition says something is blocking the mutex. Have you established that the session stack is actually inside the read_some frame when e.g. doing a debugger break during a "lock-up"?
When I manually close the socket the read_some method returns with error code so I know it is inside there I have an issue.
You can't legally do that. Your socket is in use on a thread - in a blocking read -, and you're bound to close it from a separate thread. That's a race-condition (see docs). If you want cancellable operations, use async_read*.
There are more code smells (read_some is a lowlevel primitive that is rarely what you want at the application level, detached threads with manual synchronization on termination could be packaged tasks, shared boolean flags could be atomics, notify_one outside the mutex could lead to thread starvation on some platforms etc.).
If you can share more code I'll be happy to sketch simplified solutions that remove the problems.

How to ensure that the messages will be enqueued in chronological order on multithreaded Asio io_service?

Following Michael Caisse's cppcon talk I created a connection handler MyUserConnection which has a sendMessage method. sendMessage method adds a message to the queue similarly to the send() in the cppcon talk. My sendMessage method is called from multiple threads outside of the connection handler in high intervals. The messages must be enqueued chronologically.
When I run my code with only one Asio io_service::run call (aka one io_service thread) it async_write's and empties my queue as expected (FIFO), however, the problem occurs when there are, for example, 4 io_service::run calls, then the queue is not filled or the send calls are not called chronologically.
class MyUserConnection : public std::enable_shared_from_this<MyUserConnection> {
public:
MyUserConnection(asio::io_service& io_service, SslSocket socket) :
service_(io_service),
socket_(std::move(socket)),
strand_(io_service) {
}
void sendMessage(std::string msg) {
auto self(shared_from_this());
service_.post(strand_.wrap([self, msg]() {
self->queueMessage(msg);
}));
}
private:
void queueMessage(const std::string& msg) {
bool writeInProgress = !sendPacketQueue_.empty();
sendPacketQueue_.push_back(msg);
if (!writeInProgress) {
startPacketSend();
}
}
void startPacketSend() {
auto self(shared_from_this());
asio::async_write(socket_,
asio::buffer(sendPacketQueue_.front().data(), sendPacketQueue_.front().length()),
strand_.wrap([self](const std::error_code& ec, std::size_t /*n*/) {
self->packetSendDone(ec);
}));
}
void packetSendDone(const std::error_code& ec) {
if (!ec) {
sendPacketQueue_.pop_front();
if (!sendPacketQueue_.empty()) { startPacketSend(); }
} else {
// end(); // My end call
}
}
asio::io_service& service_;
SslSocket socket_;
asio::io_service::strand strand_;
std::deque<std::string> sendPacketQueue_;
};
I'm quite sure that I misinterpreted the strand and io_service::post when running the connection handler on multithreaded io_service. I'm also quite sure that the messages are not enqueued chronologically instead of messages not being async_write chronologically. How to ensure that the messages will be enqueued in chronological order in sendMessage call on multithreaded io_service?
If you use a strand, the order is guaranteed to be the order in which you post the operations to the strand.
Of course, if there is some kind of "correct ordering" between threads that post then you have to synchronize the posting between them, that's your application domain.
Here's a modernized, simplified take on your MyUserConnection class with a self-contained server test program:
Live On Coliru
#include <boost/asio.hpp>
#include <boost/asio/ssl.hpp>
#include <deque>
#include <iostream>
#include <mutex>
namespace asio = boost::asio;
namespace ssl = asio::ssl;
using asio::ip::tcp;
using boost::system::error_code;
using SslSocket = ssl::stream<tcp::socket>;
class MyUserConnection : public std::enable_shared_from_this<MyUserConnection> {
public:
MyUserConnection(SslSocket&& socket) : socket_(std::move(socket)) {}
void start() {
std::cerr << "Handshake initiated" << std::endl;
socket_.async_handshake(ssl::stream_base::handshake_type::server,
[self = shared_from_this()](error_code ec) {
std::cerr << "Handshake complete" << std::endl;
});
}
void sendMessage(std::string msg) {
post(socket_.get_executor(),
[self = shared_from_this(), msg = std::move(msg)]() {
self->queueMessage(msg);
});
}
private:
void queueMessage(std::string msg) {
outbox_.push_back(std::move(msg));
if (outbox_.size() == 1)
sendLoop();
}
void sendLoop() {
std::cerr << "Sendloop " << outbox_.size() << std::endl;
if (outbox_.empty())
return;
asio::async_write( //
socket_, asio::buffer(outbox_.front()),
[this, self = shared_from_this()](error_code ec, std::size_t) {
if (!ec) {
outbox_.pop_front();
sendLoop();
} else {
end();
}
});
}
void end() {}
SslSocket socket_;
std::deque<std::string> outbox_;
};
int main() {
asio::thread_pool ioc;
ssl::context ctx(ssl::context::sslv23_server);
ctx.set_password_callback([](auto...) { return "test"; });
ctx.use_certificate_file("server.pem", ssl::context::file_format::pem);
ctx.use_private_key_file("server.pem", ssl::context::file_format::pem);
ctx.use_tmp_dh_file("dh2048.pem");
tcp::acceptor a(ioc, {{}, 8989u});
for (;;) {
auto s = a.accept(make_strand(ioc.get_executor()));
std::cerr << "accepted " << s.remote_endpoint() << std::endl;
auto sess = make_shared<MyUserConnection>(SslSocket(std::move(s), ctx));
sess->start();
for(int i = 0; i<30; ++i) {
post(ioc, [sess, i] {
std::string msg = "message #" + std::to_string(i) + "\n";
{
static std::mutex mx;
// Lock so console output is guaranteed in the same order
// as the sendMessage call
std::lock_guard lk(mx);
std::cout << "Sending " << msg << std::flush;
sess->sendMessage(std::move(msg));
}
});
}
break; // for online demo
}
ioc.join();
}
If you run it a few times, you will see that
the order in which the threads post is not deterministic (that's up to the kernel scheduling)
the order in which messages are sent (and received) is exactly the order in which they are posted.
See live demo runs on my machine:
On a multi core or even on a single core preemptive OS, you cannot truly feed messages into a queue in strictly chronological order. Even if you use a mutex to synchronize write access to the queue, the strict order is no longer guaranteed once multiple writers wait on the mutex and the mutex becomes free. At best, the order, in which the waiting write threads acquire the mutex, is implementation dependent (OS code dependent), but it is best to assume it is just random.
With that being said, the strict chronological order is a matter of definition in the first place. To explain that, imagine your PC has some digital output bits (1 for each writer thread) and you connected a logic analyzer to those bits.... and imagine, you pick some spot in the code, where you toggle such a respective bit in your enqueue function. Even if that bit toggle takes place just one assembly instruction prior to acquiring the mutex, it is possible, that the order had been changed, while the writer code approached that point. You could also set it to other arbirtrary points prior (e.g. when you enter the enqueue function). But then, the same reasoning applies. Hence, the strict chronological order is in itself a matter of definition.
There is an analogy to a case, where a CPUs interrupt controller has multiple inputs and you tried to build a system which processes those interrupts in strictly chronological order. Even if all interrupt inputs were signaled exactly at the same moment (a switch, pulling them all to signaled state simultaneously), some order would occur (e.g. caused by hardware logic or just by noise at the input pins or by the systems interrupt dispatcher function (some CPUs (e.g. MIPS 4102) have a single interrupt vector and assembly code checks the possible interrupt sources and dispatches to dedicated interrupt handlers).
This analogy helps see the pattern: It comes down to asynchronous inputs on a synchronous system. Which is a notoriously hard problem in itself.
So, the best you could possibly do, is to make a suitable definition of your applications "strict ordering" and live with it.
Then, to avoid violations of your definition, you could use a priority queue instead of a normal FIFO data type and use as priority some atomic counter:
At your chosen point in the code, atomically read and increment the counter.
This is your messages sequence number.
Assemble your message and enqueue it into the priority queue, using your sequence number as priority.
Another possible approach is to define a notion of "simultaneous", which is detectable on the other side of the queue (and thus, the reader cannot assume strict ordering for a set of "simultaneous" messages). This could be implemented by reading some high frequency tick count and all those messages, which have the same "time stamp" are to be considered simultaneous on reader side.

Cancelling boost::asio::async_read gracefully

I have a class that looks like this:
class MyConnector : public boost::noncopyable, public boost::enable_shared_from_this<MyConnector>
{
public:
typedef MyConnector this_type;
boost::asio::ip::tcp::socket _plainSocket;
boost::shared_ptr<std::vector<uint8_t>> _readBuffer;
// lot of obvious stuff removed....
void readProtocol()
{
_readBuffer = boost::make_shared<std::vector<uint8_t>>(12, 0);
boost::asio::async_read(_plainSocket, boost::asio::buffer(&_readBuffer->at(0), 12),
boost::bind(&this_type::handleReadProtocol, shared_from_this(),
boost::asio::placeholders::bytes_transferred, boost::asio::placeholders::error));
}
void handleReadProtocol(size_t bytesRead,const boost::system::error_code& error)
{
// handling code removed
}
};
This class instance is generally waiting to receive 12 bytes protocol, before trying to read the full message. However, when I try to cancel this read operation and destroy the object, it doesn't happen. When I call _plainSocket.cancel(ec), it doesn't call handleReadProtocol with that ec. Socket disconnects, but the handler is not called.
boost::system::error_code ec;
_plainSocket.cancel(ec);
And the shared_ptr of MyConnector object that was passed using shared_from_this() is not released. The object remains like a zombie in the heap memory. How do I cancel the async_read() in such a way that the MyConnector object reference count is decremented, allowing the object to destroy itself?
Two things: one, in handleReadProtocol, make sure that, if there is an error, that readProtocol is not called. Canceled operations still call the handler, but with an error code set.
Second, asio recommends shutting down and closing the socket if you're finished with the connection. For example:
asio::post([this] {
if (_plainSocket.is_open()) {
asio::error_code ec;
/* For portable behaviour with respect to graceful closure of a connected socket, call
* shutdown() before closing the socket. */
_plainSocket.shutdown(asio::ip::tcp::socket::shutdown_both, ec);
if (ec) {
Log(fmt::format("Socket shutdown error {}.", ec.message()));
ec.clear();
}
_plainSocket.close(ec);
if (ec)
Log(fmt::format("Socket close error {}.", ec.message()));
}
});

How to design proper release of a boost::asio socket or wrapper thereof

I am making a few attempts at making my own simple asynch TCP server using boost::asio after not having touched it for several years.
The latest example listing I can find is:
http://www.boost.org/doc/libs/1_54_0/doc/html/boost_asio/tutorial/tutdaytime3/src.html
The problem I have with this example listing is that (I feel) it cheats and it cheats big, by making the tcp_connection a shared_ptr, such that it doesn't worry about the lifetime management of each connection. (I think) They do this for brevity, since it is a small tutorial, but that solution is not real world.
What if you wanted to send a message to each client on a timer, or something similar? A collection of client connections is going to be necessary in any real world non-trivial server.
I am worried about the lifetime management of each connection. I figure the natural thing to do would be to keep some collection of tcp_connection objects or pointers to them inside tcp_server. Adding to that collection from the OnConnect callback and removing from that collection OnDisconnect.
Note that OnDisconnect would most likely be called from an actual Disconnect method, which in turn would be called from OnReceive callback or OnSend callback, in the case of an error.
Well, therein lies the problem.
Consider we'd have a callstack that looked something like this:
tcp_connection::~tcp_connection
tcp_server::OnDisconnect
tcp_connection::OnDisconnect
tcp_connection::Disconnect
tcp_connection::OnReceive
This would cause errors as the call stack unwinds and we are executing code in a object that has had its destructor called...I think, right?
I imagine everyone doing server programming comes across this scenario in some fashion. What is a strategy for handling it?
I hope the explanation is good enough to follow. If not let me know and I will create my own source listing, but it will be very large.
Edit:
Related
) Memory management in asynchronous C++ code
IMO not an acceptable answer, relies on cheating with shared_ptr outstanding on receive calls and nothing more, and is not real world. what if the server wanted to say "Hi" to all clients every 5 minutes. A collection of some kind is necessary. What if you are calling io_service.run on multiple threads?
I am also asking on the boost mailing list:
http://boost.2283326.n4.nabble.com/How-to-design-proper-release-of-a-boost-asio-socket-or-wrapper-thereof-td4693442.html
Like I said, I fail to see how using smart pointers is "cheating, and cheating big". I also do not think your assessment that "they do this for brevity" holds water.
Here's a slightly redacted excerpt¹ from our code base that exemplifies how using shared_ptrs doesn't preclude tracking connections.
It shows just the server side of things, with
a very simple connection object in connection.hpp; this uses the enable_shared_from_this
just the fixed size connection_pool (we have dynamically resizing pools too, hence the locking primitives). Note how we can do actions on all active connections.
So you'd trivially write something like this to write to all clients, like on a timer:
_pool.for_each_active([] (auto const& conn) {
send_message(conn, hello_world_packet);
});
a sample listener that shows how it ties in with the connection_pool (which has a sample method to close all connections)
Code Listings
connection.hpp
#pragma once
#include "xxx/net/rpc/protocol.hpp"
#include "log.hpp"
#include "stats_filer.hpp"
#include <memory>
namespace xxx { namespace net { namespace rpc {
struct connection : std::enable_shared_from_this<connection>, protected LogSource {
typedef std::shared_ptr<connection> ptr;
private:
friend struct io;
friend struct listener;
boost::asio::io_service& _svc;
protocol::socket _socket;
protocol::endpoint _ep;
protocol::endpoint _peer;
public:
connection(boost::asio::io_service& svc, protocol::endpoint ep)
: LogSource("rpc::connection"),
_svc(svc),
_socket(svc),
_ep(ep)
{}
void init() {
_socket.set_option(protocol::no_delay(true));
_peer = _socket.remote_endpoint();
g_stats_filer_p->inc_value("asio." + _ep.address().to_string() + ".sockets_accepted");
debug() << "New connection from " << _peer;
}
protocol::endpoint endpoint() const { return _ep; }
protocol::endpoint peer() const { return _peer; }
protocol::socket& socket() { return _socket; }
// TODO encapsulation
int handle() {
return _socket.native_handle();
}
bool valid() const { return _socket.is_open(); }
void cancel() {
_svc.post([this] { _socket.cancel(); });
}
using shutdown_type = boost::asio::ip::tcp::socket::shutdown_type;
void shutdown(shutdown_type what = shutdown_type::shutdown_both) {
_svc.post([=] { _socket.shutdown(what); });
}
~connection() {
g_stats_filer_p->inc_value("asio." + _ep.address().to_string() + ".sockets_disconnected");
}
};
} } }
connection_pool.hpp
#pragma once
#include <mutex>
#include "xxx/threads/null_mutex.hpp"
#include "xxx/net/rpc/connection.hpp"
#include "stats_filer.hpp"
#include "log.hpp"
namespace xxx { namespace net { namespace rpc {
// not thread-safe by default, but pass e.g. std::mutex for `Mutex` if you need it
template <typename Ptr = xxx::net::rpc::connection::ptr, typename Mutex = xxx::threads::null_mutex>
struct basic_connection_pool : LogSource {
using WeakPtr = std::weak_ptr<typename Ptr::element_type>;
basic_connection_pool(std::string name = "connection_pool", size_t size)
: LogSource(std::move(name)), _pool(size)
{ }
bool try_insert(Ptr const& conn) {
std::lock_guard<Mutex> lk(_mx);
auto slot = std::find_if(_pool.begin(), _pool.end(), std::mem_fn(&WeakPtr::expired));
if (slot == _pool.end()) {
g_stats_filer_p->inc_value("asio." + conn->endpoint().address().to_string() + ".connections_dropped");
error() << "dropping connection from " << conn->peer() << ": connection pool (" << _pool.size() << ") saturated";
return false;
}
*slot = conn;
return true;
}
template <typename F>
void for_each_active(F action) {
auto locked = [=] {
using namespace std;
lock_guard<Mutex> lk(_mx);
vector<Ptr> locked(_pool.size());
transform(_pool.begin(), _pool.end(), locked.begin(), mem_fn(&WeakPtr::lock));
return locked;
}();
for (auto const& p : locked)
if (p) action(p);
}
constexpr static bool synchronizing() {
return not std::is_same<xxx::threads::null_mutex, Mutex>();
}
private:
void dump_stats(LogSource::LogTx tx) const {
// lock is assumed!
size_t empty = 0, busy = 0, idle = 0;
for (auto& p : _pool) {
switch (p.use_count()) {
case 0: empty++; break;
case 1: idle++; break;
default: busy++; break;
}
}
tx << "usage empty:" << empty << " busy:" << busy << " idle:" << idle;
}
Mutex _mx;
std::vector<WeakPtr> _pool;
};
// TODO FIXME use null_mutex once growing is no longer required AND if
// en-pooling still only happens from the single IO thread (XXX-2535)
using server_connection_pool = basic_connection_pool<xxx::net::rpc::connection::ptr, std::mutex>;
} } }
listener.hpp
#pragma once
#include "xxx/threads/null_mutex.hpp"
#include <mutex>
#include "xxx/net/rpc/connection_pool.hpp"
#include "xxx/net/rpc/io_operations.hpp"
namespace xxx { namespace net { namespace rpc {
struct listener : std::enable_shared_from_this<listener>, LogSource {
typedef std::shared_ptr<listener> ptr;
protocol::acceptor _acceptor;
protocol::endpoint _ep;
listener(boost::asio::io_service& svc, protocol::endpoint ep, server_connection_pool& pool)
: LogSource("rpc::listener"), _acceptor(svc), _ep(ep), _pool(pool)
{
_acceptor.open(ep.protocol());
_acceptor.set_option(protocol::acceptor::reuse_address(true));
_acceptor.set_option(protocol::no_delay(true));
::fcntl(_acceptor.native(), F_SETFD, FD_CLOEXEC); // FIXME use non-racy socket factory?
_acceptor.bind(ep);
_acceptor.listen(32);
}
void accept_loop(std::function<void(connection::ptr conn)> on_accept) {
auto self = shared_from_this();
auto conn = std::make_shared<xxx::net::rpc::connection>(_acceptor.get_io_service(), _ep);
_acceptor.async_accept(conn->_socket, [this,self,conn,on_accept](boost::system::error_code ec) {
if (ec) {
auto tx = ec == boost::asio::error::operation_aborted? debug() : warn();
tx << "failed accept " << ec.message();
} else {
::fcntl(conn->_socket.native(), F_SETFD, FD_CLOEXEC); // FIXME use non-racy socket factory?
if (_pool.try_insert(conn)) {
on_accept(conn);
}
self->accept_loop(on_accept);
}
});
}
void close() {
_acceptor.cancel();
_acceptor.close();
_acceptor.get_io_service().post([=] {
_pool.for_each_active([] (auto const& sp) {
sp->shutdown(connection::shutdown_type::shutdown_both);
sp->cancel();
});
});
debug() << "shutdown";
}
~listener() {
}
private:
server_connection_pool& _pool;
};
} } }
¹ download as gist https://gist.github.com/sehe/979af25b8ac4fd77e73cdf1da37ab4c2
While others have answered similarly to the second half of this answer, it seems the most complete answer I can find, came from asking the same question on the Boost Mailing list.
http://boost.2283326.n4.nabble.com/How-to-design-proper-release-of-a-boost-asio-socket-or-wrapper-thereof-td4693442.html
I will summarize here in order to assist those that arrive here from a search in the future.
There are 2 options
1) Close the socket in order to cancel any outstanding io and then post a callback for the post-disconnection logic on the io_service and let the server class be called back when the socket has been disconnected. It can then safely release the connection. As long as there was only one thread that had called io_service::run, then other asynchronous operations will have been already been resolved when the callback is made. However, if there are multiple threads that had called io_service::run, then this is not safe.
2) As others have been pointing out in their answers, using the shared_ptr to manage to connections lifetime, using outstanding io operations to keep them alive, is viable. We can use a collection weak_ptr to the connections in order to access them if we need to. The latter is the tidbit that had been omitted from other posts on the topic which confused me.
The way that asio solves the "deletion problem" where there are outstanding async methods is that is splits each async-enabled object into 3 classes, eg:
server
server_service
server_impl
there is one service per io_loop (see use_service<>). The service creates an impl for the server, which is now a handle class.
This has separated the lifetime of the handle and the lifetime of the implementation.
Now, in the handle's destructor, a message can be sent (via the service) to the impl to cancel all outstanding IO.
The handle's destructor is free to wait for those io calls to be queued if necessary (for example if the server's work is being delegated to a background io loop or thread pool).
It has become a habit with me to implement all io_service-enabled objects this way as it makes coding with aiso very much simpler.
Connection lifetime is a fundamental issue with boost::asio. Speaking from experience, I can assure you that getting it wrong causes "undefined behaviour"...
The asio examples use shared_ptr to ensure that a connection is kept alive whilst it may have outstanding handlers in an asio::io_service. Note that even in a single thread, an asio::io_service runs asynchronously to the application code, see CppCon 2016: Michael Caisse "Asynchronous IO with Boost.Asio" for an excellent description of the precise mechanism.
A shared_ptr enables the lifetime of a connection to be controlled by the shared_ptr instance count. IMHO it's not "cheating and cheating big"; but an elegant solution to complicated problem.
However, I agree with you that just using shared_ptr's to control connection lifetimes is not a complete solution since it can lead to resource leaks.
In my answer here: Boost async_* functions and shared_ptr's, I proposed using a combination of shared_ptr and weak_ptr to manage connection lifetimes. An HTTP server using a combination of shared_ptr's and weak_ptr's can be found here: via-httplib.
The HTTP server is built upon an asynchronous TCP server which uses a collection of (shared_ptr's to) connections, created on connects and destroyed on disconnects as you propose.

Thread safety of boost::asio io_service and std::containers

I'm building a network service with boost::asio and I'm unsure about the thread safety.
io_service.run() is called only once from a thread dedicated for the io_service work
send_message() on the other hand can be called either by the code inside the second io_service handlers mentioned later, or by the mainThread upon user interaction. And that is why I'm getting nervous.
std::deque<message> out_queue;
// send_message will be called by two different threads
void send_message(MsgPtr msg){
while (out_queue->size() >= 20){
Sleep(50);
}
io_service_.post([this, msg]() { deliver(msg); });
}
// from my understanding, deliver will only be called by the thread which called io_service.run()
void deliver(const MsgPtr){
bool write_in_progress = !out_queue.empty();
out_queue.push_back(msg);
if (!write_in_progress)
{
write();
}
}
void write()
{
auto self(shared_from_this());
asio::async_write(socket_,
asio::buffer(out_queue.front().header(),
message::header_length), [this, self](asio::error_code ec, std::size_t/)
{
if (!ec)
{
asio::async_write(socket_,
asio::buffer(out_queue.front().data(),
out_queue.front().paddedPayload_size()),
[this, self](asio::error_code ec, std::size_t /*length*/)
{
if (!ec)
{
out_queue.pop_front();
if (!out_queue.empty())
{
write();
}
}
});
}
});
}
Is this scenario safe?
A similar second scenario: When the network thread receives a message, it posts them into another asio::io_service which is also run by its own dedicated thread. This io_service uses an std::unordered_map to store callback functions etc.
std::unordered_map<int, eventSink> eventSinkMap_;
//...
// called by the main thread (GUI), writes a callback function object to the map
int IOReactor::registerEventSink(std::function<void(int, std::shared_ptr<message>)> fn, QObject* window, std::string endpointId){
util::ScopedLock lock(&sync_);
eventSink es;
es.id = generateRandomId();
// ....
std::pair<int, eventSink> eventSinkPair(es.id, es);
eventSinkMap_.insert(eventSinkPair);
return es.id;
}
// called by the second thread, the network service thread when a message was received
void IOReactor::onMessageReceived(std::shared_ptr<message> msg, ConPtr con)
{
reactor_io_service_.post([=](){ handleReceive(msg, con); });
}
// should be called only by the one thread running the reactor_io_service.run()
// read and write access to the map
void IOReactor::handleReceive(std::shared_ptr<message> msg, ConPtr con){
util::ScopedLock lock(&sync_);
auto es = eventSinkMap_.find(msg.requestId);
if (es != eventSinkMap_.end())
{
auto fn = es->second.handler;
auto ctx = es->second.context;
QMetaObject::invokeMethod(ctx, "runInMainThread", Qt::QueuedConnection, Q_ARG(std::function<void(int, std::shared_ptr<msg::IMessage>)>, fn), Q_ARG(int, CallBackResult::SUCCESS), Q_ARG(std::shared_ptr<msg::IMessage>, msg));
eventSinkMap_.erase(es);
}
first of all: Do I even need to use a lock here?
Ofc both methods access the map, but they are not accessing the same elements (the receiveHandler cannot try to access or read an element that has not yet been registered/inserted into the map). Is that threadsafe?
First of all, a lot of context is missing (where is onMessageReceived invoked, and what is ConPtr? and you have too many questions. I'll give you some specific pointers that will help you though.
You should be nervous here:
void send_message(MsgPtr msg){
while (out_queue->size() >= 20){
Sleep(50);
}
io_service_.post([this, msg]() { deliver(msg); });
}
The check out_queue->size() >= 20 requires synchronization unless out_queue is thread safe.
The call to io_service_.post is safe, because io_service is thread safe. Since you have one dedicated IO thread, this means that deliver() will run on that thread. Right now, you need synchronization there too.
I strongly suggest using a proper thread-safe queue there.
Q. first of all: Do I even need to use a lock here?
Yes you need to lock to do the map lookup (otherwise you get a data race with the main thread inserting sinks).
You do not need to lock during the invocation (in fact, that seems like a very unwise idea that could lead to performance issue or lockups). The reference remains valid due to Iterator invalidation rules.
The deletion of course requires a lock again. I'd revise the code to do deletion and removal at once, and invoke the sink only after releasing the lock. NOTE You will have to think about exceptions here (in your code when there is an exception during invocation, the sink doesn't get removed (ever?). This might be important to you.
Live Demo
void handleReceive(std::shared_ptr<message> msg, ConPtr con){
util::ScopedLock lock(&sync_);
auto es = eventSinkMap_.find(msg->requestId);
if (es != eventSinkMap_.end())
{
auto fn = es->second.handler;
auto ctx = es->second.context;
eventSinkMap_.erase(es); // invalidates es
lock.unlock();
// invoke in whatever way you require
fn(static_cast<int>(CallBackResult::SUCCESS), std::static_pointer_cast<msg::IMessage>(msg));
}
}