boost::asio reasoning behind num_implementations for io_service::strand - c++

We've been using asio in production for years now and recently we have reached a critical point when our servers become loaded just enough to notice a mysterious issue.
In our architecture, each separate entity that runs independently uses a personal strand object. Some of the entities can perform a long work (reading from file, performing MySQL request, etc). Obviously, the work is performed within handlers wrapped with strand. All sounds nice and pretty and should work flawlessly, until we have begin to notice an impossible things like timers expiring seconds after they should, even though threads are 'waiting for work' and work being halt for no apparent reason. It looked like long work performed inside a strand had impact on other unrelated strands, not all of them, but most.
Countless hours were spent to pinpoint the issue. The track has led to the way strand object is created: strand_service::construct (here).
For some reason developers decided to have a limited number of strand implementations. Meaning that some totally unrelated objects will share a single implementation and hence will be bottlenecked because of this.
In the standalone (non-boost) asio library similar approach is being used. But instead of shared implementations, each implementation is now independent but may share a mutex object with other implementations (here).
What is it all about? I have never heard of limits on number of mutexes in the system. Or any overhead related to their creation/destruction. Though the last problem could be easily solved by recycling mutexes instead of destroying them.
I have a simplest test case to show how dramatic is a performance degradation:
#include <boost/asio.hpp>
#include <atomic>
#include <functional>
#include <iostream>
#include <thread>
std::atomic<bool> running{true};
std::atomic<int> counter{0};
struct Work
{
Work(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<Work> _this(new Work(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<Work> _this)
{
counter.fetch_add(1, std::memory_order_relaxed);
if (running.load(std::memory_order_relaxed)) {
start_the_work(_this->_strand.get_io_service());
}
}
boost::asio::strand _strand;
};
struct BlockingWork
{
BlockingWork(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<BlockingWork> _this(new BlockingWork(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<BlockingWork> _this)
{
sleep(5);
}
boost::asio::strand _strand;
};
int main(int argc, char ** argv)
{
boost::asio::io_service io_service;
std::unique_ptr<boost::asio::io_service::work> work{new boost::asio::io_service::work(io_service)};
for (std::size_t i = 0; i < 8; ++i) {
Work::start_the_work(io_service);
}
std::vector<std::thread> workers;
for (std::size_t i = 0; i < 8; ++i) {
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
}
if (argc > 1) {
std::cout << "Spawning a blocking work" << std::endl;
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
BlockingWork::start_the_work(io_service);
}
sleep(5);
running = false;
work.reset();
for (auto && worker : workers) {
worker.join();
}
std::cout << "Work performed:" << counter.load() << std::endl;
return 0;
}
Build it using this command:
g++ -o asio_strand_test_case -pthread -I/usr/include -std=c++11 asio_strand_test_case.cpp -lboost_system
Test run in a usual way:
time ./asio_strand_test_case
Work performed:6905372
real 0m5.027s
user 0m24.688s
sys 0m12.796s
Test run with a long blocking work:
time ./asio_strand_test_case 1
Spawning a blocking work
Work performed:770
real 0m5.031s
user 0m0.044s
sys 0m0.004s
Difference is dramatic. What happens is each new non-blocking work creates a new strand object up until it shares the same implementation with strand of the blocking work. When this happens it's a dead-end, until long work finishes.
Edit:
Reduced parallel work down to the number of working threads (from 1000 to 8) and updated test run output. Did this because when both numbers are close the issue is more visible.

Well, an interesting issue and +1 for giving us a small example reproducing the exact issue.
The problem you are having 'as I understand' with the boost implementation is that, it by default instantiates only a limited number of strand_impl, 193 as I see in my version of boost (1.59).
Now, what this means is that a large number of requests will be in contention as they would be waiting for the lock to be unlocked by the other handler (using the same instance of strand_impl).
My guess for doing such a thing would be to disallow overloading the OS by creating lots and lots and lots of mutexes. That would be bad. The current implementation allows one to reuse the locks (and in a configurable way as we will see below)
In my setup:
MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -O2 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:489696
real 0m5.016s
user 0m1.620s
sys 0m4.069s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:188480
real 0m5.031s
user 0m0.611s
sys 0m1.495s
Now, there is a way to change this number of cached implementations by setting the Macro BOOST_ASIO_STRAND_IMPLEMENTATIONS.
Below is the result I got after setting it to a value of 1024:
MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -DBOOST_ASIO_STRAND_IMPLEMENTATIONS=1024 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:450928
real 0m5.017s
user 0m2.708s
sys 0m3.902s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:458603
real 0m5.027s
user 0m2.611s
sys 0m3.902s
Almost the same for both cases! You might want to adjust the value of the macro as per your needs to keep the deviation small.

Note that if you don't like Asio's implementation you can always write your own strand which creates a separate implementation for each strand instance. This might be better for your particular platform than the default algorithm.

Edit: As of recent Boosts, standalone ASIO and Boost.ASIO are now in sync. This answer is preserved for historical interest.
Standalone ASIO and Boost.ASIO have become quite detached in recent years as standalone ASIO is slowly morphed into the reference Networking TS implementation for standardisation. All the "action" is happening in standalone ASIO, including major bug fixes. Only very minor bug fixes are made to Boost.ASIO. There is several years of difference between them by now.
I'd therefore suggest anyone finding any problems at all with Boost.ASIO should switch over to standalone ASIO. The conversion is usually not hard, look into the many macro configs for switching between C++ 11 and Boost in config.hpp. Historically Boost.ASIO was actually auto-generated by script from standalone ASIO, it may be the case Chris has kept those scripts working, and so therefore you could regenerate a brand shiny new Boost.ASIO with all the latest changes. I'd suspect such a build is not well tested however.

Related

Is it possible to force MPI to always block on send?

Is there a way to force MPI to always block on send? This might be useful when looking for deadlocks in a distributed algorithm which otherwise depends on the buffering MPI might choose to do on send.
For example, the following program (run with 2 processes) works without problems on my machine:
// C++
#include <iostream>
#include <thread>
// Boost
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
int main() {
using namespace std::chrono_literals;
mpi::environment env;
mpi::communicator world;
auto me = world.rank();
auto other = 1 - me;
char buffer[10] = {0};
while (true) {
world.send(other, 0, buffer);
world.recv(other, 0, buffer);
std::cout << "Node " << me << " received" << std::endl;
std::this_thread::sleep_for(200ms);
}
}
But if I change the size of the buffer to 10000 it blocks indefinitely.
For pure MPI codes, what you describe is exactly what MPI_Ssend() gives you. However, here, you are not using pure MPI, you are using boost::mpi. And unfortunately, according to boost::mpi's documentation, MPI_Ssend() isn't supported.
That said, maybe boost::mpi offers another way, but I doubt it.
If you want blocking behavior, use MPI_Ssend. It will block until a matching receive has been posted, without buffering the request. The amount of buffering provided by MPI_Send is (intentionally) implementation specific. The behavior you get for a buffer of 10000 may differ when trying a different implementation.
I don't know if you can actually tweak the buffering configuration, and I wouldn't try because it would not be portable. Instead, I'd try to use the MPI_Ssend variant in some debug configuration, and use the default MPI_Send when best performance are needed.
(disclaimer: I'm not familiar with boost's implementation, but MPI is a standard. Also, I saw Gilles comment after posting this answer...)
You can consider tuning the eager limit value (http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit) to force that send operation to block on any message size. The way to establish the eager limit, depends on the MPI implementation. On Intel MPI you can use the I_MPI_EAGER_THRESHOLD environment variable (see https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm), for instance.

Using C++11 futures: Nested calls of std::async crash: Compiler/Standard library bug?

After experiencing crashes when introducing nested calls of std::async in my real program, I was able to reproduce the problem in the following minimum example. It crashes often, but not always. Do you see anything what goes wrong, or is it a compiler or standard library bug? Note that the problem remains if get() calls to the futures are added.
#include <future>
#include <vector>
int main (int, char *[])
{
std::vector<std::future<void>> v;
v.reserve(100);
for (int i = 0; i != 100; ++i)
{
v.emplace_back(std::async(std::launch::async, [] () {
std::async(std::launch::async, [] { });
}));
}
return 0;
}
I observe two different kinds of crashes: (in about every fifth run)
Termination with "This application has requested the Runtime to terminate it in an unusual way."
Termination after throwing an instance of 'std::future_error', what(): Promise already satisfied.
Environment:
Windows 7
gcc version 4.8.2 (i686-posix-dwarf-rev3, Built by
MinGW-W64 project), as provided by Qt 5.3.2
Command line call: g++ -std=c++11 -pthread futures.cpp
Compiled and run on two independent machines
Option -pthread?
Could it be that in my environment for some reason the option -pthread is silently not taken into account? I observe the same behavior with and without that option.
Since this answer is still "unanswered," after talking with some people from Lounge<C++>, I think I can say that it's pretty obvious from the comments that this is due to an implementation error either on MinGW/MinGW-w64's or pthread's part at the time. Using gcc 4.9.1, MinGW-W64, the problem does not appear anymore. In fact, the program above appears to compile and run correctly even on a version earlier than 4.8.2 with POSIX threading.
I myself am not an expert, my guess is that the exact trip-up happens when the program appears to try to write to the same promise twice, which, I think, should be a big no-no, as an std::async should write its result only once (again, I'm not sure if I'm right here, and other comments and edits will most likely clarify).
Also, this may be a related problem: std::future exception on gcc experimental implementation of C++0x

Asynchronous call to MATLAB's engEvalString

Edit 2: Problem solved, see my answer.
I am writing a C++ program that communicates with MATLAB through the Engine API. The C++ application is running on Windows 7, and interacting with MATLAB 2012b (32-bit).
I would like to make a time-consuming call to the MATLAB engine, using engEvalString, but cannot figure out how to make the call asynchronous. No callback is necessary (but would be nice if possible).
The following is a minimum example of what doesn't work.
#include <boost/thread.hpp>
extern "C" {
#include <engine.h>
}
int main()
{
Engine* eng = engOpen("");
engEvalString(eng,"x=10");
boost::thread asyncEval(&engEvalString,eng,"y=5");
boost::this_thread::sleep(boost::posix_time::seconds(10));
return 0;
}
After running this program, I switch to the MATLAB engine window and find:
» x
x =
10
» y
Undefined function or variable 'y'.
So it seems that the second call, which should set y=5, is never processed by the MATLAB engine.
The thread definitely runs, you can check this by moving the engEvalString call into a local function and launching this as the thread instead.
I'm really stumped here, and would appreciate any suggestions!
EDIT: As Shafik pointed out in his answer, the engine is not thread-safe. I don't think this should be an issue for my use case, as the calls I need to make are ~5 seconds apart, for a calculation that takes 2 seconds. The reason I cannot wait for this calculation, is that the C++ application is a "medium-hard"-real-time robot controller which should send commands at 50Hz. If this rate drops below 30Hz, the robot will assume network issues and close the connection.
So, I figured out the problem, but would love it if someone could explain why!
The following works:
#include <boost/thread.hpp>
extern "C" {
#include <engine.h>
}
void asyncEvalString()
{
Engine* eng = engOpen("");
engEvalString(eng,"y=5");
}
int main()
{
Engine* eng = engOpen("");
engEvalString(eng,"x=10");
boost::thread asyncEvalString(&asyncEvalString);
boost::this_thread::sleep(boost::posix_time::seconds(1));
engEvalString(eng,"z=15");
return 0;
}
As you can see, you need to get a new pointer to the engine in the new thread. The pointer returned in asyncEvalString is different to the original pointer returned by engOpen in the main function, however both pointers continue to operate without problem:
» x
x =
10
» y
y =
5
» z
z =
15
Finally, to tackle the problem of thread safety, a mutex could be set up around the engEvalString calls to ensure only one thread uses the engine at any one time. The asyncEvalString function could also be modified to trigger a callback function once the engEvalString function has been completed.
I would however appreciate someone explaining why the above solution works. Threads share heap allocated memory of the process, and can access memory on other threads' stacks (?), so I fail to understand why the first Engine* was suddenly invalid when used in a separate thread.
So according to this Mathworks document it is not thread safe so I doubt this will work:
http://www.mathworks.com/help/matlab/matlab_external/using-matlab-engine.html
and according to this document, engOpen forks a new process which would probably explain the rest of the behavior you are seeing:
http://www.mathworks.com/help/matlab/apiref/engopen.html
Also see, threads and forks, think twice about mixing them:
http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

How can I use boost::thread::timed_join with nanoseconds enabled in boost::date_time?

Here is some C++ code illustrating my problem with a minimal expample:
// uncomment the next line, to make it hang up:
//#define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG //needed for nanosecond support of boost
#include <boost/thread.hpp>
void foo()
{
while(true);
}
int main(int noParameters, char **parameterArray)
{
boost::thread MyThread(&foo);
if ( MyThread.timed_join( boost::posix_time::seconds(1) ) )
{
std::cout<<"\nDone!\n";
}
else
{
std::cerr<<"\nTimed out!\n";
}
}
As long as I don't turn on the nanosecond support everthing works as expected, but as soon as I uncomment the #define needed for the nanosecond support in boost::posix_time the program doesn't get past the if-statement any more, just as if I had called join() instead of timed_join().
Now I've already figured out, that this happens because BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG changes the actual data representation of the timestamps from a single 64bit integer to 64+32 bit. A lot boost stuff is completely implemented inside the headers but the thread methods are not and because of that they cannot adapt to the new data format without compiling them again with the apropriate options. Since the code is meant to run on an external server, compiling my own version of boost is not an option and neither is turning off the nanosecond support.
Therefore my question is as follows: Is there a way to pass on a value (on the order of seconds) to timed_join() without using the incompatible 96bit posix_time methods and without modifying the standard boost packages?
I'm running on Ubuntu 12.04 with boost 1.46.1.
Unfortunately I don't think your problem can be cleanly solved as written. Since the library you're linking against was compiled without nanosecond support, by definition you violate the one-definition rule if you happen to enable nanosecond support for any piece that's already compiled into the library binary. In this case, you're enabling it across the function calls to timed_join.
The obvious solution is to decide which is less painful to give up: Building your own boost, or removing nanosecond times.
The less obvious "hack" that may or may not totally work is to write your own timed_join wrapper that takes a thread object and an int representing seconds or ms or whatever. Then this function is implemented in a source file with nothing else and that does not enable nanosecond times for the specific purpose of calling into the compiled boost binary. Again I want to stress that if at any point you fail to completely segregate such usages you'll violate the one definition rule and run into undefined behavior.

c++ threads - parallel processing

I was wondering how to execute two processes in a dual-core processor in c++.
I know threads (or multi-threading) is not a built-in feature of c++.
There is threading support in Qt, but I did not understand anything from their reference. :(
So, does anyone know a simple way for a beginner to do it. Cross-platform support (like Qt) would be very helpful since I am on Linux.
Try the Multithreading in C++0x part 1: Starting Threads as a 101. If you compiler does not have C++0x support, then stay with Boost.Thread
Take a look at Boost.Thread. This is cross-platform and a very good library to use in your C++ applications.
What specifically would you like to know?
The POSIX thread (pthreads) library is probably your best bet if you just need a simple threading library, it has implementations both on Windows and Linux.
A guide can be found e.g. here. A Win32 implementation of pthreads can be downloaded here.
Edit: Didn't see you were on Linux. In that case I'm not 100% sure but I think the libraries are probably already bundled in with your GCC installation.
I'd recommend using the Boost libraries Boost.Thread instead. This will wrap platform specifics of Win32 and Posix, and give you a solid set of threading and synchronization objects. It's also in very heavy use, so finding help on any issues you encounter on SO and other sites is easy.
You can search for a free PDF book "C++-GUI-Programming-with-Qt-4-1st-ed.zip" and read Chapter 18 about Multi-threading in Qt.
Concurrent programming features supported by Qt includes (not limited to) the following:
Mutex
Read Write Lock
Semaphore
Wait Condition
Thread Specific Storage
However, be aware of the following trade-offs with Qt:
Performance penalties vs native threading libraries. POSIX thread (pthreads) has been native to Linux since kernel 2.4 and may not substitute for < process.h > in W32API in all situations.
Inter-thread communication in Qt is implemented with SIGNAL and SLOT constructs. These are NOT part of the C++ language and are implemented as macros which requires proprietary code generators provided by Qt to be fully compiled.
If you can live with the above limitations, just follow these recipes for using QThread:
#include < QtCore >
Derive your own class from QThread. You must implement a public function run() that returns void to contain instructions to be executed.
Instantiate your own class and call start() to kick off a new thread.
Sameple Code:
#include <QtCore>
class MyThread : public QThread {
public:
void run() {
// do something
}
};
int main(int argc, char** argv) {
MyThread t1, t2;
t1.start(); // default implementation from QThread::start() is fine
t2.start(); // another thread
t1.wait(); // wait for thread to finish
t2.wait();
return 0;
}
As an important note in c++14, the use of concurrent threading is available:
#include<thread>
class Example
{
auto DoStuff() -> std::string
{
return "Doing Stuff";
}
auto DoStuff2() -> std::string
{
return "Doing Stuff 2";
}
};
int main()
{
Example EO;
std::string(Example::*func_pointer)();
func_pointer = &Example::DoStuff;
std::future<string> thread_one = std::async(std::launch::async, func_pointer, &EO); //Launching upon declaring
std::string(Example::*func_pointer_2)();
func_pointer_2 = &Example::DoStuff2;
std::future<string> thread_two = std::async(std::launch::deferred, func_pointer_2, &EO);
thread_two.get(); //Launching upon calling
}
Both std::async (std::launch::async, std::launch::deferred) and std::thread are fully compatible with Qt, and in some cases may be better at working in different OS environments.
For parallel processing, see this.