Is it possible to force MPI to always block on send?

Is it possible to force MPI to always block on send? - c++

Is there a way to force MPI to always block on send? This might be useful when looking for deadlocks in a distributed algorithm which otherwise depends on the buffering MPI might choose to do on send.
For example, the following program (run with 2 processes) works without problems on my machine:
// C++
#include <iostream>
#include <thread>
// Boost
#include <boost/mpi.hpp>
namespace mpi = boost::mpi;
int main() {
using namespace std::chrono_literals;
mpi::environment env;
mpi::communicator world;
auto me = world.rank();
auto other = 1 - me;
char buffer[10] = {0};
while (true) {
world.send(other, 0, buffer);
world.recv(other, 0, buffer);
std::cout << "Node " << me << " received" << std::endl;
std::this_thread::sleep_for(200ms);
}
}
But if I change the size of the buffer to 10000 it blocks indefinitely.

For pure MPI codes, what you describe is exactly what MPI_Ssend() gives you. However, here, you are not using pure MPI, you are using boost::mpi. And unfortunately, according to boost::mpi's documentation, MPI_Ssend() isn't supported.
That said, maybe boost::mpi offers another way, but I doubt it.

If you want blocking behavior, use MPI_Ssend. It will block until a matching receive has been posted, without buffering the request. The amount of buffering provided by MPI_Send is (intentionally) implementation specific. The behavior you get for a buffer of 10000 may differ when trying a different implementation.
I don't know if you can actually tweak the buffering configuration, and I wouldn't try because it would not be portable. Instead, I'd try to use the MPI_Ssend variant in some debug configuration, and use the default MPI_Send when best performance are needed.
(disclaimer: I'm not familiar with boost's implementation, but MPI is a standard. Also, I saw Gilles comment after posting this answer...)

You can consider tuning the eager limit value (http://blogs.cisco.com/performance/what-is-an-mpi-eager-limit) to force that send operation to block on any message size. The way to establish the eager limit, depends on the MPI implementation. On Intel MPI you can use the I_MPI_EAGER_THRESHOLD environment variable (see https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm), for instance.

Related

How to read data in a specific memory address in c++

I have created an integer variable using the following code in first.cpp:
#include <iostream>
using namespace std;
int main()
{
int myvar = 10;
cout << &myvar;
// I only need two steps above.
// The following steps are coded to make this program run continuously.
cout << "Enter your name" << endl;
string name;
cin >> name;
return 0;
}
Output of first.cpp is:
>0x6dfed4
While the above program is running, I also run the following program in second.cpp:
#include <iostream>
using namespace std;
int main()
{
// I this program I want to read the content in myvar variable in first.cpp
// How to do it ?
}
Using the second program, I want to read the content of the myvar variable in first.cpp.
How to do it ?
Thank you.

To elaborate on my comment above:
As I wrote each program will run in a different process.
Each process has a separate adddress space.
Therefore using the address of a variable in another process doesn't make any sense.
In order to communicate between processes, you need some kind of IPC (inter-process communication):
Inter-process communication
If you only need to share a variable, the first mechanism that comes to mind is to use shared memory:
Shared memory
Shared-memory is very much OS dependent.
You didn't mention which OS you are using.
On Windows you can use Win API for managing shared memory:
Windows shared memory
There's an equivalent on Linux, but I am not familiar with the details.
Alternatively you can use the boost solution which is cross platform:
boost shared memory
If you'll delve into that it will be clear that these kind of solutions comes with some compilcations.
So the question is why do you need to do it ? Maybe there's a better solution to your problem
(you haven't described what it actually is, so there's not much more I can say).

What's the purpose of Boost pipe and why it's important?

Apologies if this question is overly broad. I'm new to C++ and trying to understand different stream types and why they matter (or doesn't matter).
I'm learning by coding a simple program that launch a child process, and process the output. I'm following the Boost process synchronous IO example: https://www.boost.org/doc/libs/1_75_0/doc/html/boost_process/tutorial.html#boost_process.tutorial.io.
One of the example can be reduce to this:
#include <boost/process.hpp>
using namespace std;
using namespace boost::process;
int main(int argc, char *argv[]) {
opstream in;
ipstream out;
child c("c++filt", std_out > out, std_in < in);
in << "_ZN5boost7process8tutorialE" << endl;
in.pipe().close(); // This will help c++filt quit, so we don't hang at wait() forever
c.wait();
return 0;
}
My question is:
Why do we have to use a boost opstream? Can I use istringstream instead (besides that it doesn't compile)? Can make it compile with istringstream?
Boost document said:
Boost.process provides the pipestream (ipstream, opstream, pstream) to wrap around the pipe and provide an implementation of the std::istream, std::ostream and std::iostream interface.
Does being a pipe matter, i.e. does pipe have significant implication here?

What Are Processes, How Do They Talk?
Programs interact with their environment in various ways. One set of channels are the standard input, output and error streams.
These are often tied to a terminal or files by a shell (cmd.exe, sh, bash etc).
Now if programs interact with eachother, like:
ls | rev
to list files and send the output to another program (rev, which reverses each line), this is implemented with pipes. Pipes are an operating system feature, not a boost idea. All major operating systems have them.
Fun fact: the | operator used in a most shells to indicate this type of output/input redirection between processes is called the PIPE symbol.
What Is A Pipe, Then?
Pipes are basically "magic" file-descriptors that refer to an "IO channel" rather than a file. Pipes have two ends: One party can writes to one end, the other party reads from the other.
Why?
Two reasons that come to mind right away
Files require disk IO and syncing, making it slow
Another fun fact: MSDOS has implemented pipes in terms of temporary files (on disk) for a very long time:
MS-DOS 2.0 introduced the ability to pipe the output of one program as the input of another. Since MS-DOS was a single-tasking operating system, this was simulated by redirecting the first program’s output to a temporary file and running it to completion, then running the second program with its input redirected from that temporary file. Now all of a sudden, MS-DOS needed a location to create temporary files! For whatever reason, the authors of MS-DOS chose to use the TEMP variable to control where these temporary files were created.
The pipe enables asynchronous IO. This can be important in case processes have two-way (full duplex) IO going on.
Okay Do I Care?
Yes, no, maybe.
You mostly don't. The ipstream/opstream classes are 100% compatible with std::istream/std::ostream, so if you had a function that expects them:
void simulate_input(std::ostream& os)
{
for (int i = 0; i < 10; ++i) {
os << "_ZN5boost7process8tutorialE" << std::endl;
}
}
You can perfectly use it in your sample:
bp::opstream in;
bp::ipstream out;
bp::child c("c++filt", bp::std_out > out, bp::std_in < in);
simulate_input(in);
in.close();
c.wait();
When You Definitely Need It
In full-duplex situations where you could easily induce a deadlock where both programs are waiting for input from the other end because they're doing the IO synchronously.
You can find examples + solution here:
How to reproduce deadlock hinted to by Boost process documentation?
boost::process::child will not exit after closing input stream
Boost::Process Pipe Streams and Unit Test

boost::asio reasoning behind num_implementations for io_service::strand

We've been using asio in production for years now and recently we have reached a critical point when our servers become loaded just enough to notice a mysterious issue.
In our architecture, each separate entity that runs independently uses a personal strand object. Some of the entities can perform a long work (reading from file, performing MySQL request, etc). Obviously, the work is performed within handlers wrapped with strand. All sounds nice and pretty and should work flawlessly, until we have begin to notice an impossible things like timers expiring seconds after they should, even though threads are 'waiting for work' and work being halt for no apparent reason. It looked like long work performed inside a strand had impact on other unrelated strands, not all of them, but most.
Countless hours were spent to pinpoint the issue. The track has led to the way strand object is created: strand_service::construct (here).
For some reason developers decided to have a limited number of strand implementations. Meaning that some totally unrelated objects will share a single implementation and hence will be bottlenecked because of this.
In the standalone (non-boost) asio library similar approach is being used. But instead of shared implementations, each implementation is now independent but may share a mutex object with other implementations (here).
What is it all about? I have never heard of limits on number of mutexes in the system. Or any overhead related to their creation/destruction. Though the last problem could be easily solved by recycling mutexes instead of destroying them.
I have a simplest test case to show how dramatic is a performance degradation:
#include <boost/asio.hpp>
#include <atomic>
#include <functional>
#include <iostream>
#include <thread>
std::atomic<bool> running{true};
std::atomic<int> counter{0};
struct Work
{
Work(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<Work> _this(new Work(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<Work> _this)
{
counter.fetch_add(1, std::memory_order_relaxed);
if (running.load(std::memory_order_relaxed)) {
start_the_work(_this->_strand.get_io_service());
}
}
boost::asio::strand _strand;
};
struct BlockingWork
{
BlockingWork(boost::asio::io_service & io_service)
: _strand(io_service)
{ }
static void start_the_work(boost::asio::io_service & io_service)
{
std::shared_ptr<BlockingWork> _this(new BlockingWork(io_service));
_this->_strand.get_io_service().post(_this->_strand.wrap(std::bind(do_the_work, _this)));
}
static void do_the_work(std::shared_ptr<BlockingWork> _this)
{
sleep(5);
}
boost::asio::strand _strand;
};
int main(int argc, char ** argv)
{
boost::asio::io_service io_service;
std::unique_ptr<boost::asio::io_service::work> work{new boost::asio::io_service::work(io_service)};
for (std::size_t i = 0; i < 8; ++i) {
Work::start_the_work(io_service);
}
std::vector<std::thread> workers;
for (std::size_t i = 0; i < 8; ++i) {
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
}
if (argc > 1) {
std::cout << "Spawning a blocking work" << std::endl;
workers.push_back(std::thread([&io_service] {
io_service.run();
}));
BlockingWork::start_the_work(io_service);
}
sleep(5);
running = false;
work.reset();
for (auto && worker : workers) {
worker.join();
}
std::cout << "Work performed:" << counter.load() << std::endl;
return 0;
}
Build it using this command:
g++ -o asio_strand_test_case -pthread -I/usr/include -std=c++11 asio_strand_test_case.cpp -lboost_system
Test run in a usual way:
time ./asio_strand_test_case
Work performed:6905372
real 0m5.027s
user 0m24.688s
sys 0m12.796s
Test run with a long blocking work:
time ./asio_strand_test_case 1
Spawning a blocking work
Work performed:770
real 0m5.031s
user 0m0.044s
sys 0m0.004s
Difference is dramatic. What happens is each new non-blocking work creates a new strand object up until it shares the same implementation with strand of the blocking work. When this happens it's a dead-end, until long work finishes.
Edit:
Reduced parallel work down to the number of working threads (from 1000 to 8) and updated test run output. Did this because when both numbers are close the issue is more visible.

Well, an interesting issue and +1 for giving us a small example reproducing the exact issue.
The problem you are having 'as I understand' with the boost implementation is that, it by default instantiates only a limited number of strand_impl, 193 as I see in my version of boost (1.59).
Now, what this means is that a large number of requests will be in contention as they would be waiting for the lock to be unlocked by the other handler (using the same instance of strand_impl).
My guess for doing such a thing would be to disallow overloading the OS by creating lots and lots and lots of mutexes. That would be bad. The current implementation allows one to reuse the locks (and in a configurable way as we will see below)
In my setup:
MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -O2 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:489696
real 0m5.016s
user 0m1.620s
sys 0m4.069s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:188480
real 0m5.031s
user 0m0.611s
sys 0m1.495s
Now, there is a way to change this number of cached implementations by setting the Macro BOOST_ASIO_STRAND_IMPLEMENTATIONS.
Below is the result I got after setting it to a value of 1024:
MacBook-Pro:asio_test amuralid$ g++ -std=c++14 -DBOOST_ASIO_STRAND_IMPLEMENTATIONS=1024 -o strand_issue strand_issue.cc -lboost_system -pthread
MacBook-Pro:asio_test amuralid$ time ./strand_issue
Work performed:450928
real 0m5.017s
user 0m2.708s
sys 0m3.902s
MacBook-Pro:asio_test amuralid$ time ./strand_issue 1
Spawning a blocking work
Work performed:458603
real 0m5.027s
user 0m2.611s
sys 0m3.902s
Almost the same for both cases! You might want to adjust the value of the macro as per your needs to keep the deviation small.

Note that if you don't like Asio's implementation you can always write your own strand which creates a separate implementation for each strand instance. This might be better for your particular platform than the default algorithm.

Edit: As of recent Boosts, standalone ASIO and Boost.ASIO are now in sync. This answer is preserved for historical interest.
Standalone ASIO and Boost.ASIO have become quite detached in recent years as standalone ASIO is slowly morphed into the reference Networking TS implementation for standardisation. All the "action" is happening in standalone ASIO, including major bug fixes. Only very minor bug fixes are made to Boost.ASIO. There is several years of difference between them by now.
I'd therefore suggest anyone finding any problems at all with Boost.ASIO should switch over to standalone ASIO. The conversion is usually not hard, look into the many macro configs for switching between C++ 11 and Boost in config.hpp. Historically Boost.ASIO was actually auto-generated by script from standalone ASIO, it may be the case Chris has kept those scripts working, and so therefore you could regenerate a brand shiny new Boost.ASIO with all the latest changes. I'd suspect such a build is not well tested however.

Threadsafe logging

I want to implement a simple class for logging from multiple threads. The idea there is, that each object that wants to log stuff, receives an ostream-object that it can write messages to using the usual operators. The desired behaviour is, that the messages are added to the log when the stream is flushed. This way, messages will not get interrupted by messages from other threads. I want to avoid using a temporary stringstream to store the message, as that would make most messages at least twoliners. As I see it, the standard way of achieving this would be to implement my own streambuffer, but this seems very cumbersome and error-prone. Is there a simpler way to do this? If not, do you know a good article/howto/guide on custom streambufs?
Thanks in advance,
Space_C0wbo0y
UPDATE:
Since it seems to work I added my own answer.

Take a look at log4cpp; they have a multi-thread support. It may save your time.

So, I took a look at Boost.IOstreams and here's what I've come up with:
class TestSink : public boost::iostreams::sink {
public:
std::streamsize write( const char * s, std::streamsize n ) {
std::string message( s, n );
/* This would add a message to the log instead of cout.
The log implementation is threadsafe. */
std::cout << message << std::endl;
return n;
}
};
TestSink can be used to create a stream-buffer (see stream_buffer-template). Every thread will receive it's own instance of TestSink, but all TestSinks will write to the same log. TestSink is used as follows:
TestSink sink;
boost::iostreams::stream_buffer< TestSink > testbuf( sink, 50000 );
std::ostream out( &testbuf );
for ( int i = 0; i < 10000; i++ )
out << "test" << i;
out << std::endl;
The important fact here is, that TestSink.write is only called when the stream is flushed (std::endl or std::flush), or when the internal buffer of the stream_buffer instance is full (the default buffer size cannot hold 40000 chars, so I initalize it to 50000). In this program, TestSink.write is called exactly once (the output is too long to post here). This way I can write logmessage using normal formatted stream-IO without any temporary variables and be sure, that the message is posted to the log in one piece when I flush the stream.
I will leave the question open another day, in case there are different suggestions/problems I have not considered.

You think log4cpp is too heavy and you reach for Boost.IOStreams instead? Huh?
You may wish to consider logog. It's thread-safe for POSIX, Win32 and Win64.

Re. your own response. If you are using this for error logging and you program crashes before flushing your stream then you logging is a bit useless isn't it?

Non-threadsafe file I/O in C/C++

While troubleshooting some performance problems in our apps, I found out that C's stdio.h functions (and, at least for our vendor, C++'s fstream classes) are threadsafe. As a result, every time I do something as simple as fgetc, the RTL has to acquire a lock, read a byte, and release the lock.
This is not good for performance.
What's the best way to get non-threadsafe file I/O in C and C++, so that I can manage locking myself and get better performance?
MSVC provides _fputc_nolock, and GCC provides unlocked_stdio and flockfile, but I can't find any similar functions in my compiler (CodeGear C++Builder).
I could use the raw Windows API, but that's not portable and I assume would be slower than an unlocked fgetc for character-at-a-time I/O.
I could switch to something like the Apache Portable Runtime, but that could potentially be a lot of work.
How do others approach this?
Edit: Since a few people wondered, I had tested this before posting. fgetc doesn't do system calls if it can satisfy reads from its buffer, but it does still do locking, so locking ends up taking an enormous percentage of time (hundreds of locks to acquire and release for a single block of data read from disk). Not doing character-at-a-time I/O would be a solution, but C++Builder's fstream classes unfortunately use fgetc (so if I want to use iostream classes, I'm stuck with it), and I have a lot of legacy code that uses fgetc and friends to read fields out of record-style files (which would be reasonable if it weren't for locking issues).

I'd simply not do IO a char at a time if it is sensible performance wise.

fgetc is almost certainly not reading a byte each time you call it (where by 'reading' I mean invoking a system call to perform I/O). Look somewhere else for your performance bottleneck, as this is probably not the problem, and using unsafe functions is certainly not the solution. Any lock handling you do will probably be less efficient than the handling done by the standard routines.

The easiest way would be to read the entire file in memory, and then provide your own fgetc-like interface to that buffer.

Why not just memory map the file? Memory mapping is portable (except in Windows Vista which requires you to jump through hopes to use it now, the dumbasses). Anyhow, map your file into memory, and do you're own locking/not-locking on the resulting memory location.
The OS handles all the locking required to actually read from the disk - you'll NEVER be able to eliminate this overhead. But your processing overhead, on the otherhand, won't be affected by extraneous locking other than that which you do yourself.

the multi-platform approach is pretty simple. Avoid functions or operators where standard specifies that they should use sentry. sentry is an inner class in iostream classes which ensures stream consistency for every output character and in multi-threaded environment it locks the stream related mutex for each character being output. This avoids race conditions at low level but still makes the output unreadable, since strings from two threads might be output concurrently as the following example states:
thread 1 should write: abc
thread 2 should write: def
The output might look like: adebcf instead of abcdef or defabc. This is because sentry is implemented to lock and unlock per character.
The standard defines it for all functions and operators dealing with istream or ostream. The only way to avoid this is to use stream buffers and your own locking (per string for example).
I have written an app, which outputs some data to a file and mesures the speed. If you add here a function which ouptuts using the fstream directly without using the buffer and flush, you will see the speed difference. It uses boost, but I hope it is not a problem for you. Try to remove all the streambuffers and see the difference with and without them. I my case the performance drawback was factor 2-3 or so.
The following article by N. Myers will explain how locales and sentry in c++ IOStreams work. And for sure you should look up in ISO C++ Standard document, which functions use sentry.
Good Luck,
Ovanes
#include <vector>
#include <fstream>
#include <iterator>
#include <algorithm>
#include <iostream>
#include <cassert>
#include <cstdlib>
#include <boost/progress.hpp>
#include <boost/shared_ptr.hpp>
double do_copy_via_streambuf()
{
const size_t len = 1024*2048;
const size_t factor = 5;
::std::vector<char> data(len, 1);
std::vector<char> buffer(len*factor, 0);
::std::ofstream
ofs("test.dat", ::std::ios_base::binary|::std::ios_base::out);
noskipws(ofs);
std::streambuf* rdbuf = ofs.rdbuf()->pubsetbuf(&buffer[0], buffer.size());
::std::ostreambuf_iterator<char> oi(rdbuf);
boost::progress_timer pt;
for(size_t i=1; i<=250; ++i)
{
::std::copy(data.begin(), data.end(), oi);
if(0==i%factor)
rdbuf->pubsync();
}
ofs.flush();
double rate = 500 / pt.elapsed();
std::cout << rate << std::endl;
return rate;
}
void count_avarage(const char* op_name, double (*fct)())
{
double av_rate=0;
const size_t repeat = 1;
std::cout << "doing " << op_name << std::endl;
for(size_t i=0; i<repeat; ++i)
av_rate+=fct();
std::cout << "average rate for " << op_name << ": " << av_rate/repeat
<< "\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n"
<< std::endl;
}
int main()
{
count_avarage("copy via streambuf iterator", do_copy_via_streambuf);
return 0;
}

One thing to consider is to build a custom runtime. Most compilers provide the source to the runtime library (I'd be surprised if it weren't in the C++ Builder package).
This could end up being a lot of work, but maybe they've localized the thread support to make something like this easy. For example, with the embedded system compiler I'm using, it's designed for this - they have documented hooks to add the lock routines. However, it's possible that this could be a maintenance headache, even if it turns out to be relatively easy initially.
Another similar route would be to talk to someone like Dinkumware about using a 3rd party runtime that provides the capabilities you need.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js