In my application I am implementing what I would call a "reduce on sparse vectors" via a TBB flow graph compared with MPI one-sided communication (RMA). The central piece of the algorithm looks as follows:
auto &reduce = m_g_R.add<function_node<ReductionJob, ReductionJob>>(
serial,
[=, &reduced_bi](ReductionJob rj) noexcept
{
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
if (m_R_comms[r] != MPI_COMM_NULL)
{
const size_t n = reduced_bi.dim(r);
MPI_Win win;
MPI_Win_create(
buffer,
r == mr ? n * sizeof(T) : 0,
sizeof(T),
MPI_INFO_NULL,
m_R_comms[r],
&win
);
if (n > 0 && r != mr)
{
MPI_Win_lock(MPI_LOCK_SHARED, 0, 0, win);
size_t i = 0;
do
{
while (i < n && !mask[i]) ++i;
size_t base = i;
while (i < n && mask[i]) ++i;
if (i > base) MPI_Accumulate(
buffer + base, i - base, MpiType<T>::type,
0,
base, i - base, MpiType<T>::type,
MPI_SUM,
win
);
}
while (i < n);
MPI_Win_unlock(0, win);
}
MPI_Win_free(&win);
}
return rj;
}
);
This is executed for each rank r participating in the calculation, with reduced_bi.dim(r) specifying how many elements each rank owns. mr is the current rank, and the communicators are created in such a way that the target process is root for each of them. buffer is an array of T = double (typically), and mask is an std::vector<bool> identifying which elements are non-zero. The combination of loops splits the communication into chunks of non-zero elements.
This generally works fine and results are correct, same as my previous implementation using MPI_Reduce. However, is seems to be crucial that the concurrency level for this node is set to serial, indicating that there is at most one parallel TBB task (and thus at most one thread) executing this code.
I would like to set it to unlimited to improve performance, and indeed that works fine that way on my laptop with small jobs, running with MPICH 3.4.1. On the cluster where I really want to run the computation, however, using OpenMPI 4.1.1, it runs for a while before crashing with a segfault and a backtrace involving a bunch of UCX functions.
I wonder now, is it not allowed to have multiple threads in parallel call RMA operations like this (and on my laptop it only works accidentally), or am I hitting a bug/limitation on the cluster? From the documentation I do not see directly that what I would like to do is not supported.
Of course, MPI is initialized with MPI_THREAD_MULTIPLE and I repeat again that the snippet as posted above works fine, only when I change serial --> unlimited to enable concurrent execution do I hit the problem on the cluster.
In reply to Victor Eijkhout comment(s) below, here is a complete sample program that reproduces the issue. This runs fine on my laptop (tested specifically with mpirun -n 16), but it crashes on the cluster when I run it with 16 ranks (spread across 4 cluster nodes).
#include <iostream>
#include <vector>
#include <thread>
#include <mpi.h>
int main(void)
{
int requested = MPI_THREAD_MULTIPLE, provided;
MPI_Init_thread(nullptr, nullptr, requested, &provided);
if (provided != requested)
{
std::cerr << "Failed to initialize MPI with full thread support!"
<< std::endl;
exit(1);
}
int mr, nr;
MPI_Comm_rank(MPI_COMM_WORLD, &mr);
MPI_Comm_size(MPI_COMM_WORLD, &nr);
const size_t dim = 1024;
const size_t repeat = 100;
std::vector<double> send(dim, static_cast<double>(mr) + 1.0);
std::vector<double> recv(dim, 0.0);
MPI_Win win;
MPI_Win_create(
recv.data(),
recv.size() * sizeof(double),
sizeof(double),
MPI_INFO_NULL,
MPI_COMM_WORLD,
&win
);
std::vector<std::thread> threads;
for (size_t i = 0; i < repeat; ++i)
{
threads.clear();
threads.reserve(nr);
for (int r = 0; r < nr; ++r) if (r != mr)
{
threads.emplace_back([r, &send, &win]
{
MPI_Win_lock(MPI_LOCK_SHARED, r, 0, win);
for (size_t i = 0; i < dim; ++i) MPI_Accumulate(
send.data() + i, 1, MPI_DOUBLE,
r,
i, 1, MPI_DOUBLE,
MPI_SUM,
win
);
MPI_Win_unlock(r, win);
});
}
for (auto &t : threads) t.join();
MPI_Barrier(MPI_COMM_WORLD);
if (mr == 0) std::cout << recv.front() << std::endl;
}
MPI_Win_free(&win);
MPI_Finalize();
}
Note: I am intentionally using plain threads here to avoid unnecessary dependencies. It should be linked with -lpthread.
The specific error I get on the cluster is this, using OpenMPI 4.1.1:
*** An error occurred in MPI_Accumulate
*** reported by process [1829189442,11]
*** on win ucx window 3
*** MPI_ERR_RMA_SYNC: error executing rma sync
*** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
*** and potentially your MPI job)
Possible relevant parts from ompi_info:
Open MPI: 4.1.1
Open MPI repo revision: v4.1.1
Open MPI release date: Apr 24, 2021
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes, OMPI progress: no, Event lib: yes)
It has been compiled with UCX/1.10.1.
The style in C++ is to put the * or & with the type, not the identifier. This is called out specifically near the beginning of Stroustrup’s first book, and is an intentional difference from C style.
Create -- Lock -- Unlock -- Free
⧺R.1 Manage resources automatically using resource handles and RAII (Resource Acquisition Is Initialization)
Use a wrapper class, either written for the purpose, designed for this C API, a general purpose resource manager template, or unique_ptr with a custom deleter, rather than explicit calls that must be matched up for correct behavior.
RAII/RFID is one of C++'s fundamental strengths, and using it will go a long way to making your code less buggy and more maintainable in time.
Use destructuring syntax.
const auto r = std::get<0>(rj);
auto *buffer = std::get<1>(rj)->data.data();
auto &mask = std::get<1>(rj)->mask;
rather than referring to get<0> and get<1> you can immediately name the components.
const auto& [r, fred] = rj;
auto* buffer = fred->data.data();
auto& mask = fred->mask;
Related
I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.
I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.
I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.
I have an algorithm where in each iteration each node has to calculate a segment of an array, where each element of x_ depends on all the elements of x.
x_[i] = some_func(x) // each x_[i] depends on the entire x
That is, each iteration takes x and calculates x_, which will be the new x for the next iteration.
A way of paralelizing this is MPI would be to split x_ between the nodes and have an Allgather call after the calculation of x_ so that each processor would send its x_ to the appropriate location in x in all the other processors, then repeat. This is very inefficient since it requires an expensive Allgather call every iteration, not to mention it requires as many copies of x as there are nodes.
I've thought of an alternative way that doesn't require copying. If the program is running on a single machine, with shared RAM, would it be possible to just share the x_ between the nodes (without copying)? That is, after calculating x_ each processor would make it visible to the other nodes, which could then use it as their x for the next iteration without needing to make several copies. I can design the algorithm so that no processor accesses the same x_ at the same time, which is why making a private copy for each node is overkill.
I guess what I'm asking is: can I share memory in MPI simply by tagging an array as shared-between-nodes, as opposed to manually making a copy for each node? (for simplicity assume I'm running on one CPU)
You can share memory within a node using MPI_Win_allocate_shared from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).
MPI functions
The following are taken from the MPI 3.1 standard.
Allocating shared memory
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)
(if you want the Fortran declaration, click the link)
You deallocate memory using MPI_Win_free. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.
Querying the node allocations
In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)
int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)
(if you want the Fortran declaration, click the link above)
Synchronizing shared memory
MPI_WIN_SYNC(win)
IN win window object (handle)
int MPI_Win_sync(MPI_Win win)
This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.
You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize) to attain a consistent view of data.
Synchronization
If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier and MPI_Send+MPI_Recv will work for the latter. Or you can use MPI-3 atomics to build counters and locks.
Example program
The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.
This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.
#include <stdio.h>
#include <mpi.h>
/* This function synchronizes process rank i with process rank j
* in such a way that this function returns on process rank j
* only after it has been called on process rank i.
*
* No additional semantic guarantees are provided.
*
* The process ranks are with respect to the input communicator (comm). */
int p2p_xsync(int i, int j, MPI_Comm comm)
{
/* Avoid deadlock. */
if (i==j) {
return MPI_SUCCESS;
}
int rank;
MPI_Comm_rank(comm, &rank);
int tag = 666; /* The number of the beast. */
if (rank==i) {
MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
} else if (rank==j) {
MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
/* If val is the same at all MPI processes in comm,
* this function returns 1, else 0. */
int coll_check_equal(int val, MPI_Comm comm)
{
int minmax[2] = {-val,val};
MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
return ((-minmax[0])==minmax[1] ? 1 : 0);
}
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int * shptr = NULL;
MPI_Win shwin;
MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD,
&shptr, &shwin);
/* l=local r=remote */
MPI_Aint rsize = 0;
int rdisp;
int * rptr = NULL;
int lint = -999;
MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
if (rptr==NULL || rsize!=sizeof(int)) {
printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
MPI_Abort(MPI_COMM_WORLD, 1);
}
/*******************************************************/
MPI_Win_lock_all(0 /* assertion */, shwin);
if (rank==0) {
*shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
MPI_Win_sync(shwin);
}
for (int j=1; j<size; j++) {
p2p_xsync(0, j, MPI_COMM_WORLD);
}
if (rank!=0) {
MPI_Win_sync(shwin);
}
lint = *rptr;
MPI_Win_unlock_all(shwin);
/*******************************************************/
if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
if (rank==0) {
printf("SUCCESS!\n");
}
} else {
printf("rank %d: lint = %d \n", rank, lint);
}
MPI_Win_free(&shwin);
MPI_Finalize();
return 0;
}
I've got a function that is typically run 50 times (to run 50 simulations). Usually this is done sequentially single threaded but I'd like to speed things up using multiple threads. The threads don't need to access each others memory or data so I don't think racing is an issue. Essentially the thread should just complete its task, and return to main thats it's finished, also returning a double value.
First of all, looking through all the boost documentation and examples has really convoluted me and I'm not sure what I'm looking for anymore. boost::thread ? boost future? Could someone give an example of what is applicable in my case. Additionally, I don't understand how to specify how many threads to run, is it more like I would run 50 threads and the OS handles when to execute them?
If your code is completely CPU-bound (no network/disk IO), then you would benefit from starting as many background threads as you have CPUs. Use Boost's hardware_concurrency() function to determine that number and/or allow the user to set it. Just starting a bunch of threads is not helpful, as that will increase the overhead caused by creating, switching and terminating threads.
The code starting the threads is a simple loop, followed by another loop to wait for the thread's completion. You can also use the thread_group class for that. If the number of jobs is not known and can't be distributed on thread startup, consider using a thread pool where you just start a sensible number of threads and then give them jobs while they come up.
Read the Boost.Thread Futures docs for an idea of using futures and async to achieve this. It also shows how to do it manually (the hard way) using thread objects.
Given this serial code:
double run_sim(Data*);
int main()
{
const unsigned ntasks = 50;
double results[ntasks];
Data data[ntasks];
for (unsigned i=0; i<ntasks; ++i)
results[i] = run_sim(data[i]);
}
A naive parallel version would be:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/bind.hpp>
double run_task(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
boost::future<int> futures[nsim];
for (unsigned i=0; i<nsim; ++i)
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
results[i] = futures[i].get();
}
Because boost::async doesn't yet support deferred functions every async call will create a new thread, so this will spawn 50 thread at once. This might perform quite badly, so you could split it up into smaller blocks:
#define BOOST_THREAD_PROVIDES_FUTURE
#include <boost/thread/future.hpp>
#include <boost/thread/thread.hpp>
#include <boost/bind.hpp>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
unsigned nprocs = boost::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2; // cannot determine number of cores, let's say 2
Data data[nsim];
boost::future<int> futures[nsim];
double results[nsim];
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = boost::async(boost::bind(&run_sim, &data[i]));
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}
If hardware_concurrency() returns 4, this will create three new threads then call run_sim synchronously in the main() thread, then create another three new threads then call run_sim synchronously. This will prevent 50 threads all being created at once, as the main thread stops to do some of the work, which will allow some of the other threads to complete.
The code above requires quite a recent version of Boost, it's slightly easier using Standard C++ if you can use C++11:
#include <future>
double run_sim(Data*);
int main()
{
const unsigned nsim = 50;
Data data[nsim];
std::future<int> futures[nsim];
double results[nsim];
unsigned nprocs = std::thread::hardware_concurrency();
if (nprocs == 0)
nprocs = 2;
for (unsigned i=0; i<nsim; ++i)
{
if ( ((i+1) % nprocs) != 0 )
futures[i] = std::async(boost::launch::async, &run_sim, &data[i]);
else
results[i] = run_sim(&data[i]);
}
for (unsigned i=0; i<nsim; ++i)
if ( ((i+1) % nprocs) != 0 )
results[i] = futures[i].get();
}