Share memory across MPI nodes to prevent unecessary copying - c++

I have an algorithm where in each iteration each node has to calculate a segment of an array, where each element of x_ depends on all the elements of x.
x_[i] = some_func(x) // each x_[i] depends on the entire x
That is, each iteration takes x and calculates x_, which will be the new x for the next iteration.
A way of paralelizing this is MPI would be to split x_ between the nodes and have an Allgather call after the calculation of x_ so that each processor would send its x_ to the appropriate location in x in all the other processors, then repeat. This is very inefficient since it requires an expensive Allgather call every iteration, not to mention it requires as many copies of x as there are nodes.
I've thought of an alternative way that doesn't require copying. If the program is running on a single machine, with shared RAM, would it be possible to just share the x_ between the nodes (without copying)? That is, after calculating x_ each processor would make it visible to the other nodes, which could then use it as their x for the next iteration without needing to make several copies. I can design the algorithm so that no processor accesses the same x_ at the same time, which is why making a private copy for each node is overkill.
I guess what I'm asking is: can I share memory in MPI simply by tagging an array as shared-between-nodes, as opposed to manually making a copy for each node? (for simplicity assume I'm running on one CPU)

You can share memory within a node using MPI_Win_allocate_shared from MPI-3. It provides a portable way to use Sys5 and POSIX shared memory (and anything similar).
MPI functions
The following are taken from the MPI 3.1 standard.
Allocating shared memory
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
IN disp_unit local unit size for displacements, in bytes (positive integer)
IN info info argument (handle) IN comm intra-communicator (handle)
OUT baseptr address of local allocated window segment (choice)
OUT win window object returned by the call (handle)
int MPI_Win_allocate_shared(MPI_Aint size, int disp_unit, MPI_Info info, MPI_Comm comm, void *baseptr, MPI_Win *win)
(if you want the Fortran declaration, click the link)
You deallocate memory using MPI_Win_free. Both allocation and deallocation are collective. This is unlike Sys5 or POSIX, but makes the interface much simpler on the user.
Querying the node allocations
In order to know how to perform load-store against another process' memory, you need to query the address of that memory in the local address space. Sharing the address in the other process' address space is incorrect (it might work in some cases, but one cannot assume it will work).
MPI_WIN_SHARED_QUERY(win, rank, size, disp_unit, baseptr)
IN win shared memory window object (handle)
IN rank rank in the group of window win (non-negative integer) or MPI_PROC_NULL
OUT size size of the window segment (non-negative integer)
OUT disp_unit local unit size for displacements, in bytes (positive integer)
OUT baseptr address for load/store access to window segment (choice)
int MPI_Win_shared_query(MPI_Win win, int rank, MPI_Aint *size, int *disp_unit, void *baseptr)
(if you want the Fortran declaration, click the link above)
Synchronizing shared memory
MPI_WIN_SYNC(win)
IN win window object (handle)
int MPI_Win_sync(MPI_Win win)
This function serves as a memory barrier for load-store accesses to the data associated with the shared memory window.
You can also use ISO language features (i.e. those provided by C11 and C++11 atomics) or compiler extensions (e.g. GCC intrinsics such as __sync_synchronize) to attain a consistent view of data.
Synchronization
If you understand interprocess shares memory semantics already, the MPI-3 implementation will be easy to understand. If not, just remember that you need to synchronize memory and control flow correctly. There is MPI_Win_sync for the former, while existing MPI sync functions like MPI_Barrier and MPI_Send+MPI_Recv will work for the latter. Or you can use MPI-3 atomics to build counters and locks.
Example program
The following code is from https://github.com/jeffhammond/HPCInfo/tree/master/mpi/rma/shared-memory-windows, which contains example programs of shared-memory usage that have been used by the MPI Forum to debate the semantics of these features.
This program demonstrates unidirectional pair-wise synchronization through shared-memory. If you merely want to create a WORM (write-once, read-many) slab, that should be much simpler.
#include <stdio.h>
#include <mpi.h>
/* This function synchronizes process rank i with process rank j
* in such a way that this function returns on process rank j
* only after it has been called on process rank i.
*
* No additional semantic guarantees are provided.
*
* The process ranks are with respect to the input communicator (comm). */
int p2p_xsync(int i, int j, MPI_Comm comm)
{
/* Avoid deadlock. */
if (i==j) {
return MPI_SUCCESS;
}
int rank;
MPI_Comm_rank(comm, &rank);
int tag = 666; /* The number of the beast. */
if (rank==i) {
MPI_Send(NULL, 0, MPI_INT, j, tag, comm);
} else if (rank==j) {
MPI_Recv(NULL, 0, MPI_INT, i, tag, comm, MPI_STATUS_IGNORE);
}
return MPI_SUCCESS;
}
/* If val is the same at all MPI processes in comm,
* this function returns 1, else 0. */
int coll_check_equal(int val, MPI_Comm comm)
{
int minmax[2] = {-val,val};
MPI_Allreduce(MPI_IN_PLACE, minmax, 2, MPI_INT, MPI_MAX, comm);
return ((-minmax[0])==minmax[1] ? 1 : 0);
}
int main(int argc, char * argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int * shptr = NULL;
MPI_Win shwin;
MPI_Win_allocate_shared(rank==0 ? sizeof(int) : 0,sizeof(int),
MPI_INFO_NULL, MPI_COMM_WORLD,
&shptr, &shwin);
/* l=local r=remote */
MPI_Aint rsize = 0;
int rdisp;
int * rptr = NULL;
int lint = -999;
MPI_Win_shared_query(shwin, 0, &rsize, &rdisp, &rptr);
if (rptr==NULL || rsize!=sizeof(int)) {
printf("rptr=%p rsize=%zu \n", rptr, (size_t)rsize);
MPI_Abort(MPI_COMM_WORLD, 1);
}
/*******************************************************/
MPI_Win_lock_all(0 /* assertion */, shwin);
if (rank==0) {
*shptr = 42; /* Answer to the Ultimate Question of Life, The Universe, and Everything. */
MPI_Win_sync(shwin);
}
for (int j=1; j<size; j++) {
p2p_xsync(0, j, MPI_COMM_WORLD);
}
if (rank!=0) {
MPI_Win_sync(shwin);
}
lint = *rptr;
MPI_Win_unlock_all(shwin);
/*******************************************************/
if (1==coll_check_equal(lint,MPI_COMM_WORLD)) {
if (rank==0) {
printf("SUCCESS!\n");
}
} else {
printf("rank %d: lint = %d \n", rank, lint);
}
MPI_Win_free(&shwin);
MPI_Finalize();
return 0;
}

Related

Passing large 2d dimentional array in MPI C++

I have a task to speed up a program using MPI.
Let's assume I have a large 2d array (1000x1000 or bigger) on the input. I have a working sequential program that divides, so the 2d array into chunks (for example 10x10) and calculates the result which is double for each chuck. (so we have a function which argument is 2d array 10x10 and a result is a double number).
My first idea to speed up:
Create 1d array of size N*N (for example 10x10 = 100) and Send array to another process
double* buffer = new double[dataPortionSize];
//copy some data to buffer
MPI_Send(buffer, dataPortionSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
Recieve it in another process, calculate result, send back the result
double* buf = new double[dataPortionSize];
MPI_Recv(buf, dataPortionSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
double result = function->calc(buf);
MPI_Send(&result, 1, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD);
This program was much slower than the sequential version. It looks like MPI needs a lot of time to pass an array to another process.
My second idea:
Pass the whole 2d input array to all processes
// data is protected field in base class, it is injected during runtime
MPI_Send(&(data[0][0]), dataSize * dataSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
And receive data like this
double **arrayAlloc( int size ) {
double **result; result = new double [ size ];
for ( int i = 0; i < size; i++ )
result[ i ] = new double[ size ];
return result;
}
double **data = arrayAlloc(dataSize);
MPI_Recv(&data[0][0], dataSize * dataSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
Unfortunately, I got a bunch of errors during execution:
Those crashes are pretty random. It happened 2 times that the program ended successfully
My third idea:
Pass memory address to all processes, but I found this:
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
Does anyone have an idea how to speed up it? I understand that the key thing for speed to is pass array/arrays to processes in an efficient way, but I don't have an idea how to do this.
You have multiple issues here. I'll try to go through them in some arbitrary order.
As someone else explained, your second attempt fails because MPI expects you to work with a single consecutive array, not an array of pointers. So you want to allocate something like matrix = new double[rows * cols] and then access individual rows as &matrix[row * cols] or an individual value as matrix[row * cols + col]
This would be a data structure that you can send, receive, scatter, and gather with MPI. It would also be faster in general.
You are correct to assume that MPI takes time to transfer data. Even best case it is the cost of a memcpy. Usually significantly more. If your program is doing too little work before transferring data, it will not be faster.
Your first attempt may have failed because the first process doesn't do anything useful while waiting for the result. You didn't include the receive operation in your code sample. However, if you wrote something like this:
for(int block = 0; block < nblocks; ++block) {
generate_data(buf);
MPI_Send(buf, ...);
MPI_Recv(buf, ...);
}
Then you cannot expect a speedup because the process is not doing anything useful while waiting for the result. You can avoid this with double buffering. Let the first process generate the next data block before waiting in the receive operation for the result. Something like this:
generate_data(0, input); /* 0-th block */
MPI_Send(input, ...);
for(int block = 1; block < nblocks; ++block) {
generate_data(block, input); /* 1st up to nth block */
MPI_Recv(output, ...); /* 0-th up to n-1-th block */
MPI_Send(input, ...);
}
MPI_Recv(output, ...); /* n-th block */
Now calculations in both processes can overlap.
You shouldn't use MPI_Send and MPI_Recv to begin with! MPI is designed for collective operations like MPI_Scatter and MPI_Gather. What you should do, is generate N blocks for N processes, MPI_Scatter them across all processes. Then let each process compute their result. Then MPI_Gather them back at the root process.
Even better, let every process work independently, if possible. Of course this depends on your data but if you can generate and process data blocks independently from one another, don't do any communication. Just let them all work alone. Something like this:
int rank, worldsize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &worldsize);
for(int block = rank; block < nblocks; block += worldsize) {
process_data(block);
}

Time efficient design model for sending to and receiving from all mpi processes: MPI all 2 all communication

I am trying to send message to all MPI processes from a process and also receive message from all those processes in a process. It is basically an all to all communication where every process sends message to every other process (except itself) and receives message from every other process.
The following example code snippet shows what I am trying to achieve. Now, the problem with MPI_Send is its behavior where for small message size it acts as non-blocking but for the larger message (in my machine BUFFER_SIZE 16400) it blocks. I am aware of this is how MPI_Send behaves. As a workaround, I replaced the code below with blocking (send+recv) which is MPI_Sendrecv. Example code is like this MPI_Sendrecv(intSendPack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, intReceivePack, BUFFER_SIZE, MPI_INT, processId, MPI_TAG, MPI_COMM_WORLD, MPI_STATUSES_IGNORE) . I am making the above call for all the processes of MPI_COMM_WORLD inside a loop for every rank and this approach gives me what I am trying to achieve (all to all communication). However, this call takes a lot of time which I want to cut-down with some time-efficient approach. I have tried with mpi scatter and gather to perform all to all communication but here one issue is the buffer size (16400) may differ in actual implementation in different iteration for MPI_all_to_all function calling. Here, I am using MPI_TAG to differentiate the call in different iteration which I cannot use in scatter and gather functions.
#define BUFFER_SIZE 16400
void MPI_all_to_all(int MPI_TAG)
{
int size;
int rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* intSendPack = new int[BUFFER_SIZE]();
int* intReceivePack = new int[BUFFER_SIZE]();
for (int prId = 0; prId < size; prId++) {
if (prId != rank) {
MPI_Send(intSendPack, BUFFER_SIZE, MPI_INT, prId, MPI_TAG,
MPI_COMM_WORLD);
}
}
for (int sId = 0; sId < size; sId++) {
if (sId != rank) {
MPI_Recv(intReceivePack, BUFFER_SIZE, MPI_INT, sId, MPI_TAG,
MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
}
I want to know if there is a way I can perform all to all communication using any efficient communication model. I am not sticking to MPI_Send, if there is some other way which provides me what I am trying to achieve, I am happy with that. Any help or suggestion is much appreciated.
This is a benchmark that allows to compare performance of collective vs. point-to-point communication in an all-to-all communication,
#include <iostream>
#include <algorithm>
#include <mpi.h>
#define BUFFER_SIZE 16384
void point2point(int*, int*, int, int);
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank_id = 0, com_sz = 0;
double t0 = 0.0, tf = 0.0;
MPI_Comm_size(MPI_COMM_WORLD, &com_sz);
MPI_Comm_rank(MPI_COMM_WORLD, &rank_id);
int* intSendPack = new int[BUFFER_SIZE]();
int* result = new int[BUFFER_SIZE*com_sz]();
std::fill(intSendPack, intSendPack + BUFFER_SIZE, rank_id);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
// Send-Receive
t0 = MPI_Wtime();
point2point(intSendPack, result, rank_id, com_sz);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Send-receive time: " << tf - t0 << std::endl;
// Collective
std::fill(result, result + BUFFER_SIZE*com_sz, 0);
std::fill(result + BUFFER_SIZE*rank_id, result + BUFFER_SIZE*(rank_id+1), rank_id);
t0 = MPI_Wtime();
MPI_Allgather(intSendPack, BUFFER_SIZE, MPI_INT, result, BUFFER_SIZE, MPI_INT, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
tf = MPI_Wtime();
if (!rank_id)
std::cout << "Allgather time: " << tf - t0 << std::endl;
MPI_Finalize();
delete[] intSendPack;
delete[] result;
return 0;
}
// Send/receive communication
void point2point(int* send_buf, int* result, int rank_id, int com_sz)
{
MPI_Status status;
// Exchange and store the data
for (int i=0; i<com_sz; i++){
if (i != rank_id){
MPI_Sendrecv(send_buf, BUFFER_SIZE, MPI_INT, i, 0,
result + i*BUFFER_SIZE, BUFFER_SIZE, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
}
}
}
Here every rank contributes its own array intSendPack to the array result on all other ranks that should end up the same on all the ranks. result is flat, each rank takes BUFFER_SIZE entries starting with its rank_id*BUFFER_SIZE. After the point-to-point communication, the array is reset to its original shape.
Time is measured by setting up an MPI_Barrier which will give you the maximum time out of all ranks.
I ran the benchmark on 1 node of Nersc Cori KNL using slurm. I ran it a few times each case just to make sure the values are consistent and I'm not looking at an outlier, but you should run it maybe 10 or so times to collect more proper statistics.
Here are some thoughts:
For small number of processes (5) and a large buffer size (16384) collective communication is about twice faster than point-to-point, but it becomes about 4-5 times faster when moving to larger number of ranks (64).
In this benchmark there is not much difference between performance with recommended slurm settings on that specific machine and default settings but in real, larger programs with more communication there is a very significant one (job that runs for less than a minute with recommended will run for 20-30 min and more with default). Point of this is check your settings, it may make a difference.
What you were seeing with Send/Receive for larger messages was actually a deadlock. I saw it too for the message size shown in this benchmark. In case you missed those, there are two worth it SO posts on it: buffering explanation and a word on deadlocking.
In summary, adjust this benchmark to represent your code more closely and run it on your system, but collective communication in an all-to-all or one-to-all situations should be faster because of dedicated optimizations such as superior algorithms used for communication arrangement. A 2-5 times speedup is considerable, since communication often contributes to the overall time the most.

MPI - How to create partial arrays for workers when array initialization value must be constant?

I don't have much experience with C++ or MPI currently, so I assume this will be an easy question to answer.
I want to be able to change the number of processes that can work on my array sort for experimentation purposes, but when I try to declare a partial array for my worker to work on, I receive an error stating that the array size variable, PART, needs to be constant.
Is this from how I calculated or parsed it, or from an MPI mechanic?
const int arraySize = 10000
int main(int argc, char ** argv)
{
MPI_Init(&argc, &argv);
int rank;
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
const int PART = floor(arraySize / size);
auto start = std::chrono::high_resolution_clock::now(); //start timer
//================================ WORKER PROCESSES ===============================
if (rank != 0)
{
int tmpArray[PART]; //HERE IS MY PROBLEM
MPI_Recv(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); //recieve data into local initalized array
qsort(&tmpArray[0], PART, sizeof(int), compare); // quick sort
MPI_Send(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD); //send sorted array back to rank 0
}
auto tmpArray = std::make_unique<int[]>(PART);
If the size of an array is determined at runtime, as in your case, this would give a variable length array, which is supported in C, but not in standard C++.
So in C++, the size of an array needs to be a (compile time) constant.
To overcome this, you'll have to use dynamic memory allocation. This can be achieved either through "classic C" functions malloc and free (which are rarely used in C++), through their C++-pendants new and delete (or new[] and delete[]), or - the preferred way - through the use of container objects like, for example, std::vector<int> that encapsulate this memory allocation issues for you.

Simple MPI_Gather test with memcpy error

I am learning MPI, and trying to create examples of some of the functions. I've gotten several to work, but I am having issues with MPI_Gather. I had a much more complex fitting test, but I trimmed it down to the most simple code. I am still, however, getting the following error:
root#master:/home/sgeadmin# mpirun ./expfitTest5
Assertion failed in file src/mpid/ch3/src/ch3u_request.c at line 584: FALSE
memcpy argument memory ranges overlap, dst_=0x1187e30 src_=0x1187e40 len_=400
internal ABORT - process 0
I am running one master instance and two node instances through AWS EC2. I have all the appropriate libraries installed, as I've gotten other MPI examples to work. My program is:
int main()
{
int world_size, world_rank;
int nFits = 100;
double arrCount[100];
double *rBuf = NULL;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
assert(world_size!=1);
int nElements = nFits/(world_size-1);
if(world_rank>0){
for(int k = 0; k < nElements; k++)
{
arrCount[k] = k;
}}
MPI_Barrier(MPI_COMM_WORLD);
if(world_rank==0)
{
rBuf = (double*) malloc( nFits*sizeof(double));
}
MPI_Gather(arrCount, nElements, MPI_DOUBLE, rBuf, nElements, MPI_DOUBLE, 0, MPI_COMM_WORLD);
if(world_rank==0){
for(int i = 0; i < nFits; i++)
{
cout<<rBuf[i]<<"\n";
}}
MPI_Finalize();
exit(0);
}
Is there something I am not understanding in malloc or MPI_Gather? I've compared my code to other samples, and can't find any differences.
The root process in a gather operation does participate in the operation. I.e. it sends data to it's own receive buffer. That also means you must allocate memory for it's part in the receive buffer.
Now you could use MPI_Gatherv and specify a recvcounts[0]/sendcount at root of 0 to follow your example closely. But usually you would prefer to write an MPI application in a way that the root participates equally in the operation, i.e. int nElements = nFits/world_size.

Unknown MPI_Datatype in MPI_Get_count

I want to send several instances of struct S to some processes. The layout of each struct could be different, that is, s.v might have different sizes. When receiving data, I do not know the exact MPI_Datatype in MPI_Get_count because that information is available in the sender process only. Also consider that struct S has a lot of non-primitive members so that I cannot assume the MPI_Datatype as MPI_INT when receiving. How can I acquire the MPI_Datatype to be used for MPI_Get_count?
struct S
{
std::vector<int> v;
MPI_Datatype mpi_dtype;
void make_layout()
{
const int nblock = 1;
int block_count[nblock] = {v.size()};
MPI_Aint offset[nblock];
MPI_Get_address(&v[0], &offset[0]);
MPI_Datatype block_type[nblock] = {MPI_INT};
MPI_Type_struct(nblock, block_count, offset, block_type, &mpi_dtype);
MPI_Type_commit(&mpi_dtype);
}
};
int main()
{
int rank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0)
{
S s0;
s0.v.resize(7);
s0.make.layout();
MPI_Send(MPI_BOTTOM, 1, s0.mpi_dtype, 1, 0, MPI_COMM_WORLD);
}
else
{
S s1; // note that right now s1.mpi_dtype != s0.mpi_dtype
MPI_Status status;
int number_amount;
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// MPI_Get_count(&status, ???, &number_amount);
// MPI_Recv(...);
}
return 0;
}
MPI messages are generally typeless, that is, they do not carry the datatypes in them. There is no way on the receiver side to obtain detailed information about the message structure. A viable option is to send the structure as two messages. The first one will contain some kind of description, e.g., the length of each vector. Once received, that information is used by the receiver to prepare the data object and to construct the appropriate datatype. Then, the second message will transmit the actual data. Boost.MPI completely automates this process - check its source code to learn how.
Another option is to pack the description with the data in a single message using MPI_Pack. On the receiver side, the message size could be probed with type MPI_PACKED and the data should be unpacked using MPI_Unpack. I don't think the added complexity of this case will offset the small performance loss due to sending two messages instead of one.
In your case that is not necessary as your structure has a single member, which is a vector of integers. Therefore, the correct type to give to MPI_Get_count is MPI_INT.