MPI - sending parts of image to different processes - c++

I'm writing a program in which process 0 sends parts of image to other processes which transform (long operation) this part and send back to the rank 0. I have a problem with one thing. To reproduce my issue I wrote a simple example. An image with size 512x512px is split on 4 parts (vertical stripes) by process 0. Next other processes save this part on disk. The problem is that each process saves the same part. I discovered that the image is split on parts correctly but problem is probably with sending data. What's wrong in my code?
Run:
mpirun -np 5 ./example
Main:
int main(int argc, char **argv) {
int size, rank;
MPI_Request send_request, rec_request;
MPI_Status status;
ostringstream s;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
Mat mat = imread("/home/user/original.jpg", CV_LOAD_IMAGE_COLOR);
if (!mat.data) exit(-1);
int idx = 1;
for (int c = 0; c < 512; c += 128) {
Mat slice = mat(Rect(c, 0, 128, 512)).clone();
MPI_Isend(slice.data, 128 * 512 * 3, MPI_BYTE, idx, 0, MPI_COMM_WORLD, &send_request);
idx++;
}
}
if (rank != 0) {
Mat test = Mat(512, 128, CV_8UC3);
MPI_Irecv(test.data, 128 * 512 * 3, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &rec_request);
MPI_Wait(&rec_request, &status);
s << "/home/user/p" << rank << ".jpg";
imwrite(s.str(), test);
}
MPI_Finalize();
return 0;
}

If you insist on using non-blocking operations, then the proper way to issue multiple of them at the same time is:
MPI_Request *send_reqs = new MPI_Request[4];
int idx = 1;
for (int c = 0; c < 512; c += 128) {
Mat slice = mat(Rect(c, 0, 128, 512)).clone();
MPI_Isend(slice.data, 128 * 512 * 3, MPI_BYTE, idx, 0, MPI_COMM_WORLD, &send_reqs[idx-1]);
idx++;
}
MPI_Waitall(4, send_reqs, MPI_STATUSES_IGNORE);
delete [] send_reqs;
Another (and IMHO better) option would be to utilise MPI_Scatterv to scatter the original data buffer. Thus you could even save cloning parts of the image matrix.
if (rank == 0) {
Mat mat = imread("/home/user/original.jpg", CV_LOAD_IMAGE_COLOR);
if (!mat.data) exit(-1);
int *send_counts = new int[size];
int *displacements = new int[size];
// The following calculations assume row-major storage
for (int i = 0; i < size; i++) {
send_counts[i] = displacements[i] = 0;
}
int idx = 1;
for (int c = 0; c < 512; c += 128) {
displacements[idx] = displacements[idx-1] + send_counts[idx-1];
send_counts[idx] = 128 * 512 * 3;
idx++;
}
MPI_Scatterv(mat.data, send_counts, displacements, MPI_BYTE,
NULL, 0, MPI_BYTE, 0, MPI_COMM_WORLD);
delete [] send_counts;
delete [] displacements;
}
if (1 <= rank && rank <= 4) {
Mat test = Mat(512, 128, CV_8UC3);
MPI_Scatterv(NULL, NULL, NULL, MPI_BYTE,
test.data, 128 * 512 * 3, MPI_BYTE, 0, MPI_COMM_WORLD);
s << "/home/user/p" << rank << ".jpg";
imwrite(s.str(), test);
}
Note how the arguments to MPI_Scatterv are prepared. Since you are scattering to 4 MPI processes only, setting certain elements of send_counts[] to zero allows the program to function correctly with more than 5 MPI processes. Also, the root rank in your original code doesn't send to itself, therefore send_counts[0] must be zero.

The problem is that you are not waiting till the send operation completes before the matrix Mat is destructed. Use MPI_Send instead of MPI_Isend.
If you really want to use non blocking communication, you have to keep track of all MPI_Request objects and of all Mat images until the send is complete.

Related

MPI_Scatterv (c) will give segmentation fault

I've built a fairly simple c code that reads a pgm image, splits it in different sections and sends it to various cores to elaborate it.
In order to account for some elaboration margins (each core has to access a larger area of the image than the it needs to write on), I can't simply split the image but I first have to create an array where I add the before mentioned margins.
As a quick example: an image is 1600x1200 (width x height), I have 2 cores, I want to access an area of 3x3 centered on the pixel and I'm splitting this image horizontal line by horizontal line then the subdivision would be -> the first core gets the pixels from 0 to 6011600, the second core gets the pixels from 5091600 to 1200*1600.
Now, I believe there is nothing wrong in how I implemented this in my program, still I get this error:
[ct1pt-tnode003:22389:0:22389] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffe7f60ead8)
==== backtrace (tid: 22389) ====
0 0x000000000004ee05 ucs_debug_print_backtrace() ???:0
1 0x0000000000402624 main() ???:0
2 0x0000000000022505 __libc_start_main() ???:0
3 0x0000000000400d99 _start() ???:0
This is my code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <math.h>
#include <time.h>
#include "testlibscatter.h"
#include <mpi.h>
#define MSGLEN 2048
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int m = atoi(argv[1]), n = atoi(argv[2]), kern_type = atoi(argv[3]);
double kernel[m*n];
int i_rank, ranks;
int param, symm;
MPI_Comm_rank( MPI_COMM_WORLD, &i_rank);
MPI_Comm_size( MPI_COMM_WORLD, &ranks);
int xsize, ysize, maxval;
xsize = 0;
ysize = 0;
maxval = 0;
void * ptr;
switch (kern_type){
case 1:
meankernel(m, n, kernel);
break;
case 2:
weightkernel(m, n, param, kernel);
break;
case 3:
gaussiankernel(m, n, param, symm, kernel);
break;
}
if (i_rank == 0){
read_pgm_image(&ptr, &maxval, &xsize, &ysize, "check_me2.pgm");
}
MPI_Bcast(&xsize, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&ysize, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&maxval, 1, MPI_INT, 0, MPI_COMM_WORLD);
int flo, start, end, i;
flo = floor(ysize/ranks);
int first, last;
first = start - (m - 1)/2;
last = end + (m - 1)/2;
if (start == 0){
first = 0;
}
if (end == ysize){
last = ysize;
}
int sendcounts[ranks];
int displs[ranks];
int first2[ranks];
int last2[ranks];
int c_start2[ranks];
int c_end2[ranks];
int num;
num = (ranks - 1) * (m-1);
printf("num is %d\n", num);
unsigned short int bigpic[xsize*(ysize + num)];
if (i_rank == 0){
for(i = 0; i < ranks; i++){
c_start2[i] = i * flo;
c_end2[i] = (i + 1) * flo;
if ( i == ranks - 1){
c_end2[i] = ysize;
}
first2[i] = c_start2[i] - (m - 1)/2;
last2[i] = c_end2[i] + (m - 1)/2;
if (c_start2[i] == 0){
first2[i] = 0;
}
if (c_end2[i] == ysize){
last2[i] = ysize;
}
sendcounts[i] = (last2[i] - first2[i]) * xsize;
}
int i, j, k, index, index_disp = 0;
index = 0;
displs[0] = 0;
for (k = 0; k < ranks; k++){
for (i = first2[k]*xsize; i < last2[k]*xsize; i++){
bigpic[index] = ((unsigned short int *)ptr)[i];
index++;
}
printf("%d\n", displs[index_disp]);
index_disp++;
displs[index_disp] = index;
}
}
MPI_Bcast(displs, ranks, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(sendcounts, ranks, MPI_INT, 0, MPI_COMM_WORLD);
unsigned short int minipic[xsize*(last-first)];
MPI_Barrier(MPI_COMM_WORLD);
MPI_Scatterv(&bigpic[0], sendcounts, displs, MPI_UNSIGNED_SHORT, minipic, (last-first)*xsize, MPI_UNSIGNED_SHORT, 0, MPI_COMM_WORLD);
MPI_Finalize();
}
the function kernel simply returns an array of m*n doubles to edit the image, while the read_pgm_image returns a void pointer with the values of the image read.
I've tried printing the values of bigpic and they show no problem.
In the code shown here, start and end are used uninitialised in the computations of first and last:
int flo, start, end, i;
~~~~~~~~~~
flo = floor(ysize/ranks);
int first, last;
first = start - (m - 1)/2; // <---- start has a random value here
last = end + (m - 1)/2; // <---- end has a random value here
If the values are very large, the size of minipic may become larger than the stack size:
unsigned short int minipic[xsize*(last-first)];
^^^^^^^^^^ random (possibly large) value
A strong indication that this is indeed the cause is the fact that the address of the fault 0x7ffe7f60ead8 is very close to the end of the positive part of the virtual address space, which is where most 64-bit OSes allocate the stack area of the main thread.
Always compile with -Wall in order to get back as many diagnostic messages from the compiler as possible.

MPI_ALLgather to send local maxima to all processes, bad termination of processes

Hi I wish to find local maxima for all processes and then pass all the local maxima to all processes to make a single array, then use the MPI ring topology to compare the local maxima and then finally output the global maxima.
I can do MPI_ALLreduce more efficiently and already did that, but I want to test efficiency of the ring topology and produce the same result as allreduce.
I was following a tutorial to use MPI_Allgather and it returns me some bad error. THe code is as follows:
int main(int argc, char **argv)
{
int rank, size;
MPI_Init (&argc, &argv); // initializes MPI
//MPI_Comm comm;
double max_store[4];
double *rbuf;
//MPI_Comm_size( comm, &rank);
MPI_Comm_rank (MPI_COMM_WORLD, &rank); // get current MPI-process ID. O, 1, ...
MPI_Comm_size (MPI_COMM_WORLD, &size); // get the total number of processes
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
const unsigned int N = 5;
double max = -1;
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
srand(time(NULL) * rank); // each MPI process gets a unique seed
m = 4; // initial number of intervals
// convert command-line input to N = number of points
//N = atoi( argv[1] );
for (unsigned int i=0; i <=N; i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
m = m*4;
if( mean > max)
{
max = mean;
}
}
//if ( rank < 4 && rank >= 0 )
//{
//max_store[rank] = max;
//}
printf("Process ID %i, local_max = %f\n",rank, max);
// All processes get the global max, stored in place of the local max
MPI_Allreduce( MPI_IN_PLACE, &max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
printf("Process ID %d, global_max = %f\n",rank, max);
rbuf = (double *)malloc(rank*4*sizeof(double));
MPI_Allgather( max_store, 4, MPI_DOUBLE, rbuf, 4, MPI_DOUBLE, MPI_COMM_WORLD);
//print the array containing max from each processor
//int k;
//for( int k = 0; k < 4; k++ )
//{
//printf( "%1.5e\n", max_store[k]);
//}
double send_junk = max_store[0];
double rec_junk;
//double global_max;
MPI_Status status;
if(rank==0)
{
MPI_Send(&send_junk, 4, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); // send data to process 1
}
if(rank==1)
{
MPI_Recv(&rec_junk, 4, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &status); // receive data from process 0
}
//check between process 0 and process 1 maxima
if(rec_junk>=max_store[1])
{
rec_junk = max_store[0];
}
else
{
rec_junk = max_store[1];
}
send_junk = rec_junk;
MPI_Send(&send_junk, 4, MPI_DOUBLE, 2, 0, MPI_COMM_WORLD); // send data to process 2
if(rank==2)
{
MPI_Recv(&rec_junk, 4, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD, &status); // receive data from process 1
}
//check between process 1 and process 2 maxima
if(rec_junk>=max_store[2])
{
rec_junk = rec_junk;
}
else
{
rec_junk = max_store[2];
}
send_junk = rec_junk;
MPI_Send(&send_junk, 4, MPI_DOUBLE, 3, 0, MPI_COMM_WORLD); // send data to process 3
if(rank==3)
{
MPI_Recv(&rec_junk, 4, MPI_DOUBLE, 2, 0, MPI_COMM_WORLD, &status); // receive data from process 2
}
//check between process 2 and process 3 maxima
if(rec_junk>=max_store[3])
{
rec_junk = rec_junk;
}
else
{
rec_junk = max_store[3];
}
printf("global ring max = %f", rec_junk);
MPI_Finalize(); // programs should always perform a "graceful" shutdown
return 0;
}
I am interested to know how to send the maximas in a single array and all the processes will have access to it, so that I can compare the values in a ring topology. thanks very much.
You are not allocating the receive buffer correctly. It needs to be big enough to store 4 entries from every rank. You currently have:
rbuf = (double *)malloc(rank*4*sizeof(double));
when it should be
rbuf = (double *)malloc(size*4*sizeof(double));

finding global maxima of a function from comparing each processor's local maxima using MPI ring topology

I wish to use the MPI ring topology, passing each processor's maxima around the ring, comparing the local maxima and then output the global maxima for all processors.
I am using a 10 dimensional Monte Carlo integration function. My first idea was to make an array with each processor's local maxima, then pass that value, compare and output the highest value. But I couldn't elegantly code to make an array which will take only each processors' max value and store it corresponding to rank of the processor, this way I can also keep track which processor got the global maxima.
I didn't finish my code yet, right now I am interested to see if an array with local maxima from processor's can be created. the way I coded, it's very time consuming and if there is a lot of processors, then I have to declare them each time, and yet I couldn't produce the array I am looking for.
I am sharing the code here:
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <mpi.h>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n-1; j = j+1)
{
y = y + exp(-pow((1-x[j]),2)-100*(pow((x[j+1] - pow(x[j],2)),2)));
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// initial seed value (use system time)
//srand(time(NULL));
// step 1: calculate the common factor V
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
int rank, size;
MPI_Init (&argc, &argv); // initializes MPI
MPI_Comm_rank (MPI_COMM_WORLD, &rank); // get current MPI-process ID. O, 1, ...
MPI_Comm_size (MPI_COMM_WORLD, &size); // get the total number of processes
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
const unsigned int N = 5;
double max = -1;
double max_store[4];
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
srand(time(NULL) * rank); // each MPI process gets a unique seed
m = 4; // initial number of intervals
// convert command-line input to N = number of points
//N = atoi( argv[1] );
for (unsigned int i=0; i <=N; i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
if( mean > max)
{
max = mean;
}
//cout << setw(10) << m << setw(10) << max << setw(10) << mean << setw(10) << rank << setw(10) << size <<endl;
m = m*4;
}
//cout << setw(30) << m << setw(30) << result << setw(30) << mean <<endl;
printf("Process %d of %d mean = %1.5e\n and local max = %1.5e\n", rank, size, mean, max );
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
for( int k = 0; k < 4; k++ )
{
printf( "%1.5e\n", max_store[k]);
}
//double max_store[4] = {4.43095e-02, 5.76586e-02, 3.15962e-02, 4.23079e-02};
double send_junk = max_store[0];
double rec_junk;
MPI_Status status;
// This next if-statment implemeents the ring topology
// the last process ID is size-1, so the ring topology is: 0->1, 1->2, ... size-1->0
// rank 0 starts the chain of events by passing to rank 1
if(rank==0) {
// only the process with rank ID = 0 will be in this block of code.
MPI_Send(&send_junk, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); // send data to process 1
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, size-1, 0, MPI_COMM_WORLD, &status); // receive data from process size-1
}
else if( rank == size-1) {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); // send data to its "right neighbor", rank 0
}
else {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD); // send data to its "right neighbor" (rank+1)
}
printf("Process %d send %1.5e\n and recieved %1.5e\n", rank, send_junk, rec_junk );
MPI_Finalize(); // programs should always perform a "graceful" shutdown
return 0;
}
compile with :
mpiCC -o gd test_code.cpp
mpirun -np 4 ./gd
I would appreciate suggestion:
if there is a more elegant way to make local maxima arrays?
How to compare the local maxima and decide the global maxima while passing the values in a ring?
Also feel free to modify the code to provide me a better example to work with. I would appreciate any suggestion. thanks.
For this sort of thing, better using either MPI_Reduce() or MPI_Allreduce() with MPI_MAX as operator. The former will compute the max over the values exposed by all processes and give the result to the "root" process only, while the later will do the same, but give the results to all processes.
// Only process of rank 0 get the global max
MPI_Reduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD );
// All processes get the global max
MPI_Allreduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
// All processes get the global max, stored in place of the local max
// after the call ends - this might be the most interesting one for you
MPI_Allreduce( MPI_IN_PLACE, &max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
As you can see, you could just insert the 3rd example into your code to solve your problem.
BTW, unrelated remark, but this hurts my eyes:
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
What about something like this:
if ( rank < 4 && rank >= 0 ) {
max_store[rank] = max;
}

MPI_Gatherv unequal 2D array [duplicate]

(Suppose all the matrices are stored in row-major order.) An example that illustrate the problem is to distribute a 10x10 matrix over a 3x3 grid, so that the size of the sub-matrices in each node looks like
|-----+-----+-----|
| 3x3 | 3x3 | 3x4 |
|-----+-----+-----|
| 3x3 | 3x3 | 3x4 |
|-----+-----+-----|
| 4x3 | 4x3 | 4x4 |
|-----+-----+-----|
I've seen many posts on Stackoverflow (such as sending blocks of 2D array in C using MPI and MPI partition matrix into blocks). But they only deal with blocks of same size (in which case we can simply use MPI_Type_vector or MPI_Type_create_subarray and only one MPI_Scatterv call).
So, I'm wondering what is the most efficient way in MPI to scatter a matrix to a grid of processors where each processor has a block with a specified size.
P.S. I've also looked at MPI_Type_create_darray, but it seems not letting you specify block size for each processor.
You have to go through at least one extra step in MPI to do this.
The problem is that the most general of the gather/scatter routines, MPI_Scatterv and MPI_Gatherv, allow you to pass a "vector" (v) of counts/displacements, rather than just one count for Scatter and Gather, but the types are all assumed to be the same. Here, there's no way around it; the memory layouts of each block are different, and so have to be treated by a different type. If there were only one difference between the blocks – some had different numbers of columns, or some had different number of rows – then just using different counts would suffice. But with different columns and rows, counts won't do it; you really need to be able to specify different types.
So what you really want is an often-discussed but never implemented MPI_Scatterw (where w means vv; e.g., both counts and types are vectors) routine. But such a thing doesn't exist. The closest you can get is the much more general MPI_Alltoallw call, which allows completely general all-to-all sending and receiving of data; as the spec states, "The MPI_ALLTOALLW function generalizes several MPI functions by carefully selecting the input arguments. For example, by making all but one process have sendcounts(i) = 0, this achieves an MPI_SCATTERW function.".
So you can do this with MPI_Alltoallw by having all processes other than the one that originally has all the data ( we'll assume that it's rank 0 here) sent all their send counts to zero. All tasks will also have all their receive counts to zero except for the first - the amount of data they'll get from rank zero.
For process 0's send counts, we'll first have to define four different kinds of types (the 4 different sizes of subarrays), and then the send counts will all be 1, and the only part that remains is figuring out the send displacements (which, unlike scatterv, is here in units of bytes, because there's no single type one could use as a unit):
/* 4 types of blocks -
* blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1
*/
MPI_Datatype blocktypes[4];
int subsizes[2];
int starts[2] = {0,0};
for (int i=0; i<2; i++) {
subsizes[0] = blocksize+i;
for (int j=0; j<2; j++) {
subsizes[1] = blocksize+j;
MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]);
MPI_Type_commit(&blocktypes[2*i+j]);
}
}
/* now figure out the displacement and type of each processor's data */
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = 1;
senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char);
int idx = typeIdx(row, col, blocks);
sendtypes[proc] = blocktypes[idx];
}
}
MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes,
&(localdata[0][0]), recvcounts, recvdispls, recvtypes,
MPI_COMM_WORLD);
And this will work.
But the problem is that the Alltoallw function is so completely general, that it's difficult for implementations to do much in the line of optimization; so I'd be surprised if this performed as well as a scatter of equally-sized blocks.
So another approach is to do something like a two phases of communication.
The simplest such approach follows after noting that you can almost get all the data where it needs to go with a single MPI_Scatterv() call: in your example, if we operate in units of a single column vector with column=1 and rows=3 (the number of rows in most of the blocks of the domain), you can scatter almost all of the global data to the other processors. The processors each get 3 or 4 of these vectors, which distributes all of the data except the very last row of the global array, which can be handled by a simple second scatterv. That looks like this;
/* We're going to be operating mostly in units of a single column of a "normal" sized block.
* There will need to be two vectors describing these columns; one in the context of the
* global array, and one in the local results.
*/
MPI_Datatype vec, localvec;
MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);
MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);
/* The originating process needs to allocate and fill the source array,
* and then define types defining the array chunks to send, and
* fill out senddispls, sendcounts (1) and sendtypes.
*/
if (rank == 0) {
/* create the vector type which will send one column of a "normal" sized-block */
/* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */
/* will still have to do something to tidy up the last row of values */
/* we need to make the type have extent of 1 char for scattering */
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize);
}
}
recvcounts = localsizes[1];
MPI_Scatterv(globalptr, sendcounts, senddispls, vec,
&(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD);
MPI_Type_free(&localvec);
if (rank == 0)
MPI_Type_free(&vec);
/* now we need to do one more scatter, scattering just the last row of data
* just to the processors on the last row.
* Here we recompute the send counts
*/
if (rank == 0) {
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = 0;
senddispls[proc] = 0;
if ( isLastRow(row,blocks) ) {
sendcounts[proc] = blocksize;
senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize;
if ( isLastCol(col,blocks) )
sendcounts[proc] += 1;
}
}
}
recvcounts = 0;
if ( isLastRow(myrow, blocks) ) {
recvcounts = blocksize;
if ( isLastCol(mycol, blocks) )
recvcounts++;
}
MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
&(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);
So far so good. But it's a shame to have most of the processors sitting around doing nothing during that final, "cleanup" scatterv.
So a nicer approach is to scatter all the rows in a first phase, and scatter that data amongst the columns in a second phase. Here we create new communicators, with each processor belonging to two new communicators - one representing other processors in the same block row, and the other in the same block column. In the first step, the origin processor distributes all the rows of the global array to the other processors in the same column communicator - which can be done in a single scatterv. Then those processors, using a single scatterv and the same columns data type as in the previous example, scatter the columns to each processor in the same block row as it. The result is two fairly simple scatterv's distributing all of the data:
/* create communicators which have processors with the same row or column in them*/
MPI_Comm colComm, rowComm;
MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm);
MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm);
/* first, scatter the array by rows, with the processor in column 0 corresponding to each row
* receiving the data */
if (mycol == 0) {
int sendcounts[ blocks[0] ];
int senddispls[ blocks[0] ];
senddispls[0] = 0;
for (int row=0; row<blocks[0]; row++) {
/* each processor gets blocksize rows, each of size globalsizes[1]... */
sendcounts[row] = blocksize*globalsizes[1];
if (row > 0)
senddispls[row] = senddispls[row-1] + sendcounts[row-1];
}
/* the last processor gets one more */
sendcounts[blocks[0]-1] += globalsizes[1];
/* allocate my rowdata */
rowdata = allocchar2darray( sendcounts[myrow], globalsizes[1] );
/* perform the scatter of rows */
MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
&(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm);
}
/* Now, within each row of processors, we can scatter the columns.
* We can do this as we did in the previous example; create a vector
* (and localvector) type and scatter accordingly */
int locnrows = blocksize;
if ( isLastRow(myrow, blocks) )
locnrows++;
MPI_Datatype vec, localvec;
MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);
MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);
int sendcounts[ blocks[1] ];
int senddispls[ blocks[1] ];
if (mycol == 0) {
for (int col=0; col<blocks[1]; col++) {
sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
senddispls[col] = col*blocksize;
}
}
char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL;
MPI_Scatterv(rowptr, sendcounts, senddispls, vec,
&(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);
which is simpler and should be a relatively good balance between performance and robustness.
Running all these three methods works:
bash-3.2$ mpirun -np 6 ./allmethods alltoall
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - alltoall
Rank 0:
abc
hij
opq
Rank 1:
defg
klmn
rstu
Rank 2:
vwx
cde
jkl
Rank 3:
yzab
fghi
mnop
Rank 4:
qrs
xyz
efg
lmn
Rank 5:
tuvw
abcd
hijk
opqr
bash-3.2$ mpirun -np 6 ./allmethods twophasevecs
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - two phase, vectors, then cleanup
Rank 0:
abc
hij
opq
Rank 1:
defg
klmn
rstu
Rank 2:
vwx
cde
jkl
Rank 3:
yzab
fghi
mnop
Rank 4:
qrs
xyz
efg
lmn
Rank 5:
tuvw
abcd
hijk
opqr
bash-3.2$ mpirun -np 6 ./allmethods twophaserowcol
Global array:
abcdefg
hijklmn
opqrstu
vwxyzab
cdefghi
jklmnop
qrstuvw
xyzabcd
efghijk
lmnopqr
Method - two phase - row, cols
Rank 0:
abc
hij
opq
Rank 1:
defg
klmn
rstu
Rank 2:
vwx
cde
jkl
Rank 3:
yzab
fghi
mnop
Rank 4:
qrs
xyz
efg
lmn
Rank 5:
tuvw
abcd
hijk
opqr
The code implementing these methods follows; you can set block sizes to more typical sizes for your problem and run on a realistic number of processors to get some sense of which will be best for your application.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "mpi.h"
/* auxiliary routines, found at end of program */
char **allocchar2darray(int n, int m);
void freechar2darray(char **a);
void printarray(char **data, int n, int m);
void rowcol(int rank, const int blocks[2], int *row, int *col);
int isLastRow(int row, const int blocks[2]);
int isLastCol(int col, const int blocks[2]);
int typeIdx(int row, int col, const int blocks[2]);
/* first method - alltoallw */
void alltoall(const int myrow, const int mycol, const int rank, const int size,
const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
const char *const globalptr, char **localdata) {
/*
* get send and recieve counts ready for alltoallw call.
* everyone will be recieving just one block from proc 0;
* most procs will be sending nothing to anyone.
*/
int sendcounts[ size ];
int senddispls[ size ];
MPI_Datatype sendtypes[size];
int recvcounts[ size ];
int recvdispls[ size ];
MPI_Datatype recvtypes[size];
for (int proc=0; proc<size; proc++) {
recvcounts[proc] = 0;
recvdispls[proc] = 0;
recvtypes[proc] = MPI_CHAR;
sendcounts[proc] = 0;
senddispls[proc] = 0;
sendtypes[proc] = MPI_CHAR;
}
recvcounts[0] = localsizes[0]*localsizes[1];
recvdispls[0] = 0;
/* The originating process needs to allocate and fill the source array,
* and then define types defining the array chunks to send, and
* fill out senddispls, sendcounts (1) and sendtypes.
*/
if (rank == 0) {
/* 4 types of blocks -
* blocksize*blocksize, blocksize+1*blocksize, blocksize*blocksize+1, blocksize+1*blocksize+1
*/
MPI_Datatype blocktypes[4];
int subsizes[2];
int starts[2] = {0,0};
for (int i=0; i<2; i++) {
subsizes[0] = blocksize+i;
for (int j=0; j<2; j++) {
subsizes[1] = blocksize+j;
MPI_Type_create_subarray(2, globalsizes, subsizes, starts, MPI_ORDER_C, MPI_CHAR, &blocktypes[2*i+j]);
MPI_Type_commit(&blocktypes[2*i+j]);
}
}
/* now figure out the displacement and type of each processor's data */
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = 1;
senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize)*sizeof(char);
int idx = typeIdx(row, col, blocks);
sendtypes[proc] = blocktypes[idx];
}
}
MPI_Alltoallw(globalptr, sendcounts, senddispls, sendtypes,
&(localdata[0][0]), recvcounts, recvdispls, recvtypes,
MPI_COMM_WORLD);
}
/* second method: distribute almost all data using colums of size blocksize,
* then clean up the last row with another scatterv */
void twophasevecs(const int myrow, const int mycol, const int rank, const int size,
const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
const char *const globalptr, char **localdata) {
int sendcounts[ size ];
int senddispls[ size ];
int recvcounts;
for (int proc=0; proc<size; proc++) {
sendcounts[proc] = 0;
senddispls[proc] = 0;
}
/* We're going to be operating mostly in units of a single column of a "normal" sized block.
* There will need to be two vectors describing these columns; one in the context of the
* global array, and one in the local results.
*/
MPI_Datatype vec, localvec;
MPI_Type_vector(blocksize, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);
MPI_Type_vector(blocksize, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);
/* The originating process needs to allocate and fill the source array,
* and then define types defining the array chunks to send, and
* fill out senddispls, sendcounts (1) and sendtypes.
*/
if (rank == 0) {
/* create the vector type which will send one column of a "normal" sized-block */
/* then all processors except those in the last row need to get blocksize*vec or (blocksize+1)*vec */
/* will still have to do something to tidy up the last row of values */
/* we need to make the type have extent of 1 char for scattering */
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
senddispls[proc] = (row*blocksize*globalsizes[1] + col*blocksize);
}
}
recvcounts = localsizes[1];
MPI_Scatterv(globalptr, sendcounts, senddispls, vec,
&(localdata[0][0]), recvcounts, localvec, 0, MPI_COMM_WORLD);
MPI_Type_free(&localvec);
if (rank == 0)
MPI_Type_free(&vec);
/* now we need to do one more scatter, scattering just the last row of data
* just to the processors on the last row.
* Here we recompute the sendcounts
*/
if (rank == 0) {
for (int proc=0; proc<size; proc++) {
int row, col;
rowcol(proc, blocks, &row, &col);
sendcounts[proc] = 0;
senddispls[proc] = 0;
if ( isLastRow(row,blocks) ) {
sendcounts[proc] = blocksize;
senddispls[proc] = (globalsizes[0]-1)*globalsizes[1]+col*blocksize;
if ( isLastCol(col,blocks) )
sendcounts[proc] += 1;
}
}
}
recvcounts = 0;
if ( isLastRow(myrow, blocks) ) {
recvcounts = blocksize;
if ( isLastCol(mycol, blocks) )
recvcounts++;
}
MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
&(localdata[blocksize][0]), recvcounts, MPI_CHAR, 0, MPI_COMM_WORLD);
}
/* third method: first distribute rows, then columns, each with a single scatterv */
void twophaseRowCol(const int myrow, const int mycol, const int rank, const int size,
const int blocks[2], const int blocksize, const int globalsizes[2], const int localsizes[2],
const char *const globalptr, char **localdata) {
char **rowdata ;
/* create communicators which have processors with the same row or column in them*/
MPI_Comm colComm, rowComm;
MPI_Comm_split(MPI_COMM_WORLD, myrow, rank, &rowComm);
MPI_Comm_split(MPI_COMM_WORLD, mycol, rank, &colComm);
/* first, scatter the array by rows, with the processor in column 0 corresponding to each row
* receiving the data */
if (mycol == 0) {
int sendcounts[ blocks[0] ];
int senddispls[ blocks[0] ];
senddispls[0] = 0;
for (int row=0; row<blocks[0]; row++) {
/* each processor gets blocksize rows, each of size globalsizes[1]... */
sendcounts[row] = blocksize*globalsizes[1];
if (row > 0)
senddispls[row] = senddispls[row-1] + sendcounts[row-1];
}
/* the last processor gets one more */
sendcounts[blocks[0]-1] += globalsizes[1];
/* allocate my rowdata */
rowdata = allocchar2darray( sendcounts[myrow], globalsizes[1] );
/* perform the scatter of rows */
MPI_Scatterv(globalptr, sendcounts, senddispls, MPI_CHAR,
&(rowdata[0][0]), sendcounts[myrow], MPI_CHAR, 0, colComm);
}
/* Now, within each row of processors, we can scatter the columns.
* We can do this as we did in the previous example; create a vector
* (and localvector) type and scatter accordingly */
int locnrows = blocksize;
if ( isLastRow(myrow, blocks) )
locnrows++;
MPI_Datatype vec, localvec;
MPI_Type_vector(locnrows, 1, globalsizes[1], MPI_CHAR, &vec);
MPI_Type_create_resized(vec, 0, sizeof(char), &vec);
MPI_Type_commit(&vec);
MPI_Type_vector(locnrows, 1, localsizes[1], MPI_CHAR, &localvec);
MPI_Type_create_resized(localvec, 0, sizeof(char), &localvec);
MPI_Type_commit(&localvec);
int sendcounts[ blocks[1] ];
int senddispls[ blocks[1] ];
if (mycol == 0) {
for (int col=0; col<blocks[1]; col++) {
sendcounts[col] = isLastCol(col, blocks) ? blocksize+1 : blocksize;
senddispls[col] = col*blocksize;
}
}
char *rowptr = (mycol == 0) ? &(rowdata[0][0]) : NULL;
MPI_Scatterv(rowptr, sendcounts, senddispls, vec,
&(localdata[0][0]), sendcounts[mycol], localvec, 0, rowComm);
MPI_Type_free(&localvec);
MPI_Type_free(&vec);
if (mycol == 0)
freechar2darray(rowdata);
MPI_Comm_free(&rowComm);
MPI_Comm_free(&colComm);
}
int main(int argc, char **argv) {
int rank, size;
int blocks[2] = {0,0};
const int blocksize=3;
int globalsizes[2], localsizes[2];
char **globaldata;
char *globalptr = NULL;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0 && argc < 2) {
fprintf(stderr,"Usage: %s method\n Where method is one of: alltoall, twophasevecs, twophaserowcol\n", argv[0]);
MPI_Abort(MPI_COMM_WORLD,1);
}
/* calculate sizes for a 2d grid of processors */
MPI_Dims_create(size, 2, blocks);
int myrow, mycol;
rowcol(rank, blocks, &myrow, &mycol);
/* create array sizes so that last block has 1 too many rows/cols */
globalsizes[0] = blocks[0]*blocksize+1;
globalsizes[1] = blocks[1]*blocksize+1;
if (rank == 0) {
globaldata = allocchar2darray(globalsizes[0], globalsizes[1]);
globalptr = &(globaldata[0][0]);
for (int i=0; i<globalsizes[0]; i++)
for (int j=0; j<globalsizes[1]; j++)
globaldata[i][j] = 'a'+(i*globalsizes[1] + j)%26;
printf("Global array: \n");
printarray(globaldata, globalsizes[0], globalsizes[1]);
}
/* the local chunk we'll be receiving */
localsizes[0] = blocksize; localsizes[1] = blocksize;
if ( isLastRow(myrow,blocks)) localsizes[0]++;
if ( isLastCol(mycol,blocks)) localsizes[1]++;
char **localdata = allocchar2darray(localsizes[0],localsizes[1]);
if (!strcasecmp(argv[1], "alltoall")) {
if (rank == 0) printf("Method - alltoall\n");
alltoall(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
} else if (!strcasecmp(argv[1],"twophasevecs")) {
if (rank == 0) printf("Method - two phase, vectors, then cleanup\n");
twophasevecs(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
} else {
if (rank == 0) printf("Method - two phase - row, cols\n");
twophaseRowCol(myrow, mycol, rank, size, blocks, blocksize, globalsizes, localsizes, globalptr, localdata);
}
for (int proc=0; proc<size; proc++) {
if (proc == rank) {
printf("\nRank %d:\n", proc);
printarray(localdata, localsizes[0], localsizes[1]);
}
MPI_Barrier(MPI_COMM_WORLD);
}
freechar2darray(localdata);
if (rank == 0)
freechar2darray(globaldata);
MPI_Finalize();
return 0;
}
char **allocchar2darray(int n, int m) {
char **ptrs = malloc(n*sizeof(char *));
ptrs[0] = malloc(n*m*sizeof(char));
for (int i=0; i<n*m; i++)
ptrs[0][i]='.';
for (int i=1; i<n; i++)
ptrs[i] = ptrs[i-1] + m;
return ptrs;
}
void freechar2darray(char **a) {
free(a[0]);
free(a);
}
void printarray(char **data, int n, int m) {
for (int i=0; i<n; i++) {
for (int j=0; j<m; j++)
putchar(data[i][j]);
putchar('\n');
}
}
void rowcol(int rank, const int blocks[2], int *row, int *col) {
*row = rank/blocks[1];
*col = rank % blocks[1];
}
int isLastRow(int row, const int blocks[2]) {
return (row == blocks[0]-1);
}
int isLastCol(int col, const int blocks[2]) {
return (col == blocks[1]-1);
}
int typeIdx(int row, int col, const int blocks[2]) {
int lastrow = (row == blocks[0]-1);
int lastcol = (col == blocks[1]-1);
return lastrow*2 + lastcol;
}
Not sure if that applies to you, but it helped me in the past so it might be usefull to others.
My answer applies in the context of parallele IO. The thing is that, if you know your access are not overlapping, you can successfully write/read even with variable sizes by using MPI_COMM_SELF
A piece of code I use every day contains:
MPI_File fh;
MPI_File_open(MPI_COMM_SELF, path.c_str(), MPI_MODE_CREATE|MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
// Lot of computation to get the size right
MPI_Datatype filetype;
MPI_Type_create_subarray(gsizes.size(), &gsizes[0], &lsizes[0], &offset[0], MPI_ORDER_C, MPI_FLOAT, &filetype);
MPI_Type_commit(&filetype);
MPI_File_set_view(fh, 0, MPI_FLOAT, filetype, "native", MPI_INFO_NULL);
MPI_File_write(fh, &block->field[0], block->field.size(), MPI_FLOAT, MPI_STATUS_IGNORE);
MPI_File_close(&fh);

MPI receiving in multiple parts

Usually when I want to send a buffer to next processor and receive another one from previous one I use the following:
MPI_Irecv(rcv_buff,rcv_size,
MPI_DOUBLE,rcv_p,0,world,
&request);
MPI_Send(snd_buff,snd_size,
MPI_DOUBLE,snd_p,0,world);
MPI_Wait(&request,&status);
Suppose that I want to put the first rcv_size0 elements of rcv_buff in array0 and the rest (rcv_size1 elements) in array1, where:
rcv_size1=rcv_size-rcv_size0;
normally what I do is that I first create a dummy array like rcv_buff here and then start copying the values to array0 and array1. My question is that is there any way in MPI to receive the sent bytes in two or more sequences? for example directly receive the first size0 elements in array0 and the rest in array1?
You can do this - receiving into two buffers - by creating a type specific to that pair of buffers:
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
int recv_split(const int total, const int src, const int tag,
double *buffA, const int sizeA, double *buffB) {
if (total <= 0) return -1;
if (sizeA > total) return -1;
if (buffA == NULL) return -2;
if (buffB == NULL) return -2;
const int sizeB = total - sizeA;
int blocksizes[2] = {sizeA, sizeB};
MPI_Datatype types[2] = {MPI_DOUBLE, MPI_DOUBLE};
MPI_Aint displacements[2], addrA, addrB;
MPI_Datatype splitbuffer;
MPI_Status status;
displacements[0] = 0;
MPI_Get_address(buffA, &addrA);
MPI_Get_address(buffB, &addrB);
displacements[1] = addrB - addrA;
MPI_Type_create_struct(2, blocksizes, displacements, types, &splitbuffer);
MPI_Type_commit(&splitbuffer);
MPI_Recv(buffA, 1, splitbuffer, src, tag, MPI_COMM_WORLD, &status);
MPI_Type_free(&splitbuffer);
return 0;
}
int main(int argc, char **argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
const int sendSize = 15;
const int tag = 1;
if (rank == 0 && size >= 2) {
double sendbuff[sendSize];
for (int i=0; i<sendSize; i++)
sendbuff[i] = 1.*i;
MPI_Send(sendbuff, sendSize, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
}
if (rank == 1) {
const int buffLen = 12;
const int recvIntoA = 10;
double buffA[buffLen];
double buffB[buffLen];
for (int i=0; i<buffLen; i++) {
buffA[i] = buffB[i] = -1.;
}
recv_split(sendSize, 0, tag, buffA, recvIntoA, buffB);
printf("---Buffer A--\n");
for (int i=0; i<buffLen; i++)
printf("%5.1lf ", buffA[i]);
printf("\n---Buffer B--\n");
for (int i=0; i<buffLen; i++)
printf("%5.1lf ", buffB[i]);
printf("\n");
}
MPI_Finalize();
return 0;
}
compiling and running gives
$ mpicc -o recvsplit recvsplit.c -std=c99
$ mpirun -np 2 ./recvsplit
---Buffer A--
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 -1.0 -1.0
---Buffer B--
10.0 11.0 12.0 13.0 14.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
Just note that this type will only work for this pair of buffers; different pairs will generally have different relative displacements. You can also of course always receive into one large staging buffer and manually unpack into different buffers, using your own code or MPI_Unpack.
There's nothing that I know of directly in MPI that would allow you to do that, though there's probably some nasty pointer magic you could use to make it work. In general, it would be much cleaner to do it as two sends if you wanted to do things that way.
Another thing that's not a direct answer to your question, but you might not have known about is that your three line command above can be combined into one using MPI_SENDRECV. Try this line out:
MPI_Sendrecv(snd_buff, snd_size, MPI_DOUBLE, snd_p, 0,
rcv_buff, rcv_size, MPI_DOUBLE, rcv_p, 0,
world, &status);