Summing and Gathering elements of array element-wise in MPI - c++

After doing calculations to multiply a matrix with a vector using Cartesian topology. I got the following process with the their ranks and vectors.
P0 (process with rank = 0) =[2 , 9].
P1 (process with rank = 1) =[2 , 3]
P2 (process with rank = 2) =[1 , 9]
P3 (process with rank = 3) =[4 , 6].
Now. I need to sum the elements of the even rank processes and the odd ones separately, like this:
temp1 = [3 , 18]
temp2 = [6 , 9]
and then , gather the results in a different vector, like this:
result = [3 , 18 , 6 , 9]
My attampt to do it is to use the MPI_Reduce and then MPI_Gather like this :
// Previous code
double* temp1 , *temp2;
if(myrank %2 == 0){
BOOLEAN flag = Allocate_vector(&temp1 ,local_m); // function to allocate space for vectors
MPI_Reduce(local_y, temp1, local_n, MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp1, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE,0, comm);
free(temp1);
}
else{
Allocate_vector(&temp2 ,local_m);
MPI_Reduce(local_y, temp2, local_n , MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp2, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0,comm);
free(temp2);
}
But the answer is not correct.It seemd that the code sums all elements of the even and odd process togather and then gives a segmentation fault error:
Wrong_result = [21 15 0 0]
and this error
** Error in ./test': double free or corruption (fasttop): 0x00000000013c7510 ***
*** Error in./test': double free or corruption (fasttop): 0x0000000001605b60 ***

It won't work the way you are trying to do it. To perform reduction over the elements of a subset of processes, you have to create a subcommunicator for them. In your case, the odd and the even processes share the same comm, therefore the operations are not over the two separate groups of processes but rather over the combined group.
You should use MPI_Comm_split to perform a split, perform the reduction using the two new subcommunicators, and finally have rank 0 in each subcommunicator (let's call those leaders) participate in the gather over another subcommunicator that contains those two only:
// Make sure rank is set accordingly
MPI_Comm_rank(comm, &rank);
// Split even and odd ranks in separate subcommunicators
MPI_Comm subcomm;
MPI_Comm_split(comm, rank % 2, 0, &subcomm);
// Perform the reduction in each separate group
double *temp;
Allocate_vector(&temp, local_n);
MPI_Reduce(local_y, temp, local_n , MPI_DOUBLE, MPI_SUM, 0, subcomm);
// Find out our rank in subcomm
int subrank;
MPI_Comm_rank(subcomm, &subrank);
// At this point, we no longer need subcomm. Free it and reuse the variable.
MPI_Comm_free(&subcomm);
// Separate both group leaders (rank 0) into their own subcommunicator
MPI_Comm_split(comm, subrank == 0 ? 0 : MPI_UNDEFINED, 0, &subcomm);
if (subcomm != MPI_COMM_NULL) {
MPI_Gather(temp, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0, subcomm);
MPI_Comm_free(&subcomm);
}
// Free resources
free(temp);
The result will be in gResult of rank 0 in the latter subcomm, which happens to be rank 0 in comm because of the way the splits are performed.
Not as simple as expected, I guess, but that is the price of having convenient collective operations in MPI.
On a side node, in the code shown you are allocating temp1 and temp2 to be of length local_m, while in all collective calls the length is specified as local_n. If it happens that local_n > local_m, then heap corruption will occur.

Related

MPI_collective communication

I try to code quick sort in mpi. the algorithm of parallelization is simple. the root scatters the list in MPI_comm_world. then each node executes qsort() function on it`s subarray. the MPI_gathers() is used to give all subarrays back to root to execute again qsort on it.so simple. however i get the error. i guessed that maybe the size of sub-arrays is not exact. because it simply divides the size of list by comm_size. so it is likely that there would be a segmentation fault. however i give the size of list 1000 and the number of processors 4. the result of division is 250. so there should be no segmentation fault. But there is. could you tell me where i am wrong.
int main()
{
int array [1000];
int arrsize;
int chunk;
int* subarray;
int rank ;
int comm_size;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD,&comm_size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
if(rank==0)
{
time_t t;
srand((unsigned)time(&t));
int arrsize = sizeof(array) / sizeof(int);
for (int i = 0; i < arrsize; i++)
array[i] = rand() % 1000;
printf("\n this is processor %d and the unsorted array is:",rank);
printArray(array,arrsize);
}
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
if(rank != 0)
{
qsort(subarray,chunk,sizeof(int),comparetor);
}
MPI_Gather( subarray,chunk, MPI_INT,array, arrsize, MPI_INT,0, MPI_COMM_WORLD);
if(rank==0)
{
qsort(array,arrsize,sizeof(int),comparetor);
printf("\n this is processor %d and this is sorted array: ",rank);
printArray(array,arrsize);
}
free(subarray);
MPI_Finalize();
return 0;
}
and the error says :
Invalid MIT-MAGIC-COOKIE-1 key[h:04865] *** Process received signal ***
[h:04865] Signal: Segmentation fault (11)
[h:04865] Signal code: Address not mapped (1)
[h:04865] Failing at address: 0x421e45
[h:04865] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1906b29210]
[h:04865] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e533)[0x7f1906c71533]
[h:04865] [ 2] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x4054f)[0x7f190699654f]
[h:04865] [ 3] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_datatype_sndrcv+0x51a)[0x7f1906f3288a]
[h:04865] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_scatter_intra_basic_linear+0x12c)[0x7f1906f75dec]
[h:04865] [ 5] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Scatter+0x10d)[0x7f1906f5952d]
[h:04865] [ 6] ./parallelQuickSortMPI(+0xc8a5)[0x5640c424b8a5]
[h:04865] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1906b0a0b3]
[h:04865] [ 8] ./parallelQuickSortMPI(+0xc64e)[0x5640c424b64e]
[h:04865] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node h exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The reason for segmentation fault is in below lines.
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
You are only allocating the subarray as well as calculating the chunk size only after the MPI_Scatter operation. It is a collective operation and necessary memory allocation (eg: receiver array) as well as size to receive should be declared and defined before the call.
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
Above is the right way. You will move pass the segmentation fault with this change.

C++ split integer array into chunks

I guess my question has 2 parts:
(1) Is this the right approach to send different chunks of an array to different processors?
Let's say I have n processors whose rank ranges from 0 to n-1.
I have an array of size d. I want to split this array into k equally-sized chunks. Assume d is divisible by k.
I want to send each of these chunks to a processor whose rank is less than k.
It would be easy if I can use something like MPI_Scatter, but this function sends to EVERY OTHER processors, and I only want to send to a certain number of procs.
So what I did was that I have a loop of k iterations and do k MPI_Isend's.
Is this efficient?
(2) If it is, how do I split an array into chunks? There's always the easy way, which is
int size = d/k;
int buffs[k][size];
for (int rank = 0; rank < k; ++rank)
{
for (int i = 0; i < size ++i)
buffs[rank][i] = input[rank*size + i];
MPI_Isend(&buffs[rank], size, MPI_INT, rank, 1, comm, &request);
}
What you are looking for is MPI_Scatterv which allows you to explicitly specify the length of each chunk and its position relative to the beginning of the buffer. If you don't want to send data to certain ranks, simply set the length of their chunks to 0:
int blen[n];
MPI_Aint displ[n];
for (int rank = 0; rank < n; rank++)
{
blen[rank] = (rank < k) ? size : 0;
displ[rank] = rank * size;
}
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Scatterv(input, blen, displ, MPI_INT,
mybuf, myrank < k ? size : 0, MPI_INT,
0, MPI_COMM_WORLD);
Note that for rank >= k the displacements will run past the end of the buffer. That is all right since block lengths are set to zero for rank >= k and no data will be accessed.
As for your original approach, it is not portable and might not always work. The reason is that you are overwriting the same request handle and you never wait for the sends to complete. The correct implementation is:
MPI_Request request[k];
for (int rank = 0; rank < k; ++rank)
{
MPI_Isend(&input[rank*size], size, MPI_INT, rank, 1, comm, &request[rank]);
}
MPI_Waitall(k, request, MPI_STATUSES_IGNORE);
The most optimal implementation would be to use MPI_Scatter in a subcommunicator:
MPI_Comm subcomm;
MPI_Comm_split(MPI_COMM_WORLD, myrank < k ? 0 : MPI_UNDEFINED, myrank,
&subcomm);
// Now there are k ranks in subcomm
// Perform the scatter in the subcommunicator
if (subcomm != MPI_COMM_NULL)
MPI_Scatter(input, size, MPI_INT, mybuf, size, MPI_INT, 0, subcomm);
The MPI_Comm_split call splits MPI_COMM_WORLD and creates a new communicator from all original ranks less than k. It uses the original rank as key for ordering the ranks in the new communicator, therefore rank 0 in MPI_COMM_WORLD becomes rank 0 in subcomm. Since MPI_Scatter often performs better than MPI_Scatterv, this one is the most optimal solution.

Computing permutations using a number of elements

I am trying to generate all possible permutations of a set of elements. The order doesn't matter, and elements may be present multiple times. The number of elements in each permutation is equal to the total number of elements.
A basic recursive algorithm for computing permutations following the schema (as I am writing in C++, the code will look similar to it):
elems = [0, 1, .., n-1]; // n unique elements. numbers only exemplary.
current = []; // array of size n
perms(elems, current, 0); // initial call
perms(array elems, array current, int depth) {
if(depth == elems.size) print current;
else {
for(elem : elems) {
current[depth] = elem;
perms(elems, current, depth+1);
}
}
}
Would produce a large number of redundant sequences, e.g.:
0, 0, .., 0, 0
0, 0, .., 0, 1 // this
0, 0, .., 0, 2
. . . . .
. . . . .
0, 0, .., 0, n-1
0, 0, .., 1, 0 // is the same as this
. . . . . // many more redundant ones to follow
I tried to identify when exactly generating values can be skipped, but have so far not found nothing useful. I am sure I can find a way to hack around this, but I am also sure, that there is a rule behind this which I just haven't managed to see.
Edit: Possible solution+
elems = [0, 1, .., n-1]; // n unique elements. numbers only exemplary.
current = []; // array of size n
perms(elems, current, 0, 0); // initial call
perms(array elems, array current, int depth, int minimum) {
if(depth == elems.size) print current;
else {
for(int i=minimum; i<elems.size; i++) {
current[depth] = elems[i];
perms(elems, current, depth+1, i);
}
}
}
Make your first position varies from 0 to n. Then make your second position to 1. Then make your first position varies from 1 to n. Then set second at 2 --> first from 2 to n and so on.
I believe one such rule is to have the elements in each sequence be in non-decreasing order (or non-increasing, if you prefer).

Bijective mapping of integers

English is not my native language: sorry for my mistakes. Thank you in advance for your answers.
I'm learning C++ and I'm trying to check to what extent two sets with the same number of integers--in whatever order--are bijective.
Example :
int ArrayA [4] = { 0, 0, 3, 4 };
int ArrayB [4] = { 4, 0, 0, 3 };
ArrayA and ArrayB are bijective.
My implementation is naive.
int i, x=0;
std::sort(std::begin(ArrayA), std::end(ArrayA));
std::sort(std::begin(ArrayB), std::end(ArrayB));
for (i=0; i<4; i++) if (ArrayA[i]!=ArrayB[i]) x++;
If x == 0, then the two sets are bijective. Easy.
My problem is the following: I would like to count the number of bijections between the sets, and not only the whole property of the relationship between ArrayA and ArrayB.
Example :
int ArrayA [4] = { 0, 0, 0, 1 }
int ArrayB [4] = { 3, 1, 3, 0 }
Are the sets bijective as a whole? No. But there are 2 bijections (0 and 0, 1 and 1).
With my code, the output would be 1 bijection. Indeed, if we sort the arrays, we get:
ArrayA = 0, 0, 0, 1;
ArrayB = 0, 1, 3, 3.
A side-by-side comparaison shows only a bijection between 0 and 0.
Then, my question is:
Do you know a method to map elements between two equally-sized sets and count the number of bijections, whatever the order of the integers?
Solved!
The answer given by Ivaylo Strandjev works:
Sort the sets,
Use the std::set_intersection function,
Profit.
You need to count the number of elements that are contained in both sets. This is called set intersection and it can be done with a standard function - set_intersection, part of the header algorithm. Keep in mind you still need to sort the two arrays first.

Understanding MPI_Allgatherv in plain english

I've been learning how to implement MPI over the past couple of weeks and I'm having a very hard time to understand how to set up some of the input arguments for MPI_Allgatherv. I'll use a toy example because I need to take baby steps here. Some of the research I've done is listed at the end of this post (including my previous question, which led me to this question). First, a quick summary of what I'm trying to accomplish:
--Summary-- I'm taking a std::vector A, having multiple processors work on different parts of A, and then taking the updated parts of A and redistributing those updates to all processors. Therefore, all processors start with copies of A, update portions of A, and end with fully updated copies of A.--End--
Let's say I have a std::vector < double > containing 5 elements called "mydata" initialized as follows:
for (int i = 0; i < 5; i++)
{
mydata[i] = (i+1)*1.1;
}
Now let's say I'm running my code on 2 nodes (int tot_proc = 2). I identify the "current" node using "int id_proc," therefore, the root processor has id_proc = 0. Since the number of elements in mydata is odd, I cannot evenly distribute the work between processors. Let's say that I always break the work up as follows:
if (id_proc < tot_proc - 1)
{
//handle mydata.size()/tot_proc elements
}
else
{
//handle whatever is left over
}
In this example, that means:
id_proc = 0 will work on mydata[0] and mydata[1] (2 elements, since 5/2 = 2) … and … id_proc = 1 will work on mydata[2] - mydata[4] (3 elements, since 5/2 + 5%2 = 3)
Once each processor has worked on their respective portions of mydata, I want to use Allgatherv to merge the results together so that mydata on each processor contains all of the updated values. We know Allgatherv takes 8 arguments: (1) the starting address of the elements/data being sent, (2) the number of elements being sent, (3) the type of data being sent, which is MPI_DOUBLE in this example, (4) the address of the location you want the data to be received (no mention of "starting" address), (5) the number of elements being received, (6) the "displacements" in memory relative to the receiving location in argument #4, (7) the type of data being received, again, MPI_DOUBLE, and (8) the communicator you're using, which in my case is simply MPI_COMM_WORLD.
Now here's where the confusion begins. Since processor 0 worked on the first two elements, and processor 1 worked on the last 3 elements, then processor 0 will need to SEND the first two elements, and processor 1 will need to SEND the last 3 elements. To me, this suggests that the first two arguments of Allgatherv should be:
Processor 0: MPI_Allgatherv(&mydata[0],2,…
Processor 1: MPI_Allgatherv(&mydata[2],3,…
(Q1) Am I right about that? If so, my next question is in regard to the format of argument 2. Let's say I create a std::vector < int > sendcount such that sendcount[0] = 2, and sendcount[1] = 3.
(Q2) Does Argument 2 require the reference to the first location of sendcount, or do I need to send the reference to the location relevant to each processor? In other words, which of these should I do:
Q2 - OPTION 1
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[0],…
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[0],…
Q2 - OPTION 2
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[id_proc], … (here id_proc = 0)
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[id_proc], … (here id_proc = 1)
...On to Argument 4. Since I am collecting different sections of mydata back into itself, I suspect that this argument will look similar to Argument 1. i.e. it should be something like &mydata[?]. (Q3) Can this argument simply be a reference to the beginning of mydata (i.e. &mydata[0]), or do I have to change the index the way I did for Argument 1? (Q4) Imagine I had used 3 processors. This would mean that Processor 1 would be sending mydata[2] and mydata[3] which are in "the middle" of the vector. Since the vector's elements are contiguous, then the data that Processor 1 is receiving has to be split (some goes before, and mydata[4] goes after). Do I have to account for that split in this argument, and if so, how?
...Slightly more confusing to me is Argument 5 but I had an idea this morning. Using the toy example: if Processor 0 is sending 2 elements, then it will be receiving 3, correct? Similarly, if Processor 1 is sending 3 elements, then it is receiving 2. (Q5) So, if I were to create a std::vector < int > recvcount, couldn't I just initialize it as:
for (int i = 0; i < tot_proc; i++)
{
recvcount[i] = mydata.size() - sendcount[i];
}
And if that is true, then do I pass it to Allgatherv as &recvcount[0] or &recvcount[id_proc] (similar to Argument 2)?
Finally, Argument 6. I know this is tied to my input for Argument 4. My guess is the following: if I were to pass &mydata[0] as Argument 4 on all processors, then the displacements are the number of positions in memory that I need to move in order to get to the first location where data actually needs to be received. For example,
Processor 0: MPI_Allgatherv( … , &mydata[0], … , 2, … );
Processor 1: MPI_Allgatherv( … , &mydata[0], … , 0, … );
(Q5) Am I right in thinking that the above two lines means "Processor 0 will receive data beginning at location &mydata[0+2]. Processor 1 will receive data beginning at location &mydata[0+0]." ?? And what happens when the data needs to be split like in Q4? Finally, since I am collecting portions of a vector back into itself (replacing mydata with updated mydata by overwriting it), then this tells me that all processors other than the root process will be receiving data beginning at &mydata[0]. (Q6) If this is true, then shouldn't the displacements be 0 for all processors that are not the root?
Some of the links I've read:
Difference between MPI_allgather and MPI_allgatherv
Difference between MPI_Allgather and MPI_Alltoall functions?
Problem with MPI_Gatherv for std::vector
C++: Using MPI's gatherv to concatenate vectors of differing lengths
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Allgatherv.html
https://computing.llnl.gov/tutorials/mpi/#Routine_Arguments
My previous post on stackoverflow:
MPI C++ matrix addition, function arguments, and function returns
Most tutorials, etc, that I've read just gloss over Allgatherv.
Part of confusion here is that you're trying to do an in-place gather; you are trying to send from and receive into the same array. If you're doing that, you should use the MPI_IN_PLACE option, in which case you don't explicitly specify the send location or count. Those are there for if you are sending from a different buffer than you're receiving into, but in-place gathers are somewhat more constrained.
So this works:
#include <iostream>
#include <vector>
#include <mpi.h>
int main(int argc, char **argv) {
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size < 2) {
std::cerr << "This demo requires at least 2 procs." << std::endl;
MPI_Finalize();
return 1;
}
int datasize = 2*size + 1;
std::vector<int> data(datasize);
/* break up the elements */
int *counts = new int[size];
int *disps = new int[size];
int pertask = datasize/size;
for (int i=0; i<size-1; i++)
counts[i] = pertask;
counts[size-1] = datasize - pertask*(size-1);
disps[0] = 0;
for (int i=1; i<size; i++)
disps[i] = disps[i-1] + counts[i-1];
int mystart = disps[rank];
int mycount = counts[rank];
int myend = mystart + mycount - 1;
/* everyone initialize our data */
for (int i=mystart; i<=myend; i++)
data[i] = 0;
int nsteps = size;
for (int step = 0; step < nsteps; step++ ) {
for (int i=mystart; i<=myend; i++)
data[i] += rank;
MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
&(data[0]), counts, disps, MPI_INT, MPI_COMM_WORLD);
if (rank == step) {
std::cout << "Rank " << rank << " has array: [";
for (int i=0; i<datasize-1; i++)
std::cout << data[i] << ", ";
std::cout << data[datasize-1] << "]" << std::endl;
}
}
delete [] disps;
delete [] counts;
MPI_Finalize();
return 0;
}
Running gives
$ mpirun -np 3 ./allgatherv
Rank 0 has array: [0, 0, 1, 1, 2, 2, 2]
Rank 1 has array: [0, 0, 2, 2, 4, 4, 4]
Rank 2 has array: [0, 0, 3, 3, 6, 6, 6]