MPI_collective communication

MPI_collective communication - c++

I try to code quick sort in mpi. the algorithm of parallelization is simple. the root scatters the list in MPI_comm_world. then each node executes qsort() function on it`s subarray. the MPI_gathers() is used to give all subarrays back to root to execute again qsort on it.so simple. however i get the error. i guessed that maybe the size of sub-arrays is not exact. because it simply divides the size of list by comm_size. so it is likely that there would be a segmentation fault. however i give the size of list 1000 and the number of processors 4. the result of division is 250. so there should be no segmentation fault. But there is. could you tell me where i am wrong.
int main()
{
int array [1000];
int arrsize;
int chunk;
int* subarray;
int rank ;
int comm_size;
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD,&comm_size);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
if(rank==0)
{
time_t t;
srand((unsigned)time(&t));
int arrsize = sizeof(array) / sizeof(int);
for (int i = 0; i < arrsize; i++)
array[i] = rand() % 1000;
printf("\n this is processor %d and the unsorted array is:",rank);
printArray(array,arrsize);
}
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
if(rank != 0)
{
qsort(subarray,chunk,sizeof(int),comparetor);
}
MPI_Gather( subarray,chunk, MPI_INT,array, arrsize, MPI_INT,0, MPI_COMM_WORLD);
if(rank==0)
{
qsort(array,arrsize,sizeof(int),comparetor);
printf("\n this is processor %d and this is sorted array: ",rank);
printArray(array,arrsize);
}
free(subarray);
MPI_Finalize();
return 0;
}
and the error says :
Invalid MIT-MAGIC-COOKIE-1 key[h:04865] *** Process received signal ***
[h:04865] Signal: Segmentation fault (11)
[h:04865] Signal code: Address not mapped (1)
[h:04865] Failing at address: 0x421e45
[h:04865] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f1906b29210]
[h:04865] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x18e533)[0x7f1906c71533]
[h:04865] [ 2] /lib/x86_64-linux-gnu/libopen-pal.so.40(+0x4054f)[0x7f190699654f]
[h:04865] [ 3] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_datatype_sndrcv+0x51a)[0x7f1906f3288a]
[h:04865] [ 4] /lib/x86_64-linux-gnu/libmpi.so.40(ompi_coll_base_scatter_intra_basic_linear+0x12c)[0x7f1906f75dec]
[h:04865] [ 5] /lib/x86_64-linux-gnu/libmpi.so.40(PMPI_Scatter+0x10d)[0x7f1906f5952d]
[h:04865] [ 6] ./parallelQuickSortMPI(+0xc8a5)[0x5640c424b8a5]
[h:04865] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f1906b0a0b3]
[h:04865] [ 8] ./parallelQuickSortMPI(+0xc64e)[0x5640c424b64e]
[h:04865] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node h exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The reason for segmentation fault is in below lines.
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
You are only allocating the subarray as well as calculating the chunk size only after the MPI_Scatter operation. It is a collective operation and necessary memory allocation (eg: receiver array) as well as size to receive should be declared and defined before the call.
chunk = (int)(arrsize/comm_size);
subarray = (int*)calloc(arrsize,sizeof(int));
MPI_Scatter( array,arrsize,MPI_INT, subarray,chunk,MPI_INT,0,MPI_COMM_WORLD);
Above is the right way. You will move pass the segmentation fault with this change.

Related

* stack smashing detected * because of 2d vector

char m[2];
vector< vector<tuple<long int, long int>> > f(3);
for(i=0; i<M; i++){
scanf("%s %ld %ld", m, &n, &c);
if(m[0]=='1'){ //f--> 1
f[m[1]-65].push_back(make_tuple(c, n)); //turn ascii to number
m1+=cost;
}
..... //irelevant to f
}
I am having this piece of code and I am facing this problem :
f is supposed to be a 2d-vector of vectors of tuples.
I know I need 3 rows in f but I don't know the numbers of elements in each row (that's why I have a vector).
I thought that putting that 3 defines that I want 3 rows. But it seems like something is wrong : when I have testcase where f[0] or f[1] or f[2] has to hold more than two elements I am getting this error message : *** stack smashing detected ***: <unknown> terminated Aborted (core dumped).
If in the other hand I remove, that 3 completely : vector< vector<tuple<long int, long int>> > f I am getting a seg fault and I suppose it has to do with the fact that I am accessing with f[m[1]-65] a row that does not yet exist?
So the thing is , I need to be able to seperate my input according to m and I know that m[0] = {1,2,3} so I am using that to get m[0] - 65 = {0,1,2} as indexes and fill those 3 vectors. Can you help me get through this?

* Error in `': free(): invalid next size (fast): 0x0000000000667890 *

This was a function that i made, it calculated Fibonacci Numbers, but when I run it i get the following error
Error in `': free(): invalid next size (fast): 0x0000000000667890
int fib(int n) {
int fibn=0;
std::vector<int> x{0,1};
for(int i = 2 ; i <= n ; i++)
{
x[i]=x[i-2]+x[i-1];
}
fibn=x[n];
return fibn;
}

std::vector<int> x{0,1};
You have a vector with two elements. Valid indices are 0 and 1.
for(int i = 2 ; i <= n ; i++)
{
x[i]=x[i-2]+x[i-1];
}
After the first iteration, you access x[2] and beyond which is outside the bounds of the vector. The behaviour of the program is undefined.
You don't need to store the series in a vector since you're only returning the last value. You only need to store the last two values.

Your vector x only has 2 elements, but your loop starts by setting i to 2 and then does x[i] (aka x[2]) on the first iteration, which is out of bounds since only the indices 0 and 1 are valid.
Remember that array indices start at 0 in C++.
Accessing out of bounds is Undefined Behaviour and as a result your entire program is invalid and the compiler is not required to generate anything sensible, nor is it obliged to tell you about your error.

Summing and Gathering elements of array element-wise in MPI

After doing calculations to multiply a matrix with a vector using Cartesian topology. I got the following process with the their ranks and vectors.
P0 (process with rank = 0) =[2 , 9].
P1 (process with rank = 1) =[2 , 3]
P2 (process with rank = 2) =[1 , 9]
P3 (process with rank = 3) =[4 , 6].
Now. I need to sum the elements of the even rank processes and the odd ones separately, like this:
temp1 = [3 , 18]
temp2 = [6 , 9]
and then , gather the results in a different vector, like this:
result = [3 , 18 , 6 , 9]
My attampt to do it is to use the MPI_Reduce and then MPI_Gather like this :
// Previous code
double* temp1 , *temp2;
if(myrank %2 == 0){
BOOLEAN flag = Allocate_vector(&temp1 ,local_m); // function to allocate space for vectors
MPI_Reduce(local_y, temp1, local_n, MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp1, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE,0, comm);
free(temp1);
}
else{
Allocate_vector(&temp2 ,local_m);
MPI_Reduce(local_y, temp2, local_n , MPI_DOUBLE, MPI_SUM, 0 , comm);
MPI_Gather(temp2, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0,comm);
free(temp2);
}
But the answer is not correct.It seemd that the code sums all elements of the even and odd process togather and then gives a segmentation fault error:
Wrong_result = [21 15 0 0]
and this error
** Error in ./test': double free or corruption (fasttop): 0x00000000013c7510 ***
*** Error in./test': double free or corruption (fasttop): 0x0000000001605b60 ***

It won't work the way you are trying to do it. To perform reduction over the elements of a subset of processes, you have to create a subcommunicator for them. In your case, the odd and the even processes share the same comm, therefore the operations are not over the two separate groups of processes but rather over the combined group.
You should use MPI_Comm_split to perform a split, perform the reduction using the two new subcommunicators, and finally have rank 0 in each subcommunicator (let's call those leaders) participate in the gather over another subcommunicator that contains those two only:
// Make sure rank is set accordingly
MPI_Comm_rank(comm, &rank);
// Split even and odd ranks in separate subcommunicators
MPI_Comm subcomm;
MPI_Comm_split(comm, rank % 2, 0, &subcomm);
// Perform the reduction in each separate group
double *temp;
Allocate_vector(&temp, local_n);
MPI_Reduce(local_y, temp, local_n , MPI_DOUBLE, MPI_SUM, 0, subcomm);
// Find out our rank in subcomm
int subrank;
MPI_Comm_rank(subcomm, &subrank);
// At this point, we no longer need subcomm. Free it and reuse the variable.
MPI_Comm_free(&subcomm);
// Separate both group leaders (rank 0) into their own subcommunicator
MPI_Comm_split(comm, subrank == 0 ? 0 : MPI_UNDEFINED, 0, &subcomm);
if (subcomm != MPI_COMM_NULL) {
MPI_Gather(temp, local_n, MPI_DOUBLE, gResult, local_n, MPI_DOUBLE, 0, subcomm);
MPI_Comm_free(&subcomm);
}
// Free resources
free(temp);
The result will be in gResult of rank 0 in the latter subcomm, which happens to be rank 0 in comm because of the way the splits are performed.
Not as simple as expected, I guess, but that is the price of having convenient collective operations in MPI.
On a side node, in the code shown you are allocating temp1 and temp2 to be of length local_m, while in all collective calls the length is specified as local_n. If it happens that local_n > local_m, then heap corruption will occur.

C++ split integer array into chunks

I guess my question has 2 parts:
(1) Is this the right approach to send different chunks of an array to different processors?
Let's say I have n processors whose rank ranges from 0 to n-1.
I have an array of size d. I want to split this array into k equally-sized chunks. Assume d is divisible by k.
I want to send each of these chunks to a processor whose rank is less than k.
It would be easy if I can use something like MPI_Scatter, but this function sends to EVERY OTHER processors, and I only want to send to a certain number of procs.
So what I did was that I have a loop of k iterations and do k MPI_Isend's.
Is this efficient?
(2) If it is, how do I split an array into chunks? There's always the easy way, which is
int size = d/k;
int buffs[k][size];
for (int rank = 0; rank < k; ++rank)
{
for (int i = 0; i < size ++i)
buffs[rank][i] = input[rank*size + i];
MPI_Isend(&buffs[rank], size, MPI_INT, rank, 1, comm, &request);
}

What you are looking for is MPI_Scatterv which allows you to explicitly specify the length of each chunk and its position relative to the beginning of the buffer. If you don't want to send data to certain ranks, simply set the length of their chunks to 0:
int blen[n];
MPI_Aint displ[n];
for (int rank = 0; rank < n; rank++)
{
blen[rank] = (rank < k) ? size : 0;
displ[rank] = rank * size;
}
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Scatterv(input, blen, displ, MPI_INT,
mybuf, myrank < k ? size : 0, MPI_INT,
0, MPI_COMM_WORLD);
Note that for rank >= k the displacements will run past the end of the buffer. That is all right since block lengths are set to zero for rank >= k and no data will be accessed.
As for your original approach, it is not portable and might not always work. The reason is that you are overwriting the same request handle and you never wait for the sends to complete. The correct implementation is:
MPI_Request request[k];
for (int rank = 0; rank < k; ++rank)
{
MPI_Isend(&input[rank*size], size, MPI_INT, rank, 1, comm, &request[rank]);
}
MPI_Waitall(k, request, MPI_STATUSES_IGNORE);
The most optimal implementation would be to use MPI_Scatter in a subcommunicator:
MPI_Comm subcomm;
MPI_Comm_split(MPI_COMM_WORLD, myrank < k ? 0 : MPI_UNDEFINED, myrank,
&subcomm);
// Now there are k ranks in subcomm
// Perform the scatter in the subcommunicator
if (subcomm != MPI_COMM_NULL)
MPI_Scatter(input, size, MPI_INT, mybuf, size, MPI_INT, 0, subcomm);
The MPI_Comm_split call splits MPI_COMM_WORLD and creates a new communicator from all original ranks less than k. It uses the original rank as key for ordering the ranks in the new communicator, therefore rank 0 in MPI_COMM_WORLD becomes rank 0 in subcomm. Since MPI_Scatter often performs better than MPI_Scatterv, this one is the most optimal solution.

Understanding MPI_Allgatherv in plain english

I've been learning how to implement MPI over the past couple of weeks and I'm having a very hard time to understand how to set up some of the input arguments for MPI_Allgatherv. I'll use a toy example because I need to take baby steps here. Some of the research I've done is listed at the end of this post (including my previous question, which led me to this question). First, a quick summary of what I'm trying to accomplish:
--Summary-- I'm taking a std::vector A, having multiple processors work on different parts of A, and then taking the updated parts of A and redistributing those updates to all processors. Therefore, all processors start with copies of A, update portions of A, and end with fully updated copies of A.--End--
Let's say I have a std::vector < double > containing 5 elements called "mydata" initialized as follows:
for (int i = 0; i < 5; i++)
{
mydata[i] = (i+1)*1.1;
}
Now let's say I'm running my code on 2 nodes (int tot_proc = 2). I identify the "current" node using "int id_proc," therefore, the root processor has id_proc = 0. Since the number of elements in mydata is odd, I cannot evenly distribute the work between processors. Let's say that I always break the work up as follows:
if (id_proc < tot_proc - 1)
{
//handle mydata.size()/tot_proc elements
}
else
{
//handle whatever is left over
}
In this example, that means:
id_proc = 0 will work on mydata[0] and mydata[1] (2 elements, since 5/2 = 2) … and … id_proc = 1 will work on mydata[2] - mydata[4] (3 elements, since 5/2 + 5%2 = 3)
Once each processor has worked on their respective portions of mydata, I want to use Allgatherv to merge the results together so that mydata on each processor contains all of the updated values. We know Allgatherv takes 8 arguments: (1) the starting address of the elements/data being sent, (2) the number of elements being sent, (3) the type of data being sent, which is MPI_DOUBLE in this example, (4) the address of the location you want the data to be received (no mention of "starting" address), (5) the number of elements being received, (6) the "displacements" in memory relative to the receiving location in argument #4, (7) the type of data being received, again, MPI_DOUBLE, and (8) the communicator you're using, which in my case is simply MPI_COMM_WORLD.
Now here's where the confusion begins. Since processor 0 worked on the first two elements, and processor 1 worked on the last 3 elements, then processor 0 will need to SEND the first two elements, and processor 1 will need to SEND the last 3 elements. To me, this suggests that the first two arguments of Allgatherv should be:
Processor 0: MPI_Allgatherv(&mydata[0],2,…
Processor 1: MPI_Allgatherv(&mydata[2],3,…
(Q1) Am I right about that? If so, my next question is in regard to the format of argument 2. Let's say I create a std::vector < int > sendcount such that sendcount[0] = 2, and sendcount[1] = 3.
(Q2) Does Argument 2 require the reference to the first location of sendcount, or do I need to send the reference to the location relevant to each processor? In other words, which of these should I do:
Q2 - OPTION 1
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[0],…
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[0],…
Q2 - OPTION 2
Processor 0: MPI_Allgatherv(&mydata[0], &sendcount[id_proc], … (here id_proc = 0)
Processor 1: MPI_Allgatherv(&mydata[2], &sendcount[id_proc], … (here id_proc = 1)
...On to Argument 4. Since I am collecting different sections of mydata back into itself, I suspect that this argument will look similar to Argument 1. i.e. it should be something like &mydata[?]. (Q3) Can this argument simply be a reference to the beginning of mydata (i.e. &mydata[0]), or do I have to change the index the way I did for Argument 1? (Q4) Imagine I had used 3 processors. This would mean that Processor 1 would be sending mydata[2] and mydata[3] which are in "the middle" of the vector. Since the vector's elements are contiguous, then the data that Processor 1 is receiving has to be split (some goes before, and mydata[4] goes after). Do I have to account for that split in this argument, and if so, how?
...Slightly more confusing to me is Argument 5 but I had an idea this morning. Using the toy example: if Processor 0 is sending 2 elements, then it will be receiving 3, correct? Similarly, if Processor 1 is sending 3 elements, then it is receiving 2. (Q5) So, if I were to create a std::vector < int > recvcount, couldn't I just initialize it as:
for (int i = 0; i < tot_proc; i++)
{
recvcount[i] = mydata.size() - sendcount[i];
}
And if that is true, then do I pass it to Allgatherv as &recvcount[0] or &recvcount[id_proc] (similar to Argument 2)?
Finally, Argument 6. I know this is tied to my input for Argument 4. My guess is the following: if I were to pass &mydata[0] as Argument 4 on all processors, then the displacements are the number of positions in memory that I need to move in order to get to the first location where data actually needs to be received. For example,
Processor 0: MPI_Allgatherv( … , &mydata[0], … , 2, … );
Processor 1: MPI_Allgatherv( … , &mydata[0], … , 0, … );
(Q5) Am I right in thinking that the above two lines means "Processor 0 will receive data beginning at location &mydata[0+2]. Processor 1 will receive data beginning at location &mydata[0+0]." ?? And what happens when the data needs to be split like in Q4? Finally, since I am collecting portions of a vector back into itself (replacing mydata with updated mydata by overwriting it), then this tells me that all processors other than the root process will be receiving data beginning at &mydata[0]. (Q6) If this is true, then shouldn't the displacements be 0 for all processors that are not the root?
Some of the links I've read:
Difference between MPI_allgather and MPI_allgatherv
Difference between MPI_Allgather and MPI_Alltoall functions?
Problem with MPI_Gatherv for std::vector
C++: Using MPI's gatherv to concatenate vectors of differing lengths
http://www.mcs.anl.gov/research/projects/mpi/www/www3/MPI_Allgatherv.html
https://computing.llnl.gov/tutorials/mpi/#Routine_Arguments
My previous post on stackoverflow:
MPI C++ matrix addition, function arguments, and function returns
Most tutorials, etc, that I've read just gloss over Allgatherv.

Part of confusion here is that you're trying to do an in-place gather; you are trying to send from and receive into the same array. If you're doing that, you should use the MPI_IN_PLACE option, in which case you don't explicitly specify the send location or count. Those are there for if you are sending from a different buffer than you're receiving into, but in-place gathers are somewhat more constrained.
So this works:
#include <iostream>
#include <vector>
#include <mpi.h>
int main(int argc, char **argv) {
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (size < 2) {
std::cerr << "This demo requires at least 2 procs." << std::endl;
MPI_Finalize();
return 1;
}
int datasize = 2*size + 1;
std::vector<int> data(datasize);
/* break up the elements */
int *counts = new int[size];
int *disps = new int[size];
int pertask = datasize/size;
for (int i=0; i<size-1; i++)
counts[i] = pertask;
counts[size-1] = datasize - pertask*(size-1);
disps[0] = 0;
for (int i=1; i<size; i++)
disps[i] = disps[i-1] + counts[i-1];
int mystart = disps[rank];
int mycount = counts[rank];
int myend = mystart + mycount - 1;
/* everyone initialize our data */
for (int i=mystart; i<=myend; i++)
data[i] = 0;
int nsteps = size;
for (int step = 0; step < nsteps; step++ ) {
for (int i=mystart; i<=myend; i++)
data[i] += rank;
MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
&(data[0]), counts, disps, MPI_INT, MPI_COMM_WORLD);
if (rank == step) {
std::cout << "Rank " << rank << " has array: [";
for (int i=0; i<datasize-1; i++)
std::cout << data[i] << ", ";
std::cout << data[datasize-1] << "]" << std::endl;
}
}
delete [] disps;
delete [] counts;
MPI_Finalize();
return 0;
}
Running gives
$ mpirun -np 3 ./allgatherv
Rank 0 has array: [0, 0, 1, 1, 2, 2, 2]
Rank 1 has array: [0, 0, 2, 2, 4, 4, 4]
Rank 2 has array: [0, 0, 3, 3, 6, 6, 6]

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MPI_collective communication - c++

Related

* stack smashing detected * because of 2d vector

* Error in `': free(): invalid next size (fast): 0x0000000000667890 *

Summing and Gathering elements of array element-wise in MPI

C++ split integer array into chunks

Understanding MPI_Allgatherv in plain english

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MPI_collective communication - c++

Related

*** stack smashing detected *** because of 2d vector

*** Error in `': free(): invalid next size (fast): 0x0000000000667890 ***

Summing and Gathering elements of array element-wise in MPI

C++ split integer array into chunks

Understanding MPI_Allgatherv in plain english

Categories

Resources

* stack smashing detected * because of 2d vector

* Error in `': free(): invalid next size (fast): 0x0000000000667890 *