I'm trying to scatter values among processes belonging to an hypercube group (quicksort project).
Depending on the amount of processes I either create a new communicator excluding excessive processes, or I duplicate MPI_COMM_WORLD if it fits exactly any hypercube (power of 2).
In both cases, processes other than 0 receive their data, but:
- On first scenario, process 0 throws a segmentation fault 11
- On second scenario, nothing faults, but process 0 received values are gibberish.
NOTE: If I try a regular MPI_Scatter everything works well.
//Input
vector<int> LoadFromFile();
int d; //dimension of hypercube
int p; //active processes
int idle; //idle processes
vector<int> values; //values loaded
int arraySize; //number of total values to distribute
int main(int argc, char* argv[])
{
int mpiWorldRank;
int mpiWorldSize;
int mpiRank;
int mpiSize;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &mpiWorldRank);
MPI_Comm_size(MPI_COMM_WORLD, &mpiWorldSize);
MPI_Comm MPI_COMM_HYPERCUBE;
d = log2(mpiWorldSize);
p = pow(2, d); //Number of processes belonging to the hypercube
idle = mpiWorldSize - p; //number of processes in excess
int toExclude[idle]; //array of idle processes to exclude from communicator
int sendCounts[p]; //array of values sizes to be sent to processes
//
int i = 0;
while (i < idle)
{
toExclude[i] = mpiWorldSize - 1 - i;
++i;
}
//CREATING HYPERCUBE GROUP: Group of size of power of 2 -----------------
MPI_Group world_group;
MPI_Comm_group(MPI_COMM_WORLD, &world_group);
// Remove excessive processors if any from communicator
if (idle > 0)
{
MPI_Group newGroup;
MPI_Group_excl(world_group, 1, toExclude, &newGroup);
MPI_Comm_create(MPI_COMM_WORLD, newGroup, &MPI_COMM_HYPERCUBE);
//Abort any processor not part of the hypercube.
if (mpiWorldRank > p)
{
cout << "aborting: " << mpiWorldRank <<endl;
MPI_Finalize();
return 0;
}
}
else
{
MPI_Comm_dup(MPI_COMM_WORLD, &MPI_COMM_HYPERCUBE);
}
MPI_Comm_rank(MPI_COMM_HYPERCUBE, &mpiRank);
MPI_Comm_size(MPI_COMM_HYPERCUBE, &mpiSize);
//END OF: CREATING HYPERCUBE GROUP --------------------------
if (mpiRank == 0)
{
//STEP1: Read input
values = LoadFromFile();
arraySize = values.size();
}
//Transforming input vector into an array
int valuesArray[values.size()];
if(mpiRank == 0)
{
copy(values.begin(), values.end(), valuesArray);
}
//Broadcast input size to all processes
MPI_Bcast(&arraySize, 1, MPI_INT, 0, MPI_COMM_HYPERCUBE);
//MPI_Scatterv: determining size of arrays to be received and displacement
int nmin = arraySize / p;
int remainingData = arraySize % p;
int displs[p];
int recvCount;
int k = 0;
for (i=0; i<p; i++)
{
sendCounts[i] = i < remainingData
? nmin+1
: nmin;
displs[i] = k;
k += sendCounts[i];
}
recvCount = sendCounts[mpiRank];
int recvValues[recvCount];
//Following MPI_Scatter works well:
// MPI_Scatter(&valuesArray, 13, MPI_INT, recvValues , 13, MPI_INT, 0, MPI_COMM_HYPERCUBE);
MPI_Scatterv(&valuesArray, sendCounts, displs, MPI_INT, recvValues , recvCount, MPI_INT, 0, MPI_COMM_HYPERCUBE);
int j = 0;
while (j < recvCount)
{
cout << "rank " << mpiRank << " received: " << recvValues[j] << endl;
++j;
}
MPI_Finalize();
return 0;
}
First of all, you are supplying wrong arguments to MPI_Group_excl:
MPI_Group_excl(world_group, 1, toExclude, &newGroup);
// ^
The second argument specifies the number of entries in the exclusion list and should therefore be equal to idle. Since you are excluding a single rank only, the resulting group has mpiWorldSize-1 ranks and hence MPI_Scatterv expects that both sendCounts[] and displs[] have that many elements. Of those only p elements are properly initialised and and the rest are random, therefore MPI_Scatterv crashes in the root.
Another error is the code that aborts the idle processes: it should read if (mpiWorldRank >= p).
I would recommend that the entire exclusion code is replaced by a single call to MPI_Comm_split instead:
MPI_Comm comm_hypercube;
int colour = mpiWorldRank >= p ? MPI_UNDEFINED : 0;
MPI_Comm_split(MPI_COMM_WORLD, colour, mpiWorldRank, &comm_hypercube);
if (comm_hypercube == MPI_COMM_NULL)
{
MPI_Finalize();
return 0;
}
When no process supplies MPI_UNDEFINED as its colour, the call is equivalent to MPI_Comm_dup.
Note that you should avoid using in your code names starting with MPI_ as those could clash with symbols from the MPI implementation.
Additional note: std::vector<T> uses contiguous storage, therefore you could do without copying the elements into a regular array and simply provide the address of the first element in the call to MPI_Scatter(v):
MPI_Scatterv(&values[0], ...);
Related
I am trying to write a C++ program by using MPI, in which each rank will send a matrix to rank 0. When the matrix size is relatively small, the code works perfectly. However, when the matrix size becomes big. The code starts to give strange error which will only happen when I use specific amount of CPUs.
If you feel the full code is too long, please directly go down to the minimum example below.
To avoid overlooking some part, I give the full source code here:
#include <iostream>
#include <mpi.h>
#include <cmath>
int world_size;
int world_rank;
MPI_Comm comm;
int m, m_small, m_small2;
int index(int row, int column)
{
return m * row + column;
}
int index3(int row, int column)
{
return m_small2 * row + column;
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
m = atoi(argv[1]); //Size
int ndims = 2;
int *dims = new int[ndims];
int *period = new int[ndims];
int *coords = new int[ndims];
for (int i=0; i<ndims; i++) dims[i] = 0;
for (int i=0; i<ndims; i++) period[i] = 0;
for (int i=0; i<ndims; i++) coords[i] = 0;
MPI_Dims_create(world_size, ndims, dims);
MPI_Cart_create(MPI_COMM_WORLD, ndims, dims, period, 0, &comm);
MPI_Cart_coords(comm, world_rank, ndims, coords);
double *a, *a_2;
if (0 == world_rank) {
a = new double [m*m];
for (int i=0; i<m; i++) {
for (int j=0; j<m; j++) {
a[index(i,j)] = 0;
}
}
}
/*m_small is along the vertical direction, m_small2 is along the horizental direction*/
//The upper cells will take the reminder of total lattice point along vertical direction divided by the cell number along that direction
if (0 == coords[0]){
m_small = int(m / dims[0]) + m % dims[0];
}
else m_small = int(m / dims[0]);
//The left cells will take the reminder of total lattice point along horizental direction divided by the cell number along that direction
if (0 == coords[1]) {
m_small2 = int(m / dims[1]) + m % dims[1];
}
else m_small2 = int(m / dims[1]);
double *a_small = new double [m_small * m_small2];
/*Initialization of matrix*/
for (int i=0; i<m_small; i++) {
for (int j=0; j<m_small2; j++) {
a_small[index3(i,j)] = 2.5 ;
}
}
if (0 == world_rank) {
a_2 = new double[m_small*m_small2];
for (int i=0; i<m_small; i++) {
for (int j=0; j<m_small2; j++) {
a_2[index3(i,j)] = 0;
}
}
}
int loc[2];
int m1_rec, m2_rec;
MPI_Request send_req;
MPI_Isend(coords, 2, MPI_INT, 0, 1, MPI_COMM_WORLD, &send_req);
//This Isend may have problem!
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
if (0 == world_rank) {
for (int i = 0; i < world_size; i++) {
MPI_Recv(loc, 2, MPI_INT, i, 1, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
/*Determine the size of matrix for receiving the information*/
if (0 == loc[0]) {
m1_rec = int(m / dims[0]) + m % dims[0];
} else {
m1_rec = int(m / dims[0]);
}
if (0 == loc[1]) {
m2_rec = int(m / dims[1]) + m % dims[1];
} else {
m2_rec = int(m / dims[1]);
}
//This receive may have problem!
MPI_Recv(a_2, m1_rec * m2_rec, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
delete[] a_small;
if (0 == world_rank) {
delete[] a;
delete[] a_2;
}
delete[] dims;
delete[] period;
delete[] coords;
MPI_Finalize();
return 0;
}
Basically, the code reads an input value m, and then construct a big matrix of m x m. MPI creates a 2D topology according to the number of CPUs, which divide the big matrix to sub-matrix. The size of the sub-matrix is m_small x m_small2. There should be no problem in these steps.
The problem happens when I send the sub-matrix in each rank to rank-0 using MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req); and MPI_Recv(a_2, m1_rec * m2_rec, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);.
For example, when I run the code by this command: mpirun -np 2 ./a.out 183, I will get the error of
Read -1, expected 133224, errno = 14
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x7fb23b485010
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node dx1-500-24164 exited on signal 11 (Segmentation fault).
Strangely, If I modify the CPU number or decrease the value of input argument, the problem is not there anymore. Also, If I just comment out the MPI_Isend/Recv, there is no problem either.
So I am really wondering how to solve this problem?
Edit.1
The minimum example to reproduce the problem.
When the size of matrix is small, there is no problem. But problem comes when you increase the size of matrix (at least for me):
#include <iostream>
#include <mpi.h>
#include <cmath>
int world_size;
int world_rank;
MPI_Comm comm;
int m, m_small, m_small2;
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
m = atoi(argv[1]); //Size
double *a_2;
//Please increase the size of m_small and m_small2 and wait for the problem to happen
m_small = 100;
m_small2 = 200;
double *a_small = new double [m_small * m_small2];
if (0 == world_rank) {
a_2 = new double[m_small*m_small2];
}
MPI_Request send_req;
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
if (0 == world_rank) {
for (int i = 0; i < world_size; i++) {
MPI_Recv(a_2, m_small*m_small2, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, MPI_STATUSES_IGNORE);
}
}
delete[] a_small;
if (0 == world_rank) {
delete[] a_2;
}
MPI_Finalize();
return 0;
}
Command to run mpirun -np 2 ./a.out 183 (The input argument is actually not used the by code this time)
The problem is in the line
MPI_Isend(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, &send_req);
MPI_Isend is non-blocking send (which you pair with blocking MPI_Recv), thus when it returns it still may use a_small until you wait for the send to complete (when you are free to use a_small again) using, e.g., MPI_Wait(&send_req, MPI_STATUS_IGNORE);. So, you then delete a_small while it may still be in use by non-blocking message sending code, which likely causes access of deleted memory, which can lead to segfault and crash. Try using blocked send like this:
MPI_Send(a_small, m_small*m_small2, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD);
This will return when a_small can be reused (including by deletion), though data may still not be recieved by recievers by that time, but rather held in an internal temporary buffer.
I have been struggling to try to parallel this function which calculates interactions between particles. I had an idea to use Allgatherv which should distribute my original buffer to all other processes. Then using "rank" make a loop in which each process will calculate its part.In this program, MPI is overwritten to show stats that's why I am calling it mpi->function. Unfortunately, when I run it I receive following error. Can somebody advice what is wrong?
void calcInteraction() {
totalInteraction = 0.0;
int procs_number;
int rank;
int particle_amount = particles;
mpi->MPI_Comm_size(MPI_COMM_WORLD,&procs_number);
mpi->MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//long sendBuffer[particle_amount];
//int *displs[procs_number];
//long send_counts[procs_number+mod];
int mod = (particle_amount%procs_number);
int elements_per_process = particle_amount/procs_number;
int *send_buffer = new int[particle_amount]; //data to send
int *displs = new int[procs_number]; //displacement
int *send_counts = new int[procs_number];
if(rank == 0)
{
for(int i = 0; i < particle_amount; i++)
{
send_buffer[i] = i; // filling buffer with particles
}
}
for (int i = 0; i < procs_number;i++)
{
send_counts[i] = elements_per_process; // filling buffer since it can't be empty
}
// calculating displacement
displs[ 0 ] = 0;
for ( int i = 1; i < procs_number; i++ )
displs[ i ] = displs[ i - 1 ] + send_counts[ i - 1 ];
int allData = displs[ procs_number - 1 ] + send_counts[ procs_number - 1 ];
int * endBuffer = new int[allData];
int start,end; // initializing indices
cout<<"entering allgather"<<endl;
mpi->MPI_Allgatherv(send_buffer,particle_amount,MPI_INT,endBuffer,send_counts,displs,MPI_INT,MPI_COMM_WORLD);
// ^from ^how many ^send ^receive ^how many ^displ ^send ^communicator
// to send type buffer receive type
start = rank*elements_per_process;
cout<<"start = "<< start <<endl;
if(rank == procs_number) //in case that particle_amount is not even
{
end = (rank+1)*elements_per_process + mod;
}
else
{
end = (rank+1)*elements_per_process;
}
cout<<"end = "<< end <<endl;
cout << "calcInteraction" << endl;
for (long idx = start; idx < end; idx++) {
for (long idxB = start; idxB < end; idxB++) {
if (idx != idxB) {
totalInteraction += physics->interaction(x[idx], y[idx], z[idx], age[idx], x[idxB], y[idxB],
z[idxB], age[idxB]);
}
}
}
cout << "calcInteraction - done" << endl;
}
You are not using MPI_Allgatherv() correctly.
I had an idea to use Allgatherv which should distribute my original
buffer to all other processes.
The description suggests you need MPI_Scatter[v]() in order to slice your array from a given rank, and distributes the chunks to all the MPI tasks.
If all tasks should receive the full array, then MPI_Bcast() is what you need.
Anyway, let's assume you need an all gather.
First, you must ensure all tasks have the same particles value.
Second, since you gather the same amout of data from every MPI tasks, and store them in a contiguous location, you can simplify your code with MPI_Allgather(). If only the last task might have a bit less data, then you can use MPI_Allgatherv() (but this is not what your code is currently doing) or transmit some ghost data so you can use the simple (and probably more optimized) MPI_Allgather().
Last but not least, you should send elements_per_process elements (and not particle_amount). That should be enough to get rid of the crash (e.g. MPI_ERR_TRUNCATE). But that being said, i am not sure that will achieve the result you need or expect.
I am writing a program that calculates the sum of every number up to 1000. For example, 1+2+3+4+5....+100. First, I assign summation jobs to 10 processors: Processor 0 gets 1-100, Processor 1 gets 101-200 and so on. The sums are stored in an array.
After all summations have been done parallelly, processors send their values to Processor 0 (Processor 0 receives values using nonblocking send/recv) and Processor 0 sums up all the values and displays the result.
Here is the code:
#include <mpi.h>
#include <iostream>
using namespace std;
int summation(int, int);
int main(int argc, char ** argv)
{
int * array;
int total_proc;
int curr_proc;
int limit = 0;
int partial_sum = 0;
int upperlimit = 0, lowerlimit = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &total_proc);
MPI_Comm_rank(MPI_COMM_WORLD, &curr_proc);
MPI_Request send_request, recv_request;
/* checking if 1000 is divisible by number of procs, else quit */
if(1000 % total_proc != 0)
{
MPI_Finalize();
if(curr_proc == 0)
cout << "**** 1000 is not divisible by " << total_proc << " ...quitting..."<< endl;
return 0;
}
/* number of partial summations */
limit = 1000/total_proc;
array = new int [total_proc];
/* assigning jobs to processors */
for(int i = 0; i < total_proc; i++)
{
if(curr_proc == i)
{
upperlimit = upperlimit + limit;
lowerlimit = (upperlimit - limit) + 1;
partial_sum = summation(upperlimit, lowerlimit);
array[i] = partial_sum;
}
else
{
upperlimit = upperlimit + limit;
lowerlimit = (upperlimit - limit) + 1;
}
}
cout << "** Partial Sum From Process " << curr_proc << " is " << array[curr_proc] << endl;
/* send and receive - non blocking */
for(int i = 1; i < total_proc; i++)
{
if(curr_proc == i) /* (i = current processor) */
{
MPI_Isend(&array[i], 1, MPI_INT, 0, i, MPI_COMM_WORLD, &send_request);
cout << "-> Process " << i << " sent " << array[i] << " to Process 0" << endl;
MPI_Irecv(&array[i], 1, MPI_INT, i, i, MPI_COMM_WORLD, &recv_request);
//cout << "<- Process 0 received " << array[i] << " from Process " << i << endl;
}
}
MPI_Finalize();
if(curr_proc == 0)
{
for(int i = 1; i < total_proc; i++)
array[0] = array[0] + array[i];
cout << "Sum is " << array[0] << endl;
}
return 0;
}
int summation(int u, int l)
{
int result = 0;
for(int i = l; i <= u; i++)
result = result + i;
return result;
}
Output:
** Partial Sum From Process 0 is 5050
** Partial Sum From Process 3 is 35050
-> Process 3 sent 35050 to Process 0
<- Process 0 received 35050 from Process 3
** Partial Sum From Process 4 is 45050
-> Process 4 sent 45050 to Process 0
<- Process 0 received 45050 from Process 4
** Partial Sum From Process 5 is 55050
-> Process 5 sent 55050 to Process 0
<- Process 0 received 55050 from Process 5
** Partial Sum From Process 6 is 65050
** Partial Sum From Process 8 is 85050
-> Process 8 sent 85050 to Process 0
<- Process 0 received 85050 from Process 8
-> Process 6 sent 65050 to Process 0
** Partial Sum From Process 1 is 15050
** Partial Sum From Process 2 is 25050
-> Process 2 sent 25050 to Process 0
<- Process 0 received 25050 from Process 2
<- Process 0 received 65050 from Process 6
** Partial Sum From Process 7 is 75050
-> Process 1 sent 15050 to Process 0
<- Process 0 received 15050 from Process 1
-> Process 7 sent 75050 to Process 0
<- Process 0 received 75050 from Process 7
** Partial Sum From Process 9 is 95050
-> Process 9 sent 95050 to Process 0
<- Process 0 received 95050 from Process 9
Sum is -1544080023
Printing the contents of the array:
5050
536870912
-1579286148
-268433415
501219332
32666
501222192
32666
1
0
I'd like to know what is causing this.
If I print the array before MPI_Finalize is invoked it works fine.
The most important flaw your program has is how you divide the work. In MPI, every process is executing the main function. Therefore, you must ensure that all the processes execute your summation function if you want them to collaborate on building the result.
You don't need the for loop. Every process is executing the main separately. They just have different curr_proc values, and you can compute which portion of the job they have to perform based on that:
/* assigning jobs to processors */
int chunk_size = 1000 / total_proc;
lowerlimit = curr_proc * chunk_size;
upperlimit = (curr_proc+1) * chunk_size;
partial_sum = summation(upperlimit, lowerlimit);
Then, how the master process receives all the other processes' partial sum is not correct.
MPI rank values (curr_proc) start form 0 up to MPI_Comm_size output value (total_proc-1).
Only the process #1 is sending/receiving data.
You are using the immediate version of send and receive: MPI_Isend and MPI_recv but you are not waiting until those requests are completed. You should use MPI_Waitall for that purpose.
The correct version would be something like the following:
if( curr_proc == 0 ) {
// master process receives all data
for( int i = 1; i < total_proc; i++ )
MPI_Recv( &array[i], MPI_INT, 1, i, 0, MPI_COMM_WORLD );
} else {
// other processes send data to the master
MPI_Send( &partial_sum, MPI_INT, 1, 0, 0, MPI_COMM_WORLD );
}
This all-to-one communication pattern is known as gather. In MPI there is a function which already performs this functionality: MPI_Gather.
Finally, what you intent to perform is known as reduction: take a given amount of numeric values and generate a single output value by continuously performing a single operation (a sum, in your case). In MPI there is a function which does that, too: MPI_Reduce.
I strongly suggest you to do some basic guided exercises before trying to make your own. MPI is difficult to understand at the beginning. Building a good base is vital for you to be able to add complexity later on. A hands on tutorial is also a good way of getting started into MPI.
EDIT: Forgot to mention that you don't need to enforce an even divission of the problem size (1000 in this case) by the number of resources (total_proc). Depending on the case, you can either assign the remainder to a single process:
chunk_size = 1000 / total_proc;
if( curr_proc == 0 )
chunk_size += 1000 % total_proc;
Or balance it as much as possible:
int remainder = curr_proc < ( 1000 % proc )? 1 : 0;
lowerlimit = curr_proc * chunk_size /* as usual */
+ curr_proc; /* cumulative remainder */
upperlimit = (curr_proc + 1) * chunk_size /* as usual */
+ remainder; /* curr_proc remainder */
The second case, the load unbalance will be as much as 1, while in the first case the load unbalance can reach total_proc-1 in the worst case.
You're only initializing array[i], the element that corresponds to the curr_proc id. The other elements in that array will be uninitialized, resulting in random values. In your send/receive print loop, you only access the initialized element.
I'm not that familiar with MPI so I'm guessing, but you might want to allocate array before calling MPI_Init. Or call MPI_Receive on process 0, not each individual one.
the code here is a cuda code and is meant to find shortest pair path using Dijkstra's algorithm.
My code logic works perfectly in a c program, not in Cuda. I'm using 1 block with N threads, N being user entered.
First doubt, every thread has their own copy of variables except the shared variable temp. Correct ?
When i print the results I'm storing all values in array d and print its value which is zero for all. This is possible only if the flow of control does not enter loop after s = threadIdx.x.
Please help, have been debugging this since last 24 Hrs.
Given Input is:
Number of vertices: 4
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : 0 1 1
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : 0 2 5
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : 0 3 2
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : 1 3 4
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : 2 3 7
enter the source,destination and cost of the edge\n Enter -1 to end
Input\n Edges start from Zero : -1 -1 -1
#include<stdio.h>
#include<stdlib.h>
#include<time.h>
#include<sys/time.h>
#define nano 1000000L
__global__ void dijkstras(int *a, int *b, int *n)
{
int i;
int d[10],p[10],v[10];
// d stores distnce/cost of each path
// p stores path taken
// v stores the nodes already travelled to
int k,u,s;
int check =0;
// shared memory on cuda device
__shared__ int temp[20];
for(i=0; i < (*n)*(*n); i++)
{
temp[i] = a[i];
}
check = check + 1;
__syncthreads();
// were passing int s -- node from which distances are calculated
s = threadIdx.x;
for(i=0; i<(*n); i++)
{
d[i]=temp[s*(*n)+i];
if(d[i]!=9999)
p[i]=1;
else
p[i]=0;
v[i]=0;
}
p[s]=0;
v[s]=1;
for(i=0; i<((*n)-1); i++)
{
// findmin starts here
int i1,j1,min=0;
for(i1=0;i1<(*n);i1++)
{
if(v[i1]==0)
{
min=i1;
break;
}
}
for(j1=min+1;j1<(*n);j1++)
{
if((v[j1]==0) && (d[j1]<d[min]))
min=j1;
}
k = min;
// findmin ends here
v[k]=1;
for(u=0; u<(*n); u++)
{
if((v[u]==0) && (temp[k*(*n)+u]!=9999))
{
if(d[u]>d[k]+temp[k*(*n)+u])
{
d[u]=d[k]+temp[k*(*n)+u];
p[u]=k;
}
}
}
//storing output
int count = 0;
for(i = (s*(*n)); i< (s+1) * (*n); i++)
{
b[i] = d[count];
count++;
}
}
*n = check;
}
main()
{
int *a, *b, *n;
int *d_a, *d_b, *d_n;
int i,j,c;
int check = 0;
printf("enter the number of vertices.... : ");
n = (int*)malloc(sizeof(int));
scanf("%d",n);
int size = (*n) * (*n) * sizeof(int);
//allocating device memory
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_n, sizeof(int));
a = (int*)malloc(size);
b = (int*)malloc(size);
check = check +1;
for(i=0; i<(*n); i++)
for(j=0; j<=i; j++)
if(i==j)
a[(i*(*n) + j)]=0;
else
a[(i*(*n) + j)]=a[(j*(*n) + i)]=9999;
printf("\nInitial matrix is\n");
for(i=0;i<(*n);i++)
{
for(j=0;j<(*n);j++)
{
printf("%d ",a[i*(*n)+j]);
}
printf("\n");
}
while(1)
{
printf("\n enter the source,destination and cost of the edge\n Enter -1 to end Input\n Edges start from Zero : \n");
scanf("%d %d %d",&i,&j,&c);
if(i==-1)
break;
a[(i*(*n) + j)]=a[(j*(*n) + i)]=c;
}
printf("\nInput matrix is\n");
for(i=0;i<(*n);i++)
{
for(j=0;j<(*n);j++)
{
printf("%d ",a[i*(*n)+j]);
}
printf("\n");
}
check = check +1;
// copying input matrix to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_n, n, sizeof(int), cudaMemcpyHostToDevice);
check++;
struct timeval start,stop;
double time;
int N = *n;
gettimeofday(&start,NULL);
dijkstras<<<1,N>>>(d_a, d_b, d_n);
gettimeofday(&stop,NULL);
time=(double)(stop.tv_sec-start.tv_sec)+(double)(stop.tv_usec-start.tv_usec)/(double)nano;
printf("\n TIME TAKEN: %lf\n",time);
check++;
// copying result from device to host
cudaMemcpy(b, d_b, size, cudaMemcpyDeviceToHost);
cudaMemcpy(n, d_n, sizeof(int), cudaMemcpyDeviceToHost);
check++;
// printing result
printf("the shortest paths are....");
for(i=0; i<(N); i++)
{
for(j=0; j<(N); j++)
{
if(i != j)
printf("\n the cost of the path from %d to %d = %d\n",i,j,b[i*(N) + j]);
}
printf("\n\n");
}
printf("your debug value of check in main is %d\n",check); //5
printf("your debug value of check in device is %d\n",*n); // 1+ 7+ 10
free(a); free(b);free(n);
cudaFree(d_a); cudaFree(d_b);cudaFree(d_n);
}
The root cause of this problem was supplying an uninitialised device variable as a kernel argument. In this kernel call:
dijkstras<<<1,N>>>(d_a, d_b, d_n);
d_n had been allocated memory, but never assigned a value, resulting in undefined behaviour within the kernel.
I would contend this proved hard for the original poster to detect because of a poor design decision in the kernel itself. In this prototype:
__global__ void dijkstras(int *a, int *b, int *n)
n was being used as both an input and an output with two completely different meanings, which made it far harder to detect the problem with the call. If the prototype was:
__global__ void dijkstras(int *a, int *b, int n, *int check)
then the role of n and checkwould be far clearer, and likelihood of making a mistake when calling the kernel and missing it when debugging would be lessened.
I am trying to develop a parallel random walker simulation with MPI and C++.
In my simulation, each process can be thought of as a cell which can contain particles (random walkers). The cells are aligned in one dimension with periodic boundary conditions (i.e. ring topology).
In each time step, a particle can stay in its cell or go into the left or right neighbour cell with a certain probability. To make it a bit easier, only the last particle in each cell's list can walk. If the particle walks, it has to be sent to the process with the according rank (MPI_Isend + MPI_Probe + MPI_Recv + MPI_Waitall).
However, after the first step my particles start disappearing, i.e. the messages are getting 'lost' somehow.
Below is a minimal example (sorry if it's still rather long). To better track the particle movements, each particle has an ID which corresponds to the rank of the process in which it started. After each step, each cell prints the IDs of the particles stored in it.
#include <mpi.h>
#include <vector>
#include <iostream>
#include <random>
#include <string>
#include <sstream>
#include <chrono>
#include <algorithm>
using namespace std;
class Particle
{
public:
int ID; // this is the rank of the process which initialized the particle
Particle () : ID(0) {};
Particle (int ID) : ID(ID) {};
};
stringstream msg;
string msgString;
int main(int argc, char** argv)
{
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// communication declarations
MPI_Status status;
// get the ranks of neighbors (periodic boundary conditions)
int neighbors[2];
neighbors[0] = (world_size + world_rank - 1) % world_size; // left neighbor
neighbors[1] = (world_size + world_rank + 1) % world_size; // right neighbor
// declare particle type
MPI_Datatype type_particle;
MPI_Type_contiguous (1, MPI_INT, &type_particle);
MPI_Type_commit (&type_particle);
// every process inits 1 particle with ID = world_rank
vector<Particle> particles;
particles.push_back (Particle(world_rank));
// obtain a seed from the timer
typedef std::chrono::high_resolution_clock myclock;
myclock::time_point beginning = myclock::now();
myclock::duration d = myclock::now() - beginning;
unsigned seed2 = d.count();
default_random_engine generator (seed2);
uniform_real_distribution<double> distribution (0, 1);
// ------------------------------------------------------------------
// begin time loop
//-------------------------------------------------------------------
for (int t=0; t<10; t++)
{
// ------------------------------------------------------------------
// 1) write a message string containing the current list of particles
//-------------------------------------------------------------------
// write the rank and the particle IDs into the msgString
msg << "rank " << world_rank << ": ";
for (auto& i : particles)
{
msg << i.ID << " ";
}
msg << "\n";
msgString = msg.str();
msg.str (string()); msg.clear ();
// to print the messages in order, the messages are gathered by root (rank 0) and then printed
// first, gather nums to root
int num = msgString.size();
int rcounts[world_size];
MPI_Gather( &num, 1, MPI_INT, rcounts, 1, MPI_INT, 0, MPI_COMM_WORLD);
// root now has correct rcounts, using these we set displs[] so
// that data is placed contiguously (or concatenated) at receive end
int displs[world_size];
displs[0] = 0;
for (int i=1; i<world_size; ++i)
{
displs[i] = displs[i-1]+rcounts[i-1]*sizeof(char);
}
// create receive buffer
int rbuf_size = displs[world_size-1]+rcounts[world_size-1];
char *rbuf = new char[rbuf_size];
// gather the messages
MPI_Gatherv( &msgString[0], num, MPI_CHAR, rbuf, rcounts, displs, MPI_CHAR,
0, MPI_COMM_WORLD);
// root prints the messages
if (world_rank == 0)
{
cout << endl << "step " << t << endl;
for (int i=0; i<rbuf_size; i++)
cout << rbuf[i];
}
// ------------------------------------------------------------------
// 2) send particles randomly to neighbors
//-------------------------------------------------------------------
Particle snd_buf;
int sndDest = -1;
// 2a) if there are particles left, prepare a message. otherwise, proceed to step 2b)
if (!particles.empty ())
{
// write the last particle in the list to a buffer
snd_buf = particles.back ();
// flip a coin. with a probability of 50 %, the last particle in the list gets sent to a random neighbor
double rnd = distribution (generator);
if (rnd <= .5)
{
particles.pop_back ();
// pick random neighbor
if (rnd < .25)
{
sndDest = neighbors[0]; // send to the left
}
else
{
sndDest = neighbors[1]; // send to the right
}
}
}
// 2b) always send a message to each neighbor (even if it's empty)
MPI_Request requests[2];
for (int i=0; i<2; i++)
{
int dest = neighbors[i];
MPI_Isend (
&snd_buf, // void* data
sndDest==dest ? 1 : 0, // int count <---------------- send 0 particles to every neighbor except the one specified by sndDest
type_particle, // MPI_Datatype
dest, // int destination
0, // int tag
MPI_COMM_WORLD, // MPI_Comm
&requests[i]
);
}
// ------------------------------------------------------------------
// 3) probe and receive messages from each neighbor
//-------------------------------------------------------------------
for (int i=0; i<2; i++)
{
int src = neighbors[i];
// probe to determine if the message is empty or not
MPI_Probe (
src, // int source,
0, // int tag,
MPI_COMM_WORLD, // MPI_Comm comm,
&status // MPI_Status* status
);
int nRcvdParticles = 0;
MPI_Get_count (&status, type_particle, &nRcvdParticles);
// if the message if non-empty, receive it
if (nRcvdParticles > 0) // this proc can receive max. 1 particle from each neighbor
{
Particle rcv_buf;
MPI_Recv (
&rcv_buf, // void* data
1, // int count
type_particle, // MPI_Datatype
src, // int source
0, // int tag
MPI_COMM_WORLD, // MPI_Comm comm
MPI_STATUS_IGNORE // MPI_Status* status
);
// add received particle to the list
particles.push_back (rcv_buf);
}
}
MPI_Waitall (2, requests, MPI_STATUSES_IGNORE);
}
// ------------------------------------------------------------------
// end time loop
//-------------------------------------------------------------------
// Finalize the MPI environment.
MPI_Finalize();
if (world_rank == 0)
cout << "\nMPI_Finalize()\n";
return 0;
}
I ran the simulation with 8 processes and below is a sample of the output. In step 1, it still seems to work well, but beginning with step 2 the particles begin disappearing.
step 0
rank 0: 0
rank 1: 1
rank 2: 2
rank 3: 3
rank 4: 4
rank 5: 5
rank 6: 6
rank 7: 7
step 1
rank 0: 0
rank 1: 1
rank 2: 2 3
rank 3:
rank 4: 4 5
rank 5:
rank 6: 6 7
rank 7:
step 2
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4: 4
rank 5:
rank 6: 6 7
rank 7:
step 3
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4:
rank 5:
rank 6: 6
rank 7:
step 4
rank 0: 0
rank 1:
rank 2: 2
rank 3:
rank 4:
rank 5:
rank 6: 6
rank 7:
I have no ideas what's wrong with the code... Somehow, the combination MPI_Isend + MPI_Probe + MPI_Recv + MPI_Waitall seems not to work... Any help is really appreciated!
There is an error in your code. The following logic (irrelevant code and arguments omitted) is wrong:
MPI_Probe(..., &status);
MPI_Get_count (&status, type_particle, &nRcvdParticles);
// if the message if non-empty, receive it
if (nRcvdParticles > 0)
{
MPI_Recv();
}
MPI_Probe does not remove zero-sized messages from the message queue. The only MPI calls that do so is MPI_Recv and the combination of MPI_Irecv + MPI_Test/MPI_Wait. You must receive all messages, including zero-sized ones, otherwise they will prevent the reception of further messages with the same (source, tag) combination. Although reception of a zero-sized message writes nothing into the receive buffer, it removes the message envelope from the queue and the next matching message could be received.
Solution: move the call to MPI_Recv before the conditional operator.