Deadlock in MPI_Reduce() when run on multiple nodes

Deadlock in MPI_Reduce() when run on multiple nodes - c++

I have a problem with my MPI code, it hangs when the code is run on multiple nodes. It successfully completes when run on a single node. I am not sure how to debug this. Can someone help me debug this issue?
Program Usage:
mpicc -o string strin.cpp
mpirun -np 4 -npernode 2 -hostfile hosts ./string 12 0.1 0.9 10 2
My Code:
#include <iostream>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main ( int argc, char **argv )
{
float *y, *yold;
float *v, *vold;
int nprocs, myid;
FILE *f = NULL;
MPI_Status status;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
// const int NUM_MASSES = 1000;
// const float Ktension = 0.1;
// const float Kdamping = 0.9;
// const float duration = 10.0;
#if 0
if ( argc != 5 ) {
std::cout << "usage: " << argv[0] << " NUM_MASSES durationInSecs Ktension Kdamping\n";
return 2;
}
#endif
int NUM_MASSES = atoi ( argv[1] );
float duration = atof ( argv[2] );
float Ktension = atof ( argv[3] );
float Kdamping = atof ( argv[4] );
const int PICKUP_POS = NUM_MASSES / 7; // change this for diff harmonics
const int OVERSAMPLING = 16; // run sim at this multiple of audio sampling rate
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);
// open output file
if (myid == 0) {
f = fopen ( "rstring.raw", "wb" );
if (!f) {
std::cout << "can't open output file\n";
return 1;
}
}
// allocate displacement and velocity arrays
y = new float[NUM_MASSES];
yold = new float[NUM_MASSES];
v = new float[NUM_MASSES];
// initialize displacements (pluck it!) and velocities
for (int i = 0; i < NUM_MASSES; i++ ) {
v[i] = 0.0f;
yold[i] = y[i] = 0.0f;
if (i == NUM_MASSES/2 )
yold[i] = 1.0; // impulse at string center
}
// Broadcast data
//MPI_Bcast(y, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//MPI_Bcast(yold, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//MPI_Bcast(v, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//int numIters = duration * 44100 * OVERSAMPLING;
int numIters = atoi( argv[5] );
for ( int t = 0; t < numIters; t++ ) {
// for each mass element
float sum = 0;
float gsum = 0;
int i_start;
int i_end ;
i_start = myid * (NUM_MASSES/nprocs);
i_end = i_start + (NUM_MASSES/nprocs);
for ( int i = i_start; i < i_end; i++ ) {
if ( i == 0 || i == NUM_MASSES-1 ) {
} else {
float accel = Ktension * (yold[i+1] + yold[i-1] - 2*yold[i]);
v[i] += accel;
v[i] *= Kdamping;
y[i] = yold[i] + v[i];
sum += y[i];
}
}
MPI_Reduce(&sum, &gsum, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
float *tmp = y;
y = yold;
yold = tmp;
if (myid == 0) {
//printf("%f\n", gsum);
if ( t % OVERSAMPLING == 0 ) {
fwrite ( &gsum, sizeof(float), 1, f );
}
}
}
if (myid == 0) {
fclose ( f );
}
MPI_Finalize();
}

If you have the possibility to do it, you may try to run your application inside a parallel debugger (like Totalview).
Otherwise, when the program hangs, you can attach a freely available serial debugger (like GDB) to one process at a time so to see where the potential problem may be located.

I guess you're receiving message, which isn't sent by any node. If every node first try to receive message, which node will send it?
You can modify program for example if id == 0 send(msg) else receive(&msg) and try use timeouts.
Write on a piece of paper how it works and how nodes interact and you will see, where is the problem.

I finally found the answer from OpenMPI mailing list. I think the problem is because of the way my hosts are setup.
I guess the TCP BTL gets confused by virtual interfaces (vmnet?) when run on multiple nodes. I limited the used interfaces using the "--mca btl_tcp_if_include eth0" argument. and this solved my issue.

Related

MPI C++ Runtime Error: signal 11 (Segmentation fault) with multi-dimensional array creation

Making Mandelbrot with MPI
So I've made a Mandelbrot generator and everything worked fine. Now I'm throwing in a speedup from MPI. Process 0 generates a file name mbrot.ppm and adds the appropriate metadata, then divides up the workload into chunks.
Each process receives the chunk's starting and ending positions and gets to work calculating its portion of the Mandelbrot set. To write to the mbrot.ppm file, each process saves its data in an array so it doesn't write to the file before the previous process finishes.
My Problem
Its a runtime error that says:
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node Lenovo exited on signal 11 (Segmentation fault).
I believe it comes from the line int data[3][xrange][yrange]; (line 120) since the print statement after this line never executes. Would there be an obvious reason I'm missing why this multi-dimensional array is causing me problems?
Full Code
#include <iostream>
#include <mpi.h>
#include <unistd.h>
#include <stdlib.h>
#include <math.h>
#include <fstream>
#define MCW MPI_COMM_WORLD
using namespace std;
struct Complex {
double r;
double i;
};
Complex operator + (Complex s, Complex t) {
Complex v;
v.r = s.r + t.r;
v.i = s.i + t.i;
return v;
};
Complex operator * (Complex s, Complex t) {
Complex v;
v.r = s.r * t.r - s.i * t.i;
v.i = s.r * t.i + s.i * t.r;
return v;
};
int rcolor(int iters) {
if (iters == 255) return 0;
return 32 * (iters % 8);
};
int gcolor(int iters) {
if (iters == 255) return 0;
return 32 * (iters % 8);
};
int bcolor(int iters) {
if (iters == 255) return 0;
return 32 * (iters % 8);
};
int mbrot(Complex c, int maxIters) {
int i = 0;
Complex z;
z = c;
while (i < maxIters && z.r * z.r + z.i * z.i < 4) {
z = z * z + c;
i++;
}
return i;
};
int main(int argc, char * argv[]) {
int rank, size;
MPI_Init( & argc, & argv);
MPI_Comm_rank(MCW, & rank);
MPI_Comm_size(MCW, & size);
if (size < 2) {
printf("Not an MPI process if only 1 process runs.\n");
exit(1);
}
if (size % 2 != 0) {
printf("Please use a even number\n");
exit(1);
}
Complex c1, c2, c;
char path[] = "brot.ppm";
int DIM;
int chunk[4];
c1.r = -1;
c1.i = -1;
c2.r = 1;
c2.i = 1;
if (rank == 0) { //start the file
ofstream fout;
fout.open(path);
DIM = 2000; // pixel dimensions
fout << "P3" << endl; // The file type .ppm
fout << DIM << " " << DIM << endl; // dimensions of the image
fout << "255" << endl; // color depth
fout.close();
// making dimesions marks
for (int i = 0; i < size; i++) {
chunk[0] = 0; // startX
chunk[1] = DIM; // endX
chunk[2] = (DIM / size) * i; // startY
chunk[3] = (DIM / size) * (i + 1); // endY
MPI_Send(chunk, 4, MPI_INT, i, 0, MCW);
};
};
MPI_Recv(chunk, 4, MPI_INT, 0, 0, MCW, MPI_STATUS_IGNORE);
printf("Process %d recieved chunk\n\t StartX: %d, EndX: %d\n\t StartY: %d, EndY: %d\n", rank, chunk[0], chunk[1], chunk[2], chunk[3]);
// do stuff save in array
// data[3 elements][Xs][Ys]
int xrange = chunk[1] - chunk[0];
int yrange = chunk[3] - chunk[2];
printf("Process %d, x: %d, y: %d\n", rank, xrange, yrange);
int data[3][xrange][yrange];
printf("done\n");
// generate data for mandlebrot
for (int j = chunk[2]; j < chunk[3]; ++j) {
for (int i = chunk[0]; i < chunk[1]; ++i) {
// calculate one pixel of the DIM x DIM image
c.r = (i * (c1.r - c2.r) / DIM) + c2.r;
c.i = (j * (c1.i - c2.i) / DIM) + c2.i;
int iters = mbrot(c, 255);
data[0][i][j] = rcolor(iters);
data[1][i][j] = gcolor(iters);
data[2][i][j] = bcolor(iters);
}
}
printf("here2\n");
// taking turns to write their data to file
for (int k = 0; k < size; k++) {
if (rank == k) {
ofstream fout;
fout.open(path, ios::app);
fout << rank << " was here" << endl;
for (int j = chunk[2]; j < chunk[3]; ++j) {
for (int i = chunk[0]; i < chunk[1]; ++i) {
fout << data[0][i][j] << " " << data[1][i][j] << " " << data[2][i][j] << " ";
}
fout << endl;
}
printf("Process %d done and waiting\n", rank);
} else {
MPI_Barrier(MCW);
}
}
MPI_Finalize();
};
How to Run
$ mpic++ -o mbrot.out mbrot.cpp
$ mpirun -np 4 mbrot.out

MPI and Segmentation Faults

Alright so this program is meant to simulate a solar system by semi-randomly generating a star, semi-randomly generating planets around the star, simulating the passing of time (using MPI to spread out the computational load), and determining habitability of resulting planets. I should have it commented for readability.
I am however having a problem with getting MPI working. As far as I can tell I'm doing something wrong that prevents it from initializing properly. Here's the errors I get.
OrbitPlus.cpp:323:50: error: invalid conversion from ‘char’ to ‘char**’ [-fpermissive]
system1 = Time( system, n , dt , argc, **argv);
^
OrbitPlus.cpp:191:33: error: initializing argument 5 of ‘std::vector<std::vector<float> > Time(std::vector<std::vector<float> >, int, float, int, char**)’ [-fpermissive]
std::vector<std::vector<float>> Time( std::vector<std::vector<float>> system , int n, float dt, int argc, char **argv){
^
I do find it interesting that both errors are considered fpermissive errors if when I compile it with -
mpic++ -std=c++11 -o OrbitPlus OrbitPlus.cpp
So it seems if I was feeling adventurous I could just run the code with -fpermissive option and roll the dice, but I don't feel like being so brave. Clearly the errors are related to each other.
Here's my code.
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <tuple>
#include <vector>
#include <stdio.h>
#include <math.h>
#include <complex>
#include <stdint.h>
#include <time.h>
#include <string.h>
#include <algorithm>
#include "mpi.h"
double MyRandom(){
//////////////////////////
//Random Number Generator
//Returns number between 0-99
//////////////////////////
double y = 0;
unsigned seed = time(0);
std::srand(seed);
uint64_t x = std::rand();
x ^= x << 13;
x ^= x >> 7;
x ^= x << 17;
x = (1070739 * x) % 2199023255530;
y = x / 21990232555.31 ;
return y;
}
////////////////////////
///////////////////////
std::tuple< char , float , float , float , int > Star(){
////////////////////////////
//Star will generate a Star
//Randomly or User Selected
//Class, Luminosity, Probability, Radius, Mass, Temperature
//Stars always take up 99% of the mass of the system.
///////////////////////////
char Class;
int choice = 8;
float L, R, M, T;
double y = 4;
std::tuple< char , float , float , float , float > star( Class , L , R , M , T) ;
std::cout << "Select Star Class (OBAFGKM) or Select 8 for Random" << std::endl;
std::cout << "1 = O, 2 = B, 3 = A, 4 = F, 5 = G, 6 = K, 7 = M : ";
std::cin >> choice;
if ( choice == 8 ) {
y = MyRandom();
if (y <= 0.003) choice = 1;
if ((y > 0.003) && (y <= 0.133)) choice = 2;
if ((y > 0.133) && (y <= 0.733)) choice = 3;
if ((y > 0.733) && (y <= 3.733)) choice = 4;
if ((y > 3.733) && (y <= 11.333)) choice = 5;
if ((y > 11.333) && (y <= 23.433)) choice = 6;
else choice = 7;
}
if (choice == 1) {
Class = 'O';
L = 30000;
R = 0.0307;
M = 16;
T = 30000;
}
if (choice == 2) {
Class = 'B';
L = 15000;
R = 0.0195;
M = 9;
T = 20000;
}
if (choice == 3) {
Class = 'A';
L = 15;
R = 0.00744;
M = 1.7;
T = 8700;
}
if (choice == 4) {
Class = 'F';
L = 3.25;
R = 0.00488;
M = 1.2;
T = 6750;
}
if (choice == 5) {
Class = 'G';
L = 1;
R = 0.00465;
M = 1;
T = 5700;
}
if (choice == 6) {
Class = 'K';
L = 0.34;
R = 0.00356;
M = 0.62;
T = 4450;
}
if (choice == 7) {
Class = 'M';
L = 0.08;
R = 0.00326;
M = 0.26;
T = 3000;
}
return star;
}
////////////
///////////
std::vector< std::vector<float> > Planet( float L, float R, float M, int T, int n){
///////////////////////////
//Planet generates the Planets
//Random 1 - 10, Random distribution 0.06 - 6 JAU unless specified by User
//Frost line Calculated, First Planet after Frost line is the Jupiter
//The Jupiter will have the most mass of all Jovian worlds
//Otherwise divided into Jovian and Terrestrial Worlds, Random Masses within groups
//Also calculates if a planet is in the Habitable Zone
////////////////////////////
float frostline, innerCHZ, outerCHZ;
float a = 0.06; // a - albedo
float m = M / 100; //Mass of the Jupiter always 1/100th mass of the Star.
std::vector<float> sys;
std::vector<std::vector <float>> system;
for (int i = 0 ; i < n ; i++){
sys.push_back( MyRandom()/10 * 3 ) ; //Distances in terms of Sol AU
}
sort(sys.begin(), sys.end() );
for (int i = 0 ; i < n ; i++){
system[i].push_back(sys[i]);
system[i].push_back(0); //system[i][0] is x, system[i][1] is y
}
frostline = (0.6 * T / 150) * (0.6 * T/150) * R / sqrt(1 - a);
innerCHZ = sqrt(L / 1.1);
outerCHZ = sqrt(L / 0.53);
for (int i = 0 ; i < n ; i++){
if (system[i][0] <= frostline) {
float tmass = m * 0.0003 * MyRandom();
system[i].push_back(tmass) ; //system[i][2] is mass, [3] is marker for the Jupiter
system[i].push_back(0) ;
}
if ((system[i][0] >= frostline) && (system[i-1][0] < frostline)){
system[i].push_back(m) ;
float J = 1;
system[i].push_back(J) ;
}
if ((system[i][0] >= frostline) && (system[i-1][0] >= frostline)) {
float jmass = m * 0.01 * MyRandom();
system[i].push_back(jmass) ;
system[i].push_back(0) ;
}
if ((system[i][0] >= innerCHZ) && (system[i][0] <= outerCHZ)){
float H = 1;
system[i].push_back(H);
}
else system[i].push_back(0); //[4] is habitable marker
}
return system;
}
////////////
////////////
std::vector<std::vector<float>> Time( std::vector<std::vector<float>> system , int n, float dt, int argc, char **argv){
#define ASIZE 3 //Setup
int MPI_Init(int *argc, char ***argv);
int rank, numtasks = n, namelen, rc;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;
MPI_Init( &argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
rc = MPI_Bcast(&system, ASIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD); //Master
// Broadcast computed initial values to all other processes
if (rc != MPI_SUCCESS) {
fprintf(stderr, "Oops! An error occurred in MPI_Bcast()\n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
//Slaves
const float pi = 4 * atan(1.0);
const float G = 6.67 * pow(10,-11);
float a_x, a_y;
for (int i = 0 ; i < n; i++) {
if (rank != i){
a_x = G * system[i][2] * (system[i][0]-system[rank][0]) / ((system[i][0]-system[rank][0]) * (system[i][0]-system[rank][0]));
a_y = G * system[i][2] * (system[i][1]-system[rank][1]) / ((system[i][1]-system[rank][1]) * (system[i][1]-system[rank][1]));
}
if (rank == i){
a_x = G * system[i][2] * 100 * system[i][0] / (system[i][0] * system[i][0]);
a_y = G * system[i][2] * 100 * system[i][1] / (system[i][1] * system[i][1]);
}
a_x += a_x;
a_y += a_y;
}
for (int i=0; i < n; i++){
system[i][0] += system[i][5] * dt + 0.5 * a_x * dt * dt;
system[i][1] += system[i][6] * dt + 0.5 * a_y * dt * dt;
system[i][5] += a_x * dt;
system[i][6] += a_y * dt;
}
for(int i=0 ; i<n ; i++){
for(int j=0 ; j<i ; j++){
if (system[j][0] == 0 && system[j][1] == 0){
system.erase(system.begin() + j);
} // crash into star
if (system[j][0] == system[i][0] && system[j][1] == system[i][1]){
system[i][2] += system[j][2];
system.erase(system.begin() + j);
} // planet crash
} //check co-ordinates
} // planet destroy loop
for(int i = 0 ; i < n ; i++){
if (sqrt(system[i][0]*system[i][0] + system[i][1]*system[i][1]) >= 60) system.erase(system.begin() + i);
}
//Send results back to the first process
if (rank != 0){// All processes except the one of rank 0
MPI_Send(&system, 1, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD);
}
else {
for (int j = 1; j < numtasks; j++) {
MPI_Recv(&system, 1, MPI_DOUBLE, MPI_ANY_SOURCE, 1,
MPI_COMM_WORLD, &status);
}
}
MPI_Finalize();
///////////////////////////
//Time advances the solar system.
//Plots the Orbits
//Uses MPI to spread it's calculations.
///////////////////////////
return system;
}
////////////
////////////
std::vector<bool> FinalCheck( std::vector<std::vector<float>> system, std::vector<bool> Water, int n){
///////////////////////////
//Final Checks
//Reports if a Planet spent the whole Time in the Habitable Zone
///////////////////////////
for (int i = 0 ; i < n ; i++){
if (system[i][4] == 1.0) Water.push_back(true);
else Water.push_back(false);
}
return Water;
}
////////////
////////////
int main(int argc, char** argv){
char Class;
float L, R, M, T;
std::tuple< char , float , float , float , float > star( Class , L , R , M , T );
star = Star();
int n = MyRandom()/10 + 1;
std::vector<std::vector <float>> system ;
std::vector<std::vector <float>> system1 ;
system = Planet( L , R , M, T, n);
float G = 6.67 * pow(10,-11), pi = 4 * atan(1.0), dt;
for (int i = 0; i < n; i++){
if (system[i][3] == 1){
dt = 2 * pi * .01 * pow(system[i][0] * 1.5 * pow(10,8), 1.5) / sqrt(G * M * 2 * pow(10,30));
}
system[i].push_back(0.0); //system[i][5] is speed in x-axis
system[i].push_back( sqrt(6.67 * pow(10,-11) * 2 * pow(10,30) * M / system[i][0])); //system[i][6] is speed in y-axis
}
std::ofstream Finder;
std::ofstream Report;
Finder.open("plotdata.dat");
Report.open("report.txt");
Finder << "# Plot Co-ordinates" << std::endl;
for (int i = 0 ; i < 1000 ; i++) {
system1 = Time( system, n , dt , argc, argv);
for (int j=0 ; j<n ; j++){
Finder << "[color " << j << "] " << system[j][0] << " " << system[j][1] << std::endl;
if((system[j][4] == 1.0) && ( (sqrt(system[j][0] * system[j][0] + system[j][1] * system[j][1]) < sqrt(L / 1.1) ) || ((sqrt(system[j][0] * system[j][0] + system[j][1] * system[j][1]) > sqrt(L / 0.53)) ))) system[j][4] = 0.0;
}
system = system1;
}
Finder.close();
int m;
m = system.size()/system[0].size();
std::vector<bool> Water;
Water = FinalCheck( system, Water, n);
//Report
for (int i = 0 ; i < n ; i++){
Report << "Planet " << i << "ends up at" << system[i][0] << " and " << system[i][1] << "has mass " << system[i][2] ;
if (system[i][3] == 1) Report << ", which is the 'Jupiter' of the system." ;
if (system[i][4] == 1) Report << ", which can have liquid water on the surface." ;
}
Report.close();
///////////////////////////
//Report cleans everything up and gives the results
//Shows the plot, lists the Planets
//Reports the Positions and Masses of all Planets
//Reports which was the Jupiter and which if any were Habitable
//////////////////////////
return 0;
}
Any thoughts the gurus here have would be appreciated, especially with getting rid of those -fpermissive errors.
EDIT 1 - Code as presented will now completely compile - but will return a Segmentation fault during the Star routine. After the user inputs the star type but before it actually makes a star as far as I can tell.

code is running, but the gpu function won't be executed

I got two functions:
The add_cpu function works fine, but the add_gpu function does not.
I tried to check sum options on my GPU driver Software and read my code over and over again. I tried the exact same code on an other machine and it worked fine.
The checkError result on current machine is 1, what it shouldn't be.
And checkError result on my Laptop is 0, what is correct.
Does anyone have any suggestion of what is the problem with the graphic card or the system?
I have no clue what's the problem here.
Did I miss some sort of option?
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
#include <math.h>
#define out std::cout <<
#define end << std::endl
__global__
void add_gpu( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) y[i] = x[i] + y[i];
}
void add_cpu( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) y[i] = x[i] + y[i];
}
void init( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
int checkError( int n, float f, float* y ) {
float c = 0.0f;
for ( int i = 0; i < n; i++ ) c = fmax( c, fabs( y[i] - f ) );
return c;
}
void print( int n, float* obj, char* str = "obj: " ) {
out str << obj[0];
for ( int i = 1; i < n; i++ ) out ", " << obj[i];
out "" end;
}
int main( ) {
int n = 1 << 5;
float* x, * y;
float error = 0.0f;
cudaMallocManaged( &x, n * sizeof( float ) );
cudaMallocManaged( &y, n * sizeof( float ) );
init( n, x, y );
print( n, x, "x" );
print( n, y, "y" );
add_gpu<< <1, 1 >> > ( n, x, y );
//add_cpu(n, x, y);
cudaDeviceSynchronize( );
print( n, y, "y" );
error = checkError( n, 3.0f, y );
out "error: " << error end;
cudaFree( x );
cudaFree( y );
return 0;
}

I don't see exactly where the problem is but in order to debug it you should check the cuda errors.
Most cuda functions return a cuda status. You can maybe use a little wrapper function like this to check the errors
checkCudaError(const cudaError_t error) {
if (error != cudaSuccess) {
std::cout << "Cuda error: " << cudaGetErrorString(error) << std::endl;
// maybe do something else
}
}
and call function like cudaMallocManaged() this way
checkCudaError(cudaMallocManaged(&x, n * sizeof(float));
For all operations which are performed on the device (like custom kernels) you should run the kernel and after that call
cudaGetLastError()
and maybe also use checkCudaError()
checkCudaError(cudaGetLastError())
Note that cudaGetLastError() will always return a error if at some point an error occured and so you have to find the place where the first error occures. That is why you should check cuda error every time the GPU was used in some way.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc263dbe6574220cc776b45438fc351e8
Without copying the data to the device your GPU doesnt know the data and without copying them back your host doesnt know the results

MPI Gather Corrupting Arrays

I have written an MPI code in C++ for my Raspberry Pi cluster, which generates an image of the Mandelbrot Set. What happens is on each node (excluding the master, processor 0) part of the Mandelbrot Set is calculated, resulting in each node having a 2D array of ints that indicates whether each xy point is in the set.
It appears to work well on each node individually, but when all the arrays are gathered to the master using this command:
MPI_Gather(&inside, 1, MPI_INT, insideFull, 1, MPI_INT, 0, MPI_COMM_WORLD);
it corrupts the data, and the result is an array full of garbage.
(inside is the nodes' 2D arrays of part of the set. insideFull is also a 2D array but it holds the whole set)
Why would it be doing this?
(This led to me wondering if it corrupting because the master isn't sending its array to itself (or at least I don't want it to). So part of my question also is is there an MPI_Gather variant that doesn't send anything from the root process, just collects from everything else?)
Thanks
EDIT: here's the whole code. If anyone can suggest better ways of how I'm transferring the arrays, please say.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
// ONLY USE MULTIPLES OF THE NUMBER OF SLAVE PROCESSORS
#define ImageHeight 128
#define ImageWidth 128
double MinRe = -1.9;
double MaxRe = 0.5;
double MinIm = -1.2;
double MaxIm = MinIm + (MaxRe - MinRe)*ImageHeight / ImageWidth;
double Re_factor = (MaxRe - MinRe) / (ImageWidth - 1);
double Im_factor = (MaxIm - MinIm) / (ImageHeight - 1);
unsigned n;
unsigned MaxIterations = 50;
int red;
int green;
int blue;
// MPI variables ****
int processorNumber;
int processorRank;
//*******************//
int main(int argc, char** argv) {
// Initialise MPI
MPI_Init(NULL, NULL);
// Get the number of procesors
MPI_Comm_size(MPI_COMM_WORLD, &processorNumber);
// Get the rank of this processor
MPI_Comm_rank(MPI_COMM_WORLD, &processorRank);
// Get the name of this processor
char processorName[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processorName, &name_len);
// A barrier just to sync all the processors, make timing more accurate
MPI_Barrier(MPI_COMM_WORLD);
// Make an array that stores whether each point is in the Mandelbrot Set
int inside[ImageWidth / processorNumber][ImageHeight / processorNumber];
if(processorRank == 0) {
printf("Generating Mandelbrot Set\n");
}
// We don't want the master to process the Mandelbrot Set, only the slaves
if(processorRank != 0) {
// Determine which coordinates to test on each processor
int xMin = (ImageWidth / (processorNumber - 1)) * (processorRank - 1);
int xMax = ((ImageWidth / (processorNumber - 1)) * (processorRank - 1)) - 1;
int yMin = (ImageHeight / (processorNumber - 1)) * (processorRank - 1);
int yMax = ((ImageHeight / (processorNumber - 1)) * (processorRank - 1)) - 1;
// Check each value to see if it's in the Mandelbrot Set
for (int y = yMin; y <= yMax; y++) {
double c_im = MaxIm - y *Im_factor;
for (int x = xMin; x <= xMax; x++) {
double c_re = MinRe + x*Re_factor;
double Z_re = c_re, Z_im = c_im;
int isInside = 1;
for (n = 0; n <= MaxIterations; ++n) {
double Z_re2 = Z_re * Z_re, Z_im2 = Z_im * Z_im;
if (Z_re2 + Z_im2 > 10) {
isInside = 0;
break;
}
Z_im = 2 * Z_re * Z_im + c_im;
Z_re = Z_re2 - Z_im2 + c_re;
}
if (isInside == 1) {
inside[x][y] = 1;
}
else{
inside[x][y] = 0;
}
}
}
}
// Wait for all processors to finish computing
MPI_Barrier(MPI_COMM_WORLD);
int insideFull[ImageWidth][ImageHeight];
if(processorRank == 0) {
printf("Sending parts of set to master\n");
}
// Send all the arrays to the master
MPI_Gather(&inside[0][0], 1, MPI_INT, &insideFull[0][0], 1, MPI_INT, 0, MPI_COMM_WORLD);
// Output the data to an image
if(processorRank == 0) {
printf("Generating image\n");
FILE * image = fopen("mandelbrot_set.ppm", "wb");
fprintf(image, "P6 %d %d 255\n", ImageHeight, ImageWidth);
for(int y = 0; y < ImageHeight; y++) {
for(int x = 0; x < ImageWidth; x++) {
if(insideFull[x][y]) {
putc(0, image);
putc(0, image);
putc(255, image);
}
else {
putc(0, image);
putc(0, image);
putc(0, image);
}
// Just to see what values return, no actual purpose
printf("%d, %d, %d\n", x, y, insideFull[x][y]);
}
}
fclose(image);
printf("Complete\n");
}
MPI_Barrier(MPI_COMM_WORLD);
// Finalise MPI
MPI_Finalize();
}

You call MPI_Gether with the following parameters:
const void* sendbuf : &inside[0][0] Starting address of send buffer
int sendcount : 1 Number of elements in send buffer
const MPI::Datatype& sendtype : MPI_INT Datatype of send buffer elements
void* recvbuf : &insideFull[0][0]
int recvcount : 1 Number of elements for any single receive
const MPI::Datatype& recvtype : MPI_INT Datatype of recvbuffer elements
int root : 0 Rank of receiving process
MPI_Comm comm : MPI_COMM_WORLD Communicator (handle).
Sending/receiving only one element is not sufficient. Instead of 1 use
(ImageWidth / processorNumber)*(ImageHeight / processorNumber)
Then think about the different memory layout of your source and target 2D arrays:
int inside[ImageWidth / processorNumber][ImageHeight / processorNumber];
vs.
int insideFull[ImageWidth][ImageHeight];
As the copy is a memory bloc copy, and not an intelligent 2D array copy, all your source integers will be transfered contiguously to the target adress, regardless of the different size of the lines.
I'd recommend to send the data fisrt into an array of the same size as the source, and then in the receiving process, to copy the elements to the right lines & columns in the full array, for example with a small function like:
// assemble2d():
// copys a source int sarr[sli][sco] to a destination int darr[dli][sli]
// using an offset to starting at darr[doffli][doffco].
// The elements that are out of bounds are ignored. Negative offset possible.
void assemble2D(int*darr, int dli, int dco, int*sarr, int sli, int sco, int doffli=0, int doffco=0)
{
for (int i = 0; i < sli; i++)
for (int j = 0; j < sco; j++)
if ((i + doffli >= 0) && (j + doffco>=0) && (i + doffli<dli) && (j + doffco<dco))
darr[(i+doffli)*dli + j+doffco] = sarr[i*sli+j];
}

Code not compiling - ends up in xmemory

I just got started trying to learn how to code graphics using C++. When compiling a linear interpolation code, the code does not run and sends VC++ to the xmemory file. No errors or warnings given, thus leaving me with nothing to work on. What did I do wrong? I suspect the problem is connected to the way I assign the vectors, yet none of my changes have worked.
Here is the code:
#include "SDL.h"
#include <iostream>
#include <glm/glm.hpp>
#include <vector>
#include "SDLauxiliary.h"
using namespace std;
using glm::vec3;
using std::vector;
const int SCREEN_WIDTH = 640;
const int SCREEN_HEIGHT = 480;
SDL_Surface* screen;
void Draw();
void Interpolate( float a, float b, vector<float>& result ) {
int i = 0;
for ( float x=a;x < b+1; ++x )
{
result[i] = x;
i = i + 1;
}
}
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
int i = 0;
for (int add=0; add < 4; ++add) {
float count1 = (b[add]-a[add])/resultvec.size() + a[add];
float count2 = (b[add]-a[add])/resultvec.size() + a[add];
float count3 = (b[add]-a[add])/resultvec.size() + a[add];
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
i = i + 1;
}
}
int main( int argc, char* argv[] )
{
vector<float> result(10); // Create a vector width 10 floats
Interpolate(5, 14, result); // Fill it with interpolated values
for( int i=0; i < result.size(); ++i )
cout << result[i] << " "; // Print the result to the terminal
vector<vec3> resultvec( 4 );
vec3 a(1,4,9.2);
vec3 b(4,1,9.8);
InterpolateVec( a, b, resultvec );
for( int i=0; i<resultvec.size(); ++i )
{
cout << "( "
<< resultvec[i].x << ", "
<< resultvec[i].y << ", "
<< resultvec[i].z << " ) ";
}
screen = InitializeSDL( SCREEN_WIDTH, SCREEN_HEIGHT );
while( NoQuitMessageSDL() )
{
Draw();
}
SDL_SaveBMP( screen, "screenshot.bmp" );
return 0;
}
void Draw()
{
for( int y=0; y<SCREEN_HEIGHT; ++y )
{
for( int x=0; x<SCREEN_WIDTH; ++x )
{
vec3 color(1,0,1);
PutPixelSDL( screen, x, y, color );
}
}
if( SDL_MUSTLOCK(screen) )
SDL_UnlockSurface(screen);
SDL_UpdateRect( screen, 0, 0, 0, 0 );
}

I can not post a comment to the question so I'll write my thoughts as answer.
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
It looks like you (or one of your library) overload operator, for float to make vec2 and after vec3. Nice solution, but if I right, then no reason to assign each components to that value and your code will be similiar to:
resultvec[i] = (count1, count2, count3);
Again this is just a hypothesis! I can not compile your code and see the error.
Also I am not understand why you using i, which equal to add.

Strange that some of you could not compile the code; it may be that you have not installed the libraries (the n00b speculating, yay ...).
So here is what I did to make it work (less code is better in this case, as the first comment stated):
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
resultvec[0].x = a[0];
resultvec[0].y = a[1];
resultvec[0].z = a[2];
float count1 = (b[0]-a[0])/(resultvec.size() - 1);
float count2 = (b[1]-a[1])/(resultvec.size() - 1);
float count3 = (b[2]-a[2])/(resultvec.size() - 1);
for (int add=1; add < 5; ++add) {
a[0] = a[0] + count1;
a[1] = a[1] + count2;
a[2] = a[2] + count3;
resultvec[add].x = a[0];
resultvec[add].y = a[1];
resultvec[add].z = a[2];
}
}
I discovered (after many an hour ...) was that I did not need to add count1, count2 and count3; vec3 is such a type that adding count1 does what I wanted it to; assigning color (i.e. something like (0,0,1)). Am I making since? My vocabulary is not that technical I know.

Or, you could save some time and let glm::vec3 do what glm::vec3 is supposed to do.
In the mean time; here, have a cookie (cookie.png)
void Interpolate(vec3 a, vec3 b, vector<vec3>& result) {
vec3 diffStep = (b-a) * (1.0f / (result.size() - 1)); // Operator overloading
result[0] = vec3(a);
for(int i = 1; i < result.size(); i++) {
result[i] = result[i-1] + diffStep;
}
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Deadlock in MPI_Reduce() when run on multiple nodes - c++

If you have the possibility to do it, you may try to run your application inside a parallel debugger (like Totalview). Otherwise, when the program hangs, you can attach a freely available serial debugger (like GDB) to one process at a time so to see where the potential problem may be located.

Related

MPI C++ Runtime Error: signal 11 (Segmentation fault) with multi-dimensional array creation

MPI and Segmentation Faults

code is running, but the gpu function won't be executed

MPI Gather Corrupting Arrays

Code not compiling - ends up in xmemory

Categories

Resources