MPI receiving in multiple parts - c++

Usually when I want to send a buffer to next processor and receive another one from previous one I use the following:
MPI_Irecv(rcv_buff,rcv_size,
MPI_DOUBLE,rcv_p,0,world,
&request);
MPI_Send(snd_buff,snd_size,
MPI_DOUBLE,snd_p,0,world);
MPI_Wait(&request,&status);
Suppose that I want to put the first rcv_size0 elements of rcv_buff in array0 and the rest (rcv_size1 elements) in array1, where:
rcv_size1=rcv_size-rcv_size0;
normally what I do is that I first create a dummy array like rcv_buff here and then start copying the values to array0 and array1. My question is that is there any way in MPI to receive the sent bytes in two or more sequences? for example directly receive the first size0 elements in array0 and the rest in array1?

You can do this - receiving into two buffers - by creating a type specific to that pair of buffers:
#include <stdio.h>
#include <mpi.h>
#include <stdlib.h>
int recv_split(const int total, const int src, const int tag,
double *buffA, const int sizeA, double *buffB) {
if (total <= 0) return -1;
if (sizeA > total) return -1;
if (buffA == NULL) return -2;
if (buffB == NULL) return -2;
const int sizeB = total - sizeA;
int blocksizes[2] = {sizeA, sizeB};
MPI_Datatype types[2] = {MPI_DOUBLE, MPI_DOUBLE};
MPI_Aint displacements[2], addrA, addrB;
MPI_Datatype splitbuffer;
MPI_Status status;
displacements[0] = 0;
MPI_Get_address(buffA, &addrA);
MPI_Get_address(buffB, &addrB);
displacements[1] = addrB - addrA;
MPI_Type_create_struct(2, blocksizes, displacements, types, &splitbuffer);
MPI_Type_commit(&splitbuffer);
MPI_Recv(buffA, 1, splitbuffer, src, tag, MPI_COMM_WORLD, &status);
MPI_Type_free(&splitbuffer);
return 0;
}
int main(int argc, char **argv) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
const int sendSize = 15;
const int tag = 1;
if (rank == 0 && size >= 2) {
double sendbuff[sendSize];
for (int i=0; i<sendSize; i++)
sendbuff[i] = 1.*i;
MPI_Send(sendbuff, sendSize, MPI_DOUBLE, 1, tag, MPI_COMM_WORLD);
}
if (rank == 1) {
const int buffLen = 12;
const int recvIntoA = 10;
double buffA[buffLen];
double buffB[buffLen];
for (int i=0; i<buffLen; i++) {
buffA[i] = buffB[i] = -1.;
}
recv_split(sendSize, 0, tag, buffA, recvIntoA, buffB);
printf("---Buffer A--\n");
for (int i=0; i<buffLen; i++)
printf("%5.1lf ", buffA[i]);
printf("\n---Buffer B--\n");
for (int i=0; i<buffLen; i++)
printf("%5.1lf ", buffB[i]);
printf("\n");
}
MPI_Finalize();
return 0;
}
compiling and running gives
$ mpicc -o recvsplit recvsplit.c -std=c99
$ mpirun -np 2 ./recvsplit
---Buffer A--
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 -1.0 -1.0
---Buffer B--
10.0 11.0 12.0 13.0 14.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0
Just note that this type will only work for this pair of buffers; different pairs will generally have different relative displacements. You can also of course always receive into one large staging buffer and manually unpack into different buffers, using your own code or MPI_Unpack.

There's nothing that I know of directly in MPI that would allow you to do that, though there's probably some nasty pointer magic you could use to make it work. In general, it would be much cleaner to do it as two sends if you wanted to do things that way.
Another thing that's not a direct answer to your question, but you might not have known about is that your three line command above can be combined into one using MPI_SENDRECV. Try this line out:
MPI_Sendrecv(snd_buff, snd_size, MPI_DOUBLE, snd_p, 0,
rcv_buff, rcv_size, MPI_DOUBLE, rcv_p, 0,
world, &status);

Related

MPI_Scatterv (c) will give segmentation fault

I've built a fairly simple c code that reads a pgm image, splits it in different sections and sends it to various cores to elaborate it.
In order to account for some elaboration margins (each core has to access a larger area of the image than the it needs to write on), I can't simply split the image but I first have to create an array where I add the before mentioned margins.
As a quick example: an image is 1600x1200 (width x height), I have 2 cores, I want to access an area of 3x3 centered on the pixel and I'm splitting this image horizontal line by horizontal line then the subdivision would be -> the first core gets the pixels from 0 to 6011600, the second core gets the pixels from 5091600 to 1200*1600.
Now, I believe there is nothing wrong in how I implemented this in my program, still I get this error:
[ct1pt-tnode003:22389:0:22389] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7ffe7f60ead8)
==== backtrace (tid: 22389) ====
0 0x000000000004ee05 ucs_debug_print_backtrace() ???:0
1 0x0000000000402624 main() ???:0
2 0x0000000000022505 __libc_start_main() ???:0
3 0x0000000000400d99 _start() ???:0
This is my code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <math.h>
#include <time.h>
#include "testlibscatter.h"
#include <mpi.h>
#define MSGLEN 2048
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int m = atoi(argv[1]), n = atoi(argv[2]), kern_type = atoi(argv[3]);
double kernel[m*n];
int i_rank, ranks;
int param, symm;
MPI_Comm_rank( MPI_COMM_WORLD, &i_rank);
MPI_Comm_size( MPI_COMM_WORLD, &ranks);
int xsize, ysize, maxval;
xsize = 0;
ysize = 0;
maxval = 0;
void * ptr;
switch (kern_type){
case 1:
meankernel(m, n, kernel);
break;
case 2:
weightkernel(m, n, param, kernel);
break;
case 3:
gaussiankernel(m, n, param, symm, kernel);
break;
}
if (i_rank == 0){
read_pgm_image(&ptr, &maxval, &xsize, &ysize, "check_me2.pgm");
}
MPI_Bcast(&xsize, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&ysize, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(&maxval, 1, MPI_INT, 0, MPI_COMM_WORLD);
int flo, start, end, i;
flo = floor(ysize/ranks);
int first, last;
first = start - (m - 1)/2;
last = end + (m - 1)/2;
if (start == 0){
first = 0;
}
if (end == ysize){
last = ysize;
}
int sendcounts[ranks];
int displs[ranks];
int first2[ranks];
int last2[ranks];
int c_start2[ranks];
int c_end2[ranks];
int num;
num = (ranks - 1) * (m-1);
printf("num is %d\n", num);
unsigned short int bigpic[xsize*(ysize + num)];
if (i_rank == 0){
for(i = 0; i < ranks; i++){
c_start2[i] = i * flo;
c_end2[i] = (i + 1) * flo;
if ( i == ranks - 1){
c_end2[i] = ysize;
}
first2[i] = c_start2[i] - (m - 1)/2;
last2[i] = c_end2[i] + (m - 1)/2;
if (c_start2[i] == 0){
first2[i] = 0;
}
if (c_end2[i] == ysize){
last2[i] = ysize;
}
sendcounts[i] = (last2[i] - first2[i]) * xsize;
}
int i, j, k, index, index_disp = 0;
index = 0;
displs[0] = 0;
for (k = 0; k < ranks; k++){
for (i = first2[k]*xsize; i < last2[k]*xsize; i++){
bigpic[index] = ((unsigned short int *)ptr)[i];
index++;
}
printf("%d\n", displs[index_disp]);
index_disp++;
displs[index_disp] = index;
}
}
MPI_Bcast(displs, ranks, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast(sendcounts, ranks, MPI_INT, 0, MPI_COMM_WORLD);
unsigned short int minipic[xsize*(last-first)];
MPI_Barrier(MPI_COMM_WORLD);
MPI_Scatterv(&bigpic[0], sendcounts, displs, MPI_UNSIGNED_SHORT, minipic, (last-first)*xsize, MPI_UNSIGNED_SHORT, 0, MPI_COMM_WORLD);
MPI_Finalize();
}
the function kernel simply returns an array of m*n doubles to edit the image, while the read_pgm_image returns a void pointer with the values of the image read.
I've tried printing the values of bigpic and they show no problem.
In the code shown here, start and end are used uninitialised in the computations of first and last:
int flo, start, end, i;
~~~~~~~~~~
flo = floor(ysize/ranks);
int first, last;
first = start - (m - 1)/2; // <---- start has a random value here
last = end + (m - 1)/2; // <---- end has a random value here
If the values are very large, the size of minipic may become larger than the stack size:
unsigned short int minipic[xsize*(last-first)];
^^^^^^^^^^ random (possibly large) value
A strong indication that this is indeed the cause is the fact that the address of the fault 0x7ffe7f60ead8 is very close to the end of the positive part of the virtual address space, which is where most 64-bit OSes allocate the stack area of the main thread.
Always compile with -Wall in order to get back as many diagnostic messages from the compiler as possible.

Matrix calculation error appears when dimensions become large [duplicate]

This question already has answers here:
C programming, why does this large array declaration produce a segmentation fault?
(6 answers)
Closed 6 years ago.
I am running a code where I am simply creating 2 matrices: one matrix is of dimensions arows x nsame and the other has dimensions nsame x bcols. The result is an array of dimensions arows x bcols. This is fairly simple to implement using BLAS and the following code appears to work as intended when using the below master-slave model with OpenMPI:`
#include <iostream>
#include <stdio.h>
#include <iostream>
#include <cmath>
#include <mpi.h>
#include <gsl/gsl_blas.h>
using namespace std;`
int main(int argc, char** argv){
int noprocs, nid;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &nid);
MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
int master = 0;
const int nsame = 500; //must be same if matrices multiplied together = acols = brows
const int arows = 500;
const int bcols = 527; //works for 500 x 500 x 527 and 6000 x 100 x 36
int rowsent;
double buff[nsame];
double b[nsame*bcols];
double c[arows][bcols];
double CC[1*bcols]; //here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < bcols; i++){
CC[i] = 0.;
};
// Master part
if (nid == master ) {
double a [arows][nsame]; //creating identity matrix of dimensions arows x nsame (it is I if arows = nsame)
for (int i = 0; i < arows; i++){
for (int j = 0; j < nsame; j++){
if (i == j)
a[i][j] = 1.;
else
a[i][j] = 0.;
}
}
double b[nsame*bcols];//here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < (nsame*bcols); i++){
b[i] = (10.*i + 3.)/(3.*i - 2.) ;
};
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
rowsent=0;
for (int i=1; i < (noprocs); i++) {
// Note A is a 2D array so A[rowsent]=&A[rowsent][0]
MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
rowsent++;
}
for (int i=0; i<arows; i++) {
MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);
int sender = status.MPI_SOURCE;
int anstype = status.MPI_TAG; //row number+1
int IND_I = 0;
while (IND_I < bcols){
c[anstype - 1][IND_I] = CC[IND_I];
IND_I++;
}
if (rowsent < arows) {
MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
rowsent++;
}
else { // tell sender no more work to do via a 0 TAG
MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
}
}
}
// Slave part
else {
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
while(status.MPI_TAG != 0) {
int crow = status.MPI_TAG;
gsl_matrix_view AAAA = gsl_matrix_view_array(buff, 1, nsame);
gsl_matrix_view BBBB = gsl_matrix_view_array(b, nsame, bcols);
gsl_matrix_view CCCC = gsl_matrix_view_array(CC, 1, bcols);
/* Compute C = A B */
gsl_blas_dgemm (CblasNoTrans, CblasNoTrans, 1.0, &AAAA.matrix, &BBBB.matrix,
0.0, &CCCC.matrix);
MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
}
}
// output c here on master node //uncomment the below lines if I wish to see the output
// if (nid == master){
// if (rowsent == arows){
// // cout << rowsent;
// int IND_F = 0;
// while (IND_F < arows){
// int IND_K = 0;
// while (IND_K < bcols){
// cout << "[" << IND_F << "]" << "[" << IND_K << "] = " << c[IND_F][IND_K] << " ";
// IND_K++;
// }
// cout << "\n";
// IND_F++;
// }
// }
// }
MPI_Finalize();
//free any allocated space here
return 0;
};
Now what appears odd is that when I increase size of the matrices (e.g. from nsame = 500 to nsame = 501), the code no longer works. I receive the following error:
mpirun noticed that process rank 0 with PID 0 on node Users-MacBook-Air exited on signal 11 (Segmentation fault: 11).
I have tried this with other combinations of sizes for the matrices and there always appears to be an upper limit for the size of the matrices themselves (which seems to vary based on how I vary the different dimensions themselves). I have also tried modifying the values of the matrices themselves although this does not appear to change anything. I realize there are alternative ways to initialize the matrices in my example (e.g. using vector) but am simply wondering why my current scheme of multiplying matrices of arbitrary size seems to only work to a certain extent.
You're declaring too many big local variables, which is causing stack space related problems. a, in particular, is 500x500 doubles (250000 8 byte elements, or 2 million bytes). b is even larger.
You'll need to dynamically allocate space for some or all of those arrays.
There might be a compiler option to increase the initial stack space but that isn't a good long term solution.

finding global maxima of a function from comparing each processor's local maxima using MPI ring topology

I wish to use the MPI ring topology, passing each processor's maxima around the ring, comparing the local maxima and then output the global maxima for all processors.
I am using a 10 dimensional Monte Carlo integration function. My first idea was to make an array with each processor's local maxima, then pass that value, compare and output the highest value. But I couldn't elegantly code to make an array which will take only each processors' max value and store it corresponding to rank of the processor, this way I can also keep track which processor got the global maxima.
I didn't finish my code yet, right now I am interested to see if an array with local maxima from processor's can be created. the way I coded, it's very time consuming and if there is a lot of processors, then I have to declare them each time, and yet I couldn't produce the array I am looking for.
I am sharing the code here:
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <mpi.h>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n-1; j = j+1)
{
y = y + exp(-pow((1-x[j]),2)-100*(pow((x[j+1] - pow(x[j],2)),2)));
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// initial seed value (use system time)
//srand(time(NULL));
// step 1: calculate the common factor V
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
int rank, size;
MPI_Init (&argc, &argv); // initializes MPI
MPI_Comm_rank (MPI_COMM_WORLD, &rank); // get current MPI-process ID. O, 1, ...
MPI_Comm_size (MPI_COMM_WORLD, &size); // get the total number of processes
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
const unsigned int N = 5;
double max = -1;
double max_store[4];
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
srand(time(NULL) * rank); // each MPI process gets a unique seed
m = 4; // initial number of intervals
// convert command-line input to N = number of points
//N = atoi( argv[1] );
for (unsigned int i=0; i <=N; i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
if( mean > max)
{
max = mean;
}
//cout << setw(10) << m << setw(10) << max << setw(10) << mean << setw(10) << rank << setw(10) << size <<endl;
m = m*4;
}
//cout << setw(30) << m << setw(30) << result << setw(30) << mean <<endl;
printf("Process %d of %d mean = %1.5e\n and local max = %1.5e\n", rank, size, mean, max );
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
for( int k = 0; k < 4; k++ )
{
printf( "%1.5e\n", max_store[k]);
}
//double max_store[4] = {4.43095e-02, 5.76586e-02, 3.15962e-02, 4.23079e-02};
double send_junk = max_store[0];
double rec_junk;
MPI_Status status;
// This next if-statment implemeents the ring topology
// the last process ID is size-1, so the ring topology is: 0->1, 1->2, ... size-1->0
// rank 0 starts the chain of events by passing to rank 1
if(rank==0) {
// only the process with rank ID = 0 will be in this block of code.
MPI_Send(&send_junk, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); // send data to process 1
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, size-1, 0, MPI_COMM_WORLD, &status); // receive data from process size-1
}
else if( rank == size-1) {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); // send data to its "right neighbor", rank 0
}
else {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD); // send data to its "right neighbor" (rank+1)
}
printf("Process %d send %1.5e\n and recieved %1.5e\n", rank, send_junk, rec_junk );
MPI_Finalize(); // programs should always perform a "graceful" shutdown
return 0;
}
compile with :
mpiCC -o gd test_code.cpp
mpirun -np 4 ./gd
I would appreciate suggestion:
if there is a more elegant way to make local maxima arrays?
How to compare the local maxima and decide the global maxima while passing the values in a ring?
Also feel free to modify the code to provide me a better example to work with. I would appreciate any suggestion. thanks.
For this sort of thing, better using either MPI_Reduce() or MPI_Allreduce() with MPI_MAX as operator. The former will compute the max over the values exposed by all processes and give the result to the "root" process only, while the later will do the same, but give the results to all processes.
// Only process of rank 0 get the global max
MPI_Reduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD );
// All processes get the global max
MPI_Allreduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
// All processes get the global max, stored in place of the local max
// after the call ends - this might be the most interesting one for you
MPI_Allreduce( MPI_IN_PLACE, &max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
As you can see, you could just insert the 3rd example into your code to solve your problem.
BTW, unrelated remark, but this hurts my eyes:
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
What about something like this:
if ( rank < 4 && rank >= 0 ) {
max_store[rank] = max;
}

MPI - sending parts of image to different processes

I'm writing a program in which process 0 sends parts of image to other processes which transform (long operation) this part and send back to the rank 0. I have a problem with one thing. To reproduce my issue I wrote a simple example. An image with size 512x512px is split on 4 parts (vertical stripes) by process 0. Next other processes save this part on disk. The problem is that each process saves the same part. I discovered that the image is split on parts correctly but problem is probably with sending data. What's wrong in my code?
Run:
mpirun -np 5 ./example
Main:
int main(int argc, char **argv) {
int size, rank;
MPI_Request send_request, rec_request;
MPI_Status status;
ostringstream s;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
Mat mat = imread("/home/user/original.jpg", CV_LOAD_IMAGE_COLOR);
if (!mat.data) exit(-1);
int idx = 1;
for (int c = 0; c < 512; c += 128) {
Mat slice = mat(Rect(c, 0, 128, 512)).clone();
MPI_Isend(slice.data, 128 * 512 * 3, MPI_BYTE, idx, 0, MPI_COMM_WORLD, &send_request);
idx++;
}
}
if (rank != 0) {
Mat test = Mat(512, 128, CV_8UC3);
MPI_Irecv(test.data, 128 * 512 * 3, MPI_BYTE, 0, 0, MPI_COMM_WORLD, &rec_request);
MPI_Wait(&rec_request, &status);
s << "/home/user/p" << rank << ".jpg";
imwrite(s.str(), test);
}
MPI_Finalize();
return 0;
}
If you insist on using non-blocking operations, then the proper way to issue multiple of them at the same time is:
MPI_Request *send_reqs = new MPI_Request[4];
int idx = 1;
for (int c = 0; c < 512; c += 128) {
Mat slice = mat(Rect(c, 0, 128, 512)).clone();
MPI_Isend(slice.data, 128 * 512 * 3, MPI_BYTE, idx, 0, MPI_COMM_WORLD, &send_reqs[idx-1]);
idx++;
}
MPI_Waitall(4, send_reqs, MPI_STATUSES_IGNORE);
delete [] send_reqs;
Another (and IMHO better) option would be to utilise MPI_Scatterv to scatter the original data buffer. Thus you could even save cloning parts of the image matrix.
if (rank == 0) {
Mat mat = imread("/home/user/original.jpg", CV_LOAD_IMAGE_COLOR);
if (!mat.data) exit(-1);
int *send_counts = new int[size];
int *displacements = new int[size];
// The following calculations assume row-major storage
for (int i = 0; i < size; i++) {
send_counts[i] = displacements[i] = 0;
}
int idx = 1;
for (int c = 0; c < 512; c += 128) {
displacements[idx] = displacements[idx-1] + send_counts[idx-1];
send_counts[idx] = 128 * 512 * 3;
idx++;
}
MPI_Scatterv(mat.data, send_counts, displacements, MPI_BYTE,
NULL, 0, MPI_BYTE, 0, MPI_COMM_WORLD);
delete [] send_counts;
delete [] displacements;
}
if (1 <= rank && rank <= 4) {
Mat test = Mat(512, 128, CV_8UC3);
MPI_Scatterv(NULL, NULL, NULL, MPI_BYTE,
test.data, 128 * 512 * 3, MPI_BYTE, 0, MPI_COMM_WORLD);
s << "/home/user/p" << rank << ".jpg";
imwrite(s.str(), test);
}
Note how the arguments to MPI_Scatterv are prepared. Since you are scattering to 4 MPI processes only, setting certain elements of send_counts[] to zero allows the program to function correctly with more than 5 MPI processes. Also, the root rank in your original code doesn't send to itself, therefore send_counts[0] must be zero.
The problem is that you are not waiting till the send operation completes before the matrix Mat is destructed. Use MPI_Send instead of MPI_Isend.
If you really want to use non blocking communication, you have to keep track of all MPI_Request objects and of all Mat images until the send is complete.

mpirun was unable to find the specified executable file

I have problems compiling this code using OpenMPI.Since I am a bit new to the concepts of using OpenMPI, it would be great if someone of you could give me a hint to the mistake here.
Compiling works just fine, but if I run the code I get this message:
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
I am compiling using:
mpic++ matmult.cpp -o matmult
and running it with:
mpirun -n 2 matmult
... and here is the used code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define MASTER 0
#define FROM_MASTER 1
#define FROM_WORKER 2
// ---------------------------------------------------------------------------
// allocate space for empty matrix A[row][col]
// access to matrix elements possible with:
// - A[row][col]
// - A[0][row*col]
float **alloc_mat(int row, int col)
{
float **A1, *A2;
A1 = (float **)calloc(row, sizeof(float *)); // pointer on rows
A2 = (float *)calloc(row*col, sizeof(float)); // all matrix elements
for (int i = 0; i < row; i++)
A1[i] = A2 + i*col;
return A1;
}
// ---------------------------------------------------------------------------
// random initialisation of matrix with values [0..9]
void init_mat(float **A, int row, int col)
{
for (int i = 0; i < row*col; i++)
A[0][i] = (float)(rand() % 10);
}
// ---------------------------------------------------------------------------
// DEBUG FUNCTION: printout of all matrix elements
void print_mat(float **A, int row, int col, char *tag)
{
int i, j;
printf("Matrix %s:\n", tag);
for (i = 0; i < row; i++)
{
for (j = 0; j < col; j++)
printf("%6.1f ", A[i][j]);
printf("\n");
}
}
// ---------------------------------------------------------------------------
int main(int argc, char *argv[]) {
int numtasks;
int taskid;
int numworkers;
int source;
int dest;
int mtype;
int rows;
int averow, extra, offset;
double starttime, endtime;
float **A, **B, **C; // matrices
int d1, d2, d3; // dimensions of matrices
int i, j, k, rc; // loop variables
MPI_Status status;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
if (argc != 4) {
printf ("Matrix multiplication: C = A x B\n");
printf ("Usage: %s <NumRowA> <NumColA> <NumColB>\n", argv[0]);
return 0;
}
if (numtasks < 2 ) {
printf("Need at least two MPI tasks. Quitting...\n");
MPI_Abort(MPI_COMM_WORLD,rc);
exit(1);
}
/* read user input */
d1 = atoi(argv[1]); // rows of A and C d1
d2 = atoi(argv[2]); // cols of A and rows of B d2
d3 = atoi(argv[3]); // cols of B and C d3
printf("Matrix sizes C[%d][%d] = A[%d][%d] x B[%d][%d]\n", d1, d3, d1, d2, d2, d3);
/* prepare matrices */
A = alloc_mat(d1, d2);
init_mat(A, d1, d2);
B = alloc_mat(d2, d3);
init_mat(B, d2, d3);
C = alloc_mat(d1, d3);
/* Code für den Manager */
if (taskid == MASTER) {
/*printf("matrix multiplikation withMPI\n");
printf("initializing arrays ...\n");
for (i=0; i<d1; i++)
for (j=0; j<d2; j++)
A[i][j]=i+j;
for (i=0; i<d2; i++)
for (j=0; j<d3; j++)
B[i][j]=i*j;*/
/* Matrizen versenden */
averow = d1/numworkers;
extra = d1%numworkers;
offset = 0;
mtype = FROM_MASTER;
starttime=MPI_Wtime();
for (dest=1;dest<=numworkers;dest++) {
rows = (dest <= extra) ? averow+1 :averow;
printf("Sending %drows to task %doffset=%d\n",rows,dest,offset);
MPI_Send(&offset, 1, MPI_INT,dest,mtype, MPI_COMM_WORLD);
MPI_Send(&rows, 1, MPI_INT,dest,mtype, MPI_COMM_WORLD);
MPI_Send(&A[offset][0],rows*d2, MPI_DOUBLE,dest,mtype, MPI_COMM_WORLD);
MPI_Send(&B, d2*d3, MPI_DOUBLE,dest,mtype, MPI_COMM_WORLD);
offset =offset+rows;
}
/* Ergebnisse empfangen */
mtype = FROM_WORKER;
for (i=1; i<=numworkers; i++) {
source = i;
MPI_Recv(&offset, 1, MPI_INT,source,mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&rows, 1, MPI_INT,source,mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&C[offset][0],rows*d3,
MPI_DOUBLE,source,mtype,MPI_COMM_WORLD,&status);
printf("Received results from task %d\n",source);
}
endtime=MPI_Wtime();
printf("\nIt took %fseconds.\n",endtime-starttime);
}
/* Code für die Arbeiter */
if (taskid > MASTER) {
mtype = FROM_MASTER;
MPI_Recv(&offset, 1, MPI_INT, MASTER,mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&d1, 1, MPI_INT, MASTER,mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&A,rows*d2, MPI_DOUBLE, MASTER,mtype, MPI_COMM_WORLD, &status);
MPI_Recv(&B, d2*d3, MPI_DOUBLE, MASTER,mtype, MPI_COMM_WORLD, &status);
/* print user instruction */
// no initialisation of C, because it gets filled by matmult
/* serial version of matmult */
printf("Perform matrix multiplication...\n");
for (i = 0; i < d1; i++)
for (j = 0; j < d3; j++)
for (k = 0; k < d2; k++)
C[i][j] += A[i][k] * B[k][j];
mtype = FROM_WORKER;
MPI_Send(&offset, 1, MPI_INT, MASTER,mtype, MPI_COMM_WORLD);
MPI_Send(&d1, 1, MPI_INT, MASTER,mtype, MPI_COMM_WORLD);
MPI_Send(&C,rows*d3, MPI_DOUBLE, MASTER,mtype, MPI_COMM_WORLD);
}
MPI_Finalize();
/* test output
print_mat(A, d1, d2, "A");
print_mat(B, d2, d3, "B");
print_mat(C, d1, d3, "C"); */
printf ("\nDone.\n");
//return 0;
}
Results of running mpirun matmult (default settings, single process):
mpirun has exited due to process rank 0 with PID 77202 on node
juliuss-mbp-3 exiting improperly. There are three reasons this could
occur:
this process did not call "init" before exiting, but others in the
job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls
"init", then ALL processes must call "init" prior to termination.
this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call
"finalize" prior to exiting or it will be considered an "abnormal
termination"
this process called "MPI_Abort" or "orte_abort" and the mca parameter orte_create_session_dirs is set to false. In this case,
the run-time cannot detect that the abort call was an abnormal
termination. Hence, the only error message you will receive is this
one. This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here). You can
avoid this message by specifying -quiet on the mpirun command line.
Secondary Issue (still important):
Your program expects an argument count of 4, ie. program name + 3 arguments passed in, from this code:
if (argc != 4) {
printf ("Matrix multiplication: C = A x B\n");
printf ("Usage: %s <NumRowA> <NumColA> <NumColB>\n", argv[0]);
return 0;
}
Since this conditional returns 0 without calling the proper MPI_Abort(...) or MPI_Finalize() then you will receive the mpi error:
mpirun has exited due to process rank 0 with PID 77202 on node juliuss-mbp-3 exiting improperly.
By adding MPI_Abort(MPI_COMM_WORLD,rc); before return 0 I believe your program will be in the clear.
if (argc != 4) {
printf ("Matrix multiplication: C = A x B\n");
printf ("Usage: %s <NumRowA> <NumColA> <NumColB>\n", argv[0]);
MPI_Abort(MPI_COMM_WORLD,rc);
return 0;
}
Primary Issue:
However we should address the main cause of the issue, which is: you need to pass 3 arguments to your program when you run mpirun -np 2 matmult or mpirun matmult. Which should be in this format:
mpirun -np 2 matmult parameter1 parameter2 parameter3
or
mpirun matmult parameter1 parameter2 parameter3
From your code the parameters (arguments) should be:
parameter1 = rows of A and C
parameter2 = cols of A and rows of B
parameter3 = cols of B and C
and your run command could look like:
mpirun -np 2 matmult 2 2 2