MPI_Bcast, broadcasting array buffer to specific locations in the receive buffer

MPI_Bcast, broadcasting array buffer to specific locations in the receive buffer - c++

Using MPI, I want to do a broadcast operation by all processes in the communication group such that at the end of boradcast by all processes, the buffer in all processes have the same data.
Here is a fragment of code depicting what I want to do :
//assume there are 10 processes
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double globalArray[100];
for(int i=0; i<100; ++i) {
A[i] = (double)i + 1.0;
}
double buffer[10]; //assume all entries in buffer is zero
//only one array location in each rank is initialized and rest remain zero
buffer[rank] = globalArray[(rank + 1)*10]
MPI_Bcast(&buffer, 1, MPI_DOUBLE, rank, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
for(int i=0; i<10; ++i) {
std::cout << buffer[i] << "\t";
}
//expect to get [10 20 ... 100] in all the processes
MPI_Finalize();
I know there is MPI_Scatterv which on calling on each processor can do the work but this means I have to create two additional arrays, one for send_counts and displacements, which will always be the same for each scatterv operation.
Is there an easier way to do this ?

If i understand correctly your question, what you really need is MPI_Allgather()
double myvalue = (rank + 1) * 10;
MPI_Allgather(&myvalue, 1, MPI_DOUBLE, buffer, 1, MPI_DOUBLE, MPI_COMM_WORLD);
MWE:
#include <mpi.h>
#include <iostream>
int main(int argc, char* argv[])
{
int size, rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
double buffer[10];
double myvalue = (rank + 1) * 10;
MPI_Allgather(&myvalue, 1, MPI_DOUBLE, buffer, 1, MPI_DOUBLE, MPI_COMM_WORLD);
if (rank == 0) {
for(int i=0; i<10; ++i) {
std::cout << buffer[i] << "\t";
}
}
// answer for all processes: 10 20 ... 100
MPI_Finalize();
return 0;
}

Related

Poor scaling when running code in parallel using MPI and openMP

I have the following implementation:
int main(int argc, char **argv)
{
int n_runs = 100; // Number of runs
int seed = 1;
int arraySize = 400;
/////////////////////////////////////////////////////////////////////
// initialise the random number generator using a fixed seed for reproducibility
srand(seed);
MPI_Init(nullptr, nullptr);
int rank, n_procs;
MPI_Comm_size(MPI_COMM_WORLD, &n_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// Initialise the probability step and results vectors.
// We have 21 probabilities between 0 and 1 (inclusive).
double prob_step = 0.05;
std::vector<double> avg_steps_over_p(21,0);
std::vector<double> trans_avg_steps_over_p(21,0);
std::vector<int> min_steps_over_p(21,0);
std::vector<int> trans_min_steps_over_p(21,0);
std::vector<int> max_steps_over_p(21,0);
std::vector<int> trans_max_steps_over_p(21,0);
std::vector<double> prob_reached_end(21,0);
std::vector<double> trans_prob_reached_end(21,0);
// Loop over probabilities and compute the number of steps before the model burns out,
// averaged over n_runs.
for (int i = rank; i < 21; i+=n_procs)
{
double prob = i*prob_step;
int min_steps = std::numeric_limits<int>::max();
int max_steps = 0;
for (int i_run = 0; i_run < n_runs; ++i_run)
{
Results result = forest_fire(arraySize, prob);
avg_steps_over_p[i] += result.stepCount;
if (result.fireReachedEnd) ++prob_reached_end[i];
if (result.stepCount < min_steps) min_steps = result.stepCount;
if (result.stepCount > max_steps) max_steps = result.stepCount;
}
avg_steps_over_p[i] /= n_runs;
min_steps_over_p[i] = min_steps;
max_steps_over_p[i] = max_steps;
prob_reached_end[i] = 1.0*prob_reached_end[i] / n_runs;
}
// Worker processes communicate their results to the master process.
if (rank > 0)
{
MPI_Send(&avg_steps_over_p[0], 21, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD);
MPI_Send(&min_steps_over_p[0], 21, MPI_INT, 0, rank, MPI_COMM_WORLD);
MPI_Send(&max_steps_over_p[0], 21, MPI_INT, 0, rank, MPI_COMM_WORLD);
MPI_Send(&prob_reached_end[0], 21, MPI_DOUBLE, 0, rank, MPI_COMM_WORLD);
} else
{
for (int i = 1; i < n_procs; ++i)
{
MPI_Status status;
MPI_Recv(&trans_avg_steps_over_p[0], 21, MPI_DOUBLE, i, i, MPI_COMM_WORLD, &status);
for (int j = i; j < 21; j += n_procs) {
avg_steps_over_p[j] = trans_avg_steps_over_p[j];
}
MPI_Recv(&trans_min_steps_over_p[0], 21, MPI_INT, i, i, MPI_COMM_WORLD, &status);
for (int j = i; j < 21; j += n_procs) {
min_steps_over_p[j] = trans_min_steps_over_p[j];
}
MPI_Recv(&trans_max_steps_over_p[0], 21, MPI_INT, i, i, MPI_COMM_WORLD, &status);
for (int j = i; j < 21; j += n_procs) {
max_steps_over_p[j] = trans_max_steps_over_p[j];
}
MPI_Recv(&trans_prob_reached_end[0], 21, MPI_DOUBLE, i, i, MPI_COMM_WORLD, &status);
for (int j = i; j < 21; j += n_procs) {
prob_reached_end[j] = trans_prob_reached_end[j];
}
}
// Master process outputs the final result.
std::cout << "Probability, Avg. Steps, Min. Steps, Max Steps" << std::endl;
for (int i = 0; i < 21; ++i)
{
double prob = i * prob_step;
std::cout << prob << "," << avg_steps_over_p[i]
<< "," << min_steps_over_p[i] << ","
<< max_steps_over_p[i] << ","
<< prob_reached_end[i] << std::endl;
}
}
MPI_Finalize();
return 0;
}
I have tried the following parameters: scaling analysis
I'm new to parallelisation and HPC so forgive me if I'm wrong, but I was expecting a speed-up ratio of greater than 3 when increasing the tasks per node and CPUs per task. I haven't yet tried all the possibilities but I believe the behaviour here is odd, especially when keeping CPUs per task at 1 and increasing tasks per node from 2->3->4. I know it's not as simple a case as greater core usage = greater speed up, but from what I've gathered these should speed-up.
Is there a possible inefficiency in my code that is leading to this, or is this expected behaviour? My full code is here, which includes the openMP parallelisation: https://www.codedump.xyz/cpp/Y5Rr68L8Mncmx1Sd.
Many thanks.

I don't know how many operations are in the forest_fire routine but it had better be a couple of tens of thousands otherwise you don't have enough work to overcome the parallelization overhead.
Rank 0 handles all processes sequentially. You should use MPI_Irecv. And I wonder if a collective operation would not be preferable.
You are indexing with [i] which is a strided operation. That is space-wasting as I pointed out in another question you posted. Every process should only allocate as much space as is needed on that process.

How to exchange data between different processes using MPI

I'm trying to use MPI_Sendrecv to exchange data between processes but get trapped in the send and receive logics.
Assume that we have four processes, and each of them has 2 1D arrays of size 4: Params and Params_. The entries of Params will be initialized to be rank * 4 + i (e.g. in process rank 0, Params = {0,1,2,3}), while Params_ is used to store the Params sent by the other processes. Every process is going to send its params to the others and receive the params from the others. The code is as follows.
#include<mpi.h>
#include<stdio.h>
int main(int argc, char **argv){
int my_rank, ncpus;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &ncpus);
double *params;
double *params_;
int Nparams = 4;
params = new double[Nparams];
params_ = new double[Nparams];
// Initialize params and print it
printf("ID %d: ", my_rank);
for (int i = 0; i < Nparams; i++) {
params[i] = my_rank * ncpus + i;
printf("%g ", params[i]);
}
printf("\n");
// Receive params from the other processes, store it in params_, and print it
for (int i = 0; i < ncpus; i++) {
MPI_Sendrecv(params, Nparams, MPI_DOUBLE, i, 0, params_, Nparams, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status);
printf("Receive params from %d: ", i);
for (int i = 0; i < Nparams; i++)
printf("%g ", params_[i]);
printf("\n");
}
delete[] params;
delete[] params_;
MPI_Finalize();
return 0;
}
The output of the code shows nothing (even cannot print out Params), and there are a lot of warnings like WARNING: There was an error initializing an OpenFabrics device. Could you help me figure out why this doesn't work and what I should do. Thanks a lot!

Sum of the numbers 1 to 1000 in parallel

The following code uses 2n CPUs to calculate the sum of 1 to 1000. Each of the processors calculates a portion of this aggregate and independently displays the output.
The final result of the computing of all processors is collected by the first processor and aggregated and the final result is displayed in the output.
#include <iostream>
#include <stdio.h>
#include <mpi.h>
static int MyNode, Nodes;
using namespace std;
int main(int* argc, char** argv[])
{
MPI_Init(argc, argv);
MyNode = MPI_Comm_rank(MPI_COMM_WORLD, &MyNode);
Nodes = MPI_Comm_size(MPI_COMM_WORLD, &Nodes);
MPI_Status status;
int sum = 0;
int accum = 0;
int FIndex = 1000 * MyNode / Nodes + 1;
int LIndex = 1000 * (MyNode + 1) /
Nodes;
for (int I = FIndex; I <= LIndex; I = I + 1)
sum += I;
if (MyNode != 0)
MPI_Send(&sum, 1, MPI_INT, 0, 1,
MPI_COMM_WORLD);
else
for (int J = 1; J < Nodes; J = J + 1) {
MPI_Recv(&accum, 1, MPI_INT,
J, 1, MPI_COMM_WORLD,
&status);
sum += accum;
}
if (MyNode == 0) {
cout << "Total Nodes is " << Nodes << ".The sum from 1 to 1000 is: " << sum << endl;
}
MPI_Finalize();
return 0;
}
After running, I encounter a problem: Integer division by zero. (MyNode / Nodes)
why MyNode , Nodes are zero?

Just pass reference to MyNode and Nodes:
MPI_Comm_rank(MPI_COMM_WORLD, &MyNode);
MPI_Comm_size(MPI_COMM_WORLD, &Nodes);
MPI_Comm_size returns MPI_SUCCESS on success. Otherwise, the return
value is an error code.

The following functions return errors if any
MyNode = MPI_Comm_rank(MPI_COMM_WORLD, &MyNode);
Nodes = MPI_Comm_size(MPI_COMM_WORLD, &Nodes);
Since you are storing the error state in MyNode and Nodes, (In this case there is no error) The value of MyNOde and Nodes is 0.
Change it to this
int err;
err = MPI_Comm_rank(MPI_COMM_WORLD, &MyNode);
err = MPI_Comm_size(MPI_COMM_WORLD, &Nodes);

MPI - How to partition and communicate my array portions between master and worker processes

I am having a problem executing my master/worker MPI program.
The goal is to have the master pass portions of the integer array to the workers, have the workers sort their portions, and then return array portion to the master process which then combines the portions into finalArray[].
I think it has something to do with how I'm passing the portions of the array between processes, but I can't seem to think of anything new to try.
My code:
int compare(const void * a, const void * b) // used for quick sort method
{
if (*(int*)a < *(int*)b) return -1;
if (*(int*)a > *(int*)b) return 1;
return 0;
}
const int arraySize = 10000;
int main(int argc, char ** argv)
{
int rank;
int numProcesses;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &numProcesses);
const int PART = floor(arraySize / (numProcesses - 1));
auto start = std::chrono::high_resolution_clock::now(); //start timer
//================================= MASTER PROCESS =================================
if (rank == 0)
{
int bigArray[arraySize];
int finalArray[arraySize];
for (int i = 0; i < arraySize; i++) //random number generator
{
bigArray[i] = rand();
}
for (int i = 0; i < numProcesses - 1; i++)
{
MPI_Send(&bigArray, PART, MPI_INT, i + 1, 0, MPI_COMM_WORLD); // send elements of the array
}
for (int i = 0; i < numProcesses - 1; i++)
{
std::unique_ptr<int[]> tmpArray(new int[PART]);
MPI_Recv(&tmpArray, PART, MPI_INT, i + 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); //recieve sorted array from workers
for (int k = 0; k < PART; k++)
{
finalArray[PART * i + k] = tmpArray[k];
}
}
for (int m = 0; m < arraySize; m++)
{
printf(" Sorted Array: %d \n", finalArray[m]); //print my sorted array
}
}
//================================ WORKER PROCESSES ===============================
if (rank != 0)
{
std::unique_ptr<int[]> tmpArray(new int[PART]);
MPI_Recv(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); //recieve data into local initalized array
qsort(&tmpArray, PART, sizeof(int), compare); // quick sort
MPI_Send(&tmpArray, PART, MPI_INT, 0, 0, MPI_COMM_WORLD); //send sorted array back to rank 0
}
MPI_Barrier(MPI_COMM_WORLD);
auto end = std::chrono::high_resolution_clock::now(); //end timer
std::cout << "process took: "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count() //prints timer
<< " nanoseconds\n ";
MPI_Finalize();
return 0;
}
I am fairly new to MPI and C++ so any advice on either subject related to this problem is extremely helpful. I realize there may be many problems with this code so thank you for all help in advance.

What is the easier way to split an array

I have one problem at that point when I tried to split an array into some subarrays.
To be more exactly I have an array, let's say int a[10]={1,3,2,7,8,12,5,7,68,10} and I'm running my program on X process (in this moment I'm using 8 but could be more or less).
And I want to sent to each process on part of this array, for example for my array in this moment each process will receive something like process0 = {1, 3}, process2 = {2, 7} and so on.. until process7 = 68, 10.
After I've send each subarray I will do some operations on each subarray and after I want to merge all my subarrays into one back.
I've search on google a lot and I saw some example using MPI_Send and MPI_Recv or MPI_Scatter and MPI_Gather and I've tried some methods but everything I've tried... was without success and I receive errors or
null pointer...
My Code:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 32
int A[N];
int main(int argc, char *argv[]) {
int size;
int rank;
const int ROOT = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int count = N / (size - 1);
int *localArray = (int *) malloc(count * sizeof(int));
if (rank == ROOT) {
for (int i = 0; i < N; i++) {
A[i] = rand() % 10;
}
for (int dest = 1; dest < size; ++dest) {
MPI_Send(&A[(dest - 1) * count], count, MPI_INT, dest, tag, MPI_COMM_WORLD);
printf("P0 sent a %d elements to P%d.\n", count, dest);
}
for (int source = 1; source < size; source++) {
MPI_Recv(localArray, count, MPI_INT, source, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
//--------------------------------MERGE THE ALL RESULTS INTO A SORTED ARRAY-------------------------------------
printf("Received results from task %d\n", source);
}
}
else {
MPI_Recv(localArray, count, MPI_INT, ROOT, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
//---------------SORT THE localArray-------------------
MPI_Send(localArray, count, MPI_INT, ROOT, tag, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
What ever I've tried I can't get results where I've put the comment, what I'm doing wrong

While dreamcrash already suggested you could clean up your code using scatter & gather, I would put more emphasis on this. Use the built-in collective operations wherever possible. Do not try to rebuild them on your own. Not only is the code cleaner and easier to understand, it will also be significantly faster and allow all sorts of optimizations by the MPI implementation. Your example (assuming N is divisible by size) becomes:
if (rank == ROOT) {
for (int i = 0; i < N; i++) {
A[i] = rand() % 10;
}
}
MPI_Scatter(A, count, MPI_INT, localArray, count, MPI_INT, ROOT, MPI_COMM_WORLD);
//---------------SORT THE localArray-------------------
MPI_Gather(localArray, count, MPI_INT, A, count, MPI_INT, ROOT, MPI_COMM_WORLD);
MPI_Finalize();
Note that the ROOT rank correctly participates in the computation and does send data to itself using scatter / gather without any additional code path.
Now since your example explicitly uses N=10, which is not divisible by size=8, here is a version that works correctly. The idea is to distribute the remainder of the integer division evenly across the first remainder ranks (each gets one additional element to work on). You have to do that irregardless of using send/recv or scatter/gather. With scatter/gather you use the MPI_Scatterv / MPI_Gatherv variants, which take an array of sendcounts (how much elements does each rank get) and displacements (offset of each local part within the global one):
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 32
int A[N];
int main(int argc, char *argv[]) {
int size;
int rank;
const int ROOT = 0;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// compute the work distribution
int remainder = N % size;
int local_counts[size], offsets[size];
int sum = 0;
for (int i = 0; i < size; i++) {
local_counts[i] = N / size;
if (remainder > 0) {
local_counts[i] += 1;
remainder--;
}
offsets[i] = sum;
sum += local_counts[i];
}
int localArray[local_counts[rank]];
if (rank == ROOT) {
for (int i = 0; i < N; i++) {
A[i] = rand() % 10;
}
}
MPI_Scatterv(A, local_counts, offsets, MPI_INT, localArray, local_counts[rank], MPI_INT, ROOT, MPI_COMM_WORLD);
//---------------SORT THE localArray-------------------
MPI_Gatherv(localArray, local_counts[rank], MPI_INT, A, local_counts, offsets, MPI_INT, ROOT, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}

Change your code for something like this:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define N 32
int A[N]; // this should be global
int main(int argc, char *argv[]) {
int size;
int rank;
const int VERY_LARGE_INT = 999999;
const int ROOT = 0;
int tag = 1234;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int count = N / size ;
int *localArray = (int *) malloc(count * sizeof(int));
int localMin; // minimum computed on rank i
int globalMin; // will only be valid on rank == ROOT
if (rank == ROOT) {
for (int i = 0; i < N; i++) {
A[i] = rand() % 10;
}
// master local copy
for (int i = 0; i < count; i++)
localArray[i] = A[i];
for (int dest = 1; dest < size; ++dest) {
MPI_Send(&A[dest* count], count, MPI_INT, dest, tag, MPI_COMM_WORLD);
printf("P0 sent a %d elements to P%d.\n", count, dest);
}
localMin = VERY_LARGE_INT;
for (int source = 1; source < size; source++)
{
MPI_Recv(localArray, count, MPI_INT, source, 2, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
//--------------------------------I CANT GET RESULT HERE-------------------------------------
printf("Received results from task %d\n", source);
}
}
else
{
MPI_Recv(localArray, count, MPI_INT, ROOT, tag, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
//.. do something
MPI_Send(localArray, count, MPI_INT, ROOT, 2, MPI_COMM_WORLD);
}
MPI_Finalize();
return 0;
}
Some mistakes:
Array A is global, therefore all processes will have it, you most
likely want to only allocate it for the master process;
I changed N / (size - 1) to N / size, however be aware that this
only works when N %% size == 0, thus you might want to deal with opposed
scenario.
Since the master will have a sub-copy of the global array, I am performing this local copy from A to local array before sending the data to the slaves :
// master local copy
for (int i = 0; i < count; i++)
localArray[i] = A[i];
You have a small mistake on the merging part, the master and the slaves are using different tags, that was causing a deadlock. That is why I also changed this:
MPI_Send(localArray, count, MPI_INT, ROOT, tag, MPI_COMM_WORLD);
to
MPI_Send(localArray, count, MPI_INT, ROOT, 2, MPI_COMM_WORLD);
Both have now the same tag (2);
You could implement this code with scatter and gather and it would be a lot cleaner see here some examples.
Another mirror issue is if you are using C language instead of int *localArray = (int *) malloc(count * sizeof(int)); you should do int *localArray = malloc(count * sizeof(int)); see here why.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

MPI_Bcast, broadcasting array buffer to specific locations in the receive buffer - c++

Related

Poor scaling when running code in parallel using MPI and openMP

How to exchange data between different processes using MPI

Sum of the numbers 1 to 1000 in parallel

MPI - How to partition and communicate my array portions between master and worker processes

What is the easier way to split an array

Categories

Resources