How to use shared global datasets in MPI? - c++

I am an MPI beginner. I have a large array gmat of numbers (type double, dimensions 1x14000000) which is precomputed and stored in a binary file. It will use approximately 100 MB in memory (14000000 x8 bytes /1024 /1024). I want to write a MPI code which will do some computation on this array (for example, multiply all elements of gmat by rank number of the process). This array gmat itself stays constant during run-time.
The code is supposed to be something like this
#include <iostream>
#include "mpi.h"
double* gmat;
long int imax;
int main(int argc, char* argv[])
{
void performcomputation(int rank); // this function performs the computation and will be called by all processes
imax=atoi(argv[1]); // user inputs the length of gmat
MPI::Init();
rank = MPI::COMM_WORLD.Get_rank();
size = MPI::COMM_WORLD.Get_size(); //i will use -np 16 = 4 processors x 4 cores
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
MPI::COMM_WORLD.Barrier();
performcomputation(rank);
MPI::Finalize();
return 0;
}
void performcomputation(int rank)
{
int i;
for (i=0;i <imax; i++)
cout << "the new value is" << gmat[i]*rank << endl;
}
My question is when I run this code using 16 processes (-np 16), is the gmat same for all of them ? I mean, will the code use 16 x 100 MB in memory to store gmat for each process or will it use only 100 MB since I have defined gmat to be global ? And I don't want different processes to read gmat separately from the file, since reading so many numbers takes time. What is a better way to do this ? Thanks.

First of all, please do not use the MPI C++ bindings. These were deprecated in MPI-2.2 and then deleted in MPI-3.0 and therefore no longer part of the specification, which means that future MPI implementations are not even required to provide C++ bindings, and that if they do, they will probably diverge in the way the interface looks like.
That said, your code contains a very common mistake:
if rank==0 // read the gmat array using one of the processes
{
gmat = new double[imax];
// read values of gmat from a file
// next line is supposed to broadcast values of gmat to all processes which will use it
MPI::COMM_WORLD.Bcast(&gmat,imax,MPI::DOUBLE,1);
}
This won't work as there are four errors here. First, gmat is only allocated at rank 0 and not allocated in the others ranks, which is not what you want. Second, you are giving Bcast the address of the pointer gmat and not the address of the data pointed by it (i.e. you should not use the & operator). You also broadcast from rank 0 but put 1 as the broadcast root argument. But the most important error is that MPI_BCAST is a collective communication call and all ranks are required to call it with the same value of the root argument in order for it to complete successfully. The correct code (using the C bindings instead of the C++ ones) is:
gmat = new double[imax];
if (rank == 0)
{
// read values of gmat from a file
}
MPI_Bcast(gmat, imax, MPI_DOUBLE, 0, MPI_COMM_WORLD);
// ^^^^ ^^^
// no & root == 0
Each rank has its own copy of gmat. Initially all values are different (e.g. random or all zeros, depends on the memory allocator). After the broadcast all copies will become identical to the copy of gmat at rank 0. After the call to performcomputation() each copy will be different again since each rank multiplies the elements of gmat with a different number. The answer to your question is: the code will use 100 MiB in each rank, therefore 16 x 100 MiB in total.
MPI deals with distributed memory - processes do not share variables, no matter if they are local or global ones. The only way to share data is to use MPI calls like point-to-point communication (e.g. MPI_SEND / MPI_RECV), collective calls (e.g. MPI_BCAST) or one-sided communication (e.g. MPI_PUT / MPI_GET).

Related

Passing large 2d dimentional array in MPI C++

I have a task to speed up a program using MPI.
Let's assume I have a large 2d array (1000x1000 or bigger) on the input. I have a working sequential program that divides, so the 2d array into chunks (for example 10x10) and calculates the result which is double for each chuck. (so we have a function which argument is 2d array 10x10 and a result is a double number).
My first idea to speed up:
Create 1d array of size N*N (for example 10x10 = 100) and Send array to another process
double* buffer = new double[dataPortionSize];
//copy some data to buffer
MPI_Send(buffer, dataPortionSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
Recieve it in another process, calculate result, send back the result
double* buf = new double[dataPortionSize];
MPI_Recv(buf, dataPortionSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
double result = function->calc(buf);
MPI_Send(&result, 1, MPI_DOUBLE, 0, 3, MPI_COMM_WORLD);
This program was much slower than the sequential version. It looks like MPI needs a lot of time to pass an array to another process.
My second idea:
Pass the whole 2d input array to all processes
// data is protected field in base class, it is injected during runtime
MPI_Send(&(data[0][0]), dataSize * dataSize, MPI_DOUBLE, currentProcess, 1, MPI_COMM_WORLD);
And receive data like this
double **arrayAlloc( int size ) {
double **result; result = new double [ size ];
for ( int i = 0; i < size; i++ )
result[ i ] = new double[ size ];
return result;
}
double **data = arrayAlloc(dataSize);
MPI_Recv(&data[0][0], dataSize * dataSize, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD, status);
Unfortunately, I got a bunch of errors during execution:
Those crashes are pretty random. It happened 2 times that the program ended successfully
My third idea:
Pass memory address to all processes, but I found this:
MPI processes cannot read each others' memory, and virtual addressing makes one process' pointer completely meaningless to another.
Does anyone have an idea how to speed up it? I understand that the key thing for speed to is pass array/arrays to processes in an efficient way, but I don't have an idea how to do this.
You have multiple issues here. I'll try to go through them in some arbitrary order.
As someone else explained, your second attempt fails because MPI expects you to work with a single consecutive array, not an array of pointers. So you want to allocate something like matrix = new double[rows * cols] and then access individual rows as &matrix[row * cols] or an individual value as matrix[row * cols + col]
This would be a data structure that you can send, receive, scatter, and gather with MPI. It would also be faster in general.
You are correct to assume that MPI takes time to transfer data. Even best case it is the cost of a memcpy. Usually significantly more. If your program is doing too little work before transferring data, it will not be faster.
Your first attempt may have failed because the first process doesn't do anything useful while waiting for the result. You didn't include the receive operation in your code sample. However, if you wrote something like this:
for(int block = 0; block < nblocks; ++block) {
generate_data(buf);
MPI_Send(buf, ...);
MPI_Recv(buf, ...);
}
Then you cannot expect a speedup because the process is not doing anything useful while waiting for the result. You can avoid this with double buffering. Let the first process generate the next data block before waiting in the receive operation for the result. Something like this:
generate_data(0, input); /* 0-th block */
MPI_Send(input, ...);
for(int block = 1; block < nblocks; ++block) {
generate_data(block, input); /* 1st up to nth block */
MPI_Recv(output, ...); /* 0-th up to n-1-th block */
MPI_Send(input, ...);
}
MPI_Recv(output, ...); /* n-th block */
Now calculations in both processes can overlap.
You shouldn't use MPI_Send and MPI_Recv to begin with! MPI is designed for collective operations like MPI_Scatter and MPI_Gather. What you should do, is generate N blocks for N processes, MPI_Scatter them across all processes. Then let each process compute their result. Then MPI_Gather them back at the root process.
Even better, let every process work independently, if possible. Of course this depends on your data but if you can generate and process data blocks independently from one another, don't do any communication. Just let them all work alone. Something like this:
int rank, worldsize;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &worldsize);
for(int block = rank; block < nblocks; block += worldsize) {
process_data(block);
}

Read from 5x10^8 different array elements, 4 bytes each time

So I'm taking an assembly course and have been tasked with making a benchmark program for my computer - needless to say, I'm a bit stuck on this particular piece.
As the title says, we're supposed to create a function to read from 5x108 different array elements, 4 bytes each time. My only problem is, I don't even think it's possible for me to create an array of 500 million elements? So what exactly should I be doing? (For the record, I'm trying to code this in C++)
//Benchmark Program in C++
#include <iostream>
#include <time.h>
using namespace std;
int main() {
clock_t t1,t2;
int readTemp;
int* arr = new int[5*100000000];
t1=clock();
cout << "Memory Test"
<< endl;
for(long long int j=0; j <= 500000000; j+=1)
{
readTemp = arr[j];
}
t2=clock();
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
cout << "Time Taken: " << seconds << " seconds" <<endl;
}
Your system tries to allocate 2 billion bytes (1907 MiB), while the maximum available memory for Windows is 2 gigabytes (2048 MiB). These numbers are very close. It's likely your system has allocated the remaining 141 MiB for other stuff. Even though your code is very small, OS is pretty liberal in allocation of the 2048 MiB address space, wasting large chunks for e.g. the following:
C++ runtime (standard library and other libraries)
Stack: OS allocates a lot of memory to support recursive functions; it doesn't matter that you don't have any
Paddings between virtual memory pages
Paddings used just to make specific sections of data appear at specific addresses (e.g. 0x00400000 for lowest code address, or something like that, is used in Windows)
Padding used to randomize the values of pointers
There's a Windows application that shows a memory map of a running process. You can use it by adding a delay (e.g. getchar()) before the allocation and looking at the largest contiguous free block of memory at that point, and which allocations prevent it from being large enough.
The size is possible :
5 * 10^8 * 4 = ~1.9 GB.
First you will need to allocate your array (dynamically only ! There's no such stack memory).
For your task the 4 bytes is the size of an interger, so you can do it
int* arr = new int[5*100000000];
Alternatively, if you want to be more precise, you can allocate it as bytes
int* arr = new char[5*4*100000000];
Next, you need to make the memory dirty (meaning write something into it) :
memset(arr,0,5*100000000*sizeof(int));
Now, you can benchmark cache misses (I'm guessing that's what it's intended in such a huge array) :
int randomIndex= GetRandomNumberBetween(0,5*100000000-1); // make your own random implementation
int bytes = arr[randomIndex]; // access 4 bytes through integer
If you want 5* 10 ^8 accesses randomly you can make a knuth shuffle inside your getRandomNumber instead of using pure random.

MPI_Exchanging sides of vector between processes

In my code I have an arbitrary number of processes exchanging some parts of their local vectors. The local vectors are vectors of pairs and for this reason I've been using an MPI derived datatype. In principle I don't know how many elements each process sends to the others and for this reason I also have to send the size of the buffer. In particular, each process exchanges data with the process with rank: myrank-1 and with the process with rank: myrank+1. In case of process 0 instead of myrank-1 it exchanges with process with rank comm_size-1. And as well in case of process comm_size-1 instead of myrank+1 it exchanges with process with rank 0.
This is my code:
unsigned int size1tobesent;
size1tobesent=last.size();//Buffer size
int slpartner = (rank + 1) % p;
int rlpartner = (rank - 1 + p) % p;
unsigned int sizereceived1;
MPI_Sendrecv(&size1tobesent, 1, MPI_UNSIGNED, slpartner, 0,&sizereceived1,1,
MPI_UNSIGNED, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect first1(sizereceived1);
MPI_Sendrecv(&last[0], last.size(), mytype, slpartner, 0,&first1[0],sizereceived1,
mytype, rlpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
unsigned int size2tobesent;
size2tobesent=first.size();//Buffer size2
unsigned int sizereceived2;
MPI_Sendrecv(&size2tobesent, 1, MPI_UNSIGNED, rlpartner, 0,
&sizereceived2,1,MPI_UNSIGNED, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Vect last1(sizereceived2);
MPI_Sendrecv(&first[0], first.size(), mytype, rlpartner, 0,&last1[0],
sizereceived2 ,mytype, slpartner, 0,MPI_COMM_WORLD,MPI_STATUS_IGNORE);
Now when I run my code with 2 or 3 processes all works as expected. With more than 3 the results are unpredictable. I don't know if this is due to a particular combination of the input data or if there are some theoretical errors that I'm missing.
Finally consider that this code is part of a for cycle.

Simple MPI_Scatter try

I am just learning OpenMPI. Tried a simple MPI_Scatter example:
#include <mpi.h>
using namespace std;
int main() {
int numProcs, rank;
MPI_Init(NULL, NULL);
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
int* data;
int num;
data = new int[5];
data[0] = 0;
data[1] = 1;
data[2] = 2;
data[3] = 3;
data[4] = 4;
MPI_Scatter(data, 5, MPI_INT, &num, 5, MPI_INT, 0, MPI_COMM_WORLD);
cout << rank << " recieved " << num << endl;
MPI_Finalize();
return 0;
}
But it didn't work as expected ...
I was expecting something like
0 received 0
1 received 1
2 received 2 ...
But what I got was
32609 received
1761637486 received
1 received
33 received
1601007716 received
Whats with the weird ranks? Seems to be something to do with my scatter? Also, why is the sendcount and recvcount the same? At first I thought since I'm scattering 5 elements to 5 processors, each will get 1? So I should be using:
MPI_Scatter(data, 5, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
But this gives an error:
[JM:2861] *** An error occurred in MPI_Scatter
[JM:2861] *** on communicator MPI_COMM_WORLD
[JM:2861] *** MPI_ERR_TRUNCATE: message truncated
[JM:2861] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
if (rank == xxx) {
UPDATE
I noticed to run, send and receive buffer must be of same length ... and the data should be declared like:
int data[5][5] = { {0}, {5}, {10}, {3}, {4} };
Notice the columns is declared as length 5 but I only initialized 1 value? What is actually happening here? Is this code correct? Suppose I only want each process to receive 1 value only.
sendcount is the number of elements you want to send to each process, not the count of elements in the send buffer. MPI_Scatter will just take sendcount * [number of processes in the communicator] elements from the send buffer from the root process and scatter it to all processes in the communicator.
So to send 1 element to each of the processes in the communicator (assume there are 5 processes), set sendcount and recvcount to be 1.
MPI_Scatter(data, 1, MPI_INT, &num, 1, MPI_INT, 0, MPI_COMM_WORLD);
There are restrictions on the possible datatype pairs and they are the same as for point-to-point operations. The type map of recvtype should be compatible with the type map of sendtype, i.e. they should have the same list of underlying basic datatypes. Also the receive buffer should be large enough to hold the received message (it might be larger, but not smaller). In most simple cases, the data type on both send and receive sides are the same. So sendcount - recvcount pair and sendtype - recvtype pair usually end up the same. An example where they can differ is when one uses user-defined datatype(s) on either side:
MPI_Datatype vec5int;
MPI_Type_contiguous(5, MPI_INT, &vec5int);
MPI_Type_commit(&vec5int);
MPI_Scatter(data, 5, MPI_INT, local_data, 1, vec5int, 0, MPI_COMM_WORLD);
This works since the sender constructs messages of 5 elements of type MPI_INT while each receiver interprets the message as a single instance of a 5-element integer vector.
(Note that you specify the maximum number of elements to be received in MPI_Recv and the actual amount received might be less, which can be obtained by MPI_Get_count. In contrast, you supply the expected number of elements to be received in recvcount of MPI_Scatter so error will be thrown if the message length received is not exactly the same as promised.)
Probably you know by now that the weird rank printed out is caused by stack corruption, since num can only contains 1 int but 5 int are received in MPI_Scatter.
I am wondering though, why doing I need to differentiate between root and child processes? Seems like in this case, the source/root will also get a copy? Another thing is will other processes run scatter too? Probably not, but why? I thought all processes will run this code since its not in the typical if I see in MPI programs?
It is necessary to differentiate between root and other processes in the communicator (they are not child process of the root since they can be in a separate computer) in some operations such as Scatter and Gather, since these are collective communication (group communication) but with a single source/destination. The single source/destination (the odd one out) is therefore called root. It is necessary for all the processes to know the source/destination (root process) to set up send and receive correctly.
The root process, in case of Scatter, will also receive a piece of data (from itself), and in case of Gather, will also include its data in the final result. There is no exception for the root process, unless "in place" operations are used. This also applies to all collective communication functions.
There are also root-less global communication operations like MPI_Allgather, where one does not provide a root rank. Rather all ranks receive the data being gathered.
All processes in the communicator will run the function (try to exclude one process in the communicator and you will get a deadlock). You can imagine processes on different computer running the same code blindly. However, since each of them may belong to different communicator group and has different rank, the function will run differently. Each process knows whether it is member of the communicator, and each knows the rank of itself and can compare to the rank of the root process (if any), so they can set up the communication or do extra actions accordingly.

High number causes seg fault

This bit of code is from a program I am writing to take in x col and x rows to run a matrix multiplication on CUDA, parallel processing. The larger the sample size, the better.
I have a function that auto generates x amount of random numbers.
I know the answer is simple but I just wanted to know exactly why. But when I run it with say 625000000 elements in the array, it seg faults. I think it is because I have gone over the size allowed in memory for an int.
What data type should I use in place of int for a larger number?
This is how the data is being allocated, then passed into the function.
a.elements = (float*) malloc(mem_size_A);
where
int mem_size_A = sizeof(float) * size_A; //for the example let size_A be 625,000,000
Passed:
randomInit(a.elements, a.rowSize,a.colSize, oRowA, oColA);
What the randomInit is doing is say I enter a 2x2 but I am padding it up to a multiple of 16. So it takes the 2x2 and pads the matrix to a 16x16 of zeros and the 2x2 is still there.
void randomInit(float* data, int newRowSize,int newColSize, int oldRowSize, int oldColSize)
{
printf("Initializing random function. The new sized row is %d\n", newRowSize);
for (int i = 0; i < newRowSize; i++)//go per row of new sized row.
{
for(int j=0;j<newColSize;j++)
{
printf("This loop\n");
if(i<oldRowSize&&j<oldColSize)
{
data[newRowSize*i+j]=rand() / (float)RAND_MAX;//brandom();
}
else
data[newRowSize*i+j]=0;
}
}
}
I've even ran it with the printf in the loop. This is the result I get:
Creating the random numbers now
Initializing random function. The new sized row is 25000
This loop
Segmentation fault
Your memory allocation for data is probably failing.
Fortunately, you almost certainly don't need to store a large collection of random numbers.
Instead of storing:
data[n]=rand() / (float)RAND_MAX
for some huge collection of n, you can run:
srand(n);
value = rand() / (float)RAND_MAX;
when you need a particular number and you'll get the same value every time, as if they were all calculated in advance.
I think you're going past the value you allocated for data. when you're newrowsize is too large, you're accessing unallocated memory.
remember, data isn't infinitely big.
Well the real problem is that, if the problem is really the integer size used for your array access, you will be not able to fix it. I think you probably just have not enough space in your memory so as to store that huge number of data.
If you want to extends that, just define a custom structure or class if you are in C++. But you will loose the O(1) time access complexity involves with array.