Why I get the following error for the following code with mpirun -np 2 ./out command? I called make_layout() after resizing the std::vector so normally I should not get this error. It works if I do not resize. What is the reason?
main.cpp:
#include <iostream>
#include <vector>
#include "mpi.h"
MPI_Datatype MPI_CHILD;
struct Child
{
std::vector<int> age;
void make_layout();
};
void Child::make_layout()
{
int nblock = 1;
int age_size = age.size();
int block_count[nblock] = {age_size};
MPI_Datatype block_type[nblock] = {MPI_INT};
MPI_Aint offset[nblock] = {0};
MPI_Type_struct(nblock, block_count, offset, block_type, &MPI_CHILD);
MPI_Type_commit(&MPI_CHILD);
}
int main()
{
int rank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
Child kid;
kid.age.resize(5);
kid.make_layout();
int datasize;
MPI_Type_size(MPI_CHILD, &datasize);
std::cout << datasize << std::endl; // output: 20 (5x4 seems OK).
if (rank == 0)
{
MPI_Send(&kid, 1, MPI_CHILD, 1, 0, MPI_COMM_WORLD);
}
if (rank == 1)
{
MPI_Recv(&kid, 1, MPI_CHILD, 0, 0, MPI_COMM_WORLD, NULL);
}
MPI_Finalize();
return 0;
}
Error message:
*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x14ae7b8
[ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x113d0)[0x7fe1ad91c3d0]
[ 1] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x22)[0x7fe1ad5c5a92]
[ 2] ./out[0x400de4]
[ 3] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7fe1ad562830]
[ 4] ./out[0x400ec9]
*** End of error message ***
Here is an example with several std::vector members that uses MPI datatypes with absolute addresses:
struct Child
{
int foo;
std::vector<float> bar;
std::vector<int> baz;
Child() : dtype(MPI_DATATYPE_NULL) {}
~Child() { if (dtype != MPI_DATATYPE_NULL) MPI_Type_free(dtype); }
const MPI_Datatype mpi_dtype();
void invalidate_dtype();
private:
MPI_Datatype dtype;
void make_dtype();
};
const MPI_Datatype Child::mpi_dtype()
{
if (dtype == MPI_DATATYPE_NULL)
make_dtype();
return dtype;
}
void Child::invalidate_dtype()
{
if (dtype != MPI_DATATYPE_NULL)
MPI_Datatype_free(&dtype);
}
void Child::make_dtype()
{
const int nblock = 3;
int block_count[nblock] = {1, bar.size(), baz.size()};
MPI_Datatype block_type[nblock] = {MPI_INT, MPI_FLOAT, MPI_INT};
MPI_Aint offset[nblock];
MPI_Get_address(&foo, &offset[0]);
MPI_Get_address(&bar[0], &offset[1]);
MPI_Get_address(&baz[0], &offset[2]);
MPI_Type_struct(nblock, block_count, offset, block_type, &dtype);
MPI_Type_commit(&dtype);
}
Sample use of that class:
Child kid;
kid.foo = 5;
kid.bar.resize(5);
kid.baz.resize(10);
if (rank == 0)
{
MPI_Send(MPI_BOTTOM, 1, kid.mpi_dtype(), 1, 0, MPI_COMM_WORLD);
}
if (rank == 1)
{
MPI_Recv(MPI_BOTTOM, 1, kid.mpi_dtype(), 0, 0, MPI_COMM_WORLD, NULL);
}
Notice the use of MPI_BOTTOM as the buffer address. MPI_BOTTOM specifies the bottom of the address space, which is 0 on architectures with flat address space. Since the offsets passed to MPI_Type_create_struct are the absolute addresses of the structure members, when those are added to 0, the result is again the absolute address of each structure member. Child::mpi_dtype() returns a lazily constructed MPI datatype specific to that instance.
Since resize() reallocates memory, which could result in the data being moved to a different location in memory, the invalidate_dtype() method should be used to force the recreation of the MPI datatype after resize() or any other operation that might trigger memory reallocation:
// ...
kid.bar.resize(100);
kid.invalidate_dtype();
// MPI_Send / MPI_Recv
Please excuse any sloppy C++ code above.
The problem here is that you're telling MPI to send a block of integers from &kid, but that's not where your data is. &kid points to an std::vector object, which has an internal pointer to your block of integers allocated somewhere on the heap.
Replace &kid with kid.age.data() and it should work. The reason it "works" when you don't resize is that the vectors will be of 0 size, so MPI will try to send an empty message and no actual memory access takes place.
Be careful, you faced several problems.
First std::vector stores object in heap, so data is not really stored inside your struct.
Second you are not able to send STL containers even between dynamic libraries, also for app instances this is also true. Because they may be compiled with different versions of STL and work on different architectures differently.
Here is good answer about this part of question: https://stackoverflow.com/a/22797419/440168
Related
Continuing on my CUDA beginner's adventure, I've been introduced to Thrust, which seems a convenient lib that saves me the hassle of explicit memory (de-)allocation.
I've already tried combining it with a few cuBLAS routines, e.g. gemv, by generating a raw pointer to the underlying storage with thrust::raw_pointer_cast(array.data()) and then feeding this to the routines, and it works just fine.
The current task is to get the inverse of a matrix, and for that I'm using getrfBatched and getriBatched. From the documentation:
cublasStatus_t cublasDgetrfBatched(cublasHandle_t handle,
int n,
double *Aarray[],
int lda,
int *PivotArray,
int *infoArray,
int batchSize);
where
Aarray - device - array of pointers to <type> array
Naturally I thought I could use another layer of Thrust vector to express this array of pointers and again feed its raw pointer to cuBLAS, so here's what I did:
void test()
{
thrust::device_vector<double> in(4);
in[0] = 1;
in[1] = 3;
in[2] = 2;
in[3] = 4;
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
thrust::device_vector<double> out(4, 0);
thrust::device_vector<int> pivot(2, 0);
int info = 0;
thrust::device_vector<double*> in_array(1);
in_array[0] = thrust::raw_pointer_cast(in.data());
thrust::device_vector<double*> out_array(1);
out_array[0] = thrust::raw_pointer_cast(out.data());
stat = cublasDgetrfBatched(handle, 2,
(double**)thrust::raw_pointer_cast(in_array.data()), 2,
thrust::raw_pointer_cast(pivot.data()), &info, 1);
stat = cublasDgetriBatched(handle, 2,
(const double**)thrust::raw_pointer_cast(in_array.data()), 2,
thrust::raw_pointer_cast(pivot.data()),
(double**)thrust::raw_pointer_cast(out_array.data()), 2, &info, 1);
}
When executed, stat says CUBLAS_STATUS_SUCCESS (0) and info says 0 (execution successful), yet if I try to access the elements of in, pivot or out with standard bracket notation, I hit a thrust::system::system_error. Seems to me that the corresponding memory got corrupted somehow.
Anything obvious that I'm missing here?
The documentation for cublas<t>getrfBatched indicates that the infoArray parameter is expected to be a pointer to device memory.
Instead you have passed a pointer to host memory:
int info = 0;
...
stat = cublasDgetrfBatched(handle, 2,
(double**)thrust::raw_pointer_cast(in_array.data()), 2,
thrust::raw_pointer_cast(pivot.data()), &info, 1);
^^^^^
If you run your code with cuda-memcheck (always a good practice, in my opinion, any time you are having trouble with a CUDA code, before asking others for help) you will receive an error of "invalid global write of size 4". This is due to the fact that a kernel launched by cublasDgetrfBatched() is attempting to write the info data to device memory using an ordinary host pointer that you provided, which is always illegal in CUDA.
CUBLAS itself does not trap errors like this for performance reasons. However the thrust API uses more rigorous synchronization and error checking, in some cases. Therefore, the use of thrust code after this error reports the error, even though the error had nothing to do with thrust (it was an asynchronously reported error from a previous kernel launch).
The solution is straightforward; provide device storage for info:
$ cat t329.cu
#include <thrust/device_vector.h>
#include <cublas_v2.h>
#include <iostream>
void test()
{
thrust::device_vector<double> in(4);
in[0] = 1;
in[1] = 3;
in[2] = 2;
in[3] = 4;
cublasStatus_t stat;
cublasHandle_t handle;
stat = cublasCreate(&handle);
thrust::device_vector<double> out(4, 0);
thrust::device_vector<int> pivot(2, 0);
thrust::device_vector<int> info(1, 0);
thrust::device_vector<double*> in_array(1);
in_array[0] = thrust::raw_pointer_cast(in.data());
thrust::device_vector<double*> out_array(1);
out_array[0] = thrust::raw_pointer_cast(out.data());
stat = cublasDgetrfBatched(handle, 2,
(double**)thrust::raw_pointer_cast(in_array.data()), 2,
thrust::raw_pointer_cast(pivot.data()), thrust::raw_pointer_cast(info.data()), 1);
stat = cublasDgetriBatched(handle, 2,
(const double**)thrust::raw_pointer_cast(in_array.data()), 2,
thrust::raw_pointer_cast(pivot.data()),
(double**)thrust::raw_pointer_cast(out_array.data()), 2, thrust::raw_pointer_cast(info.data()), 1);
for (int i = 0; i < 4; i++) {
double test = in[i];
std::cout << test << std::endl;
}
}
int main(){
test();
}
$ nvcc -o t329 t329.cu -lcublas
t329.cu(12): warning: variable "stat" was set but never used
$ cuda-memcheck ./t329
========= CUDA-MEMCHECK
3
0.333333
4
0.666667
========= ERROR SUMMARY: 0 errors
$
You'll note this change in the above code is applied to usage for both cublas calls, as the infoArray parameter has the same expectations for both.
I'm finishing off a simple MPI program and I'm struggling on the last part of the project.
I send 2 ints containing a start point and end point to the slave node. And using these I need to create an array and populate it. I need to send this back to the Master node. Slave code below:
printf("Client waiting for start point and endpoint array\n");fflush(stdout);
int startEnd [2];
MPI_Recv(startEnd, 2, MPI_INT, 0, 100, MPI_COMM_WORLD, &status);
int end = startEnd[1];
int start = startEnd[0];
printf("Recieved Start End of %d \t %d\n", startEnd[0], startEnd[1]);fflush(stdout);
unsigned char TargetHash[MAX_HASH_LEN];
MPI_Recv(TargetHash, MAX_HASH_LEN, MPI_CHAR, 0, 100, MPI_COMM_WORLD, &status);
int sizeToCompute = (end - start);
uint64* pStartPosIndexE = new uint64[sizeToCompute];
int iterator = 0;
for (int nPos = end; nPos >= start; nPos--)
{
cwc.SetHash(TargetHash);
cwc.HashToIndex(nPos);
int i;
for (i = nPos + 1; i <= cwc.GetRainbowChainLength() - 2; i++)
{
cwc.IndexToPlain();
cwc.PlainToHash();
cwc.HashToIndex(i);
}
pStartPosIndexE[iterator] = cwc.GetIndex();
}
Is this the correct way to create the array of dynamic length and how would I send this array back to the master node?
Sending dynamically allocated arrays is no different than sending static arrays. When the array size varies, the receive code gets a bit more complicated, but not that much more complicated:
// ---------- Sender code ----------
MPI_Send(pStartPosIndexE, sizeToCompute, MPI_UINT64, 99, ...);
// --------- Receiver code ---------
// Wait for a message with tag 99
MPI_Status status;
MPI_Probe(MPI_ANY_SOURCE, 99, MPI_COMM_WORLD, &status);
// Get the number of elements in the message
int nElems;
MPI_Get_elements(&status, MPI_UINT64_T, &nElems);
// Allocate buffer of appropriate size
uint64 *result = new uint64[nElems];
// Receive the message
MPI_Recv(result, nElems, MPI_UINT64_T, status.MPI_SOURCE, 99, ...);
Using MPI_Probe with source rank of MPI_ANY_SOURCE is what is usually done in master/worker applications where workers are processed on a first-come-first-served basis.
I have a serial C++ program that I wish to parallelize. I know the basics of MPI, MPI_Send, MPI_Recv, etc. Basically, I have a data generation algorithm that runs significantly faster than the data processing algorithm. Currently they run in series, but I was thinking that running the data generation in the root process, having the data processing done on the slave processes, and sending a message from the root to a slave containing the data to be processed. This way, each slave processes a data set and then waits for its next data set.
The problem is that, once the root process is done generating data, the program hangs because the slaves are waiting for more.
This is an example of the problem:
#include "mpi.h"
#include <cassert>
#include <cstdio>
class Generator {
public:
Generator(int min, int max) : value(min - 1), max(max) {}
bool NextValue() {
++value;
return value < max;
}
int Value() { return value; }
private:
int value, max;
Generator() {}
Generator(const Generator &other) {}
Generator &operator=(const Generator &other) { return *this; }
};
long fibonnaci(int n) {
assert(n > 0);
if (n == 1 || n == 2) return 1;
return fibonnaci(n-1) + fibonnaci(n-2);
}
int main(int argc, char **argv) {
MPI_Init(&argc, &argv);
int rank, num_procs;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
if (rank == 0) {
Generator generator(1, 2 * num_procs);
int proc = 1;
while (generator.NextValue()) {
int value = generator.Value();
MPI_Send(&value, 1, MPI_INT, proc, 73, MPI_COMM_WORLD);
printf("** Sent %d to process %d.\n", value, proc);
proc = proc % (num_procs - 1) + 1;
}
} else {
while (true) {
int value;
MPI_Status status;
MPI_Recv(&value, 1, MPI_INT, 0, 73, MPI_COMM_WORLD, &status);
printf("** Received %d from process %d.\n", value, status.MPI_SOURCE);
printf("Process %d computed %d.\n", rank, fibonnaci(2 * (value + 10)));
}
}
MPI_Finalize();
return 0;
}
Obviously not everything above is "good practice", but it is sufficient to get the point across.
If I remove the while(true) from the slave processes, then the program exits when each of the slaves have exited. I would like the program to exit only after the root process has done its job AND all of the slaves have processed everything that has been sent.
If I knew how many data sets would be generated, I could have that many process running and everything would exit nicely, but that isn't the case here.
Any suggestions? Is there anything in the API that will do this? Could this be solved better with a better topology? Would MPI_Isend or MPI_IRecv do this better? I am fairly new to MPI so bear with me.
Thanks
The usual practice is to send to all worker processes an empty message with a special tag that signals them to exit the infinite processing loop. Let's say this tag is 42. You would do something like that in the worker loop:
while (true) {
int value;
MPI_Status status;
MPI_Recv(&value, 1, MPI_INT, 0, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
if (status.MPI_TAG == 42) {
printf("Process %d exiting work loop.\n", rank);
break;
}
printf("** Received %d from process %d.\n", value, status.MPI_SOURCE);
printf("Process %d computed %d.\n", rank, fibonnaci(2 * (value + 10)));
}
The manager process would do something like this after the generator loop:
for (int i = 1; i < num_procs; i++)
MPI_Send(&i, 0, MPI_INT, i, 42, MPI_COMM_WORLD);
Regarding your next question. Using MPI_Isend() in the master process would deserialise the execution and increase the performance. The truth however is that you are sending very small messages and those are typically internally buffered (WARNING - implementation dependent!) so your MPI_Send() is actually non-blocking and you already have non-serial execution. MPI_Isend() returns an MPI_Request handle that you need to take care of later. You could either wait for it to finish with MPI_Wait() or MPI_Waitall() but you could also just call MPI_Request_free() on it and it will be automatically freed when the operation is over. This is usually done when you'd like to send many messages asynchronously and would not care on when the sends will be completed, but it's a bad practice nevertheless since having a large number of outstanding requests can consume lots of precious memory. As for the worker processes - they need the data in order to proceed with the computation so using MPI_Irecv() is not necessary.
Welcome to the wonderful world of MPI programming!
I had a problem with a program that uses MPI and I have just fixed it, however, I don't seem to understand what was wrong in the first place. I'm quite green with programming relates stuff, so please be forgiving.
The program is:
#include <iostream>
#include <cstdlib>
#include <mpi.h>
#define RNumber 3
using namespace std;
int main() {
/*Initiliaze MPI*/
int my_rank; //My process rank
int comm_sz; //Number of processes
MPI_Comm GathComm; //Communicator for MPI_Gather
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_sz);
/*Initialize an array for results*/
long rawT[RNumber];
long * Times = NULL; //Results from threads
if (my_rank == 0) Times = (long*) malloc(comm_sz*RNumber*sizeof(long));
/*Fill rawT with results at threads*/
for (int i = 0; i < RNumber; i++) {
rawT[i] = i;
}
if (my_rank == 0) {
/*Main thread recieves data from other threads*/
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber, MPI_LONG, 0, GathComm);
}
else {
/*Other threads send calculation results to main thread*/
MPI_Gather(rawT, RNumber, MPI_LONG, Times, RNumber, MPI_LONG, 0, GathComm);
}
/*Finalize MPI*/
MPI_Finalize();
return 0;
};
On execution the program returns the following message:
Fatal error in PMPI_Gather: Invalid communicator, error stack:
PMPI_Gather(863): MPI_Gather(sbuf=0xbf824b70, scount=3, MPI_LONG,
rbuf=0x98c55d8, rcount=3, MPI_LONG, root=0, comm=0xe61030) failed
PMPI_Gather(757): Invalid communicator Fatal error in PMPI_Gather:
Invalid communicator, error stack: PMPI_Gather(863):
MPI_Gather(sbuf=0xbf938960, scount=3, MPI_LONG, rbuf=(nil), rcount=3,
MPI_LONG, root=0, comm=0xa6e030) failed PMPI_Gather(757): Invalid
communicator
After I remove GathComm altogether and substitute it with MPI_COMM_WORLD default communicator everything works fine.
Could anyone be so kind to explain what was I doing wrong and how did this adjustment made everything work?
That's because GathComm has not been assigned a valid communicator. "MPI_Comm GathComm;" only declares the variable to hold a communicator but doesn't create one.
You can use the default communicator (MPI_COMM_WORLD) if you simply want to include all procs in the operation.
Custom communicators are useful when you want to organised your procs in separate groups or when using virtual communication topologies.
To find out more, check out this article which describes Groups, Communicator and Topologies.
This code sampler is used to learn MPI programming. The MPI package I use is MPICH2 1.3.1. The code below is my first step to learn MPI_Isend(), MPI_Irecv() and MPI_Wait(). The code has a master and several workers. Master receives data from workers while workers send data to master. As usual, the data size is very large, workers split data into trunks and send trunks sequentially. I use some tricks to overlap the computation and communication when sending trunks. The method is very simple, just keeping two buffers to hold two trunks for each sending cycle.
int test_mpi_wait_2(int argc, char* argv[])
{
int rank;
int numprocs;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
int trunk_num = 6;// assume there are six trunks
int trunk_size = 10000;// assume each trunk has 10,000 data points
if(rank == 0)
{
//allocate receiving buffer for all workers
int** recv_buf = new int* [numprocs];
for(int i=0;i<numprocs;i++)
recv_buf[i] = new int [trunk_size];
//collecting first trunk from all workers
MPI_Request* requests = new MPI_Request[numprocs];
for(int i=1;i<numprocs;i++)
MPI_Irecv(recv_buf[i], trunk_size, MPI_INT, i, 0, MPI_COMM_WORLD, &requests[i]);
//define send_buf counter used to record how many trunks have been collected
vector<int> counter(numprocs);
MPI_Status status;
//assume therer are N-1 workers, then the total trunks will be collected is (N-1)*trunk_num
for(int i=0;i<(numprocs-1)*trunk_num;i++)
{
//wait until receive one trunk from any worker
int active_index;
MPI_Waitany(numprocs-1, requests+1, &active_index, &status);
int request_index = active_index + 1;
int procs_index = active_index + 1;
//check wheather all trunks from this worker have been collected
if(++counter[procs_index] != trunk_num)
{
//receive next trunk from this worker
MPI_Irecv(recv_buf[procs_index], trunk_size, MPI_INT, procs_index, 0, MPI_COMM_WORLD, &requests[request_index]);
}
}
for(int i=0;i<numprocs;i++)
delete [] recv_buf[i];
delete [] recv_buf;
delete [] requests;
cout<<rank<<" done"<<endl;
}
else
{
//for each worker, the worker first fill one trunk and send it to master
//for efficiency, the computation of trunk and communication to master is overlapped.
//two buffers are allocated to implement the overlapped computation
int* send_buf[2];
send_buf[0] = new int [trunk_size];//Buffer A
send_buf[1] = new int [trunk_size];//Buffer B
MPI_Request requests[2];
//file first trunk
for(int i=0;i<trunk_size;i++)
send_buf[0][i] = 0;
//send this trunk
MPI_Isend(send_buf[0], trunk_size, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[0]);
if(trunk_num > 1)
{
//file second trunk
for(int i=0;i<trunk_size;i++)
send_buf[1][i] = i;
//send this trunk
MPI_Isend(send_buf[1], trunk_size, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[1]);
}
//for remained trunks, keep cycle until all trunks are sent
for(int i=2;i<trunk_num;i+=2)
{
//wait till trunk data at buffer A is sent
MPI_Wait(&requests[0], MPI_STATUS_IGNORE);
//fill buffer A with next trunk data
for(int j=0;j<trunk_size;j++)
send_buf[0][j] = j * i;
//send buffer A
MPI_Isend(send_buf[0], trunk_size, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[0]);
//if more trunks are remained, fill buffer B and sent it
if(i+ 1 < trunk_num)
{
MPI_Wait(&requests[1], MPI_STATUS_IGNORE);
for(int j=0;j<trunk_size;j++)
send_buf[1][j] = j * (i + 1);
MPI_Isend(send_buf[1], trunk_size, MPI_INT, 0, 0, MPI_COMM_WORLD, &requests[1]);
}
}
//wait until last two trunks have been sent
if(trunk_num == 1)
{
MPI_Wait(&requests[0], MPI_STATUS_IGNORE);
}
else
{
MPI_Wait(&requests[0], MPI_STATUS_IGNORE);
MPI_Wait(&requests[1], MPI_STATUS_IGNORE);
}
delete [] send_buf[0];
delete [] send_buf[1];
cout<<rank<<" done"<<endl;
}
MPI_Finalize();
return 0;
}
Not much of an answer but this compiles and runs on my version of MPI, with up to 4 processors. The code does seem a bit involved, but I also cannot see any reason why it should not work.
I see several obvious ones: some for loops are not terminated, some cout statements aren't terminated, etc. I believe the code wasn't formatted properly...