After much Googling, I have no idea what's causing this issue. Here it is:
I have a simple call to MPI_Allgather in my code which I have double, triple, and quadruple-checked to be correct (send/receive buffers are properly sized; the send/receive sizes in the call are correct), but for 'large' numbers of processes I get either a deadlock or an MPI_ERR_TRUNCATE. The communicator being used for the Allgather is split from MPI_COMM_WORLD using MPI_Comm_split. For my current testing, rank 0 goes to one communicator, and the remaining ranks go to a second communicator. For 6 total ranks or less, the Allgather works just fine. If I use 7 ranks, I get an MPI_ERR_TRUNCATE. 8 ranks, deadlock. I have verified that the communicators were split correctly (MPI_Comm_rank and MPI_Comm_size is correct on all ranks for both Comms).
I have manually verified the size of each send and receive buffer, and the maximal number of receives. My first workaround was to swap the MPI_Allgather for a for-loop of MPI_Gather's to each process. This worked for that one case, but changing the meshes given to my code (CFD grids being partitioned using METIS) brought the problem back. Now my solution, which I haven't been able to break (yet), is to replace the Allgather with an Allgatherv, which I suppose is more efficient anyways since I have a different number of pieces of data being sent from each process.
Here's the (I hope) relevant offending code in context; if I've missed something, the Allgather in question is on line 599 of this file.
// Get the number of mpiFaces on each processor (for later communication)
// 'nProgGrid' is the size of the communicator 'gridComm'
vector<int> nMpiFaces_proc(nProcGrid);
// This MPI_Allgather works just fine, every time
// int nMpiFaces is assigned on preceding lines
MPI_Allgather(&nMpiFaces,1,MPI_INT,nMpiFaces_proc.data(),1,MPI_INT,gridComm);
int maxNodesPerFace = (nDims==2) ? 2 : 4;
int maxNMpiFaces = getMax(nMpiFaces_proc);
// The matrix class is just a fancy wrapper around std::vector that
// allows for (i,j) indexing. The getSize() and getData() methods just
// call the size() and data() methods, respectively, of the underlying
// vector<int> object.
matrix<int> mpiFaceNodes_proc(nProcGrid,maxNMpiFaces*maxNodesPerFace);
// This is the MPI_Allgather which (sometimes) doesn't work.
// vector<int> mpiFaceNodes is assigned in preceding lines
MPI_Allgather(mpiFaceNodes.data(),mpiFaceNodes.size(),MPI_INT,
mpiFaceNodes_proc.getData(),maxNMpiFaces*maxNodesPerFace,
MPI_INT,gridComm);
I am currently using OpenMPI 1.6.4, g++ 4.9.2, and an AMD FX-8350 8-core processor with 16GB of RAM, running the latest updates of Elementary OS Freya 0.3 (basically Ubuntu 14.04). However, I have also had this issue on another machine using CentOS, Intel hardware, and MPICH2.
Any ideas? I have heard that it could be possible to change MPI's internal buffer size(s) to fix similar issues, but a quick try to do so (as shown in http://www.caps.ou.edu/pipermail/arpssupport/2002-May/000361.html) had no effect.
For reference, this issue is very similar to the one shown here: https://software.intel.com/en-us/forums/topic/285074, except that in my case, I have only 1 processor with 8 cores, on a single desktop computer.
UPDATE
I've managed to put together a minimalist example of this failure:
#include <iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include "mpi.h"
using namespace std;
int main(int argc, char* argv[])
{
MPI_Init(&argc,&argv);
int rank, nproc, newID, newRank, newSize;
MPI_Comm newComm;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
newID = rank%2;
MPI_Comm_split(MPI_COMM_WORLD,newID,rank,&newComm);
MPI_Comm_rank(newComm,&newRank);
MPI_Comm_size(newComm,&newSize);
srand(time(NULL));
// Get a different 'random' number for each rank on newComm
//int nSend = rand()%10000;
//for (int i=0; i<newRank; i++) nSend = rand()%10000;
/*! -- Found a set of #'s which fail for nproc=8: -- */
int badSizes[4] = {2695,7045,4256,8745};
int nSend = badSizes[newRank];
cout << "Comm " << newID << ", rank " << newRank << ": nSend = " << nSend << endl;
vector<int> send(nSend);
for (int i=0; i<nSend; i++)
send[i] = rand();
vector<int> nRecv(newSize);
MPI_Allgather(&nSend,1,MPI_INT,nRecv.data(),1,MPI_INT,newComm);
int maxNRecv = 0;
for (int i=0; i<newSize; i++)
maxNRecv = max(maxNRecv,nRecv[i]);
vector<int> recv(newSize*maxNRecv);
MPI_Barrier(MPI_COMM_WORLD);
cout << "rank " << rank << ": Allgather-ing data for communicator " << newID << endl;
MPI_Allgather(send.data(),nSend,MPI_INT,recv.data(),maxNRecv,MPI_INT,newComm);
cout << "rank " << rank << ": Done Allgathering-data for communicator " << newID << endl;
MPI_Finalize();
return 0;
}
The above code was compiled and run as:
mpicxx -std=c++11 mpiTest.cpp -o mpitest
mpirun -np 8 ./mpitest
with the following output on both my 16-core CentOS and my 8-core Ubuntu machines:
Comm 0, rank 0: nSend = 2695
Comm 1, rank 0: nSend = 2695
Comm 0, rank 1: nSend = 7045
Comm 1, rank 1: nSend = 7045
Comm 0, rank 2: nSend = 4256
Comm 1, rank 2: nSend = 4256
Comm 0, rank 3: nSend = 8745
Comm 1, rank 3: nSend = 8745
rank 5: Allgather-ing data for communicator 1
rank 6: Allgather-ing data for communicator 0
rank 7: Allgather-ing data for communicator 1
rank 0: Allgather-ing data for communicator 0
rank 1: Allgather-ing data for communicator 1
rank 2: Allgather-ing data for communicator 0
rank 3: Allgather-ing data for communicator 1
rank 4: Allgather-ing data for communicator 0
rank 5: Done Allgathering-data for communicator 1
rank 3: Done Allgathering-data for communicator 1
rank 4: Done Allgathering-data for communicator 0
rank 2: Done Allgathering-data for communicator 0
Note that only 2 of the ranks from each communicator exit the Allgather; this isn't what happens in my actual code (no ranks on the 'broken' communicator exit the Allgather), but the end result is the same - the code hangs until I kill it.
I'm guessing this has something to do with the differing number of sends on each process, but as far as I can tell from the MPI documentation and tutorials I've seen, this is supposed to be allowed, correct? Of course, the MPI_Allgatherv is a little more applicable, but for reasons of simplicity I have been using Allgather instead.
You must use MPI_Allgatherv if the input counts are not identical across all processes.
To be precise, what must match is the type signature count,type, since technically you can get to the same fundamental representation with different datatypes (e.g. N elements vs 1 element that is a contiguous type of N elements), but if you use the same argument everywhere, which is the common usage of MPI collectives, then your counts must match everywhere.
The relevant portion of the latest MPI standard (3.1) is on page 165:
The type signature associated with sendcount, sendtype, at a process
must be equal to the type signature associated with recvcount,
recvtype at any other process.
I am currently working a P300 (basically there is detectable increase in a brain wave when a user sees something they are interested) detection system in C++ using the Emotiv EPOC. The system works but to improve accuracy I'm attempting to use Wekinator for machine learning, using an support vector machine (SVM).
So for my P300 system I have three stimuli (left, right and forward arrows). My program keeps track of the stimulus index and performs some filtering on the incoming "brain wave" and then calculates which index has the highest average area under the curve to determine which stimuli the user is looking at.
For my integration with Wekinator: I have setup Wekinator to receive a custom OSC message with 64 features (the length of the brain wave related to the P300) and set up three parameters with discrete values of 1 or 0. For training I have I have been sending the "brain wave" for each stimulus index in a trial and setting the relevant parameters to 0 or 1, then training it and running it. The issue is that when the OSC message is received by the the program from Wekinator it is returning 4 messages, rather than just the one most likely.
Here is the code for the training (and input to Wekinator during run time):
for(int s=0; s < stimCount; s++){
for(int i=0; i < stimIndexes[s].size(); i++) {
int eegIdx = stimIndexes[s][i];
ofxOscMessage wek;
wek.setAddress("/oscCustomFeatures");
if (eegIdx + winStart + winLen < sig.size()) {
int winIdx = 0;
for(int e=eegIdx + winStart; e < eegIdx + winStart + winLen; e++) {
wek.addFloatArg(sig[e]);
//stimAvgWins[s][winIdx++] += sig[e];
}
validWindowCount[s]++;
}
std::cout << "Num args: " << wek.getNumArgs() << std::endl;
wekinator.sendMessage(wek);
}
}
Here is the receipt of messages from Wekinator:
if(receiver.hasWaitingMessages()){
ofxOscMessage msg;
while(receiver.getNextMessage(&msg)) {
std::cout << "Wek Args: " << msg.getNumArgs() << std::endl;
if (msg.getAddress() == "/OSCSynth/params"){
resultReceived = true;
if(msg.getArgAsFloat(0) == 1){
result = 0;
} else if(msg.getArgAsFloat(1) == 1){
result = 1;
} else if(msg.getArgAsFloat(2) == 1){
result = 2;
}
std::cout << "Wek Result: " << result << std::endl;
}
}
}
Full code for both is at the following Gist:
https://gist.github.com/cilliand/f716c92933a28b0bcfa4
Main query is basically whether something is wrong with the code: Should I send the full "brain wave" for a trial to Wekinator? Or should I train Wekinator on different features? Does the code look right or should it be amended? Is there a way to only receive one OSC message back from Wekinator based on smaller feature sizes i.e. 64 rather than 4 x 64 per stimulus or 9 x 64 per stimulus index.
What I'm trying to do is to broadcast a value (my pivot) to a sub domain of my hypercube communicator.
So that for example process 0 sends to process 1,2 & 3 when process 4 sends to 4,5 & 6.
Does it require that I create communicators before hand or is there a way to do a broadcast/send to selected processes?
int broadcaster = 0;
if(isBroadcaster)
{
cout << "rank " << mpiRank << " currentd:" << currentd << " selecting pivot: " << pivot << endl;
pivot = currentValues[0];
broadcaster = mpiRank;
}
//TODO: Broadcast to processes 0 to 4 only.
//here, MPI_COMM_HYPERCUBE contains process 0 to 8
MPI_Bcast(&pivot, 1, MPI_INT, broadcaster, MPI_COMM_HYPERCUBE);
The best solution probably is to use MPI_COMM_SPLIT to break up your processes into sub-communicators. That is the way of describing communication domains.
The MPI_GROUP object is used for describing groups, but for the most part can't be used to perform communication.
Another option would be to use MPI_ALLTOALLV. That's pretty nasty though, and lots of overkill.
You can use mpi_comm_split but if the group members change too often you have to repeat this.
Another solution (dirty but effective in my opinion) would be broadcasting something like a process mask before issuing a compute command. So process 0 will broadcast an array of 8 bool like values with true being set only for mask1, mask[2], mask[3] ...
Assume we have an array or vector of length 256(can be more or less) and the number of pthreads to generate to be 4(can be more or less).
I need to figure out how to assign each pthread to a process a section of the vector.
So the following code dispatches the multiple threads.
for(int i = 0; i < thread_count; i++)
{
int *arg = (int *) malloc(sizeof(*arg));
*arg = i;
thread_err = pthread_create(&(threads[i]), NULL, &multiThread_Handler, arg);
if (thread_err != 0)
printf("\nCan't create thread :[%s]", strerror(thread_err));
}
As you can tell from the above code, each thread passes an argument value to the starting function. Where in the case of the four threads, the argument values range from 0 to 3, 5 threads = 0 to 4, and so forth.
Now the starting function does the following:
void* multiThread_Handler(void *arg)
{
int thread_index = *((int *)arg);
unsigned int start_index = (thread_index*(list_size/thread_count));
unsigned int end_index = ((thread_index+1)*(list_size/thread_count));
std::cout << "Start Index: " << start_index << std::endl;
std::cout << "End Index: " << end_index << std::endl;
std::cout << "i: " << thread_index << std::endl;
for(int i = start_index; i < end_index; i++)
{
std::cout <<"Processing array element at: " << i << std::endl;
}
}
So in the above code, the thread whose argument is 0 should process the section 0 - 63(in the case of an array size of 256 and a thread count of 4), the thread whose argument is 1 should process the section 64 - 127, and so forth. The last thread processing 192 - 256.
Each of these four sections should processed in parallel.
Also, the pthread_join() functions are present in the original main code to make sure each thread finishes before the main thread terminates.
The problem is, that the value i in the above for-loop is taking on suspiciously large values. I'm not sure why this would occur since I am fairly new to pthreads.
It seems like sometimes it works perfectly fine and other times and other times, the value of i becomes so large that it causes the program to either abort or presents a segmentation fault.
The problem is indeed a data race caused by lack of synchronization. And the shared variable being used (and modified) by multiple threads is std::cout.
When using streams such as std::cout concurrently, you need to synchronize all operations with a stream by a mutex. Otherwise, depending on the platform and your luck, you might get output from multiple threads messed together (which might sometimes look like printed values being larger than you expect), or you might get the program crashed, or have other sorts of undefined behavior.
// Incorrect Code
unsigned int start_index = (thread_index*(list_size/thread_count));
unsigned int end_index = ((thread_index+1)*(list_size/thread_count));
The above code is critical region is wrong in your above program. as there is no synchronization mechanism has been used so there is data race.This leads to the wrong calculation of start_index and end_index counters and hence we may get wrong(random garbage values) and hence the for loop variable "i" goes on the toss. So you should use the following code to synchronize the critical region of your program.
// Correct Code
s=thread_mutex_lock (&mutexhandle);
start_index = (thread_index*(list_size/thread_count));
end_index = ((thread_index+1)*(list_size/thread_count));
s=thread_mutex_unlock (&mutexhandle);
I'm currently trying to code a certain dynamic programming approach for a vehicle routing problem. At a certain point, I have a partial route that I want to add to a minmaxheap in order to keep the best 100 partial routes at a same stage. Most of the program runs smooth but when I actually want to insert a partial route into the heap, things tend to go a bit slow. That particural code is shown below:
clock_t insert_start, insert_finish, check1_finish, check2_finish;
insert_start = clock();
check2_finish = clock();
if(heap.get_vector_size() < 100) {
check1_finish= clock();
heap.insert(expansion);
cout << "node added" << endl;
}
else {
check1_finish = clock();
if(expansion.get_cost() < heap.find_max().get_cost() ) {
check2_finish = clock();
heap.delete_max();
heap.insert(expansion);
cout<< "worst node deleted and better one added" <<endl;
}
else {
check2_finish = clock();
cout << "cost too high check"<<endl;
}
}
number_expansions++;
cout << "check 1 takes " << check1_finish - insert_start << " ms" << endl;
cout << "check 2 takes " << check2_finish - check1_finish << "ms " << endl;
insert_finish = clock();
cout << "Inserting an expanded state into the heap takes " << insert_finish - insert_start << " clocks" << endl;
A typical output is this:
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 16 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
I know it's hard to say something about the code when this block uses functions that are implemented elsewhere but I'm flabbergasted as to why this sometimes takes less than a ms and sometimes takes up to 16 ms. The program should execute this block thousands of times so these small hiccups are really slowing things down enormously.
My only guess is that something happens with the vector in the heap class that stores all these states but I reserve place for a 100 items in the constructor using vector::reserve so I don't see how this could still be a problem.
Thanks!
Preempting. Your program may be preempted by the operating system, so some other program can run for a bit.
Also, it's not 16 ms. It's 16 clock ticks: http://www.cplusplus.com/reference/clibrary/ctime/clock/
If you want ms, you need to do:
cout << "Inserting an expanded state into the heap takes "
<< (insert_finish - insert_start) * 1000 / CLOCKS_PER_SEC
<< " ms " << endl;
Finally, you're setting insert_finish after printing out the other results. Try setting it immediately after your if/else block. The cout command is a good time to get preempted by another process.
My only guess is that something
happens with the vector in the heap
class that stores all these states but
I reserve place for a 100 items in the
constructor using vector::reserve so I
don't see how this could still be a
problem.
Are you using std::vector to implement it? Insert is taking linear time for std::vector. Also delete max is can take time if you are not using a sorted container.
I will suggest you to use a std::set or std::multiset. Insert, delete and find take always ln(n).
Try to measure time using QueryPerformanceCounter, because I think that clock function could not be very accurate. Probably clock has the same accuracy as windows scheduler - 10 ms for single cpu and 15 or 16 ms for multicore cpu. QueryPerformanceCounter together with QueryPerformanceFreq can give you nanosecond resolution.
It looks like you are measureing "wall time", not CPU time. Windows itself is not a realtime OS. Occasional large hiccups from high-priority things like device drivers is not at all uncommon.
On Windows if I'm manually trying to look for bottlenecks in code, I use RDTSC instead. Even better would be to not do it manually, but use a profiler.