MPI_send and MPI_recv functions for multiple processors C++ - c++

I`m pretty new in MPI programming and got stuck in the middle of my project.
I want to write an MPI code for the following problem. I am not sure which functions from MPI is appropriate.
Here is the problem:
Processor 0 has a 2D vector or array of Edges={(0,4),(1,5)}. It needs to get some information from the other processors, which is not always a fixed processor, it depends on set Edges. Therefore, I need a for loop as follows:
if(my_rank==0)
{
for(all pairs (i,j) in Edges)
{
send i (or j) to Processor r (r depends on the index i)
recieve L_r from Processor r
create (L_i, L_j, min(L_i,L_j)) // want to broadcast to all later.
}
}
Now, I am not sure how to do it for processor r, should I do in a for loop?
Note that I can not do it in an if statement since I dont know which processor would be and so based on the number of processors I need an if statement which I don`t think is a right way. I might have so many processors which each holds some part of a matrix.
Need to point that I cannot communicate with a subgroup of communicators, since it all depends on the indices, basically, I want the labels for example indices (0,4) which need to communicate with P4 that holds it.
Any ideas are appreciated.

I would do it as follow:
1) Proc 0 construct a list of every processes it has to comunicate with.
2) Proc 0 broadcast this list to all processes (or only to the one he have to communicate with, but that will be more complicated, can be done once you got a version which works)
3) You perform your comm:
If(rank==0){...}
else if (rank in the list){...}

Related

How does one send custom MPI_Datatype over to a different process?

Suppose that I create custom MPI_Datatypes for subarrays of different sizes on each of the MPI processes allocated to a program. Now I wish to send these subarrays to the master process and assemble them into a bigger array block by block. The master process is unaware of the individual datatypes (defined by the local sizes) on the other processes. Naively, therefore, I might attempt to send over these custom datatypes to the master process in the following manner.
MPI_Datatype localarr_type;
MPI_Type_create_subarray( NDIMS, array_size, local_size, box_corner, MPI_ORDER_C, MPI_FLOAT, &localarr_type );
MPI_Type_Commit(&localarr_type);
if (rank == master)
{
for (int id = 1; id < nprocs; ++id)
{
MPI_Recv( &localarr_type, 1, MPI_Datatype, id, tag1[id], comm_cart, MPI_STATUS_IGNORE );
MPI_Recv( big_array, 1, localarray_type, id, tag2[id], comm_cart, MPI_STATUS_IGNORE );
}
}
else
{
MPI_Send( &localarr_type, 1, MPI_Datatype, master, tag1[rank], comm_cart );
MPI_Send( local_box, 1, localarr_type, master, tag2[rank], comm_cart );
}
However, this results in a compilation error with the following error message from the GNU and CLANG compilers, and the latter error message from the Intel compiler.
/* GNU OR CLANG COMPILER */
error: unexpected type name 'MPI_Datatype': expected expression
/* INTEL COMPILER */
error: type name is not allowed
This means that either (1) I am attempting to send a custom MPI_Datatype over to a different process in the wrong way or that (2) this is not possible at all. I would like to know which it is, and if it is (1), I would like to know what the correct way of communicating a custom MPI_Datatype is. Thank you.
Note.
I am aware of other ways of solving the above problem without needing to communicate MPI_Datatypes. For example, one could communicate the local array sizes and manually reconstruct the MPI_Datatype from other processes inside the master process before using it in the subsequent communication of subarrays. This is not what I am looking for.
I wish to communicate the custom MPI_Datatype itself (as shown in the example above), not something that is an instance of the datatype (which is doable, as also shown in the example code above).
First of all: You can not send a datatype like that. The value MPI_Datatype is not a value of type MPI_Datatype. (It's a cute idea though.) You could send the parameters with which it is constructed, and the reconstruct it on the sending type.
However, you are probably misunderstanding the nature of MPI. In your code, with the same datatype on workers and manager, you are sort of assuming that everyone has data of the same size/shape. That is not compatible with the manager gathering everything together.
If you're gathering data on a manager process (usually not a good idea: are you really sure you need that?) then the contributing processes have the data in a small array, say at index 0..99. So you can send them as an ordinary contiguous buffer. The "manager" has a much larger array, and places all the contributions in disjoint locations. So at most the manager needs to create subarray types to indicate where the received data goes in the big array.

Dealing with complex send recv message within a for loop

I am trying to parallelise a biological model in C++ with boost::mpi. It is my first attempt, and I am entirely new to the boost library (I have started from the Boost C++ Libraries book by Schaling). The model consists of grid cells and cohorts of individuals living within each grid cell. The classes are nested, such that a vector of Cohorts* belongs to a GridCell. The model runs for 1000 years, and at each time step, there is dispersal such that the cohorts of individuals move randomly between grid cells. I want to parallelise the content of the for loop, but not the loop itself as each time step depends on the state of the previous time.
I use world.send() and world.recv() to send the necessary information from one rank to another. Because sometimes there is nothing to send between ranks I use with mpi::status and world.iprobe() to make sure the code does not hang waiting for a message that was never sent (I followed this tutorial)
The first part of my code seems to work fine but I am having troubles with making sure all the sent messages have been received before moving on to the next step in the for loop. In fact, I noticed that some ranks move on to the following time step before the other ranks have had the time to send their messaages (or at least that what it looks like from the output)
I am not posting the code because it consists of several classes and it’s quite long. If interested the code is on github. I write here roughly the pseudocode. I hope this will be enough to understand the problem.
int main()
{
// initialise the GridCells and Cohorts living in them
//depending on the number of cores requested split the
//grid cells that are processed by each core evenly, and
//store the relevant grid cells in a vector of GridCell*
// start to loop through each time step
for (int k = 0; k < (burnIn+simTime); k++)
{
// calculate the survival and reproduction probabilities
// for each Cohort and the dispersal probability
// the dispersing Cohorts are sorted based on the rank of
// the destination and stored in multiple vector<Cohort*>
// I send the vector<Cohort*> with
world.send(…)
// the receiving rank gets the vector of Cohorts with:
mpi::status statuses[world.size()];
for(int st = 0; st < world.size(); st++)
{
....
if( world.iprobe(st, tagrec) )
statuses[st] = world.recv(st, tagrec, toreceive[st]);
//world.iprobe ensures that the code doesn't hang when there
// are no dispersers
}
// do some extra calculations here
//wait that all processes are received, and then the time step ends.
//This is the bit where I am stuck.
//I've seen examples with wait_all for the non-blocking isend/irecv,
// but I don't think it is applicable in my case.
//The problem is that I noticed that some ranks proceed to the next
//time step before all the other ranks have sent their messages.
}
}
I compile with
mpic++ -I/$HOME/boost_1_61_0/boost/mpi -std=c++11 -Llibdir \-lboost_mpi -lboost_serialization -lboost_locale -o out
and execute with mpirun -np 5 out, but I would like to be able to execute with a higher number of cores on an HPC cluster later on (the model will be run at the global scale, and the number of cells might depend on the grid cell size chosen by the user).
The compilers installed are g++ (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, Open MPI: 2.1.1
The fact that you have nothing to send is an important piece of information in your scenario. You can not deduce that fact from only the absence of a message. The absence of a message only means nothing was sent yet.
Simply sending a zero-sized vector and skipping the probing is the easiest way out.
Otherwise you would probably have to change your approach radically or implement a very complex speculative execution / rollback mechanism.
Also note that the linked tutorial uses probe in a very different fashion.

Monitor buffers in GNU Radio

I have a question regarding buffering in between blocks in GNU Radio. I know that each block in GNU (including custom blocks) have buffers to store items that are going to be sent or received items. In my project, there is a certain sequence I have to maintain to synchronize events between blocks. I am using GNU radio on the Xilinx ZC706 FPGA platform with the FMCOMMS5.
In the GNU radio companion I created a custom block that controls a GPIO Output port on the board. In addition, I have an independent source block that is feeding information into the FMCOMMS GNU block. The sequence I am trying to maintain is that, in GNU radio, I first send data to the FMCOMMS block, second I want to make sure that the data got consumed by the FMCOMMS block (essentially by checking buffer), then finally I want to control the GPIO output.
From my observations, the source block buffer doesn’t seem to send the items until it’s full. This will cause a major issue in my project because this means that the GPIO data will be sent before or in parallel with sending the items to the other GNU blocks. That’s because I’m setting the GPIO value through direct access to its address in the ‘work’ function of my custom block.
I tried to use pc_output_buffers_full() in the ‘work’ function of my custom source in order to monitor the buffer, but I’m always getting 0.00. I’m not sure if it’s supposed to be used in custom blocks or if the ‘buffer’ in this case is something different from where the output items are stored. Here's a small code snippet which shows the problem:
char level_count = 0, level_val = 1;
vector<float> buff (1, 0.0000);
for(int i=0; i< noutput_items; i++)
{
if(level_count < 20 && i< noutput_items)
{
out[i] = gr_complex((float)level_val,0);
level_count++;
}
else if(i<noutput_items)
{
level_count = 0;
level_val ^=1;
out[i] = gr_complex((float)level_val,0);
}
buff = pc_output_buffers_full();
for (int n = 0; n < buff.size(); n++)
cout << fixed << setw(5) << setprecision(2) << setfill('0') << buff[n] << " ";
cout << "\n";
}
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent? Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
GNU Radio Companion version: 3.7.8
OS: Linaro 14.04 image running on the FPGA
Or is there a way to make sure that the each single output item is being sent like a continuous stream to the next block(s)?
Nope, that's not how GNU Radio works (at all!):
A while back I wrote an article that explains how GNU Radio deals with buffers, and what these actually are. While the in-memory architecture of GNU Radio buffers might be of lesser interest to you, let me quickly summarize the dynamics of it:
The buffers that (general_)work functions are called with behave for all that's practical like linearly addressable ring buffers. You get a random number of samples at once (restrictable to minimum numbers, multiples of numbers), and all that you not consume will be handed to you the next time work is called.
These buffers hence keep track of how much you've consumed, and thus, how much free space is in a buffer.
The input buffer a block sees is actually the output buffer of the "upstream" block in the flow graph.
GNU Radio's computation is backpressure-controlled: Any block's work method will immediately be called in an endless loop given that:
There's enough input for the block to do work,
There's enough output buffer space to write to.
Therefore, as soon as one block finishes its work call, the upstream block is informed that there's new free output space, thus typically leading to it running
That leads to high parallelity, since even adjacent blocks can run simultaneously without conflicting
This architecture favors large chunks of input items, especially for blocks that take a relative long time to computer: while the block is still working, its input buffer is already being filled with chunks of samples; when it's finished, chances are it's immediately called again with all the available input buffer being already filled with new samples.
This architecture is asynchronous: even if two blocks are "parallel" in your flow graph, there's no defined temporal relation between the numbers of items they produce.
I'm not even convinced switching GPIOs at times based on the speed computation in this completely non-deterministic timing data flow graph model is a good idea to start with. Maybe you'd rather want to calculate "timestamps" at which GPIOs should be switched, and send (timestamp, gpio state) command tuples to some entity in your FPGA that keeps absolute time? On the scale of radio propagation and high-rate signal processing, CPU timing is really inaccurate, and you should use the fact that you have an FPGA to actually implement deterministic timing, and use the software running on the CPU (i.e. GNU Radio) to determine when that should happen.
Is there a way to monitor the buffer so that I can determine when my first part of data bits have been sent?
Other than that, a method to asynchronously tell another another block that, yes, you've processed N samples, would be either to have a single block that just observes the outputs of both blocks that you want to synchronize and consumes an identical number of samples from both inputs, or to implement something using message passing. Again, my suspicion is that this is not a solution to your actual problem.

Concatenate data in an array in C ++

I'm working on software for processing audio in real time in C++ with Qt. I need that requirements are minimized.
Defining a temporary buffer 40ms, launching our device with a sampling frequency Fs = 8000Hz, every 320 samples entered a feature called Data Processing ().
The idea is to have a global buffer that stores the 10s last recorded, 80000 samples.
This Buffer in each iteration eliminates the initial 320 samples and looped at the end, 320 new samples. Thus the buffer is updated and the user can observe the real-time graphical representation of the recorded signal.
At first I thought of using QVector (equivalent to std::vector but for Qt) for this deployment, thus we reduce the process a few lines of code
int NUM_POINTS=320;
DatosTemporales.erase(DatosTemporales.begin(),DatosTemporales.begin()+NUM_POINTS);
DatosTemporales+= (DatosNuevos); // Datos Nuevos con un tamaño de NUM_POINTS
In each iteration we create a vector of 80000 samples in addition to free some positions so requires some processing time. An alternative for opting was the use of * double, and iterations a loop:
for(int i=0;i<80000;i++){
if(i<80000-NUM_POINTS){
aux=DatosTemporales[i];
DatosTemporales[i+NUM_POINTS]=aux;
}else{
DatosTemporales[i]=DatosNuevos[i-NUN_POINTS];
}
}
Does fails. I think the best way is to use dynamic memory. Implementing this process by pointers. Could anyone give me some idea how to implement it?
It sounds like what you are looking for is a circular buffer.
https://www.google.com/search?q=qcircularbuffer
https://qt.gitorious.org/qt/qtbase/merge_requests/60
And it looks like you only need the header file and you should be good to go.
A similar tool that is already in the Qt data set is found here:
http://doc.qt.io/qt-5/qcontiguouscache.html#details
The advantage of using a system like these presented, is that they don't need to have dynamic memory, it just needs to move the head and the tail pointers.
Hope that helps.

OpenCL SHA1 Throughput Optimisation

Hoping someone more experienced in OpenCL usage may be able to help me here! I'm doing a project (to help me learn a bit more crypto and to try my hand at GPGPU programming) where I'm trying to implement my own SHA-1 algorighm.
Ultimately my question is about maximizing my throughput rates. At present I'm seeing something like 56.1 MH/sec, which compares very badly to open source programs I've looked at, such as John the Ripper and OCLHashcat, which are giving 1,000 and 1,500 MH/sec respectively (heck, I'd be well-chuffed with a 3rd of that!).
So, what I'm doing
I've written a SHA-1 implementation in an OpenCL kernel and a C++ host application to load data to the GPU (using CL 1.2 C++ wrapper). I'm generating blocks of candidate data to hash in a threaded fashion on the CPU and loading this data onto the global GPU memory using the CL C++ call to enqueueWriteBuffer (using uchars to represent the bytes to hash):
errorCode = dispatchQueue->enqueueWriteBuffer(
inputBuffer,
CL_FALSE,//CL_TRUE,
0,
sizeof(cl_uchar) * inputBufferSize,
passwordBuffer,
NULL,
&dispatchDelegate);
I'm en-queuing data using enqueueNDRangeKernel in the following manner (where global worksize is a user-defined variable, at present I've set this to my GPUs maximum flattened global worksize of 16.777 million per run):
errorCode = dispatchQueue->enqueueNDRangeKernel(
*kernel,
NullRange,
NDRange(globalWorkgroupSize, 1),
NullRange,
NULL,
NULL);
This means that (per dispatch) I load 16.777 million items in a 1D array and index from my kernel into this using get_global_offset(0).
My Kernel signature:
__kernel void sha1Crack(__global uchar* out, __global uchar* in,
__constant int* passLen, __constant int* targetHash,
__global bool* collisionFound)
{
//Kernel Instance Global GPU Mem IO Mapping:
__private int id = get_global_id(0);
__private int inputIndexStart = id * passwordLen;
//Select Password input key space:
#pragma unroll
for (i = 0; i < passwordLen; i++)
{
inputMem[i] = in[inputIndexStart + i];
}
//SHA1 Code omitted for brevity...
}
So, given all this: am I doing something fundamentally wrong in the way I'm loading data? I.e. 1 call to enqueueNDrange for 16.7 million kernel executions over a 1D input vector? Should I be using a 2-D space and sub-dividing into localworkgroup ranges? I tried playing with this but it didn't seem quicker.
Or, perhaps as likely is my algorithm itself the source of slowness? I've spent a good while optimizing it and manually unrolling all of the loop stages using pre-processor directives.
I've read about memory coalescing on the hardware. Could that be my issue? :S
Any advice at all appreciated! If I've missed anything important please let me know and I'll update.
Thanks in advance! ;)
Update: 16,777,216 is the device maximum reported workgroup size; 256**3. The global array of boolean values is one boolean. It's set to false at the start of the kernel enqueue, then a branching statement sets this to true if a collision is found only - will that force a convergence? passwordLen is the length of the current input value and target hash is an int[4] encoded hash to check against.
Your 'maximum flattened global worksize' should be multiplied by passwordLen. It is the number of kernels you can run, not the maximal length of an input array. You can most likely send much more data than this to the GPU.
Other potential issues: the 'generating blocks of candidate data to hash in a threaded fashion on the CPU', try doing this in advance of the kernel iterations to see whether the delay is in the generation of the data blocks or in the processing of the kernels; your sha1 algorithm is the other obvious potential issue. I'm not sure how much you've really optimised it by 'unrolling' the loops, usually the bigger optimisation issue is 'if' statements (if a single kernel instance within a workgroup tests to true then all of the lockstepped workgroup instances must follow that branch in parallel).
And DarkZeros is correct, you should manually play with the local workgroup size making it the highest common multiple of the global size and the number of kernels which can be run at once on the card. The easiest way to do this is to round up the global work group size to the next multiple of the card capacity and use an external if{} statement in the kernel only running the kernel for global_id less than the actual number of kernels you want to run.
Dave.