Is there something wrong in my MPI algorithm? - fortran

I setup this algorithm to share data between different processors, and it has worked so far, but I'm trying to throw a much larger problem at it and I'm witnessing some very strange behavior. I'm losing pieces of data between MPI_Isend's and MPI_Recv's.
I present a snippet of the code below. It is basically comprised of three stages. First, a processor will loop over all elements in a given array. Each element represents a cell in a mesh. The processor checks if the element is being used on other processors. If yes, it does a non-blocking send to that process using the cell's unique global ID as the tag. If no, it checks the next element, and so on.
Second, the processor then loops over all elements again, this time checking if the processor needs to update the data in that cell. If yes, then the data has already been sent out by another process. The current process simply does a blocking receive, knowing who owns the data and the unique global ID for that cell.
Finally, MPI_Waitall is used for the request codes that were stored in the 'req' array during the non-blocking sends.
The issue I'm having is that this entire process completes---there is no hang in the code. But some of the data being received by some of the cells just isn't correct. I check that all data being sent is right by printing each piece of data prior to the send operation. Note that I'm sending and receiving a slice of an array. Each send will pass 31 elements. When I print the array from the process that received it, 3 out of the 31 elements are garbage. All other elements are correct. The strange thing is that it is always the same three elements that are garbage---the first, second and last element.
I want to rule out that something isn't drastically wrong in my algorithm which would explain this. Or perhaps it is related to the cluster I'm working on? As I mentioned, this worked on all other models I threw at it, using up to 31 cores. I'm getting this behavior when I try to throw 56 cores at the problem. If nothing pops out as wrong, can you suggest a means to test why certain pieces of a send are not making it to their destination?
do i = 1, num_cells
! skip cells with data that isn't needed by other processors
if (.not.needed(i)) cycle
tag = gid(i) ! The unique ID of this cell in the entire system
ghoster = ghosts(i) ! The processor that needs data from this cell
call MPI_Isend(data(i,1:tot_levels),tot_levels,mpi_datatype,ghoster,tag,MPI_COMM,req(send),mpierr)
send = send + 1
end do
sends = send-1
do i = 1, num_cells
! skip cells that don't need a data update
if (.not.needed_here(i)) cycle
tag = gid(i)
owner = owner(i)
call MPI_Recv(data(i,1:tot_levels),tot_levels,mpi_datatype,owner,tag,MPI_COMM,MPI_STATUS_IGNORE,mpierr)
end do
call MPI_Waitall(sends,req,MPI_STATUSES_IGNORE,mpierr)

Is your problem that you're not receiving all of the messages? Note that just because an MPI_SEND or MPI_ISEND completes, doesn't mean that the corresponding MPI_RECV was actually posted/completed. The return of the send call only means that the buffer can be reused by the sender. That data may still be buffered internally somewhere on either the sender or the receiver.
If it's critical that you know that the message was actually received, you need to use a different variety of the send like MPI_SSEND or MPI_RSEND (or the nonblocking versions if you prefer). Note that this won't actually solve your problem. It will probably just make it easier for you to figure out which messages aren't showing up.

I figured out a way to get my code to work, but I'm not entirely sure why, so I'm going to post the solution here and maybe somebody could comment on why this is the case and possibly offer a better solution.
As I indicated in my question and as we have discussed in the comments, it appeared that pieces of data were being lost between sends/receives. The concept of the buffer is a mystery to me, but I thought that maybe there wasn't enough space to hold my Isends, allowing for them to get lost before they could be received. So I swapped out the MPI_Isend calls with MPI_Bsend calls. I figure out how big my buffer needs to be using MPI_Pack_size. This way, I know I will have ample space for all my messages I send. I allocate my buffer size using MPI_Buffer_attach. I got rid of the MPI_Waitall, since it is no longer needed, and I replaced it with a call to MPI_Buffer_detach.
The code runs without issue and arrives at identical results to the serial case. I'm able to scale the problem size up to what I tried before and it works now. So based on these results, I'd have to assume that pieces of messages were being lost due to insufficient buffer space.
I have concerns about the impact on code performance. I did a scaling study on different problem sizes. See the image below. The x-axis gives the size of the problem (5 means the problem is 5 times bigger than 1). The y-axis gives the time to finish executing the program. There are three lines shown. Running the program in serial is shown in blue. The size=1 case is extrapolated out linearly with the green line. We see that the code execution time is linearly correlated with problem size. The red line shows running the program in parallel---we use a number of processors that matches the problem size (e.g. 2 cores for size=2, 4 cores for size=4, etc.).
You can see that the parallel execution time increases very slowly with problem size, which is expected, except for the largest case. I feel that the poor performance for the largest case is being caused by an increased amount of message buffering, which was not needed in smaller cases.

Related

Check if adjacent slave process is ended in MPI

In my MPI program, I want to send and receive information to adjacent processes. But if a process ends and doesn't send anything, its neighbors will wait forever. How can I resolve this issue? Here is what I am trying to do:
if (rank == 0) {
// don't do anything until all slaves are done
} else {
while (condition) {
// send info to rank-1 and rank+1
// if can receive info from rank-1, receive it, store received info locally
// if cannot receive info from rank-1, use locally stored info
// do the same for process rank+1
// MPI_Barrier(slaves); (wait for other slaves to finish this iteration)
}
}
I am going to check the boundaries of course. I won't check rank-1 when process number is 1 and I won't check rank+1 when process is the last one. But how can I achieve this? Should I wrap it with another while? I am confused.
I'd start by saying that MPI wasn't originally designed with your use case in mind. In general, MPI applications all start together and all end together. Not all applications fit into this model though, so don't lose hope!
There are two relatively easy ways of doing this and probably thousands of hard ones:
Use RMA to set flags on neighbors.
As has been pointed out in the comments, you can set up a tiny RMA window that exposes a single value to each neighbor. When a process is done working, it can do an MPI_Put on each neighbor to indicate that it's done and then MPI_Finalize. Before sending/receiving data to/from the neighbors, check to see if the flag is set.
Use a special tag when detecting shutdowns.
The tag value often gets ignored when sending and receiving messages, but this is a great time to use it. You can have two flags in your application. The first (we'll call it DATA) just indicates that this message contains data and you can process it as normal. The second (DONE) indicates that the process is done and is leaving the application. When receiving messages, you'll have to change the value for tag from whatever you're using to MPI_ANY_TAG. Then, when the message is received, check which tag it is. If it's DONE, then stop communicating with that process.
There's another problem with the pseudo-code that you posted however. If you expect to perform an MPI_Barrier at the end of every iteration, you can't have processes leaving early. When that happens, the MPI_Barrier will hang. There's not much you can do to avoid this unfortunately. However, given the code you posted, I'm not sure that the barrier is really necessary. It seems to me that the only inter-loop dependency is between neighboring processes. If that's the case, then the sends and receives will accomplish all of the necessary synchronization.
If you still need a way to track when all of the ranks are done, you can have each process alert a single rank (say rank 0) when it leaves. When rank 0 detects that everyone is done, it can just exit. Or, if you want to leave after some other number of processes is done, you can have rank 0 send out a message to all other ranks with a special tag like above (but add MPI_ANY_SOURCE so you can receive from rank 0).

GNU Radio general_work() function

I have trouble with using general_work function for a block which takes a vector as an input and outputs a message.
The block is a kind of demodulator. In fact it is working great if I send some data after and after (periodically).
But I need to create only one data (frame) which has a predefined size and sent it to this block. And I want this block to handle all of the items in its buffer without waiting for more data.
As I understand, it is about the buffering and scheduler structure of GNU Radio, but, I couldn't figure it out how to provide an ability to this block to handle all the symbols of the frame that I've sent without waiting for another frame.
For example, lets say my frame has 150 symbols. The scheduler calls my general_work function two, three, or four times (I don't know how it decides the number of calls for my general_work).
However, it stops lets say at symbol #141, or 143. Every time I run it, it stops at different symbol number. If I send another frame, it completes to handle remaining items (symbols) in its buffer.
Does anybody know how can I tell the scheduler to not wait for another frame to complete the remaining items in its buffer from the previously sent data.
First of all, thank you for your advices. In fact, I am studying on a link layer protocol and its implementation using SDR for my graduate thesis. Because I'm not a DSP expert, I need a wifi phy layer (transceiver). So, I decided to use an OOT module, "802.11 a/g/p Transceiver" project developed by Bastian Bloessl which is available on https://github.com/bastibl/gr-ieee802-11.git. He provided an example flow-graph (wifi_loopback.crc) to simulate the transceiver. By the way, besides the transceiver (DSP stuff) itself, he also developed some part of the data link layer issues for 802.11 such as framing and error control. In the example flow-graph, the "Message Strobe" block is used as a kind of application layer for producing data periodically and send them to a block called "OFDM MAC" which has 4 message ports (app_in, app_out, phy_in, and phy_out). In this block, the raw data which is coming from the "Message Strobe" is encapsulated by adding a header and FCS information. Then, the encapsulated data is sent (phy_out) to a hierarchical block called "Wifi PHY Hier" in order to do some DSP issues such as scrambling, coding, interleaving, symbol mapping and modulation etc. In some way, the data is converted to signal and received by the same block ("Wifi PHY Hier") and the opposite process is handled such as descrambling, decoding etc. And it gives the decoded frame to "OFDM MAC" block (phy_in). If you run this flow-graph, everything is normal. I mean, the data sent by "Message Strobe" is received correctly.
However, because I am trying to implement a kind of link layer protocol, I need some feedback from destination to source such as an ACK message. So, I decided to start by implementing a simple stop&wait protocol that the source sends a message and wait for an ACK from the destination, DATA -> ACK -> DATA -> ACK... and so on. In order to do that, I create a simple source block which sends only one data and wait for an ACK message to send another data. The data I produce with my source block is the same as the data produced by "Message Strobe". When I replace the "Message Strobe" block with my source block, I realized that something is wrong because I couldn't receive my data. So, I've followed my data in order to find which step cause this situation. There is no problem with the transmission process. In the receive process, I found the problematic block which is in the "Wifi PHY Hier" block and is the last block before this hierarchical block gives its data to "OFDM MAC" block. This problematic block which is called "OFDM Decode MAC" has two ports. The output port is a message port and the input port is complex vector. So, I reviewed the code of this block, specially, the general_work() function of it. For my particular test data, in order to complete its job correctly, it should consume 177 items to produce an output to "OFDM MAC". However, it stops consuming items after 172 items are consumed. I override the forecast() method and set ninput_items_required[0] = 177. But in this case, nothing is happened, because, as I understand, the scheduler has never see 177 items in the input buffer. As you said, this is because the block ("OFDM Decode Signal") that writes into this block's input buffer produce 172 items.
I did not go deep further yet but the interesting point is when I send a second data (in the runtime) after a period, without waiting for an ACK, the remaining 5 items of the first data I've sent are consumed in some way and received correctly by the "OFDM MAC" block. And now the second data is in the same problematic situation that the previus data has experienced.. If I send third data, the second one is also received correctly. I'm really confused. How can this be ?
I'll comment quickly on your text, and then advise below:
I have trouble with using general_work function for a block which
takes a vector as an input and outputs a message.
That block is, from a sample stream perspective, a sink. You will find that when using sink as a block type in gr_modtool, you will get a sync_block, which means you will only have to implement a work, not a general_work, and a forecast.
The block is a kind of demodulator. In fact it is working great if I
send some data after and after (periodically).
So that's great!
But I need to create only one data (frame) which has a predefined size
and sent it to this block. And I want this block to handle all of the
items in its buffer without waiting for more data.
That sounds like your block doesn't actually take streams of samples, but blocks. That is either a job for
message passing (so your block would have no input stream, just a message port) or
tagged stream blocks.
Sounds like the second to me.
As I understand, it is about the buffering and scheduler structure of
GNU Radio, but, I couldn't figure it out how to provide an ability to
this block to handle all the symbols of the frame that I've sent
without waiting for another frame.
Frame is what you make of this – to GNU Radio, your samples are just items that get written to and read from a buffer.
For example, lets say my frame has 150 symbols. The scheduler calls my
general_work function two, three, or four times (I don't know how it
decides the number of calls for my general_work).
It doesn't decide -- that's probably the chunks in which the symbols get written into the input buffer of your block. You don't have to consume all of these (or any of these) if your block isn't able to produce output with the input given. Just let GNU Radio know how many items were consumed (in the sync block case, it's implicitly done with the return value; in the general_work case, you might have to manually call consume – another reason to change your block type!).
However, it stops lets say at symbol #141, or 143. Every time I run
it, it stops at different symbol number. If I send another frame, it
completes to handle remaining items (symbols) in its buffer.
That sounds like a bug in your algorithm, not in GNU Radio. Maybe your input buffer is simply full, or maybe the block that writes into it simply doesn't provide more data?
Does anybody know how can I tell the scheduler to not wait for
another frame to complete the remaining items in its buffer from the
previously sent data.
The scheduler doesn't wait; as soon as there is data to be processed, it instantly "wakes" your block, and asks it to process the items.
I've reached Bastian, the guy who developed this OOT module. He said that the reason of the problem was a kind of padding issue. If a block called "Packet Padding2", which can be found in another OOT module that also developed by him, is used after "Wifi PHY Hier" and set the Pad Tail parameter of this block to appropriate value, the problem is solved.

OpenGL, measuring rendering time on gpu

I have some big performance issues here
So I would like to take some measurements on the gpu side.
By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled)
gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]);
{
draw(gl4);
checkGlError(gl4);
glad.swapBuffers();
}
gl4.glEndQuery(GL4.GL_TIME_ELAPSED);
gl4.glGetQueryObjectiv(queryId[0], GL4.GL_QUERY_RESULT, frameGpuTime, 0);
And since OpenGL rendering commands are supposed to be asynchronous ( the driver can buffer up to X commands before sending them all together in one batch), my question regards essentially if:
the code above is correct
I am right assuming that at the begin of a new frame all the previous GL commands (from the previous frame) have been sent, executed and terminated on the gpu
I am right assuming that when I get query result with glGetQueryObjectiv and GL_QUERY_RESULT all the GL commands so far have been terminated? That is OpenGL will wait until the result become available (from the thread)?
Yes, when you query the timer it will block until the data is available, ie until the GPU is finished with everything that happened between beginning and ending the query. To avoid synchronising with the GPU, you can use GL_QUERY_RESULT_AVAILABLE to check if the results are already available and only then read them then. That might require less straightforward code to keep tabs on open queries and periodically checking them, but it will have the least performance impact. Waiting for the value every time is a sure way to kill your performance.
Edit: To address your second question, swapping the buffer doesn't necessarily mean it will block until the operation succeeds. You may see that behaviour, but it's just as likely that it is just an implicit glFlush and the command buffer is not empty yet. Which is also the more wanted behaviour because ideally you want to start with your next frame right away and keep the CPUs command buffer filled. Check the implementations documentation for more info though, as that is implementation defined.
Edit 2: Checking for errors might end up being an implicit synchronization by the way, so you will probably see the command buffer emptying when you wait for error checking in the command stream.

Unbalanced load (v2.0) using MPI

(the problem is embarrassingly parallel)
Consider an array of 12 cells:
|__|__|__|__|__|__|__|__|__|__|__|__|
and four (4) CPUs.
Naively, I would run 4 parallel jobs and feeding 3 cells to each CPU.
|__|__|__|__|__|__|__|__|__|__|__|__|
=========|========|========|========|
1 CPU 2 CPU 3 CPU 4 CPU
BUT, it appears, that each cell has different evaluation time, some cells are evaluated very quickly, and some are not.
So, instead of wasting "relaxed CPU", I think to feed EACH cell to EACH CPU at time and continue until the entire job is done.
Namely:
at the beginning:
|____|____|____|____|____|____|____|____|____|____|____|____|
1cpu 2cpu 3cpu 4cpu
if, 2cpu finished his job at cell "2", it can jump to the first empty cell "5" and continue working:
|____|done|____|____|____|____|____|____|____|____|____|____|
1cpu 3cpu 4cpu 2cpu
|-------------->
if 1cpu finished, it can take sixth cell:
|done|done|____|____|____|____|____|____|____|____|____|____|
3cpu 4cpu 2cpu 1cpu
|------------------------>
and so on, until the full array is done.
QUESTION:
I do not know a priori which cell is "quick" and which cell is "slow", so I cannot spread cpus according to the load (more cpus to slow, less to quick).
How one can implement such algorithm for dynamic evaluation with MPI?
Thanks!!!!!
UPDATE
I use a very simple approach, how to divide the entire job into chunks, with IO-MPI:
given: array[NNN] and nprocs - number of available working units:
for (int i=0;i<NNN/nprocs;++i)
{
do_what_I_need(start+i);
}
MPI_File_write(...);
where "start" corresponds to particular rank number. In simple words, I divide the entire NNN array into fixed size chunk according to the number of available CPU and each CPU performs its chunk, writes the result to (common) output and relaxes.
IS IT POSSIBLE to change the code (Not to completely re-write in terms of Master/Slave paradigm) in such a way, that each CPU will get only ONE iteration (and not NNN/nprocs) and after it completes its job and writes its part to the file, will Continue to the next cell and not to relax.
Thanks!
There is a well known parallel programming pattern, known under many names, some of which are: bag of tasks, master / worker, task farm, work pool, etc. The idea is to have a single master process, which distributes cells to the other processes (workers). Each worker runs an infinite loop in which it waits for a message from the master, computes something and then returns the result. The loop is terminated by having the master send a message with a special tag. The wildcard tag value MPI_ANY_TAG can be used by the worker to receive messages with different tags.
The master is more complex. It also runs a loop but until all cells have been processed. Initially it sends each worker a cell and then starts a loop. In this loop it receives a message from any worker using the wildcard source value of MPI_ANY_SOURCE and if there are more cells to be processed, sends one of them to the same worker that have returned the result. Otherwise it sends a message with a tag set to the termination value.
There are many many many readily available implementations of this model on the Internet and even some on Stack Overflow (for example this one). Mind that this scheme requires one additional MPI process that often does very little work. If this is unacceptable, one can run a worker loop in a separate thread.
You want to implement a kind of client-server architecture where you have workers asking the server for work whenever they are out of work.
Depending on the size of the chunks and the speed of your communication between workers and server, you may want to adjust the size of the chunks sent to workers.
To answer your updated question:
Under the master/slave (or worker pool if that's how you prefer it to be labelled) model, you will basically need a task scheduler. The master should have information about what work has been done and what still needs to be done. The master will give each process some work to be done, then sit and wait until a process completes (using nonblocking receives and a wait_all). Once a process completes, have it send the data to the master then wait for the master to respond with more work. Continue this until the work is done.

Methodology for debugging serial poll

I'm reading values from a sensor via serial (accelerometer values), in a loop similar to this:
while( 1 ) {
vector values = getAccelerometerValues();
// Calculate velocity
// Calculate total displacement
if ( displacement == 0 )
print("Back at origin");
}
I know the time that it takes per sample, which is taken care of in the getAccelerometerValues(), so I have a time-period to calculate velocity, displacement etc. I sample at approximately 120 samples / second.
This works, but there are bugs (non-precise accelerometer values, floating-point errors, etc), and calibrating and compensating to get reasonably accurate displacement values is proving difficult.
I'm having great amounts of trouble finding a process to debug the loop. If I use a debugger (my code happens to be written in C++, and I'm slowly learning to use gdb rather than print statements), I have issues stepping through and pushing my sensor around to get an accelerometer reading at the point in time that the debugger executes the line. That is, it's very difficult to get the timing of "continue to next line" and "pushing sensor so it's accelerating" right.
I can use lots of print statements, which tend to fly past on-screen, but due to the number of samples, this gets tedious and difficult to derive where the problems are, particularly if there's more than one print statement per loop tick.
I can decrease the number of samples, which improves the readability of the programs output, but drastically decreases the reliability of the acceleration values I poll from the sensor; if I poll at 1Hz, the chances of polling the accelerometer value while it's undergoing acceleration drop considerably.
Overall, I'm having issues stepping through the code and using realistic data; I can step through it with non-useful data, or I can not really step through it with better data.
I'm assuming print statements aren't the best debugging method for this scenario. What's a better option? Are there any kinds of resources that I would find useful (am I missing something with gdb, or are there other tools that I could use)? I'm struggling to develop a methodology to debug this.
A sensible approach would be to use some interface for getAccelerometerValues() that you can substitute at either runtime or build-time, such as passing in a base-class pointer with a virtual method to override in the concrete derived class.
I can describe the mechanism in more detail if you need, but the ideal is to be able to run the same loop against:
real live data
real live data (and save it to file)
canned real data saved from a previous run
fake data you cooked up as a test case
Note in particular that the 'replay' version(s) should be easy to debug, if each call just returns the next data from the file.
Create if blocks for the exact conditions you want to debug. For example, if you only care about when the accelerometer reads that you are moving left:
if(movingLeft(values) {
print("left");
}
The usual solution to this problem is recording. Your capture sample sequences from your sensor in a real-time manner, and store them to files. Then you train your system, debug your code, etc. using the recorded data. Finally, you connect the working code to the real data stream which flows immediately from the sensor.
I would debug your code with fake (ie: random) values, before everything else.
It the computations work as expected, then I would use the values read from the port.
Also, isnt' there a way to read those values in a callback/push fashion, that is, get your function called only when there's new, reliable data?
edit: I don't know what libraries you are using, but in the .NET framework you can use a SerialPort class with the Event DataReceived. That way you are sure to use the most actual and reliable data.