MPI send-receive issue in Fortran - fortran

I am currently starting to develop a parallel code for scientific applications. I have to exchange some buffers from p0 to p1 and from p1 to p0 (I am creating ghost point between processors boundaries).
The error can be summarized by this sample code:
program test
use mpi
implicit none
integer id, ids, idr, ierr, tag, istat(MPI_STATUS_SIZE)
real sbuf, rbuf
call mpi_init(ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD,id,ierr)
if(id.eq.0) then
ids=0
idr=1
sbuf=1.5
tag=id
else
ids=1
idr=0
sbuf=3.5
tag=id
endif
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
call mpi_finalize(ierr)
return
end
What is wrong with this?

Coding with MPI can be difficult at first, and it's good that you're going through the steps of making a sample code. Your sample code as posted hangs due to deadlock. Both processes are busy MPI_SEND-ing, and the send cannot complete until it has been MPI_RECV-ed. So the code is stuck.
There are two common ways around this problem.
Send and Receive in a Particular Order
This is the simple and easy-to-understand solution. Code your send and receive operations such that nobody ever gets stuck. For your 2-process test case, you could do:
if (id==0) then
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
else
call mpi_recv(rbuf,1,MPI_REAL,idr,tag,MPI_COMM_WORLD,istat,ierr)
call mpi_send(sbuf,1,MPI_REAL,ids,tag,MPI_COMM_WORLD,ierr)
endif
Now, process 1 receives first, so there is never a deadlock. This particular example is not extensible, but there are various looping structures that can help. You can imagine a routine to send data from every process to every other process as:
do sending_process=1,nproc
if (id == sending_process) then
! -- I am sending
do destination_process = 1,nproc
if (sending_process == destination_process) cycle
call MPI_SEND ! Send to destination_process
enddo
elseif
! -- I am receiving
call MPI_RECV ! Receive from sending_process
endif
enddo
This works reasonably well and is easy to follow. I recommend this structure for beginners.
However, it has several issues for truly large problems. You are sending a number of messages equal to the number of processes squared, which can overload a large network. Also, depending on your operation, you probably do not need to send data from every process to every other process. (I suspect this is true for you given you mentioned ghosts.) You can modify the above loop to only send if data are required, but for those cases there is a better option.
Use Non-Blocking MPI Operations
For many-core problems, this is often the best solution. I recommend sticking to the simple MPI_ISEND and MPI_IRECV. Here, you start all necessary sends and receives, and then wait.
Here, I am using some list structure which has been setup already which defines the complete list of necessary destinations for each process.
! -- Open sends
do d=1,Number_Destinations
idest = Destination_List(d)
call MPI_ISEND ! To destination d
enddo
! -- Open receives
do s=1,Number_Senders
isend = Senders_List(s)
call MPI_IRECV ! From source s
enddo
call MPI_WAITALL
This option may look simpler but it is not. You must set up all necessary lists beforehand, and there are a variety of potential problems with buffer size and data alignment. Even still, it is typically the best answer for big codes.

As pointed by Vladimir, your code is too incomplete to provide a definitive answer.
That being said, that could be a well known error.
MPI_Send() might block. From a pragmatic point of view, MPI_Send() is likely to return immediately when sending a short message, but is likely to block when sending a large message. Note small and large depends on your MPI library, the interconnect you are using plus other runtime parameters. MPI_Send() might block until a MPI_Recv() is posted on the other end.
It seems you MPI_Send() and MPI_Recv() in the same block of code, so you can try using MPI_Sendrecv() to do it in one shot. MPI_Sendrecv() will issue a non blocking send under the hood, so that will help if your issue is really a MPI_Send() deadlock.

Related

Isend/Irecv doesn`t work but Send/Recv does

When I use Send/Recv my code works but when I replace Send/Recv with Isend/Irecv it yields segmentation fault. But before going anywhere else I wanted to verify whether the following snippet seems alrite or not.
The rest of the code should be fine as Send/Recv works; but I haven`t pasted here as its a long code.
INTEGER :: IERR,TASKID,NUMTASKS,SPANX,SPANY,SPANZ,PROCSX,PROCSY,PROCSZ,STAT,STATUS(MPI_STATUS_SIZE),ISTAT(MPI_STATUS_SIZE,52)
INTEGER,DIMENSION(1:52) :: REQ
ALLOCATE(RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND),STAT=IERR)
IF (IERR /=0) PRINT*,'ERROR IN RCC BY',TASKID
DO I=1,52
REQ(I)=MPI_REQUEST_NULL
ENDDO
IF (TASKID.NE.0) THEN
NT=TASKID
CALL MPI_ISEND(RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND),SIZE(RCC),MPI_DOUBLE_PRECISION,0,8,MPI_COMM_WORLD,REQ(NT),IERR)
ENDIF
IF (TASKID.EQ.0) THEN
DO NT = 1,26
CALL MPI_IRECV(CC(RSPANX(NT):RSPANXE(NT),RSPANY(NT):RSPANYE(NT),RSPANZ(NT):RSPANZE(NT)),SIZECC(NT),MPI_DOUBLE_PRECISION,NT,8,MPI_COMM_WORLD,REQ(NT+26),IERR)
ENDDO
ENDIF
CALL MPI_WAITALL(52,REQ,ISTAT,IERR)
DEALLOCATE(RCC,STAT=IERR)
IF (IERR /=0) PRINT*,'ERROR IN DEALLOCATE RCC BY',TASKID
CALL MPI_FINALIZE(IERR)
RETURN
END
However, when I use Isend/Irecv the following line doesn`t give Segmentation fault.
CALL MPI_IRECV(CC(RSPANX(NT),RSPANY(NT),RSPANZ(NT)),SIZECC(NT),MPI_DOUBLE_PRECISION,NT,8,MPI_COMM_WORLD,REQ(NT+26),IERR)
Calling asynchronous communication routines like MPI_ISEND and MPI_IRECV with array sections, e.g. RCC(IIST:IIEND,JJST:JJEND,KKST:KKEND), is very dangerous. The reason is that due to limitations in the older Fortran standards most MPI implementations do not provide proper interfaces for those routines and the compiler copies the data from the array section into a temporary contiguous storage, which then gets passed to the subroutine. The segmentation fault probably occurs due to this temporary storage being freed on return from MPI_ISEND/MPI_IRECV before the actual data transfer takes place. You can prevent this from happening by manually allocating the contiguous array and copying the data there.
On the other side, CC(RSPANX(NT),RSPANY(NT),RSPANZ(NT)) does not refer to a section of the array but rather to the location of a single element. No temporary copy of the data is created in this case.
MPI-3.0 provides an improved set of Fortran bindings mpi_f08, which uses modern features in Fortran 2008 and TS 29113 to mark such arguments with the ASYNCHRONOUS attribute and to enable safe passing of arrays with different dimensions (TYPE(*), DIMENSION(..))

How to make something lwt supported?

I am trying to understand the term lwt supported.
So assume I have a piece of code which connect a database and write some data: Db.write conn data. It has nothing to do with lwt yet and each write will cost 10 sec.
Now, I would like to use lwt. Can I directly code like below?
let write_all data_list = Lwt_list.iter (Db.write conn) data_list
let _ = Lwt_main.run(write_all my_data_list)
Support there are 5 data items in my_data_list, will all 5 data items be written into the database sequentially or in parallel?
Also in Lwt manually or http://ocsigen.org/tutorial/application, they say
Using Lwt is very easy and does not cause troubles, provided you never
use blocking functions (non cooperative functions). Blocking functions
can cause the entre server to hang!
I quite don't understand how to not using blocking functions. For every my own function, can I just use Lwt.return to make it lwt support?
Yes, your code is correct. The principle of lwt supported is that everything that can potentially takes time in your code should return an Lwt value.
About Lwt_list.iter, you can choose whether you want the treatment to be parallel or sequential, by choosing between iter_p and iter_s :
In iter_s f l, iter_s will call f on each elements
of l, waiting for completion between each element. On the
contrary, in iter_p f l, iter_p will call f on all
elements of l, then wait for all the threads to terminate.
About the non-blocking functions, the principle of the Light Weight Threads is that they keep running until they reach a "cooperation point", i.e. a point where the thread can be safely interrupted or has nothing to do, like in a sleep.
But you have to declare you enter a "cooperation point" before actually doing the sleep. This is why the whole Unix library has been wrapped, so that when you want to do an operation that takes time (e.g. a write), a cooperation point is automatically reached.
For your own function, if you use IOs operations from Unix, you should instead use the Lwt version (Lwt_unix.sleep instead of Unix.sleep)

Huge difference in MPI_Wtime() after using MPI_Barrier()?

This is the part of the code.
if(rank==0) {
temp=10000;
var=new char[temp] ;
MPI_Send(&temp,1,MPI_INT,1,tag,MPI_COMM_WORLD);
MPI_Send(var,temp,MPI_BYTE,1,tag,MPI_COMM_WORLD);
//MPI_Wait(&req[0],&sta[1]);
}
if(rank==1) {
MPI_Irecv(&temp,1,MPI_INT,0,tag,MPI_COMM_WORLD,&req[0]);
MPI_Wait(&req[0],&sta[0]);
var=new char[temp] ;
MPI_Irecv(var,temp,MPI_BYTE,0,tag,MPI_COMM_WORLD,&req[1]);
MPI_Wait(&req[0],&sta[0]);
}
//I am talking about this MPI_Barrier
MPI_Barrier(MPI_COMM_WORLD);
cout << MPI_Wtime()-t1 << endl ;
cout << "hello " << rank << " " << temp << endl ;
MPI_Finalize();
}
1. when using MPI_Barrier - As expected all the process are taking almost same amount of time, which is of order 0.02
2. when not using MPI_Barrier() - the root process(sending a message) waiting for some extra time .
and the (MPI_Wtime -t1) varies a lot and the time taken by root process is of order 2 seconds.
If i am not really mistaken MPI_Barrier is only used to bring all the running processes at the same level. so why don't the time when i am using MPI_Barrier() is 2 seconds (minimum of all processes . e. root process) . Please explain ?
Thanks to Wesley Bland for noticing that you are waiting twice on the same request. Here is an explanation of what actually happens.
There is something called progression of asynchronous (non-blocking) operations in MPI. That is when the actual transfer happens. Progression could happen in many different ways and at many different points within the MPI library. When you post an asynchronous operation, its progression could be deferred indefinitely, even until the point that one calls MPI_Wait, MPI_Test or some call that would result in new messages being pushed to or pulled from the transmit/receive queue. That's why it is very important to call MPI_Wait or MPI_Test as quickly as possible after the initiation of a non-blocking operation.
Open MPI supports a background progression thread that takes care to progress the operations even if the condition in the previous paragraph is not met, e.g. if MPI_Wait or MPI_Test is never called on the request handle. This has to be explicitly enabled when the library is being built. It is not enabled by default since background progression increases the latency of the operations.
What happens in your case is that you are waiting on the incorrect request the second time you call MPI_Wait in the receiver, therefore the progression of the second MPI_Irecv operation is postponed. The message is more than 40 KiB in size (10000 times 4 bytes + envelope overhead) which is above the default eager limit in Open MPI (32 KiB). Such messages are sent using the rendezvous protocol that requires both the send and the receive operations to be posted and progressed. The receive operation doesn't get progressed and hence the send operation in rank 0 blocks until at some point in time the clean-up routines that MPI_Finalize in rank 1 calls eventually progress the receive.
When you put the call to MPI_Barrier, it leads to the progression of the outstanding receive, acting almost like an implicit call to MPI_Wait. That's why the send in rank 0 completes quickly and both processes move on in time.
Note that MPI_Irecv, immediately followed by MPI_Wait is equivalent to simply calling MPI_Recv. The latter is not only simpler, but also less prone to simple typos like the one that you've made.
You're waiting on the same request twice for your Irecv's. the second one is the one that would take all of the time and since its getting skipped, rank 0 is getting way ahead.
MPI_BARRIER can be implemented such that some processes can leave the algorithm before the rest if the processes enter it. That's probably what's happening here.
In the tests that I have run, I see almost no difference in the runtimes. The main difference being that you seem to be running your code one time whereas I looped over your code thousands of times then took the average. My output is below:
With the barrier
[0]: 1.65071e-05
[1]: 1.66872e-05
Without the barrier
[0]: 1.35653e-05
[1]: 1.30711e-05
So I would assume any variation your are seeing is a result of your operating system more than your program.
Also, why are you using MPI_Irecv coupled with an MPI_wait rather than just using MPI_recv?

mpi alters a variable it shouldn't [duplicate]

This question already has an answer here:
MPI_Recv overwrites parts of memory it should not access
(1 answer)
Closed 3 years ago.
I have some Fortran code that I'm parallelizing with MPI which is doing truly bizarre things. First, there's a variable nstartg that I broadcast from the boss process to all the workers:
call mpi_bcast(nstartg,1,mpi_integer,0,mpi_comm_world,ierr)
The variable nstartg is never altered again in the program. Later on, I have the boss process send eproc elements of an array edge to the workers:
if (me==0) then
do n=1,ntasks-1
(determine the starting point estart and the number eproc
of values to send)
call mpi_send(edge(estart),eproc,mpi_integer,n,n,mpi_comm_world,ierr)
enddo
endif
with a matching receive statement if me is non-zero. (I've left out some other code for readability; there's a good reason I'm not using scatterv.)
Here's where things get weird: the variable nstartg gets altered to n instead of keeping its actual value. For example, on process 1, after the mpi_recv, nstartg = 1, and on process 2 it's equal to 2, and so forth. Moreover, if I change the code above to
call mpi_send(edge(estart),eproc,mpi_integer,n,n+1234567,mpi_comm_world,ierr)
and change the tag accordingly in the matching call to mpi_recv, then on process 1, nstartg = 1234568; on process 2, nstartg = 1234569, etc.
What on earth is going on? All I've changed is the tag that mpi_send/recv are using to identify the message; provided the tags are unique so that the messages don't get mixed up, this shouldn't change anything, and yet it's altering a totally unrelated variable.
On the boss process, nstartg is unaltered, so I can fix this by broadcasting it again, but that's hardly a real solution. Finally, I should mention that compiling and running this code using electric fence hasn't picked up any buffer overflows, nor did -fbounds-check throw anything at me.
The most probable cause is that you pass an INTEGER scalar as the actual status argument to MPI_RECV when it should be really declared as an array with an implementation-specific size, available as the MPI_STATUS_SIZE constant:
INTEGER, DIMENSION(MPI_STATUS_SIZE) :: status
or
INTEGER status(MPI_STATUS_SIZE)
The message tag is written to one of the status fields by the receive operation (its implementation-specific index is available as the MPI_TAG constant and the field value can be accessed as status(MPI_TAG)) and if your status is simply a scalar INTEGER, then several other local variables would get overwritten. In your case it simply happens so that nstartg falls just above status in the stack.
If you do not care about the receive status, you can pass the special constant MPI_STATUS_IGNORE instead.

How/why do functional languages (specifically Erlang) scale well?

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?
A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.
The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
EDIT:
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.
It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
-module(countwords).
-export([count_words_in_lines/1]).
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
end,
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
spawn(F).
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
receive
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
N
end.
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
{ok,countwords}
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
ok
4>
The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).
You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.
Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.
Referential transparency: See http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)
In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.