How/why do functional languages (specifically Erlang) scale well? - concurrency

I have been watching the growing visibility of functional programming languages and features for a while. I looked into them and didn't see the reason for the appeal.
Then, recently I attended Kevin Smith's "Basics of Erlang" presentation at Codemash.
I enjoyed the presentation and learned that a lot of the attributes of functional programming make it much easier to avoid threading/concurrency issues. I understand the lack of state and mutability makes it impossible for multiple threads to alter the same data, but Kevin said (if I understood correctly) all communication takes place through messages and the mesages are processed synchronously (again avoiding concurrency issues).
But I have read that Erlang is used in highly scalable applications (the whole reason Ericsson created it in the first place). How can it be efficient handling thousands of requests per second if everything is handled as a synchronously processed message? Isn't that why we started moving towards asynchronous processing - so we can take advantage of running multiple threads of operation at the same time and achieve scalability? It seems like this architecture, while safer, is a step backwards in terms of scalability. What am I missing?
I understand the creators of Erlang intentionally avoided supporting threading to avoid concurrency problems, but I thought multi-threading was necessary to achieve scalability.
How can functional programming languages be inherently thread-safe, yet still scale?

A functional language doesn't (in general) rely on mutating a variable. Because of this, we don't have to protect the "shared state" of a variable, because the value is fixed. This in turn avoids the majority of the hoop jumping that traditional languages have to go through to implement an algorithm across processors or machines.
Erlang takes it further than traditional functional languages by baking in a message passing system that allows everything to operate on an event based system where a piece of code only worries about receiving messages and sending messages, not worrying about a bigger picture.
What this means is that the programmer is (nominally) unconcerned that the message will be handled on another processor or machine: simply sending the message is good enough for it to continue. If it cares about a response, it will wait for it as another message.
The end result of this is that each snippet is independent of every other snippet. No shared code, no shared state and all interactions coming from a a message system that can be distributed among many pieces of hardware (or not).
Contrast this with a traditional system: we have to place mutexes and semaphores around "protected" variables and code execution. We have tight binding in a function call via the stack (waiting for the return to occur). All of this creates bottlenecks that are less of a problem in a shared nothing system like Erlang.
EDIT: I should also point out that Erlang is asynchronous. You send your message and maybe/someday another message arrives back. Or not.
Spencer's point about out of order execution is also important and well answered.

The message queue system is cool because it effectively produces a "fire-and-wait-for-result" effect which is the synchronous part you're reading about. What makes this incredibly awesome is that it means lines do not need to be executed sequentially. Consider the following code:
r = methodWithALotOfDiskProcessing();
x = r + 1;
y = methodWithALotOfNetworkProcessing();
w = x * y
Consider for a moment that methodWithALotOfDiskProcessing() takes about 2 seconds to complete and that methodWithALotOfNetworkProcessing() takes about 1 second to complete. In a procedural language this code would take about 3 seconds to run because the lines would be executed sequentially. We're wasting time waiting for one method to complete that could run concurrently with the other without competing for a single resource. In a functional language lines of code don't dictate when the processor will attempt them. A functional language would try something like the following:
Execute line 1 ... wait.
Execute line 2 ... wait for r value.
Execute line 3 ... wait.
Execute line 4 ... wait for x and y value.
Line 3 returned ... y value set, message line 4.
Line 1 returned ... r value set, message line 2.
Line 2 returned ... x value set, message line 4.
Line 4 returned ... done.
How cool is that? By going ahead with the code and only waiting where necessary we've reduced the waiting time to two seconds automagically! :D So yes, while the code is synchronous it tends to have a different meaning than in procedural languages.
EDIT:
Once you grasp this concept in conjunction with Godeke's post it's easy to imagine how simple it becomes to take advantage of multiple processors, server farms, redundant data stores and who knows what else.

It's likely that you're mixing up synchronous with sequential.
The body of a function in erlang is being processed sequentially.
So what Spencer said about this "automagical effect" doesn't hold true for erlang. You could model this behaviour with erlang though.
For example you could spawn a process that calculates the number of words in a line.
As we're having several lines, we spawn one such process for each line and receive the answers to calculate a sum from it.
That way, we spawn processes that do the "heavy" computations (utilizing additional cores if available) and later we collect the results.
-module(countwords).
-export([count_words_in_lines/1]).
count_words_in_lines(Lines) ->
% For each line in lines run spawn_summarizer with the process id (pid)
% and a line to work on as arguments.
% This is a list comprehension and spawn_summarizer will return the pid
% of the process that was created. So the variable Pids will hold a list
% of process ids.
Pids = [spawn_summarizer(self(), Line) || Line <- Lines],
% For each pid receive the answer. This will happen in the same order in
% which the processes were created, because we saved [pid1, pid2, ...] in
% the variable Pids and now we consume this list.
Results = [receive_result(Pid) || Pid <- Pids],
% Sum up the results.
WordCount = lists:sum(Results),
io:format("We've got ~p words, Sir!~n", [WordCount]).
spawn_summarizer(S, Line) ->
% Create a anonymous function and save it in the variable F.
F = fun() ->
% Split line into words.
ListOfWords = string:tokens(Line, " "),
Length = length(ListOfWords),
io:format("process ~p calculated ~p words~n", [self(), Length]),
% Send a tuple containing our pid and Length to S.
S ! {self(), Length}
end,
% There is no return in erlang, instead the last value in a function is
% returned implicitly.
% Spawn the anonymous function and return the pid of the new process.
spawn(F).
% The Variable Pid gets bound in the function head.
% In erlang, you can only assign to a variable once.
receive_result(Pid) ->
receive
% Pattern-matching: the block behind "->" will execute only if we receive
% a tuple that matches the one below. The variable Pid is already bound,
% so we are waiting here for the answer of a specific process.
% N is unbound so we accept any value.
{Pid, N} ->
io:format("Received \"~p\" from process ~p~n", [N, Pid]),
N
end.
And this is what it looks like, when we run this in the shell:
Eshell V5.6.5 (abort with ^G)
1> Lines = ["This is a string of text", "and this is another", "and yet another", "it's getting boring now"].
["This is a string of text","and this is another",
"and yet another","it's getting boring now"]
2> c(countwords).
{ok,countwords}
3> countwords:count_words_in_lines(Lines).
process <0.39.0> calculated 6 words
process <0.40.0> calculated 4 words
process <0.41.0> calculated 3 words
process <0.42.0> calculated 4 words
Received "6" from process <0.39.0>
Received "4" from process <0.40.0>
Received "3" from process <0.41.0>
Received "4" from process <0.42.0>
We've got 17 words, Sir!
ok
4>

The key thing that enables Erlang to scale is related to concurrency.
An operating system provides concurrency by two mechanisms:
operating system processes
operating system threads
Processes don't share state – one process can't crash another by design.
Threads share state – one thread can crash another by design – that's your problem.
With Erlang – one operating system process is used by the virtual machine and the VM provides concurrency to Erlang programme not by using operating system threads but by providing Erlang processes – that is Erlang implements its own timeslicer.
These Erlang process talk to each other by sending messages (handled by the Erlang VM not the operating system). The Erlang processes address each other using a process ID (PID) which has a three-part address <<N3.N2.N1>>:
process no N1 on
VM N2 on
physical machine N3
Two processes on the same VM, on different VM's on the same machine or two machines communicate in the same way – your scaling is therefore independent of the number of physical machines you deploy your application on (in the first approximation).
Erlang is only threadsafe in a trivial sense – it doesn't have threads. (The language that is, the SMP/multi-core VM uses one operating system thread per core).

You may have a misunderstanding of how Erlang works. The Erlang runtime minimizes context-switching on a CPU, but if there are multiple CPUs available, then all are used to process messages. You don't have "threads" in the sense that you do in other languages, but you can have a lot of messages being processed concurrently.

Erlang messages are purely asynchronous, if you want a synchronous reply to your message you need to explicitly code for that. What was possibly said was that messages in a process message box is processed sequentially. Any message sent to a process goes sits in that process message box, and the process gets to pick one message from that box process it and then move on to the next one, in the order it sees fit. This is a very sequential act and the receive block does exactly that.
Looks like you have mixed up synchronous and sequential as chris mentioned.

Referential transparency: See http://en.wikipedia.org/wiki/Referential_transparency_(computer_science)

In a purely functional language, order of evaluation doesn't matter - in a function application fn(arg1, .. argn), the n arguments can be evaluated in parallel. That guarantees a high level of (automatic) parallelism.
Erlang uses a process modell where a process can run in the same virtual machine, or on a different processor -- there is no way to tell. That is only possible because messages are copied between processes, there is no shared (mutable) state. Multi-processor paralellism goes a lot farther than multi-threading, since threads depend upon shared memory, this there can only be 8 threads running in parallel on a 8-core CPU, while multi-processing can scale to thousands of parallel processes.

Related

Best way to implement a periodic linux task in c++20

I have a periodic task in c++, running on an embedded linux platform, and have to run at 5 ms intervals. It seems to be working as expected, but is my current solution good enough?
I have implemented the scheduler using sleep_until(), but some comments I have received is that setitimer() is better. As I would like the application to be at least some what portable, I would prefer c++ standard... of course unless there are other problems.
I have found plenty of sites that show implementation with each, but I have not found any arguments for why one solution is better than the other. As I see it, sleep_until() will implement an "optimal" on any (supported) platform, and I'm getting a feeling the comments I have received are focused more on usleep() (which I do not use).
My implementation looks a little like this:
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
do_the_magic();
std::this_thread::sleep_until(next_time);
}
}
A short summoning of the issue.
I have an embedded linux platform, build with yocto and with RT capabilities
The application need to read and process incoming data every 5 ms
Building with gcc 11.2.0
Using c++20
All the "hard work" is done in separate threads, so this question is only regards triggering the task periodically and with minimal jitter
Since the application is supposed to read and process the data every 5 ms, it is possible that a few times, it does not perform the required operations. What I mean to say is that in a time interval of 20 ms, do_the_magic() is supposed to be invoked 4 times... But if the time taken to execute do_the_magic() is 10 ms, it will get invoked only 2 times. If that is an acceptable outcome, the current implementation is good enough.
Since the application is reading data, it probably receives it from the network or disk. And adding the overhead of processing it, it likely takes more than 5 ms to do so (depending on the size of the data). If it is not acceptable to miss out on any invocation of do_the_magic, the current implementation is not good enough.
What you could probably do is create a few threads. Each thread executes the do_the_magic function and then goes to sleep. Every 5 ms, you wake a sleeping thread which will most likely take less than 5 ms to happen. This way no invocation of do_the_magic is missed. Also, the number of threads depends on how long will do_the_magic take to execute.
bool is_submilli_capable() {
return std::ratio_greater<std::milli,
std::chrono::system_clock::period>::value;
}
void wake_some_thread () {
static int i = 0;
release_semaphore (i); // Release semaphore associated with thread i
i++;
i = i % NUM_THREADS;
}
void * thread_func (void * args) {
while (true) {
// Wait for a semaphore
do_the_magic();
}
int main() {
if (not is_submilli_capable())
exit(1);
while (true) {
auto next_time = next_period_start();
wake_some_thread (); // Releases a semaphore to wake a thread
std::this_thread::sleep_until(next_time);
}
Create as many semaphores as the number of threads where thread i is waiting for semaphore i. wake_some_thread can then release a semaphore starting from index 0 till NUM_THREADS and start again.
5ms is a pretty tight timing.
You can get a jitter-free 5ms tick only if you do the following:
Isolate a CPU for this thread. Configure it with nohz_full and rcu_nocbs
Pin your thread to this CPU, assign it a real-time scheduling policy (e.g., SCHED_FIFO)
Do not let any other threads run on this CPU core.
Do not allow any context switches in this thread. This includes avoiding system calls altogether. I.e., you cannot use std::this_thread::sleep_until(...) or anything else.
Do a busy wait in between processing (ensure 100% CPU utilisation)
Use lock-free communication to transfer data from this thread to other, non-real-time threads, e.g., for storing the data to files, accessing network, logging to console, etc.
Now, the question is how you're going to "read and process data" without system calls. It depends on your system. If you can do any user-space I/O (map the physical register addresses to your process address space, use DMA without interrupts, etc.) - you'll have a perfectly real-time processing. Otherwise, any system call will trigger a context switch, and latency of this context switch will be unpredictable.
For example, you can do this with certain Ethernet devices (SolarFlare, etc.), with 100% user-space drivers. For anything else you're likely to have to write your own user-space driver, or even implement your own interrupt-free device (e.g., if you're running on an FPGA SoC).

Running multiple while true loops independently in python

Essentially I have 2 "while True:" loops in my code. Both of the loops are right at the end. However when I run the code, only the first while True: loop gets run, and the second one gets ignored.
For example:
while True:
print "hi"
while True:
print "bye"
Here, it will continuously print hi, but wont print bye at all (the actual code has a tracer.execute() for one loop, and the other is listening to a port, and they both work on their own).
Is there any way to get both loops to work at the same time independently?
Yes.A way to get both loops to work at the same time independently:
Your initial surprise was related to the nature, how Finite-State-Automata actually work.
[0]: any-processing-will-always-<START>-here
[1]: Read a next instruction
[2]: Execute the instruction
[3]: GO TO [1]
The stream of abstract instructions is being executed in a pure-[SERIAL] manner, one after another. There is no other way in the CPU since uncle Turing.
Your desire to have more streams-of-instructions run at the same time independently is called [CONCURRENT] process-scheduling.
You have several tools for achieving a wanted modus-operandi:
Read about a weaker form, using just a thread-based concurrency ( which, due to a Python-specific GIL-locking, yet executes on the physical hardware as a [CONCURRENT]-processing, but GIL-interleaving ( which was knowingly implemented as a very cheap form of a collision-avoidance for each and every case, that this [CONCURRENCY] might introduce ) will finally interleave each of the ( now ) [CONCURRENT]-streams, so as to principally avoid colliding access to any Python object at the same time. If you are fine with this execute-just-one-instruction-stream-fragment-at-a-time ( and round-robin their actual order of GIL-stepped execution ), you can live in a safe and collision-free world.
Another tool, Python may use, is the joblib.Parallel()( joblib.delayed() ), where you will have to master a bit more things to make these ( now a set of fully spawned subprocesses, each ( yes, each ) having a full-copy of python-state + all variables ( read: a lot of time and memory needed to spawn 'em ) and no mutual coordination ).
So decide about which form is just-enough for the kind of your use-case, and better check the new Amdahl's Law re-formulation carefully ( implications on costs of going distributed or parallel )

Implementing an Interrupt driven Sampling Profiler

I am trying to create a sampling profiler that works on linux, I am unsure how to send an interrupt or how to get the program counter (pc) so I can find out the point the program is at when I interrupt it.
I have tried using signal(SIGUSR1, Foo*) and calling backtrace, but I get the stack for the thread I am in when I raise(SIGUSR1) rather than the thread the program is being run on.
I am not really sure if this is even the right way of going about it...
Any advice?
If you must write a profiler, let me suggest you use a good one (Zoom) as your model, not a bad one (gprof).
These are its characteristics.
There are two phases. First is the data-gathering phase:
When it takes a sample, it reads the whole call stack, not just the program counter.
It can take samples even when the process is blocked due to I/O, sleep, or anything else.
You can turn sampling on/off, so as to only take samples during times you care about. For example, while waiting for the user to type something, it is pointless to be sampling.
Second is the data-presentation phase.
What you have is a collection of stack samples, where a stack sample is a vector of memory addresses, which are almost all return addresses.
Each return address indicates a line of code in a function, unless it's in some system routine you don't have symbolic information for.
The key piece of useful information is residency fraction (usually expressed as a percent).
If there are a total of m stack samples, and line of code L is present anywhere on n of the samples, then its residency fraction is n/m.
This is true even if L appears more that once on a sample, that is still just one sample it appears on.
The importance of residency fraction is it directly indicates what fraction of time statement L is responsible for.
If you have taken m=1000 samples, and L appears on n=300 of them, then L's residency fraction is 300/1000 or 30%.
This means that if L could be removed, total time would decrease by 30%.
It is typically known as inclusive percent.
You can determine residency fraction not just for lines of code, but for anything else you can describe. For example, line of code L is inside some function F.
So you can determine the residency fraction for functions, as opposed to lines of code.
That would give you inclusive percent by function.
You could look at function pairs, like on what fraction of samples do you see function F calling function G.
That would give you the information that makes up call graphs.
There are all kinds of information you can get out of the stack samples.
One that is often seen is a "butterfly view", where you have a "focus" on one line L or function F, and on one side you show all the lines or functions immediately above it in the stack samples, and on the other side all the lines of functions immediately below it.
On each of these, you can show the residency fraction.
You can click around in this to try to find lines of code with high residency fraction that you can find a way to eliminate or reduce.
That's how you speed up the code.
Whatever you do for output, I think it is very important to allow the user to actually examine a small number of them, randomly selected.
They convey far more insight than can be gotten from any method that condenses the information.
As important as it is to know what the profiler should do, it is also important to know what not to do, even if lots of other profilers do them:
self time. A useless number. Look at some reasonable-size programs and you will see why.
invocation counts. Of no help in finding code with high residency fraction, and you can't get it with samples alone anyway.
high-frequency sampling. It's amazing how many people, certainly profiler builders, think it is important to get lots of samples. Suppose line L is on 30% of 1000 samples. Then its true inclusive percent is 30 +/- 1.4 percent. On the other hand, if it is on 30% of 10 samples, its inclusive percent is 30 +/- 14 percent. It's still pretty big - big enough to fix. What happens in most profilers is people think they need "numerical precision", so they take lots of samples and accumulate what they call "statistics", and then throw away the samples. That's like digging up diamonds, weighing them, and throwing them away. The real value is in the samples themselves, because they tell you what the problem is.
You can send signal to specific thread using pthread_kill and tid (gettid()) of target thread.
Right way of creating simple profilers is by using setitimer which can send periodic signal (SIGALRM or SIGPROF) for example, every 10 ms; or posix timers (timer_create, timer_settime, or timerfd), without needs of separate thread for sending profiling signals. Check sources of google-perftools (gperftools), they use setitimer or posix timers and collects profile with backtraces.
gprof also uses setitimer for implementing cpu time profiling (9.1 Implementation of Profiling - " Linux 2.0 ..arrangements are made for the kernel to periodically deliver a signal to the process (typically via setitimer())").
For example: result of codesearch for setitimer in gperftools's sources: https://code.google.com/p/gperftools/codesearch#search/&q=setitimer&sq=package:gperftools&type=cs
void ProfileHandler::StartTimer() {
if (!allowed_) {
return;
}
struct itimerval timer;
timer.it_interval.tv_sec = 0;
timer.it_interval.tv_usec = 1000000 / frequency_;
timer.it_value = timer.it_interval;
setitimer(timer_type_, &timer, 0);
}
You should know that setitimer has problems with fork and clone; it doesn't work with multithreaded applications. There is try to create helper wrapper: http://sam.zoy.org/writings/programming/gprof.html (wrong one) but I don't remember, does it work correctly (setitimer usually send process-wide signal, and not thread-wide). UPD: seems that since linux kernel 2.6.12, setitimer's signal is directed to the process as whole (any thread may get it).
To direct signal from timer_create to specific thread, you need gettid() (#include <sys/syscall.h>, syscall(__NR_gettid)) and SIGEV_THREAD_ID flag. Don't checked how to create periodic posix timer with thread_create (probably with timer_settime and non-zero it_interval).
PS: there is some overview of profiling in wikibooks: http://en.wikibooks.org/wiki/Introduction_to_Software_Engineering/Tools/Profiling

How to make something lwt supported?

I am trying to understand the term lwt supported.
So assume I have a piece of code which connect a database and write some data: Db.write conn data. It has nothing to do with lwt yet and each write will cost 10 sec.
Now, I would like to use lwt. Can I directly code like below?
let write_all data_list = Lwt_list.iter (Db.write conn) data_list
let _ = Lwt_main.run(write_all my_data_list)
Support there are 5 data items in my_data_list, will all 5 data items be written into the database sequentially or in parallel?
Also in Lwt manually or http://ocsigen.org/tutorial/application, they say
Using Lwt is very easy and does not cause troubles, provided you never
use blocking functions (non cooperative functions). Blocking functions
can cause the entre server to hang!
I quite don't understand how to not using blocking functions. For every my own function, can I just use Lwt.return to make it lwt support?
Yes, your code is correct. The principle of lwt supported is that everything that can potentially takes time in your code should return an Lwt value.
About Lwt_list.iter, you can choose whether you want the treatment to be parallel or sequential, by choosing between iter_p and iter_s :
In iter_s f l, iter_s will call f on each elements
of l, waiting for completion between each element. On the
contrary, in iter_p f l, iter_p will call f on all
elements of l, then wait for all the threads to terminate.
About the non-blocking functions, the principle of the Light Weight Threads is that they keep running until they reach a "cooperation point", i.e. a point where the thread can be safely interrupted or has nothing to do, like in a sleep.
But you have to declare you enter a "cooperation point" before actually doing the sleep. This is why the whole Unix library has been wrapped, so that when you want to do an operation that takes time (e.g. a write), a cooperation point is automatically reached.
For your own function, if you use IOs operations from Unix, you should instead use the Lwt version (Lwt_unix.sleep instead of Unix.sleep)

Huge difference in MPI_Wtime() after using MPI_Barrier()?

This is the part of the code.
if(rank==0) {
temp=10000;
var=new char[temp] ;
MPI_Send(&temp,1,MPI_INT,1,tag,MPI_COMM_WORLD);
MPI_Send(var,temp,MPI_BYTE,1,tag,MPI_COMM_WORLD);
//MPI_Wait(&req[0],&sta[1]);
}
if(rank==1) {
MPI_Irecv(&temp,1,MPI_INT,0,tag,MPI_COMM_WORLD,&req[0]);
MPI_Wait(&req[0],&sta[0]);
var=new char[temp] ;
MPI_Irecv(var,temp,MPI_BYTE,0,tag,MPI_COMM_WORLD,&req[1]);
MPI_Wait(&req[0],&sta[0]);
}
//I am talking about this MPI_Barrier
MPI_Barrier(MPI_COMM_WORLD);
cout << MPI_Wtime()-t1 << endl ;
cout << "hello " << rank << " " << temp << endl ;
MPI_Finalize();
}
1. when using MPI_Barrier - As expected all the process are taking almost same amount of time, which is of order 0.02
2. when not using MPI_Barrier() - the root process(sending a message) waiting for some extra time .
and the (MPI_Wtime -t1) varies a lot and the time taken by root process is of order 2 seconds.
If i am not really mistaken MPI_Barrier is only used to bring all the running processes at the same level. so why don't the time when i am using MPI_Barrier() is 2 seconds (minimum of all processes . e. root process) . Please explain ?
Thanks to Wesley Bland for noticing that you are waiting twice on the same request. Here is an explanation of what actually happens.
There is something called progression of asynchronous (non-blocking) operations in MPI. That is when the actual transfer happens. Progression could happen in many different ways and at many different points within the MPI library. When you post an asynchronous operation, its progression could be deferred indefinitely, even until the point that one calls MPI_Wait, MPI_Test or some call that would result in new messages being pushed to or pulled from the transmit/receive queue. That's why it is very important to call MPI_Wait or MPI_Test as quickly as possible after the initiation of a non-blocking operation.
Open MPI supports a background progression thread that takes care to progress the operations even if the condition in the previous paragraph is not met, e.g. if MPI_Wait or MPI_Test is never called on the request handle. This has to be explicitly enabled when the library is being built. It is not enabled by default since background progression increases the latency of the operations.
What happens in your case is that you are waiting on the incorrect request the second time you call MPI_Wait in the receiver, therefore the progression of the second MPI_Irecv operation is postponed. The message is more than 40 KiB in size (10000 times 4 bytes + envelope overhead) which is above the default eager limit in Open MPI (32 KiB). Such messages are sent using the rendezvous protocol that requires both the send and the receive operations to be posted and progressed. The receive operation doesn't get progressed and hence the send operation in rank 0 blocks until at some point in time the clean-up routines that MPI_Finalize in rank 1 calls eventually progress the receive.
When you put the call to MPI_Barrier, it leads to the progression of the outstanding receive, acting almost like an implicit call to MPI_Wait. That's why the send in rank 0 completes quickly and both processes move on in time.
Note that MPI_Irecv, immediately followed by MPI_Wait is equivalent to simply calling MPI_Recv. The latter is not only simpler, but also less prone to simple typos like the one that you've made.
You're waiting on the same request twice for your Irecv's. the second one is the one that would take all of the time and since its getting skipped, rank 0 is getting way ahead.
MPI_BARRIER can be implemented such that some processes can leave the algorithm before the rest if the processes enter it. That's probably what's happening here.
In the tests that I have run, I see almost no difference in the runtimes. The main difference being that you seem to be running your code one time whereas I looped over your code thousands of times then took the average. My output is below:
With the barrier
[0]: 1.65071e-05
[1]: 1.66872e-05
Without the barrier
[0]: 1.35653e-05
[1]: 1.30711e-05
So I would assume any variation your are seeing is a result of your operating system more than your program.
Also, why are you using MPI_Irecv coupled with an MPI_wait rather than just using MPI_recv?