RMA MPI window access latency

RMA MPI window access latency - fortran

I use Fortran (with gfortran) and MPI 2 (OpenMPI). Through MPI_Win_lock and MPI_Win_unlock together with put and get operations (in non-overlapping regions of memory) all processes update a variable on my master process, which is exposed through a window.
In a test case I have noticed, however, that the unlock operations from processes that are not the master do not return until the master has finished some task, in this case sleeping for some seconds.
If instead of sleeping the master, I make it wait for some seconds using a while loop and a timer, and in the meantime I make the master lock and unlock the window, everything goes much faster:
call start_time(time_left)
do while (time_left .gt. 0)
call MPI_Win_lock(...)
call MPI_Win_unlock(...)
call update_time(time_left)
end do
However, in my real code the master performs operations just as the other processes, so it is impossible for it to continuously lock and unlock the window. Also, it seems to me rather wasteful.
My question is therefore how to decrease this latency?
Do I really need to sprinkle my code with locks and unlocks for the master? Or is there another solution? Is this compiler / implementation dependent?

The behaviour is implementation-dependent. Most MPI libraries do not perform asynchronous progression of the operations and progression only happens while the execution control is explicitly transferred to the library by calling MPI_Something. A relatively lightweight and portable hack is to call MPI_Iprobe periodically, which should enable the library to process any outstanding non-blocking send and receive operations used to implement the RMA.

Related

Boost: Single Threaded IO Service

In my app I will receive various events that I would like to process asynchronously in a prioritised order.
I could do this with a boost::asio::io_service, but my application is single threaded. I don't want to pay for locks and mallocs you might need for a multi threaded program (the performance cost really is significant to me). I'm basically looking for a boost::asio::io_service that is written for single threaded execution.
I'm pretty sure I could implement this myself using boost::coroutine, but before I do, does something like a boost::asio::io_service that is written for single threaded execution exist already? I scanned the list of boost libraries already and nothing stood out to me

Be aware that you have to pay for synchronization as soon as you use any non-blocking calls of Asio.
Even though you might use a single thread for scheduling work and processing the resulting callbacks, Asio might still have to spawn additional threads internally for executing asynchronous calls. Those will access the io_service concurrently.
Think of an async_read on a socket: As soon as the received data becomes available, the socket has to notify the io_service. This happens concurrent to your main thread, so additional synchronization is required.
For blocking I/O this problem goes away in theory, but since asynchronous I/O is sort of the whole point of the library, I would not expect to find too many optimizations for this case in the implementation.
As was pointed out in the comments already, the contention on the io_service will be very low with only one main thread, so unless profiling indicates a clear performance bottleneck there, you should not worry about it too much.

I suggest to use boost::asio together with boost::coroutine -> boost::asio::yield_context (does already the coupling between coroutine + io_service). If you detect an task with higher priority you could suspend the current task and start processing the task with higher priority.
The problem is that you have to define/call certain check-points in the code of your task in order to suspend the task if the condition (higher prio task enqueued) is given.

Does endless While loop take up CPU resources?

From what I understand, you write your Linux Daemon that listens to a request in an endless loop.
Something like..
int main() {
while(1) {
//do something...
}
}
ref: http://www.thegeekstuff.com/2012/02/c-daemon-process/
I read that sleeping a program makes it go into waiting mode so it doesn't eat up resources.
1.If I want my daemon to check for a request every 1 second, would the following be resource consuming?
int main() {
while(1) {
if (request) {
//do something...
}
sleep(1)
}
}
2.If I were to remove the sleep, does it mean the CPU consumption will go up 100%?
3.Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
Endless loops and CPU resources is a mystery to me.

Is it possible to run an endless loop without eating resources? Say..if it does nothing but just loops itself. Or just sleep(1).
There ia a better option.
You can just use a semaphore, which remains blocked at the begining of loop and you can signal the semaphore whenever you want the loop to execute.
Note that this will not eat any resources.

The poll and select calls (mentioned by Basile Starynkevitch in a comment) or a semaphore (mentioned by Als in an answer) are the correct ways to wait for requests, depending on circumstances. On operating systems without poll or select, there should be something similar.
Neither sleep, YieldProcessor, nor sched_yield are proper ways to do this, for the following reasons.
YieldProcessor and sched_yield merely move the process to the end of the runnable queue but leave it runnable. The effect is that they allow other processes at the same or higher priority to execute, but, when those processes are done (or if there are none), then the process that called YieldProcessor or sched_yield continues to run. This causes two problems. One is that lower priority processes still will not run. Another is that this causes the processor to be always running, using energy. We would prefer the operating system to recognize when no process needs to be running and to put the processor into a low-power state.
sleep may permit this low-power state, but it plays a guessing game about how long it will be until the next request comes in, it wakes the processor repeatedly when there is no need, and it makes the process less responsive to requests, since the process will continue sleeping until the expiration of the requested time even if there is a request to be serviced.
The poll and select calls are designed for exactly this situation. They tell the operating system that this process wants to service a request coming in on one of its I/O channels but otherwise has no work to do. This allows the operating system to mark the process as not runnable and to put the processor in a low-power state if suitable.
Using a semaphore provides the same behavior, except that the signal to wake the process comes from another process raising the semaphore instead of activity arising in an I/O channel. Semaphores are suitable when the signal to do some work arrives in this way; simply use whichever of poll or a semaphore is more appropriate for your situation.
The criticism that poll, select, or a semaphore causes a kernel-mode call is irrelevant, because the other methods also cause kernel-mode calls. A process cannot sleep on its own; it has to call the operating system to request it. Similarly, YieldProcessor and sched_yield make requests to the operating system.

The short answer is yes -- removing sleep gives 100% CPU -- but the answer does depend on some additional details. It consumes all CPU it can get, unless...
The loop body is trivial, and optimised away.
The loop contains a blocking operation (like a file or network operation). The link you provide suggests to avoid this, but it is often a good idea to block until something relevant happens.
EDIT : For your scenario, I support the suggestion made by #Als.
EDIT 2: I expect this answer has received a -1 because I claim blocking operations can actually be a good idea. [If you -1, you should leave a motivation in a comment so that we all may learn something.]
Current popular thinking is that non-block (event-based) IO is good and blocking is bad. This view is oversimplified because it assumes all software that performs IO can improve throughput by using non-blocking operations.
What? Am I really suggesting that using non-blocking IO can actually reduce throughput? Yes it can. When a process serves a single activity it is actually better to use blocking IO because blocking IO only burns resources that have already been paid for in the existence of the process.
In contrast, non-blocking IO can carry a greater fixed overhead than simple blocking IO. If the process isn't able to supply additional IO that can be interleaved, then there is nothing gained by paying for non-blocking setup. (In practice, the greatest cost of innapropriate non-blocking IO is simply in the added code complexity. Beyond that, this topic is largely a thought exercise.)
Under blocking IO we rely upon the operating system to schedule those processes that can make progress. That's what the OS is designed to do.
Under non-blocking IO we have greater setup costs but can share the resources of the process and its threads between interleaved work. The non-blocking IO is therefor ideal for any process that serves multiple independent activities, such as a web server. The throughput gained is vastly superior to the fixed cost overheads of non-blocking IO.

Does msleep() give cycles to other threads?

In a multi threaded app, is
while (result->Status == Result::InProgress) Sleep(50);
//process results
better than
while (result->Status == Result::InProgress);
//process results
?
By that, I'm asking will the first method be polite to other threads while waiting for results rather than spinning constantly? The operation I'm waiting for usually takes about 1-2 seconds and is on a different thread.

I would suggest using semaphores for such case instead of polling. If you prefer active waiting, the sleep is much better solution than evaluating the loop condition constantly.

It's better, but not by much.
As long as result->Status is not volatile, the compiler is allowed to reduce
while(result->Status == Result::InProgress);
to
if(result->Status == Result::InProgress) for(;;) ;
as the condition does not change inside the loop.
Calling the external (and hence implicitly volatile) function Sleep changes this, because this may modify the result structure, unless the compiler is aware that Sleep never modifies data. Thus, depending on the compiler, the second implementation is a lot less likely to go into an endless loop.
There is also no guarantee that accesses to result->Status will be atomic. For specific memory layouts and processor architectures, reading and writing this variable may consist of multiple steps, which means that the scheduler may decide to step in in the middle.
As all you are communicating at this point is a simple yes/no, and the receiving thread should also wait on a negative reply, the best way is to use the appropriate thread synchronisation primitive provided by your OS that achieves this effect. This has the advantage that your thread is woken up immediately when the condition changes, and that it uses no CPU in the meantime as the OS is aware what your thread is waiting for.
On Windows, use CreateEvent and co. to communicate using an event object; on Unix, use a pthread_cond_t object.

Yes, sleep and variants give up the processor. Other threads can take over. But there are better ways to wait on other threads.
Don't use the empty loop.

That depends on your OS scheduling policy too.For example Linux has CFS schedular by default and with that it will fairly distribute the processor to all the tasks. But if you make this thread as real time thread with FIFO policy then code without sleep will never relenquish the processor untill and unless a higher priority thread comes, same priority or lower will never get scheduled untill you break from the loop. if you apply SCHED_RR then processes of same priority and higher will get scheduled but not lower.

Scheduling of Process(s) waiting for Semaphore

It is always said when the count of a semaphore is 0, the process requesting the semaphore are blocked and added to a wait queue.
When some process releases the semaphore, and count increases from 0->1, a blocking process is activated. This can be any process, randomly picked from the blocked processes.
Now my question is:
If they are added to a queue, why is the activation of blocking processes NOT in FIFO order? I think it would be easy to pick next process from the queue rather than picking up a process at random and granting it the semaphore. If there is some idea behind this random logic, please explain. Also, how does the kernel select a process at random from queue? getting a random process that too from queue is something complex as far as a queue data structure is concerned.
tags: various OSes as each have a kernel usually written in C++ and mutex shares similar concept

A FIFO is the simplest data structure for the waiting list in a system
that doesn't support priorities, but it's not the absolute answer
otherwise. Depending on the scheduling algorithm chosen, different
threads might have different absolute priorities, or some sort of
decaying priority might be in effect, in which case, the OS might choose
the thread which has had the least CPU time in some preceding interval.
Since such strategies are widely used (particularly the latter), the
usual rule is to consider that you don't know (although with absolute
priorities, it will be one of the threads with the highest priority).

When a process is scheduled "at random", it's not that a process is randomly chosen; it's that the selection process is not predictable.
The algorithm used by Windows kernels is that there is a queue of threads (Windows schedules "threads", not "processes") waiting on a semaphore. When the semaphore is released, the kernel schedules the next thread waiting in the queue. However, scheduling the thread does not immediately make that thread start executing; it merely makes the thread able to execute by putting it in the queue of threads waiting to run. The thread will not actually run until a CPU has no threads of higher priority to execute.
While the thread is waiting in the scheduling queue, another thread that is actually executing may wait on the same semaphore. In a traditional queue system, that new thread would have to stop executing and go to the end of the queue waiting in line for that semaphore.
In recent Windows kernels, however, the new thread does not have to stop and wait for that semaphore. If the thread that has been assigned that semaphore is still sitting in the run queue, the semaphore may be reassigned to the old thread, causing the old thread to go back to waiting on the semaphore again.
The advantage of this is that the thread that was about to have to wait in the queue for the semaphore and then wait in the queue to run will not have to wait at all. The disadvantage is that you cannot predict which thread will actually get the semaphore next, and it's not fair so the thread waiting on the semaphore could potentially starve.

It is not that it CAN'T be FIFO; in fact, I'd bet many implementations ARE, for just the reasons that you state. The spec isn't that the process is chosen at random; it is that it isn't specified, so your program shouldn't rely on it being chosen in any particular way. (It COULD be chosen at random; just because it isn't the fastest approach doesn't mean it can't be done.)

All of the other answers here are great descriptions of the basic problem - especially around thread priorities and ready queues. Another thing to consider however is IO. I'm only talking about Windows here, since it is the only platform I know with any authority, but other kernels are likely to have similar issues.
On Windows, when an IO completes, something called a kernel-mode APC (Asynchronous Procedure Call) is queued against the thread which initiated the IO in order to complete it. If the thread happens to be waiting on a scheduler object (such as the semaphore in your example) then the thread is removed from the wait queue for that object which causes the (internal kernel mode) wait to complete with (something like) STATUS_ALERTED. Now, since these kernel-mode APCs are an implementation detail, and you can't see them from user mode, the kernel implementation of WaitForMultipleObjects restarts the wait at that point which causes your thread to get pushed to the back of the queue. From a kernel mode perspective, the queue is still in FIFO order, since the first caller of the underlying wait API is still at the head of the queue, however from your point of view, way up in user mode, you just got pushed to the back of the queue due to something you didn't see and quite possibly had no control over. This makes the queue order appear random from user mode. The implementation is still a simple FIFO, but because of IO it doesn't look like one from a higher level of abstraction.
I'm guessing a bit more here, but I would have thought that unix-like OSes have similar constraints around signal delivery and places where the kernel needs to hijack a process to run in its context.
Now this doesn't always happen, but the documentation has to be conservative and unless the order is explicitly guaranteed to be FIFO (which as described above - for windows at least - it can't be) then the ordering is described in the documentation as being "random" or "undocumented" or something because a random process controls it. It also gives the OS vendors lattitude to change the ordering at some later time.

Process scheduling algorithms are very specific to system functionality and operating system design. It will be hard to give a good answer to this question. If I am on a general PC, I want something with good throughput and average wait/response time. If I am on a system where I know the priority of all my jobs and know I absolutely want all my high priority jobs to run first (and don't care about preemption/starvation), then I want a Priority algorithm.
As far as a random selection goes, the motivation could be for various reasons. One being an attempt at good throughput, etc. as mentioned above above. However, it would be non-deterministic (hypothetically) and impossible to prove. This property could be an exploitation of probability (random samples, etc.), but, again, the proofs could only be based on empirical data on whether this would really work.

What happens when pthreads wait in mutex_lock/cond_wait?

I have a program that should get the maximum out of my cpu.
It is multithreaded via pthreads that do their job well apart from the fact that they "only" get my cores to about 60% load which is not enough in my opinion.
I am searching for the reason and am asking myself (and hereby you) if the blocking functions mutex_lock/cond_wait are candidates?
What happens when a thread cannot run on in such a function?
Does pthread switch to another thread it handles or
does the thread yield its time to the system and if the latter is the case, can I change this behavior?
Regards,
Nobody
More Information
The setting is one mainthread that fills the taskpool and countless workers that fetch jobs from there and wait on a conditional that is signaled via broadcast when a serialized calculation is done. They go on with the values from this calculation until they are done, deliver their mail and fetch the next job...

On a typical modern pthreads implementation, each thread is managed by the kernel not unlike a separate process. Any blocking call like pthread_mutex_lock or pthread_cond_wait (but also, say, read) will yield its time to the system. The system will then find another eligible thread to schedule, whether in your process or another process, and run it.
If your program is only taking 60% of the CPU, it is more likely blocked on I/O than on pthread operations, unless you have done something way too granular with your pthread operations.

If a thread is waiting on a mutex/condition, it doesn't use resources (well, uses just a tiny amount). Whenever the thread enters waiting state, control switches to other threads. When the mutex is released (or condition variable signalled), the thread wakes up and may acquire the mutex (if no other thread grabs it first), and continue to run. If however some other thread acquires the mutex (this can happen if several threads are waiting for it), the thread returns to sleeping state.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js