Fastest method to wait under thread contention

Fastest method to wait under thread contention - c++

I'm using pthread on Linux. I have a circular buffer to pass data from one thread to another. Maybe the circular buffer is not the best structure to use here, but changing that would not make my problem go away, so we'll just refer it as a queue.
Whenever my queue is either full or empty, pop/push operations return NULL. This is problematic since my threads fire periodically. Waiting for another thread loop would take too long.
I've tried using semaphores (sem_post, sem_wait) but unlocking under contention takes up to 25 ms, which is about the speed of my loop. I've tried waiting with pthread_cond_t, but the unlocking takes up to between 10 and 15 ms.
Is there a faster mechanism I could use to wait for data?
EDIT*
Ok I used condition variables. I'm on an embedded device so adding "more cores or cpu power" is not an option. This made me realise I had all sorts of thread priorities set all over the place so I'll sort this out before going further

You should use condition variables. The only faster ways are platform-specific, and they're only negligibly faster.
You're seeing what you think is poor performance simply because your threads are being de-scheduled. You're seeing long "delays" when your thread is near the end of its timeslice and the scheduler allows the unblocked thread to pre-empt the running thread. If you have more cores than threads or set your thread to a higher priority, you won't see these delays.
But these delays are actually a good thing, and you shouldn't be concerned about them. Other threads just get a chance to run too.

Related

Pros and Cons of Busy Waiting on Modern Processors

I'm using busy waiting to synchronize access to critical regions, like this:
while (p1_flag != T_ID);
/* begin: critical section */
for (int i=0; i<N; i++) {
...
}
/* end: critical section */
p1_flag++;
p1_flag is a global volatile variable that is updated by another concurrent thread. As a matter of fact, I've two critical sections inside a loop and I've two threads (both executing the same loop) that commute execution of these critical regions. For instance, critical regions are named A and B.
Thread 1 Thread 2
A
B A
A B
B A
A B
B A
B
The parallel code executes faster than the serial one, however not as much as I expected. Profiling the parallel program using VTune Amplifier I noticed that a large amount of time is being spent in the synchronization directives, that is, the while(...) and flag update. I'm not sure why I'm seeing so large overhead on these "instructions" since region A is exactly the same as region B. My best guess is that this is due the cache coherence latency: I'm using an Intel i7 Ivy Bridge Machine and this micro-architecture resolves cache coherence at the L3. VTune also tells that the while (...) instruction is consuming all front-end bandwidth, but why?
To make the question(s) clear: Why are while(...) and update flag instructions taking so much execution time? Why would the while(...) instruction saturate the front-end bandwidth?

The overhead you're paying may very well be due to passing the sync variable back and forth between the core caches.
Cache coherency dictates that when you modify the cache line (p1_flag++) you need to have ownership on it. This means it would invalidate any copy existing in other cores, waiting for it to write back any changes made by that other core to a shared cache level. It would then provide the line to the requesting core in M state and perform the modification.
However, the other core would by then be constantly reading this line, read that would snoop the first core and ask if it has copy of that line. Since the first core is holding an M copy of that line, it would get written back to the shared cache and the core would lose ownership.
Now this depends on actual implementation in HW, but if the line was snooped before the change was actually made, the first core would have to attempt to get ownership on it again. In some cases i'd imagine this might take several iterations of attempts.
If you're set on using busy wait, you should at least use some pause inside it
: _mm_pause intrisic, or just __asm("pause"). This would both serve to give the other thread a chance get the lock and release you from waiting, as well as reducing the CPU effort in busy waiting (an out-of-order CPU would fill all pipelines with parallel instances of this busy wait, consuming lots of power - a pause would serialize it so only a single iteration can run at any given time - much less consuming and with the same effect).

A busy-wait is almost never a good idea in multithreaded applications.
When you busy-wait, thread scheduling algorithms will have no way of knowing that your loop is waiting on another thread, so they must allocate time as if your thread is doing useful work. And it does take processor time to check that variable over, and over, and over, and over, and over, and over...until it is finally "unlocked" by the other thread. In the meantime, your other thread will be preempted by your busy-waiting thread again and again, for no purpose at all.
This is an even worse problem if the scheduler is a priority-based one, and the busy-waiting thread is at a higher priority. In this situation, the lower-priority thread will NEVER preempt the higher-priority thread, thus you have a deadlock situation.
You should ALWAYS use semaphores or mutex objects or messaging to synchronize threads. I've never seen a situation where a busy-wait was the right solution.
When you use a semaphore or mutex, then the scheduler knows never to schedule that thread until the semaphore or mutex is released. Thus your thread will never be taking time away from threads that do real work.

Does msleep() give cycles to other threads?

In a multi threaded app, is
while (result->Status == Result::InProgress) Sleep(50);
//process results
better than
while (result->Status == Result::InProgress);
//process results
?
By that, I'm asking will the first method be polite to other threads while waiting for results rather than spinning constantly? The operation I'm waiting for usually takes about 1-2 seconds and is on a different thread.

I would suggest using semaphores for such case instead of polling. If you prefer active waiting, the sleep is much better solution than evaluating the loop condition constantly.

It's better, but not by much.
As long as result->Status is not volatile, the compiler is allowed to reduce
while(result->Status == Result::InProgress);
to
if(result->Status == Result::InProgress) for(;;) ;
as the condition does not change inside the loop.
Calling the external (and hence implicitly volatile) function Sleep changes this, because this may modify the result structure, unless the compiler is aware that Sleep never modifies data. Thus, depending on the compiler, the second implementation is a lot less likely to go into an endless loop.
There is also no guarantee that accesses to result->Status will be atomic. For specific memory layouts and processor architectures, reading and writing this variable may consist of multiple steps, which means that the scheduler may decide to step in in the middle.
As all you are communicating at this point is a simple yes/no, and the receiving thread should also wait on a negative reply, the best way is to use the appropriate thread synchronisation primitive provided by your OS that achieves this effect. This has the advantage that your thread is woken up immediately when the condition changes, and that it uses no CPU in the meantime as the OS is aware what your thread is waiting for.
On Windows, use CreateEvent and co. to communicate using an event object; on Unix, use a pthread_cond_t object.

Yes, sleep and variants give up the processor. Other threads can take over. But there are better ways to wait on other threads.
Don't use the empty loop.

That depends on your OS scheduling policy too.For example Linux has CFS schedular by default and with that it will fairly distribute the processor to all the tasks. But if you make this thread as real time thread with FIFO policy then code without sleep will never relenquish the processor untill and unless a higher priority thread comes, same priority or lower will never get scheduled untill you break from the loop. if you apply SCHED_RR then processes of same priority and higher will get scheduled but not lower.

fastest way to wake up a thread without using condition variable

I am trying to speed up a piece of code by having background threads already setup to solve one specific task. When it is time to solve my task I would like to wake up these threads, do the job and block them again waiting for the next task. The task is always the same.
I tried using condition variables (and mutex that need to go with them), but I ended up slowing my code down instead of speeding it up; mostly it happened because the calls to all needed functions are very expensive (pthread_cond_wait/pthread_cond_signal/pthread_mutex_lock/pthread_mutex_unlock).
There is no point in using a thread pool (that I don't have either) because it is a too generic construct; here I want to address only my specific task. Depending on the implementation I would also pay a performance penalty for the queue.
Do you have any suggestion for a quick wake-up without using mutex or con_var?
I was thinking in setup threads like timers reading an atomic variable; if the variable is set to 1 the threads will do the job; if it is set to 0 they will go to sleep for few microseconds (I would start with microsecond sleep since I would like to avoid using spinlocks that might be too expensive for the CPU). What do you think about it? Any suggestion is very appreciated.
I am using Linux, gcc, C and C++.

These functions should be fast. If they are taking a large fraction of your time, it is quite possible that you are trying to switch threads too often.
Try buffering up a work queue, and send the signal once a significant amount of work has accumulated.
If this is impossible due to dependencies between the tasks, then your application is not amenable to multithreading at all.

In order to gain performance in a multithreaded application, spawn as many threads as there are CPUs, not a separate thread for each task. Otherwise you end up with a lot of overhead from context switching.
You may also consider making your algorithm more linear (i.e. by using non-blocking calls).

C++ Multi-Thread Execution Speed Slow-Down

I am writing a multi-threaded c++ application. When thread A has a very computationally expensive operation to perform, it slows down threads B, C, and D. How can I prevent this?

On windows you can use Sleep(0) to release the remainder of your timeslice for other threads that are waiting.

Hard to tell without seeing code so I can only give you the advice to lower Thread A's priority. This can be done using the SetThreadPriority function.

Note that you can set the thread priorities (SetThreadPriority)
Also, I advice the backgroundworker picks it's work from a queue. The queue can then be used as a way to throttle the calculations:
you can configure how many 'tasks' are taken from the queue for processing in one swoop
you can lock the queue (use semaphores + condition event) so you can temporarily prevent new tasks from being picked up.
you can now distribute the load across more workers (say if thread B, C, D are temporarily idle, they can start to lift the work off thread A; very useful on a Quad-core + desktop)
$0.02

There are a couple of ways:
As RedX suggested, add Sleep(0) in thread A's inner loop to have it yield time more frequently. This is the cheap and lazy solution.
Better would be to change the thread priority. When you call CreateThread, pass CREATE_SUSPENDED so that the thread does not start immediately. Then call SetPriorityClass to set the thread to a lower priority, followed by ResumeThread.

You might also want to look at having your compute-bound thread yield the processor to other threads. See this post for various ways to do this.

Significance of Sleep(0)

I used to see Sleep(0) in some part of my code where some infinite/long while loops are available. I was informed that it would make the time-slice available for other waiting processes. Is this true? Is there any significance for Sleep(0)?

According to MSDN's documentation for Sleep:
A value of zero causes the thread to
relinquish the remainder of its time
slice to any other thread that is
ready to run. If there are no other
threads ready to run, the function
returns immediately, and the thread
continues execution.
The important thing to realize is that yes, this gives other threads a chance to run, but if there are none ready to run, then your thread continues -- leaving the CPU usage at 100% since something will always be running. If your while loop is just spinning while waiting for some condition, you might want to consider using a synchronization primitive like an event to sleep until the condition is satisfied or sleep for a small amount of time to prevent maxing out the CPU.

Yes, it gives other threads the chance to run.
A value of zero causes the thread to
relinquish the remainder of its time
slice to any other thread that is
ready to run. If there are no other
threads ready to run, the function
returns immediately, and the thread
continues execution.
Source

I'm afraid I can't improve on the MSDN docs here
A value of zero causes the thread to
relinquish the remainder of its time
slice to any other thread that is
ready to run. If there are no other
threads ready to run, the function
returns immediately, and the thread
continues execution.
Windows XP/2000: A value of zero
causes the thread to relinquish the
remainder of its time slice to any
other thread of equal priority that is
ready to run. If there are no other
threads of equal priority ready to
run, the function returns immediately,
and the thread continues execution.
This behavior changed starting with
Windows Server 2003.
Please also note (via upvote) the two useful answers regarding efficiency problems here.

Be careful with Sleep(0), if one loop iteration execution time is short, this can slow down such loop significantly. If this is important to use it, you can call Sleep(0), for example, once per 100 iterations.

Sleep(0); At that instruction, the system scheduler will check for any other runnable threads and possibly give them a chance to use the system resources depending on thread priorities.
On Linux there's a specific command for this: sched_yield()
as from the man pages:
sched_yield() causes the calling thread to relinquish the CPU. The
thread is moved to the end of the queue for its static priority and a
new thread gets to run.
If the calling thread is the only thread in the highest priority list
at that time, it will continue to run after a call to sched_yield().
with also
Strategic calls to sched_yield() can improve performance by giving
other threads or processes a chance to run when (heavily) contended
resources (e.g., mutexes) have been released by the caller. Avoid
calling sched_yield() unnecessarily or inappropriately (e.g., when
resources needed by other schedulable threads are still held by the
caller), since doing so will result in unnecessary context switches,
which will degrade system performance.

In one app....the main thread looked for things to do, then launched the "work" via a new thread. In this case, you should call sched_yield() (or sleep(0)) in the main thread, so, that you do not make the "looking" for work, more important then the "work". I prefer sleep(0), but sometimes this is excessive (because you are sleeping a fraction of a second).

Sleep(0) is a powerful tool and it can improve the performance in certain cases. Using it in a fast loop might be considered in special cases. When a set of threads shall be utmost responsive, they shall all use Sleep(0) frequently. But it is crutial to find a ruler for what responsive means in the context of the code.
I've given some details in https://stackoverflow.com/a/11456112/1504523

I am using using pthreads and for some reason on my mac the compiler is not finding pthread_yield() to be declared. But it seems that sleep(0) is the same thing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js