notify_one performance impact

notify_one performance impact - c++

I was reading a bit about std::condition_variable and more particularly on how to notify a waiting thread using std::condition_variable::notify_one.
I came across a few questions I will be happy to get answers on:
What exactly happens when a thread calls notify_one (OS-wise)? I guess this is OS-specific, so for the sake of argument - I'm working in Windows.
What happens if a thread calls notify_one when there is no waiting thread? Does this call have any performance impact (CPU-cycles, power etc)?
Thanks

On windows, std::condition_variable is likely to be implemented in terms of native Windows condition variables.
See: https://msdn.microsoft.com/en-us/library/windows/desktop/ms682052(v=vs.85).aspx
On unix-like systems they're normally implemented in terms of a pthreads semaphore/mutex pair.
The entire operation should take place in user space so you don't pay to switch to kernel mode, but you will be working with two synchronisation primitives under the covers. This will mean that memory fences will be issued, so there is always some price to pay.
To cut a long story short, calling notify_one when you should, i.e. after changing the state of the condition and releasing the lock, it's a reasonably cheap operation. Calling notify_one in a tight loop for no good reason is probably not going to be a good idea.
What happens if a thread calls notify_one when there is no waiting thread?
Take a mutex, check whether there are threads waiting, release the mutex. end.
Does this call have any performance impact (CPU-cycles, power etc)?
Yes of course, it consumes a few cycles and requires that the CPU is operating. Doing it once in a while won't hurt. Doing it continuously in a tight loop will consume power.
I guess my question for you is, "what's the use case"? If you're adding a million items a second to a producer/consumer queue then you're going to spend a lot of time and energy notifying nonexistent consumers. If you're adding 10 a second, time spent in notify_one probably won't even show up on any performance trace.

These questions are extremely implementation-specific. Just saying you're on Windows is not enough; each standard library may have different implementations, and a debug version could have a different implementation than a release version.
The semantic effect of notify_one when no thread is waiting is a no-op. In implementation terms, at the very least the thread has to check an atomic variable to determine if any threads are waiting. So there is a bit of overhead.
The Microsoft standard library's condition_variable is implemented in terms of the concurrency runtime's condition variable which, starting from Windows Vista, is implemented in terms of the WinAPI RTL_CONDITION_VARIABLE. The implementation of that is not accessible. However, there's a reasonable chance that its implementation is based on this Microsoft research paper:
http://research.microsoft.com/pubs/64242/implementingcvs.pdf

Related

C++20 mutex with atomic wait [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
From C++20 std::atomics have wait and notify operations. With is_always_lock_free we can ensure that the implementation is lock free. With these bricks building a lock-free mutex is not so difficult. In trivial cases locking would be a compare exchange operation, or a wait if the mutex is locked. The big question here is if it is worth it or not. If I can create such an implementation most probably the STL version is much better and faster. However I still remember how surprised I was when I saw how QMutex outperformed std::mutex QMutex vs std::mutex in 2016. So what do you think, should I experiment with such an implementation or the current implementation of std::mutex is matured enough to be optimized far beyond these tricks?
UPDATE
My wording wasn't the best, I mean that the implementation could be lock free on the happy path (locking from not locked state). Of course we should be blocked and re-scheduled if we need to wait to acquire the lock. Most probably atomic::wait is not implemented by a simple spinlock on most of the platforms (let's ignore the corner cases now), so basically it achieves the very same thing mutex::lock does. So basically if I implement such a class it would do exactly the same std::mutex does (again on most of the popular platforms). It means that STL could use the same tricks in their mutex implementations on the platforms that support these tricks. Like this spinlock, but I would use atomic wait instead of spinning. Should I trust my STL implementation that they did so?

A lock-free mutex is a contradiction.
You can build a lock out of lock-free building-blocks, and in fact that's the normal thing to do whether it's hand-written in asm or with std::atomic.
But the overall locking algorithm is by definition not lock-free. (https://en.wikipedia.org/wiki/Non-blocking_algorithm). The entire point is to stop other threads from making forward progress while one thread is in the critical section, even if it unfortunately sleeps while it's there.
I mean that the implementation could be lock free on the happy path (locking from not locked state)
std::mutex::lock() is that way too: it doesn't block if it doesn't have to! It might need to make a system call like futex(FUTEX_WAIT_PRIVATE) if there's a thread waiting for the lock. But so does an implementation that used std::notify.
Perhaps you haven't understood what "lock-free" means: it never blocks, regardless of what other threads are doing. That is all. It doesn't mean "faster in the simple/easy case". For a complex algorithm (e.g. a queue), it's often faster to just give up and block if the circular buffer is full, rather than adding overhead to the simple case to allow other threads to "assist" or cancel a stuck partial operation. (Lock-free Progress Guarantees in a circular buffer queue)
There is no inherent advantage to rolling your own std::mutex out of std::atomic. The compiler-generated machine code has to do approximately the same things either way, and I'd expect the fast path to be about the same. The only difference would be in choice of what to do in the already-locked case. (But maybe tricky to design a way to avoid calling notify iff there were no waiters, something that actual std::mutex manages in the glibc / pthreads implementation Linux.)
(I'm assuming the overhead of the library function call is negligible compared to the cost of an atomic RMW to take the mutex. Having that inline into your code is a pretty minor advantage.)
A mutex implementation can be tuned for certain use-cases, in terms of how long it spin-waits before sleeping (using an OS-assisted mechanism like futex to enable other threads to wake it when releasing the lock), and in exponential backoff for the spin-wait portion.
If std::mutex doesn't perform well for your application on the hardware you care about, it's worth considering an alternative. Although IDK exactly how you'd go about measuring whether it worked well or not. Perhaps if you could figure out that it was deciding to sleep
And yes, you could consider rolling your own with std::atomic now that there's a portable mechanism to hopefully expose a way to fall-back to OS-assisted sleep/wake mechanisms like futex. You'd still want to manually use system-specific things like x86 _mm_pause() inside a spin-wait loop, though, since I don't think C++ has anything equivalent to Rust's std::hint::spin_loop() that an implementation can use to expose things like the x86 pause instruction, intended for use in the body of a spin-loop. (See Locks around memory manipulation via inline assembly re: such considerations, and spinning read-only instead of spamming atomic RMW attempts. And for a look at the necessary parts of a spinlock in x86 assembly language, which are the same whether you get a C++ compiler to generate that machine code for you or not.)
See also https://rigtorp.se/spinlock/ re: implementing a mutex in C++ with std::atomic.
Linux / libstdc++ behaviour of notify/wait
I tested what system calls std::wait makes when it has to wait for a long time, on Arch Linux (glibc 2.33).
std::mutex lock no-contention fast path, and unlock with no waiters: zero system calls, purely user-space atomic operations. Notably is able to detect that there are no waiters when unlocking, so it doesn't make a FUTEX_WAKE system call (which otherwise would maybe a hundred times more than taking and releasing an uncontended mutex that was still hot in this cores L1d cache.)
std::mutex lock() on already locked: only a futex(0x55be9461b0c0, FUTEX_WAIT_PRIVATE, 2, NULL) system call. Possibly some spinning in user-space before that; I didn't single-step into it with GDB, but if so probably with a pause instruction.
std::mutex unlock() with a waiter: uses futex(0x55ef4af060c0, FUTEX_WAKE_PRIVATE, 1) = 1. (After an atomic RMW, IIRC; not sure why it doesn't just use a release store.)
std::notify_one: always a futex(address, FUTEX_WAKE, 1) even if there are no waiters, so it's up to you to avoid it if there are no waiters when you unlock a lock.
std::wait: spinning a few times in user-space, including 4x sched_yield() calls before a futex(addr, FUTEX_WAIT, old_val, NULL).
Note the use of FUTEX_WAIT instead of FUTEX_WAIT_PRIVATE by the wait/notify functions: these should work across processes on shared memory. The futex(2) man page says the _PRIVATE versions (only for threads of a single process) allow some additional optimizations.
I don't know about other systems, although I've heard some kinds of locks on Windows / MSVC have to suck (always a syscall even on the fast path) for backwards-compat with some ABI choices, or something. Like perhaps std::lock_guard is slow on MSVC, but std::unique_lock isn't?
Test code:
#include <atomic>
#include <thread>
#include <unistd.h> // for quick & dirty sleep and usleep. TODO: port to non-POSIX.
#include <mutex>
#if 1 // test std::atomic
//std::atomic_unsigned_lock_free gvar;
//std::atomic_uint_fast_wait_t gvar;
std::atomic<unsigned> gvar;
void waiter(){
volatile unsigned sink;
while (1) {
sink = gvar;
gvar.wait(sink); // block until it's not the old value.
// on Linux/glibc, 4x sched_yield(), then futex(0x562c3c4881c0, FUTEX_WAIT, 46, NULL ) or whatever the current counter value is
}
}
void notifier(){
while(1){
sleep(1);
gvar.store(gvar.load(std::memory_order_relaxed)+1, std::memory_order_relaxed);
gvar.notify_one();
}
}
#else
std::mutex mut;
void waiter(){
unsigned sink = 0;
while (1) {
mut.lock(); // Linux/glibc2.33 - just a futex system call, maybe some spinning in user-space first. But no sched_yield
mut.unlock();
sink++;
usleep(sink); // give the other thread plenty of time to take the lock, so we don't get it twice.
}
}
void notifier(){
while(1){
mut.lock();
sleep(1);
mut.unlock();
}
}
#endif
int main(){
std::thread t (waiter); // comment this to test the no-contention case
notifier();
// loops forever: kill it with control-C
}
Compile with g++ -Og -std=gnu++20 notifywait.cpp -pthread, run with strace -f ./a.out to see the system calls. (A couple or a few per second, since I used a nice long sleep.)
If there's any spin-waiting in user-space, it's negligible compared to the 1 second sleep interval, so it uses about a millisecond of CPU time including startup, to run for 19 iterations. (perf stat ./a.out)
Usually your time would be better spent trying to reduce the amount of locking involved, or the amount of contention, rather than trying to optimize the locks themselves. Locking is an extremely important thing, and lots of engineering has gone into tuning it for most use-cases.
If you're rolling your own lock, you probably want to get your hands dirty with system-specific stuff, because it's all a matter of tuning choices. Different systems are unlikely to have made the same tuning choices for std::mutex and wait as Linux/glibc. Unless std::wait's retry strategy on the only system you care about happens to be perfect for your use-case.
It doesn't make sense to roll your own mutex without first investigating exactly what std::mutex does on your system, e.g. single-step the asm for the already-locked case to see what retries it makes. Then you'll have a better idea whether you can do any better.

You may want to distinguish between a mutex (which is generally a sleeping lock that interacts with the scheduler) and a spinlock (which does not put the current thread to sleep, and makes sense only when a thread on a different CPU is likely to be holding the lock).
Using C++20 atomics, you can definitely implement a spinlock, but this won't be directly comparable to std::mutex, which puts the current thread to sleep. Mutexes and spinlocks are useful in different situations. When successful, the spinlock is probably going to be faster--after all the mutex implementation likely contains a spinlock. It's also the only kind of lock you can acquire in an interrupt handler (though this is less relevant to user-level code). But if you hold the spinlock for a long time and there is contention, you will waste a huge amount of CPU time.

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

How is std::condition_variable::wait implemented?

I was trying to search for how std::conidition_variable::wait is implemented in the standard library on my local machine, I can see wait_unitl but I cannot find wait.
My question is, how is the wait function implemented internally, how would one make a thread sleep indefinitely, is it using some long timed sleep or something entirely different that is OS-specific?
Thanks!

Pre-emptive multithreading is a process governed largely by the operating system. It decides which threads get timeslices and/or assigned to which cores, and so forth. As such, for most low-level threading primitives (mutexes, conditional variables, etc), the real work is done inside OS calls.
Yes, you could in theory implement something like a conditional variable with nothing more than atomic accesses and timed thread suspension. However, it would perform extremely poorly. Modern OS's know when a thread is waiting on a condition and can wake that thread up "immediately" when the condition is satisfied. Your mechanism requires that the waiting thread wait until some specific time has passed.
Plus, you'd have a whole bunch of spurious wake-ups that you have to check for, thus using thread time for no reason. The OS-based implementation will have far fewer spurious wake-ups.

c/c++ maximum number of mutexes allowed in Linux

I've been searching trying to find out what is the maximum number of mutexes in Linux for a c/c++ process without success. Also, is there a way to modify this number. The book I'm reading mentions how to find the max number of threads allowed in Linux and how to modify this number but no mention of mutexes.

Check this pthread_mutex_init.
Why No Limits are Defined
Defining symbols for the maximum number of mutexes and condition variables was considered but rejected because the number of these objects may change dynamically. Furthermore, many implementations place these objects into application memory; thus, there is no explicit maximum.
EDIT: In the comments you asked about the costs a mutex may have other than memory. Well, I don't know, but I found some interesting material about that:
This article on How does a Mutex Work says this about the costs:
The Costs
There are a few points of interest when it comes to the cost of a mutex. The first, and very vital point, is waiting time. Your threads should spend only a fraction of their time waiting on mutexes. If they are waiting too often then you are losing concurrency. In a worst case scenario many threads always trying to lock the same mutex may result in performance worse than a single thread serving all requests. This really isn’t a cost of the mutex itself, but a serious concern with concurrent programming.
The overhead costs of a mutex relate to the test-and-set operation and the system call that implements a mutex. The test-and-set is likely very low cost; being essential to concurrent processing the CPUs have good reason to make it efficient. We’ve kind of omitted another important instruction however: the fence. This is used in all high-level mutexes and may have a higher cost than the test-and-set operation. More costlier than even that however is the system call. Not only do you suffer the context switch overhead of the system call, the kernel now spends some time in its scheduling code.
So I'm guessing the costs they talk about on the EAGAIN error involves either the CPU or internal kernel structures. Maybe both. Maybe some kernel error... I honestly don't know.
StackOverflow resources
I picked some SO Q&A that might interest you. Good reading!
How efficient is locking an unlocked mutex? What is the cost of a mutex?
How pthread_mutex_lock is implemented
How do mutexes really work?
When should we use mutex and when should we use semaphore

Choosing between Critical Sections, Mutex and Spin Locks

What are the factors to keep in mind while choosing between Critical Sections, Mutex and Spin Locks? All of them provide for synchronization but are there any specific guidelines on when to use what?
EDIT: I did mean the windows platform as it has a notion of Critical Sections as a synchronization construct.

In Windows parlance, a critical section is a hybrid between a spin lock and a non-busy wait. It spins for a short time, then--if it hasn't yet grabbed the resource--it sets up an event and waits on it. If contention for the resource is low, the spin lock behavior is usually enough.
Critical Sections are a good choice for a multithreaded program that doesn't need to worry about sharing resources with other processes.
A mutex is a good general-purpose lock. A named mutex can be used to control access among multiple processes. But it's usually a little more expensive to take a mutex than a critical section.

General points to consider:
The performance cost of using the mechanism.
The complexity introduced by using the mechanism.
In any given situation 1 or 2 may be more important.
E.g.
If you using multi-threading to write a high performance algorithm by making use of many cores and need to guard some data for safe access then 1 is probably very important.
If you have an application where a background thread is used to poll for some information on a timer and on the rare occasion it notices an update you need to guard some data for access then 2 is probably more important than 1.
1 will be down to the underlying implementation and probably scales with the scope of the protection e.g. a lock that is internal to a process is normally faster than a lock across all processes on a machine.
2 is easy to misjudge. First attempts to use locks to write thread safe code will normally miss some cases that lead to a deadlock. A simple deadlock would occur for example if thread A was waiting on a lock held by thread B but thread B was waiting on a lock held by thread A. Surprisingly easy to implement by accident.
On any given platform the naming and qualities of locking mechanisms may vary.
On windows critical sections are fast and process specific, mutexes are slower but cross process. Semaphores offer more complicated use cases. Some problems e.g. allocation from a pool may be solved very efficently using atomic functions rather than locks e.g. on windows InterlockedIncrement which is very fast indeed.

A Mutex in Windows is actually an interprocess concurrency mechanism, making it incredibly slow when used for intraprocess threading. A Critical Section is the Windows analogue to the mutex you normally think of.
Spin Locks are best used when the resource being contested is usually not held for a significant number of cycles, meaning the thread that has the lock is probably going to give it up soon.
EDIT : My answer is only relevant provided you mean 'On Windows', so hopefully that's what you meant.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js