What is the overhead associated with std::condition_variable_any

What is the overhead associated with std::condition_variable_any - c++

I have read in many places that there is some overhead associated with std::condition_variable_any. Just wondering, what is this overhead?
My guess here is that since this is a generic condition variable that can work with any type of lock, it requires a manually rolled implementation of waiting (perhaps with another condition_variable and mutex or futex, or something similar) so the extra overhead probably comes from that? But not sure... As opposed to just being a native wrapper around pthread_cond_wait() (and equivalent on other systems) etc.
As a followup, if I was say implementing something that waits on, say, a shared mutex, then is this type of condition variable a bad choice because of the performance overhead? What else can I do in this situation?

pthread_cond_wait() / SleepConditionVariableSRW(), same as the the plain std::condition_variable::wait() require just a single, atomic syscall for both releasing the mutex, waiting for the condition variable and re-aquiring the mutex. The thread immediately goes to sleep and another thread - ideally one which was blocked by the mutex - can take over immediately on the same core.
With std::condition_variable_any, the unlock of the passed BasicLockable and starting to wait on the native event / condition is more than just a single syscall, it's invoking the unlock() method on the BasicLockable first and only then issues the syscall for waiting. So you have at least the overhead from the separate unlock(), plus you are more likely to trigger an less than ideal scheduling decision on the OS side. Worst case, the unlock even caused continuation of a waiting thread on a different core, with all the associated overhead.
The other way around, e.g. on spurious wakes, there are also OS side scheduling optimizations possible when dealing with a native mutex (as used in std::mutex) which don't apply with a generic BasicLockable.
Both involve some book keeping, in order to provide notify_all() logic (it's actually one event / condition per waiting thread) as well as the guarantees about all methods being atomic, so they both come with a small overhead anyway.
The real overhead comes from how well the OS can make a good scheduling decision on the combined signal-and-wait-and-lock syscall. If the OS isn't smart about the scheduling, then it makes virtually no difference.

Related

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

How is std::condition_variable::wait implemented?

I was trying to search for how std::conidition_variable::wait is implemented in the standard library on my local machine, I can see wait_unitl but I cannot find wait.
My question is, how is the wait function implemented internally, how would one make a thread sleep indefinitely, is it using some long timed sleep or something entirely different that is OS-specific?
Thanks!

Pre-emptive multithreading is a process governed largely by the operating system. It decides which threads get timeslices and/or assigned to which cores, and so forth. As such, for most low-level threading primitives (mutexes, conditional variables, etc), the real work is done inside OS calls.
Yes, you could in theory implement something like a conditional variable with nothing more than atomic accesses and timed thread suspension. However, it would perform extremely poorly. Modern OS's know when a thread is waiting on a condition and can wake that thread up "immediately" when the condition is satisfied. Your mechanism requires that the waiting thread wait until some specific time has passed.
Plus, you'd have a whole bunch of spurious wake-ups that you have to check for, thus using thread time for no reason. The OS-based implementation will have far fewer spurious wake-ups.

Is there a C++ design pattern that implements a mechanism or mutex that controls the amount of time a thread can own a locked resource?

I am looking for a way to guarantee that any time a thread locks a specific resource, it is forced to release that resource after a specific period of time (if it has not already released it). Envision a connection where you need to limit the amount of time any specific thread can own that connection for.
I envision this is how it could be used:
{
std::lock_guard<std::TimeLimitedMutex> lock(this->myTimeLimitedMutex, timeout);
try {
// perform some operation with the resource that myTimeLimitedMutex guards.
}
catch (MutexTimeoutException ex) {
// perform cleanup
}
}
I see that there is a timed_mutex that lets the program timeout if a lock cannot be acquired. I need the timeout to occur after the lock is acquired.
There are already some situations where you get a resource that can be taken away unexpectedly. For instance, a tcp sockets -- once a socket connection is made, code on each side needs to handle the case where the other side drops the connection.
I am looking for a pattern that handle types of resources that normally time out on their own, but when they don't, they need to be reset. This does not have to handle every type of resource.

This can't work, and it will never work. In other words, this can never be made. It goes against all concept of ownership and atomic transactions. Because when thread acquires the lock and implements two transactions in a row, it expects them to become atomically visible to outside word. In this scenario, it would be very possible that the transaction will be torn - first part of it will be performed, but the second will be not.
What's worse is that since the lock will be forcefully removed, the part-executed transaction will become visible to outside word, before the interrupted thread has any chance to roll-back.
This idea goes contrary to all school of multi-threaded thinking.

I support SergeyAs answer. Releasing a locked mutex after a timeout is a bad idea and cannot work. Mutex stands for mutual exclusion and this is a rock-hard contract which cannot be violated.
But you can do what you want:
Problem: You want to guarantee that your threads do not hold the mutex longer than a certain time T.
Solution: Never lock the mutex for longer than time T. Instead write your code so that the mutex is locked only for the absolutely necessary operations. It is always possible to give such a time T (modulo the uncertainties and limits given my a multitasking and multiuser operating system of course).
To achieve that (examples):
Never do file I/O inside a locked section.
Never call a system call while a mutex is locked.
Avoid sorting a list while a mutex is locked (*).
Avoid doing a slow operation on each element of a list while a mutex is locked (*).
Avoid memory allocation/deallocation while a mutex is locked (*).
There are exceptions to these rules, but the general guideline is:
Make your code slightly less optimal (e.g. do some redundant copying inside the critical section) to make the critical section as short as possible. This is good multithreading programming.
(*) These are just examples for operations where it is tempting to lock the entire list, do the operations and then unlock the list. Instead it is advisable to just take a local copy of the list and clear the original list while the mutex is locked, ideally by using the swap() operation offered by most STL containers. And then do the slow operation on the local copy outside of the critical section. This is not always possible but always worth considering. Sorting has square complexity in the worst case and usually needs random access to the entire list. It is useful to sort (a copy of) the list outside of the critical section and later check whether elements need to be added or removed. Memory allocations also have quite some complexity behind them, so massive memory allocations/deallocations should be avoided.

You can't do that with only C++.
If you are using a Posix system, it can be done.
You'll have to trigger a SIGALARM signal that's only unmasked for the thread that'll timeout. In the signal handler, you'll have to set a flag and use longjmp to return to the thread code.
In the thread code, on the setjmp position, you can only be called if the signal was triggered, thus you can throw the Timeout exception.
Please see this answer for how to do that.
Also, on linux, it seems you can directly throw from the signal handler (so no longjmp/setjmp here).
BTW, if I were you, I would code the opposite. Think about it: You want to tell a thread "hey, you're taking too long, so let's throw away all the (long) work you've done so far so I can make progress".
Ideally, you should have your long thread be more cooperative, doing something like "I've done A of a ABCD task, let's release the mutex so other can progress on A. Then let's check if I can take it again to do B and so on."
You probably want to be more fine grained (have more mutex on smaller objects, but make sure you're locking in the same order) or use RW locks (so that other threads can use the objects if you're not modifying them), etc...

Such an approach cannot be enforced because the holder of the mutex needs the opportunity to clean up anything which is left in an invalid state part way through the transaction. This can take an unknown arbitrary amount of time.
The typical approach is to release the lock when doing long tasks, and re-aquire it as needed. You have to manage this yourself as everyone will have a slightly different approach.
The only situation I know of where this sort of thing is accepted practice is at the kernel level, especially with respect to microcontrollers (which either have no kernel, or are all kernel, depending on who you ask). You can set an interrupt which modifies the call stack, so that when it is triggered it unwinds the particular operations you are interested in.

"Condition" variables can have timeouts. This allows you to wait until a thread voluntarily releases a resource (with notify_one() or notify_all()), but the wait itself will timeout after a specified fixed amount of time.
Examples in the Boost documentation for "conditions" might make this more clear.
If you want to force a release, you have to write the code which will force it though. This could be dangerous. The code written in C++ can be doing some pretty close-to-the-metal stuff. The resource could be accessing real hardware and it could be waiting on it to finish something. It may not be physically possible to end whatever the program is stuck on.
However, if it is possible, then you can handle it in the thread in which the wait() times out.

Do mutexes guarantee ordering of acquisition? Unlocking thread takes it again while others are still waiting

A coworker had an issue recently that boiled down to what we believe was the following sequence of events in a C++ application with two threads:
Thread A holds a mutex.
While thread A is holding the mutex, thread B attempts to lock it. Since it is held, thread B is suspended.
Thread A finishes the work that it was holding the mutex for, thus releasing the mutex.
Very shortly thereafter, thread A needs to touch a resource that is protected by the mutex, so it locks it again.
It appears that thread A is given the mutex again; thread B is still waiting, even though it "asked" for the lock first.
Does this sequence of events fit with the semantics of, say, C++11's std::mutex and/or pthreads? I can honestly say I've never thought about this aspect of mutexes before.
Are there any fairness guarantees to prevent starvation of other threads for too long, or any way to get such guarantees?

Known problem. C++ mutexes are thin layer on top of OS-provided mutexes, and OS-provided mutexes are often not fair. They do not care for FIFO.
The other side of the same coin is that threads are usually not pre-empted until they run out of their time slice. As a result, thread A in this scenario was likely to continue to be executed, and got the mutex right away because of that.

The guarantee of a std::mutex is enable exclusive access to shared resources. Its sole purpose is to eliminate the race condition when multiple threads attempt to access shared resources.
The implementer of a mutex may choose to favor the current thread acquiring a mutex (over another thread) for performance reasons. Allowing the current thread to acquire the mutex and make forward progress without requiring a context switch is often a preferred implementation choice supported by profiling/measurements.
Alternatively, the mutex could be constructed to prefer another (blocked) thread for acquisition (perhaps chosen according FIFO). This likely requires a thread context switch (on the same or other processor core) increasing latency/overhead. NOTE: FIFO mutexes can behave in surprising ways. E.g. Thread priorities must be considered in FIFO support - so acquisition won't be strictly FIFO unless all competing threads are the same priority.
Adding a FIFO requirement to a mutex's definition constrains implementers to provide suboptimal performance in nominal workloads. (see above)
Protecting a queue of callable objects (std::function) with a mutex would enable sequenced execution. Multiple threads can acquire the mutex, enqueue a callable object, and release the mutex. The callable objects can be executed by a single thread (or a pool of threads if synchrony is not required).

•Thread A finishes the work that it was holding the mutex for, thus
releasing the mutex.
•Very shortly thereafter, thread A needs to touch a resource that is
protected by the mutex, so it locks it again
In real world, when the program is running. there is no guarantee provided by any threading library or the OS. Here "shortly thereafter" may mean a lot to the OS and the hardware. If you say, 2 minutes, then thread B would definitely get it. If you say 200 ms or low, there is no promise of A or B getting it.
Number of cores, load on different processors/cores/threading units, contention, thread switching, kernel/user switches, pre-emption, priorities, deadlock detection schemes et. al. will make a lot of difference. Just by looking at green signal from far you cannot guarantee that you will get it green.
If you want that thread B must get the resource, you may use IPC mechanism to instruct the thread B to gain the resource.

You are inadvertently suggesting that threads should synchronise access to the synchronisation primitive. Mutexes are, as the name suggests, about Mutual Exclusion. They are not designed for control flow. If you want to signal a thread to run from another thread you need to use a synchronisation primitive designed for control flow i.e. a signal.

You can use a fair mutex to solve your task, i.e. a mutex that will guarantee the FIFO order of your operations. Unfortunately, C++ standard library doesn't have a fair mutex.
Thankfully, there are open-source implementations, for example yamc (a header-only library).

The logic here is very simple - the thread is not preempted based on mutexes, because that would require a cost incurred for each mutex operation, which is definitely not what you want. The cost of grabbing a mutex is high enough without forcing the scheduler to look for other threads to run.
If you want to fix this you can always yield the current thread. You can use std::this_thread::yield() - http://en.cppreference.com/w/cpp/thread/yield - and that might offer the chance to thread B to take over the mutex. But before you do that, allow me to tell you that this is a very fragile way of doing things, and offers no guarantee. You could, alternatively, investigate the issue deeper:
Why is it a problem that the B thread is not started when A releases the resource? Your code should not depend on such logic.
Consider using alternative thread synchronization objects like barriers (boost::barrier or http://linux.die.net/man/3/pthread_barrier_wait ) instead, if you really need this sort of logic.
Investigate if you really need to release the mutex from A at that point - I find the practice of locking and releasing fast a mutex for more than one time a code smell, it usually impacts terribly the performace. See if you can group extraction of data in immutable structures which you can play around with.
Ambitious, but try to work without mutexes - use instead lock-free structures and a more functional approach, including using a lot of immutable structures. I often found quite a performance gain from updating my code to not use mutexes (and still work correctly from the mt point of view)

How do you know this:
While thread A is holding the mutex, thread B attempts to lock it.
Since it is held, thread B is suspended.
How do you know thread B is suspended. How do you know that it is not just finished the line of code before trying to grab the lock, but not yet grabbed the lock:
Thread B:
x = 17; // is the thread here?
// or here? ('between' lines of code)
mtx.lock(); // or suspended in here?
// how can you tell?
You can't tell. At least not in theory.
Thus the order of acquiring the lock is, to the abstract machine (ie the language), not definable.

C++ Atomic/Mutex What way to follow?

I was wondering what is the better choice: it's assumed there is a trivially copyable object, let's say a queue data structure, that is used by several threads to pop/push data. The object provides only methods put/push, that can't be accessed by more than one thread the same time. Obviously if put is called, push can't be called neither.
Would you suggest to wrap the model into atomic type (if possible), or rather use mutexes?
Regards!

Atomic is hardware thing, whereas mutex is OS thing. Mutex will end up by suspending the task, even though in some cases mutex will behave as a spinlock for a short period of time aka "optimistic spin", see https://lore.kernel.org/all/56C2673F.6070202#hpe.com/T/
So, if you have small operations like incrementing a variable, aka "atomic", without waiting for other things which might take longer, then atomic is for you.
If you want to (indefinitely) wait for some things to happen in other threads, polling for results via atomics, aka spinlock, might be a waste of CPU cycles therefore less cooperative, so it's better to use a mutex/condition variable which would suspend the task at a price of context switch latency.

Atomic is preferable for those kinds of cases. The atomic is a kind of operation supported by the CPU specifically whereas the other kinds of thread control tend to be implemented by the OS or other measures and incur more overhead.
EDIT: A quick search shows up this which has more info and is basically the same kind of question: Which is more efficient, basic mutex lock or atomic integer?
EDIT 2: And a more detailed article here http://www.informit.com/articles/article.aspx?p=1832575

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js