C++ library version of upcoming C++20 wait/notify

C++ library version of upcoming C++20 wait/notify - c++

C++20 is introducing https://en.cppreference.com/w/cpp/atomic/atomic/wait and https://en.cppreference.com/w/cpp/atomic/atomic/notify_one, which introduces atomic waiting and notification functionality.
I'd like to use this but my compiler doesn't support it yet.
Are there any reliable library implementations of this functionality?
I'm envisioning using this functionality in lock-free producer-consumer situation where the consumer goes to sleep and is woken up when there's something to consume. There are no mutexes in my current implementation. Am I correct that such wait/notify functionality would allow me to do what I want to do (i.e., let my consumer thread go to sleep and let the producer efficiently wake it up without acquiring a mutex (something that would be necessary, I believe, with a condition variable).
Actually, is there an even better way to do this that I'm not currently thinking of?

Are there any reliable library implementations of this functionality?
You can use Boost.Atomic if your compiler's standard library implementation does not support it yet.
Am I correct that such wait/notify functionality would allow me to do what I want to do (i.e., let my consumer thread go to sleep and let the producer efficiently wake it up without acquiring a mutex (something that would be necessary, I believe, with a condition variable).
Correct.
Actually, is there an even better way to do this that I'm not currently thinking of?
Atomic wait is a good way. You may also want to use it the best way:
Avoid notifying when you know that the other party does not wait. Some implementations would check it for you, some wouldn't.
Use the same atomic size as underlying OS primitive uses. This means using 32-bit queue counters on Linux, where the native primitive is futex, and it takes 32-bit integer. On Windows, WaitOnAddress can take any size of 8, 16, 32, or 64. On other systems consider their implementation of atomic wait.
As other good way you can consider C++20 semaphores.

Related

Is C++ std::atomic compatible with pthreads?

I have 2 pthread threads where one is writing a bool value and another is reading it.
I dont care for portability. Its x86 architecture. The only which concerns me is writing thread sets bool to true and starts doing its own work (which happens once a day at midnight) closing a file. And the other thread had read the bool as false and proceeds with its work (writing to a file) at the same time. Its very difficult to reproduce this scenario so I better get best possible theoretical solution.
Can I use std::atomic in case of pthreads?

Can I use std::atomic in case of pthreads?
Yes, that's what std::atomic is for.
It works with std::thread, POSIX threads, and any other kind of threads. Behind the scenes it uses "magical" compiler annotations to prevent certain thread-incompatible optimizations, and processor-specific locking instructions to guarantee that thread-safe code is generated1.
It makes (almost) no sense to use std::atomic without threads (you could use std::atomic instead of volatile for signal handlers, but there is no advantage in doing so).
The only which concerns me ...
The rest of your question makes no sense to me.
1 When used correctly, which is often non-trivial thing to do, and which is why you generally should try not to use std::atomic unless you are an expert.

Does Boost have support for Windows EnterCriticalSection API?

I know Boost has support for mutexes and lock_guard, which can be used to implement critical sections.
But Windows has a special API for critical sections (see EnterCriticalSection and LeaveCriticalSection) which is a LOT faster than a mutex (for rarely contended, short sections of code).
Hence my question - it is possible in Boost to take advantage of this API, and fallback to spinlock/mutex/futex-based implementation on other platforms?

The simple answer is no.
Here's some relevant background from an old mailing list thread:
BTW. I am agree that mutex is more universal solution from a
performance point of view. But to be fair - CS are faster in simple
design. I believe that possibility to support them should be at
least
taken in account.
This was the article that someone pointed me to. The conclusion was
that CS are only faster if:
There are less than 8 threads total in the process.
You weren't running in the background.
You weren't on an dual processor machine.
To me this means that simple testing yields good CS performance
results, but any real world program is better off with a full blown
mutex.
I'm not adverse to supporting a CS implementation. However, I
originally chose not to for the following reasons:
You get either construction and destruction hits from using a PIMPL
idiom or you must include Windows.h in the Boost.Threads headers,
which I simply don't want to do. (This can be worked around by
emulating a CS ala OPTEX from the MSDN.)
According to this research paper most programs won't benefit from
a CS design.
It's trivial to code a (non-portable) critical_section class that
follows the Mutex model if you truly can make use of this.
For now I think I've made the right choice, though down the road we
may change the implementation to use a critical section or OPTEX.
Bill Kempf

Speaking as someone who helps out maintaining Boost.Thread, and as someone who failed to get an event object into Boost.Thread, I don't think critical sections have ever been added nor would be added to Boost for these reasons:
A Win32 critical section is trivially easy to build using a boost::atomic and a boost::condition_variable, so much so it isn't really worth having an official one. Here is probably the most complex one you could imagine, but extremely configurable including being constexpr ready (don't ask!): https://github.com/ned14/boost.outcome/blob/master/include/boost/outcome/v1/spinlock.hpp#L331
You can build your own simply by matching (Basic)Lockable concept and using atomic compare_exchange (non-x86/x64) or atomic exchange (x86/x64) and then grab it using a lock_guard around the critical section.
Some may object that a win32 critical section is not this. I am afraid it is: it simply spins on an atomic for a spin count, and then lazily tries to allocate a win32 event object which it then waits upon. Nothing special.
As much as you might think critical sections (really user mode mutexes) are better/faster/whatever, they probably are not as great as you might think. boost::mutex is a big vast heavyweight thing on Windows internally using a win32 semaphore as the kernel wait object because of the need to emulate thread cancellation and to behave well in a general purpose use context. It's easy to write a concurrency structure which is faster than another for some single use case, but it is very very hard to write a concurrency structure which is all of:
Faster than a standard implementation in the uncontended case.
Faster than a standard implementation in the lightly contended case.
Faster than a standard implementation in the heavily contended case.
Even if you manage all three of the above, that still isn't enough: you also need some guarantees on worst case progression ordering, so whether certain patterns of locks, waits and unlocks produce predictable outcomes. This is why threading facilities can appear to look slow in narrow use case scenarios, so Boost.Thread much as the STL can appear to be much slower than hand rolled locking code in say an uncontended use case.
Boost.Thread already does substantial work in user mode to avoid going to kernel sleep on Windows. On POSIX any of the major pthreads implementations also does substantial work to avoid kernel sleeps and hence Boost.Thread doesn't replicate that work. In other words, critical sections don't gain you anything in terms of scaling to load behaviours, though inevitably Boost.Thread v4 especially on Windows does a ton load of work a naive implementation does not (the planned rewrite of Boost.Thread is vastly more efficient on Windows as it can assume Windows Vista or above).

So, it looks like the default Boost mutex doesn't support it, but asio::detail::mutex does.
So I ended up using that:
#include <boost/asio/detail/mutex.hpp>
#include <boost/thread.hpp>
using boost::asio::detail::mutex;
using boost::lock_guard;
int myFunc()
{
static mutex mtx;
lock_guard<mutex> lock(mtx);
. . .
}

Reader-writer lock with condition variable

I find neither boost nor tbb library's condition variable has the interface of working with reader-writer lock (ie. shared mutex in boost). condition_variable::wait() only accepts mutex lock. But I think it's quite reasonable to have it work with reader-writer lock. Can anyone tell me the reason why they don't support that, or why people don't do that?
Thanks,
Cui

The underlying platform's native threading API might not be able to support it easily. For example, on a POSIX platform where a condition variable is implemented in terms of pthread_cond_t it can only be used with pthread_mutex_t. In order to get maximum performance the basic condition variable type is a lightweight wrapper over the native types, with no additional overhead.
If you want to use other types of mutex you should use std::condition_variable_any or boost::condition_variable_any, which work with any type of mutex. This has a small additional overhead due to using an internal mutex of the native plaform's type in addition to the user-supplied mutex. (I don't know if TBB offers an equivalent type.)
It's a design trade-off that allows either performance or flexibility. If you want maximum performance you get it with condition_variable but can only use simple mutexes. If you want more flexibility you can get that with condition_variable_any but you must sacrifice a little performance.

How to write your own condition variable using atomic primitives

I need to write my own implementation of a condition variable much like pthread_cond_t.
I know I'll need to use the compiler provided primitives like __sync_val_compare_and_swap etc.
Does anyone know how I'd go about this please.
Thx

Correct implementation of condition variables is HARD. Use one of the many libraries out there instead (e.g. boost, pthreads-win32, my just::thread library)
You need to:
Keep a list of waiting threads (this might be a "virtual" list rather than an actual data structure)
Ensure that when a thread waits you atomically unlock the mutex owned by the waiting thread and add it to the list before that thread goes into a blocking OS call
Ensure that when the condition variable is notified then one of the threads waiting at that time is woken, and not one that waits later
Ensure that when the condition variable is broadcast then all of the threads waiting at that time are woken, and not any threads that wait later.
plus other issues that I can't think of just now.
The details vary with OS, as you are dependent on the OS blocking/waking primitives.

I need to write my own implementation of a condition variable much like pthread_cond_t.
The condition variables cannot be implemented using only the atomic primitives like compare-and-swap.
The purpose in life of the cond vars is to provide flexible mechanism for application to access the process/thread scheduler: put a thread into sleep and wake it up.
Atomic ops are implemented by the CPU, while process/thread scheduler is an OS territory. Without some supporting system call (or emulation using existing synchronization primitives) implementing cond vars is impossible.
Edit1. The only sensible example I know and can point you to is the implementation of the historical Linux pthread library which can be found here - e.g. version from 1997. The implementation (found in condvar.c file) is rather easy to read but also highlights the requirements for implementation of the cond vars. Spinlocks (using test-and-set op) are used for synchronizations and POSIX signals are used to put threads into sleep and to wake them up.

It depends on your requirements. IF you have no further requirements, and if your process may consume 100% of available CPU time, then you have the rare chance to experiment and try out different mutex and condition variables - just try it out, and learn about the details. Great thing.
But in reality, you are uusally bound to an operating system, and so you are captivated on the OSs threading primitives, because they represent the only kind of control to - yeah - process/threading/cpu ressource usage! So, in that case, you will not even have the chance to implement your OWN condition variables - if they are not based on the primites, that the OS provides you!
So... double check your environment, what do you control? What don't you control? And what makes sense?

Why do libraries implement their own basic locks on windows?

Windows provides a number of objects useful for synchronising threads, such as event (with SetEvent and WaitForSingleObject), mutexes and critical sections.
Personally I have always used them, especially critical sections since I'm pretty certain they incur very little overhead unless already locked. However, looking at a number of libraries, such as boost, people then to go to a lot of trouble to implement their own locks using the interlocked methods on Windows.
I can understand why people would write lock-less queues and such, since thats a specialised case, but is there any reason why people choose to implement their own versions of the basic synchronisation objects?

Libraries aren't implementing their own locks. That is pretty much impossible to do without OS support.
What they are doing is simply wrapping the OS-provided locking mechanisms.
Boost does it for a couple of reasons:
They're able to provide a much better designed locking API, taking advantage of C++ features. The Windows API is C only, and not very well-designed C, at that.
They are able to offer a degree of portability. the same Boost API can be used if you run your application on a Linux machine or on Mac. Windows' own API is obviously Windows-specific.
The Windows-provided mechanisms have a glaring disadvantage: They require you to include windows.h, which you may want to avoid for a large number of reasons, not least its extreme macro abuse polluting the global namespace.

One particular reason I can think of is portability. Windows locks are just fine on their own but they are not portable to other platforms. A library which wishes to be portable must implement their own lock to guarantee the same semantics across platforms.

In many libraries (aka Boost) you need to write corss platform code. So, using WaitForSingleObject and SetEvent are no-go. Also, there common idioms, like Monitors, Conditions that Win32 API misses, (but it can be implemented using these basic primitives)
Some lock-free data structures like atomic counter are very useful; for example: boost::shared_ptr uses them in order to make it thread safe without overhead of critical section, most compilers (not msvc) use atomic counters in order to implement thread safe copy-on-write std::string.
Some things like queues, can be implemented very efficiently in thread safe way without locks at all that may give significant perfomance boost in certain applications.

There may occasionally be good reasons for implementing your own locks that don't use the Windows OS synchronization objects. But doing so is a "sharp stick." It's easy to poke yourself in the foot.
Here's an example: If you know that you are running the same number of threads as there are hardware contexts, and if the latency of waking up one of those threads which is waiting for a lock is very important to you, you might choose a spin lock implemented completely in user space. If the waiting thread is the only thread spinning on the lock, the latency of transferring the lock from the thread that owns it to the waiting thread is just the latency of moving the cache line to the owner thread and back to the waiting thread -- orders of magnitude faster than the latency of signaling a thread with an OS lock under the same circumstances.
But the scenarios where you want to do this is pretty narrow. As soon as you start having more software threads than hardware threads, you'll likely regret it. In that scenario, you could spend entire OS scheduling quanta doing nothing but spinning on your spin lock. And, if you care about power, spinlocks are bad because they prevent the processor from going into a low-power state.
I'm not sure I buy the portability argument. Portable libraries often have an OS portability layer that abstracts the different OS APIs for synchronization. If you're dealing with locks, a pthread_mutex can be made semantically the same as a Windows Mutex or Critical Section under an abstraction layer. There's some exceptions here, but for most people this is true. If you're dealing with Windows Events or POSIX condition variables, well, those are tougher to abstract. (Vista did introduce POSIX-style condition variables, but not many Windows software developers are in a position to require Vista...)

Writing locking code for a library is useful if that library is meant to be cross platform. Users of the library can use the library's locking functionality and not have to care about the underlying platform implementation. Assuming the library has versions for all the platforms being targetted it's one less bit of code that has to be ported.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js