How to write your own condition variable using atomic primitives

How to write your own condition variable using atomic primitives - c++

I need to write my own implementation of a condition variable much like pthread_cond_t.
I know I'll need to use the compiler provided primitives like __sync_val_compare_and_swap etc.
Does anyone know how I'd go about this please.
Thx

Correct implementation of condition variables is HARD. Use one of the many libraries out there instead (e.g. boost, pthreads-win32, my just::thread library)
You need to:
Keep a list of waiting threads (this might be a "virtual" list rather than an actual data structure)
Ensure that when a thread waits you atomically unlock the mutex owned by the waiting thread and add it to the list before that thread goes into a blocking OS call
Ensure that when the condition variable is notified then one of the threads waiting at that time is woken, and not one that waits later
Ensure that when the condition variable is broadcast then all of the threads waiting at that time are woken, and not any threads that wait later.
plus other issues that I can't think of just now.
The details vary with OS, as you are dependent on the OS blocking/waking primitives.

I need to write my own implementation of a condition variable much like pthread_cond_t.
The condition variables cannot be implemented using only the atomic primitives like compare-and-swap.
The purpose in life of the cond vars is to provide flexible mechanism for application to access the process/thread scheduler: put a thread into sleep and wake it up.
Atomic ops are implemented by the CPU, while process/thread scheduler is an OS territory. Without some supporting system call (or emulation using existing synchronization primitives) implementing cond vars is impossible.
Edit1. The only sensible example I know and can point you to is the implementation of the historical Linux pthread library which can be found here - e.g. version from 1997. The implementation (found in condvar.c file) is rather easy to read but also highlights the requirements for implementation of the cond vars. Spinlocks (using test-and-set op) are used for synchronizations and POSIX signals are used to put threads into sleep and to wake them up.

It depends on your requirements. IF you have no further requirements, and if your process may consume 100% of available CPU time, then you have the rare chance to experiment and try out different mutex and condition variables - just try it out, and learn about the details. Great thing.
But in reality, you are uusally bound to an operating system, and so you are captivated on the OSs threading primitives, because they represent the only kind of control to - yeah - process/threading/cpu ressource usage! So, in that case, you will not even have the chance to implement your OWN condition variables - if they are not based on the primites, that the OS provides you!
So... double check your environment, what do you control? What don't you control? And what makes sense?

Related

Why is there no std:: equivalent to pthread_spinlock_t like there is for pthread_mutex_t & std::mutex?

I've used pthreads a fair bit for concurrent programs, mainly utilising spinlocks, mutexes, and condition variables.
I started looking into multithreading using std::thread and using std::mutex, and I noticed that there doesn't seem to be an equivalent to spinlock in pthreads.
Anyone know why this is?

there doesn't seem to be an equivalent to spinlock in pthreads.
Spinlocks are often considered a wrong tool in user-space because there is no way to disable thread preemption while the spinlock is held (unlike in kernel). So that a thread can acquire a spinlock and then get preempted, causing all other threads trying to acquire the spinlock to spin unnecessarily (and if those threads are of higher priority that may cause a deadlock (threads waiting for I/O may get a priority boost on wake up)). This reasoning also applies to all lockless data structures, unless the data structure is truly wait-free (there aren't many practically useful ones, apart from boost::spsc_queue).
In kernel, a thread that has locked a spinlock cannot be preempted or interrupted before it releases the spinlock. And that is why spinlocks are appropriate there (when RCU cannot be used).
On Linux, one can prevent preemption (not sure if completely, but there has been recent kernel changes towards such a desirable effect) by using isolated CPU cores and FIFO real-time threads pinned to those isolated cores. But that requires a deliberate kernel/machine configuration and an application designed to take advantage of that configuration. Nevertheless, people do use such a setup for business-critical applications along with lockless (but not wait-free) data structures in user-space.
On Linux, there is adaptive mutex PTHREAD_MUTEX_ADAPTIVE_NP, which spins for a limited number of iterations before blocking in the kernel (similar to InitializeCriticalSectionAndSpinCount). However, that mutex cannot be used through std::mutex interface because there is no option to customise non-portable pthread_mutexattr_t before initialising pthread_mutex_t.
One can neither enable process-sharing, robostness, error-checking or priority-inversion prevention through std::mutex interface. In practice, people write their own wrappers of pthread_mutex_t which allows to set desirable mutex attributes; along with a corresponding wrapper for condition variables. Standard locks like std::unique_lock and std::lock_guard can be reused.
IMO, there could be provisions to set desirable mutex and condition variable properties in std:: APIs, like providing a protected constructor for derived classes that would initialize that native_handle, but there aren't any. That native_handle looks like a good idea to do platform specific stuff, however, there must be a constructor for the derived class to be able to initialize it appropriately. After the mutex or condition variable is initialized that native_handle is pretty much useless. Unless the idea was only to be able to pass that native_handle to (C language) APIs that expect a pointer or reference to an initialized pthread_mutex_t.
There is another example of Boost/C++ standard not accepting semaphores on the basis that they are too much of a rope to hang oneself, and that mutex (a binary semaphore, essentially) and condition variable are more fundamental and more flexible synchronisation primitives, out of which a semaphore can be built.
From the point of view of the C++ standard those are probably right decisions because educating users to use spinlocks and semaphores correctly with all the nuances is a difficult task. Whereas advanced users can whip out a wrapper for pthread_spinlock_t with little effort.

You are right there's no spin lock implementation in the std namespace. A spin lock is a great concept but in user space is generally quite poor. OS doesn't know your process wants to spin and usually you can have worse results than using a mutex. To be noted that on several platforms there's the optimistic spinning implemented so a mutex can do a really good job. In addition adjusting the time to "pause" between each loop iteration can be not trivial and portable and a fine tuning is required. TL;DR don't use a spinlock in user space unless you are really really sure about what you are doing.
C++ Thread discussion
Article explaining how to write a spin lock with benchmark
Reply by Linus Torvalds about the above article explaining why it's a bad idea

Spin locks have two advantages:
They require much fewer storage as a std::mutex, because they do not need a queue of threads waiting for the lock. On my system, sizeof(pthread_spinlock_t) is 4, while sizeof(std::mutex) is 40.
They are much more performant than std::mutex, if the protected code region is small and the contention level is low to moderate.
On the downside, a poorly implemented spin lock can hog the CPU. For example, a tight loop with a compare-and-set assembler instructions will spam the cache system with loads and loads of unnecessary writes. But that's what we have libraries for, that they implement best practice and avoid common pitfalls. That most user implementations of spin locks are poor, is not a reason to not put spin locks into the library. Rather, it is a reason to put it there, to stop users from trying it themselves.
There is a second problem, that arises from the scheduler: If thread A acquires the lock and then gets preempted by the scheduler before it finishes executing the critical section, another thread B could spin "forever" (or at least for many milliseconds, before thread A gets scheduled again) on that lock.
Unfortunately, there is no way, how userland code can tell the kernel "please don't preempt me in this critical code section". But if we know, that under normal circumstances, the critical code section executes within 10 ns, we could at least tell thread B: "preempt yourself voluntarily, if you have been spinning for over 30 ns". This is not guaranteed to return control directly back to thread A. But it will stop the waste of CPU cycles, that otherwise would take place. And in most scenarios, where thread A and B run in the same process at the same priority, the scheduler will usually schedule thread A before thread B, if B called std::this_thread::yield().
So, I am thinking about a template spin lock class, that takes a single unsigned integer as a parameter, which is the number of memory reads in the critical section. This parameter is then used in the library to calculate the appropriate number of spins, before a yield() is performed. With a zero count, yield() would never be called.

How is std::condition_variable::wait implemented?

I was trying to search for how std::conidition_variable::wait is implemented in the standard library on my local machine, I can see wait_unitl but I cannot find wait.
My question is, how is the wait function implemented internally, how would one make a thread sleep indefinitely, is it using some long timed sleep or something entirely different that is OS-specific?
Thanks!

Pre-emptive multithreading is a process governed largely by the operating system. It decides which threads get timeslices and/or assigned to which cores, and so forth. As such, for most low-level threading primitives (mutexes, conditional variables, etc), the real work is done inside OS calls.
Yes, you could in theory implement something like a conditional variable with nothing more than atomic accesses and timed thread suspension. However, it would perform extremely poorly. Modern OS's know when a thread is waiting on a condition and can wake that thread up "immediately" when the condition is satisfied. Your mechanism requires that the waiting thread wait until some specific time has passed.
Plus, you'd have a whole bunch of spurious wake-ups that you have to check for, thus using thread time for no reason. The OS-based implementation will have far fewer spurious wake-ups.

Mutexes in multithread Linux application

Could you help me to understand how to use mutexes in multithread Linux application, where:
during data writing it is need to lock variable on write and read
during data reading from the variable it is need to lock it on write.
So it is possible to read simultaneously, but writing opertion is a single opertaion in the same time. During writing, all other operation should wait before it finishes.

You're asking about something that is a bit higher level than mutexes. A mutex is a simple, low-level device. When you lock a thread with a mutex, the CPU is either executing code in the thread that obtained the lock or it is executing some other process entirely. In other words, the mutex has locked out all other threads that belong to the same (heavyweight) process.
You are asking about a read-write lock. Read-write locks use mutexes underneath the hood. The POSIX functions that deal with read-write locks start with pthread_rwlock_. Since you are on a Linux machine, just type man pthread and look for the section marked "READ/WRITE LOCK ROUTINES".

You need a reader/writer lock to allow multiple readers/single writer.
Boost.Thread has one of these (boost::shared_mutex), if you have no other preferred threading library. This uses PThreads primitives under the covers, and will probably save you time in wrapping the raw APIs yourself.
I would not recommend implementing this yourself - it's easy to get something that appears to work, but under load either crashes or kills performance or (worst of all) silently modifies your data in a way it should not be, so you get bad results.
A simple boost::mutex can also be used here as noted by #Als, but won't allow multiple concurrent reads. That is simpler to implement, and may be sufficient for your needs, depending on your read/write access profile.

You will need to use mutexes, if you have global or static objects which are being accessed(read and written to) from different threads.

Overhead of pthread mutexes?

I'm trying to make a C++ API (for Linux and Solaris) thread-safe, so that its functions can be called from different threads without breaking internal data structures. In my current approach I'm using pthread mutexes to protect all accesses to member variables. This means that a simple getter function now locks and unlocks a mutex, and I'm worried about the overhead of this, especially as the API will mostly be used in single-threaded apps where any mutex locking seems like pure overhead.
So, I'd like to ask:
do you have any experience with performance of single-threaded apps that use locking versus those that don't?
how expensive are these lock/unlock calls, compared to eg. a simple "return this->isActive" access for a bool member variable?
do you know better ways to protect such variable accesses?

All modern thread implementations can handle an uncontended mutex lock entirely in user space (with just a couple of machine instructions) - only when there is contention, the library has to call into the kernel.
Another point to consider is that if an application doesn't explicitly link to the pthread library (because it's a single-threaded application), it will only get dummy pthread functions (which don't do any locking at all) - only if the application is multi-threaded (and links to the pthread library), the full pthread functions will be used.
And finally, as others have already pointed out, there is no point in protecting a getter method for something like isActive with a mutex - once the caller gets a chance to look at the return value, the value might already have been changed (as the mutex is only locked inside the getter method).

"A mutex requires an OS context switch. That is fairly expensive. "
This is not true on Linux, where mutexes are implemented using something called futex'es. Acquiring an uncontested (i.e., not already locked) mutex is, as cmeerw points out, a matter of a few simple instructions, and is typically in the area of 25 nanoseconds w/current hardware.
For more info:
Futex
Numbers everybody should know

This is a bit off-topic but you seem to be new to threading - for one thing, only lock where threads can overlap. Then, try to minimize those places. Also, instead of trying to lock every method, think of what the thread is doing (overall) with an object and make that a single call, and lock that. Try to get your locks as high up as possible (this again increases efficiency and may /help/ to avoid deadlocking). But locks don't 'compose', you have to mentally at least cross-organize your code by where the threads are and overlap.

I did a similar library and didn't have any trouble with lock performance. (I can't tell you exactly how they're implemented, so I can't say conclusively that it's not a big deal.)
I'd go for getting it right first (i.e. use locks) then worry about performance. I don't know of a better way; that's what mutexes were built for.
An alternative for single thread clients would be to use the preprocessor to build a non-locked vs locked version of your library. E.g.:
#ifdef BUILD_SINGLE_THREAD
inline void lock () {}
inline void unlock () {}
#else
inline void lock () { doSomethingReal(); }
inline void unlock () { doSomethingElseReal(); }
#endif
Of course, that adds an additional build to maintain, as you'd distribute both single and multithread versions.

I can tell you from Windows, that a mutex is a kernel object and as such incurs a (relatively) significant locking overhead. To get a better performing lock, when all you need is one that works in threads, is to use a critical section. This would not work across processes, just the threads in a single process.
However.. linux is quite a different beast to multi-process locking. I know that a mutex is implemented using the atomic CPU instructions and only apply to a process - so they would have the same performance as a win32 critical section - ie be very fast.
Of course, the fastest locking is not to have any at all, or to use them as little as possible (but if your lib is to be used in a heavily threaded environment, you will want to lock for as short a time as possible: lock, do something, unlock, do something else, then lock again is better than holding the lock across the whole task - the cost of locking isn't in the time taken to lock, but the time a thread sits around twiddling its thumbs waiting for another thread to release a lock it wants!)

A mutex requires an OS context switch. That is fairly expensive. The CPU can still do it hundreds of thousands of times per second without too much trouble, but it is a lot more expensive than not having the mutex there. Putting it on every variable access is probably overkill.
It also probably is not what you want. This kind of brute-force locking tends to lead to deadlocks.
do you know better ways to protect such variable accesses?
Design your application so that as little data as possible is shared. Some sections of code should be synchronized, probably with a mutex, but only those that are actually necessary. And typically not individual variable accesses, but tasks containing groups of variable accesses that must be performed atomically. (perhaps you need to set your is_active flag along with some other modifications. Does it make sense to set that flag and make no further changes to the object?)

I was curious about the expense of using a pthred_mutex_lock/unlock.
I had a scenario where I needed to either copy anywhere from 1500-65K bytes without using
a mutex or to use a mutex and do a single write of a pointer to the data needed.
I wrote a short loop to test each
gettimeofday(&starttime, NULL)
COPY DATA
gettimeofday(&endtime, NULL)
timersub(&endtime, &starttime, &timediff)
print out timediff data
or
ettimeofday(&starttime, NULL)
pthread_mutex_lock(&mutex);
gettimeofday(&endtime, NULL)
pthread_mutex_unlock(&mutex);
timersub(&endtime, &starttime, &timediff)
print out timediff data
If I was copying less than 4000 or so bytes, then the straight copy operation took less time. If however I was copying more than 4000 bytes, then it was less costly to do the mutex lock/unlock.
The timing on the mutex lock/unlock ran between 3 and 5 usec long including the time for
the gettimeofday for the currentTime which took about 2 usec

For member variable access, you should use read/write locks, which have slightly less overhead and allow multiple concurrent reads without blocking.
In many cases you can use atomic builtins, if your compiler provides them (if you are using gcc or icc __sync_fetch*() and the like), but they are notouriously hard to handle correctly.
If you can guarantee the access being atomic (for example on x86 an dword read or write is always atomic, if it is aligned, but not a read-modify-write), you can often avoid locks at all and use volatile instead, but this is non portable and requires knowledge of the hardware.

Well a suboptimal but simple approach is to place macros around your mutex locks and unlocks. Then have a compiler / makefile option to enable / disable threading.
Ex.
#ifdef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //actual mutex call
#endif
#ifndef THREAD_ENABLED
#define pthread_mutex_lock(x) ... //do nothing
#endif
Then when compiling do a gcc -DTHREAD_ENABLED to enable threading.
Again I would NOT use this method in any large project. But only if you want something fairly simple.

Thread communication theory

What is the common theory behind thread communication? I have some primitive idea about how it should work but something doesn't settle well with me. Is there a way of doing it with interrupts?

Really, it's just the same as any concurrency problem: you've got multiple threads of control, and it's indeterminate which statements on which threads get executed when. That means there are a large number of POTENTIAL execution paths through the program, and your program must be correct under all of them.
In general the place where trouble can occur is when state is shared among the threads (aka "lightweight processes" in the old days.) That happens when there are shared memory areas,
To ensure correctness, what you need to do is ensure that these data areas get updated in a way that can't cause errors. To do this, you need to identify "critical sections" of the program, where sequential operation must be guaranteed. Those can be as little as a single instruction or line of code; if the language and architecture ensure that these are atomic, that is, can't be interrupted, then you're golden.
Otherwise, you idnetify that section, and put some kind of guards onto it. The classic way is to use a semaphore, which is an atomic statement that only allows one thread of control past at a time. These were invented by Edsgar Dijkstra, and so have names that come from the Dutch, P and V. When you come to a P, only one thread can proceed; all other threads are queued and waiting until the executing thread comes to the associated V operation.
Because these primitives are a little primitive, and because the Dutch names aren't very intuitive, there have been some ther larger-scale approaches developed.
Per Brinch-Hansen invented the monitor, which is basically just a data structure that has operations which are guaranteed atomic; they can be implemented with semaphores. Monitors are pretty much what Java synchronized statements are based on; they make an object or code block have that particular behavir -- that is, only one thread can be "in" them at a time -- with simpler syntax.
There are other modeals possible. Haskell and Erlang solve the problem by being functional languages that never allow a variable to be modified once it's created; this means they naturally don't need to wory about synchronization. Some new languages, like Clojure, instead have a structure called "transactional memory", which basically means that when there is an assignment, you're guaranteed the assignment is atomic and reversible.
So that's it in a nutshell. To really learn about it, the best places to look at Operating Systems texts, like, eg, Andy Tannenbaum's text.

The two most common mechanisms for thread communication are shared state and message passing.

THe most common way for threads to communicate is via some shared data structure, typically a queue. Some threads put information into the queue while others take it out. The queue must be protected by operating system facilities such as mutexes and semaphores. Interrupts have nothing to do with it.

If you're really interested in a theory of thread communications, you may want to look into formalisms like the pi Calculus.

To communicate between threads, you'll need to use whatever mechanism is supplied by your operating system and/or runtime. Interrupts would be unusually low level, although they might be used implicitly if your threads communicate using sockets or named pipes.
A common pattern would be to implement shared state using a shared memory block, relying on an os-supplied synchronization primitive such as a mutex to spare you from busy-waiting when your read from the block. Remember that if you have threads at all, then you must have some kind of scheduler already (whether it's native from the OS or emulated in your language runtime). So this scheduler can provide synchronization objects and a "sleep" function without necessarily having to rely on hardware support.
Sockets, pipes, and shared memory work between processes too. Sometimes a runtime will give you a lighter-weight way of doing synchronization for threads within the same process. Shared memory is cheaper within a single process. And sometimes your runtime will also give you an atomic message-passing mechanism.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js