Thread-local acquire/release synchronization - c++

In general, load-acquire/store-release synchronization is one of the most common forms of memory-ordering based synchronization in the C++11 memory model. It's basically how a mutex provides memory ordering. The "critical section" between a load-acquire and a store-release is always synchronized among different observer threads, in the sense that all observer threads will agree on what happens after the acquire and before the release.
Generally, this is achieved with a read-modify-write instruction, like compare-exchange, along with an acquire barrier, when entering the critical section, and another read-modify-write instruction with a release barrier when exiting the critical section.
But there are some situations where you might have a similar critical section[1] between a load-acquire and a release-store, except only one thread actually modifies the synchronization variable. Other threads may read the synchronization variable, but only one thread actually modifies it. In this case, when entering the critical section, you don't need a read-modify-write instruction. You would just need a simple store, since you are not racing with other threads that are attempting to modify the synchronization flag. (This may seem odd, but note that many lock-free memory reclamation deferral patterns, like user-space RCU or epoch based reclamation, use thread-local synchronization variables that are written to only by one thread, but read by many threads, so this isn't too weird of a situation.)
So, when entering the critical section, you could just do something like:
sync_var.store(true, ...);
.... critical section ....
sync_var.store(false, std::memory_order_release);
There is no race, because, again, there is no need for a read-modify-write when only one thread needs to set/unset the critical section variable. Other threads can simply read the critical section variable with a load-acquire.
The problem is, when you're entering the critical section, you need an acquire operation or fence. But you don't need to do a LOAD, only a STORE. So what is a good way to produce acquire ordering when you only really need a STORE? I see only two real options that fall within the C++ memory model. Either:
Use an exchange instead of a store, so you can do sync_var.exchange(true, std::memory_order_acquire). The downside here is that exchange is a more heavy-weight read-modify-write operation, when all you really need is a simple store.
Insert a "dummy" load-acquire, like:
(void)sync_var.load(std::memory_order_acquire);
sync_var.store(true, std::memory_order_relaxed);
The "dummy" load-acquire seems better. Presumably, the compiler can't optimize away the unused load, because it's an atomic instruction that has the side-effect of producing a "synchronizes-with" relationship with a release operation on sync_var. But it also seems very hacky, and the intention is unclear without comments explaining what's going on.
So what is the best way to produce acquire semantics when all we need to do is a simple store?
[1] I use the term "critical section" loosely. I don't necessarily mean a section that is always accessed via mutual exclusion. Rather, I just mean any section where memory ordering is synchronized via acquire-release semantics. This could refer to a mutex, or it could just mean something like RCU, where the critical section can be accessed concurrently by multiple readers.

The flaw in your logic is that an atomic RMW is not required because data in the critical section is modified by a single thread while all other threads only have read-access.
This is not true; there still needs to be a well-defined order between reading and writing.
You don't want data to be modified while another thread is still reading it. Therefore, each thread needs to inform other threads when it has finished accessing the data.
By only using an atomic store to enter the critical section, the 'synchronizes-with' relationship cannot be established.
Acquire/release synchronization is based on a runtime relationship where the acquirer knows that synchronization is complete only after observing a particular value returned by the atomic load.
This can never be achieved by a single atomic store since the one modifying thread can change the atomic variable sync_var at any time
and as such it has no way knowing whether another thread is still reading the data.
The option with a 'dummy' load/acquire is also invalid because it fails to inform other threads that it wants exclusive access. You attempt to solve that by using a single (relaxed) store,
but the load and the store are separate operations that can be interrupted by other threads (i.e. multiple threads simultaneously accessing the critical area).
An atomic RMW must be used by each thread to load a particular value and at the same time update the variable to inform all other threads it has now exclusive access
(regardless whether that is for reading or writing).
void lock()
{
while (sync_var.exchange(true, std::memory_order_acquire));
}
void unlock()
{
sync_var.store(false, std::memory_order_release);
}
Optimizations are possible where multiple threads have read-access at the same time (eg. std::shared_mutex).

Related

C++ 17 shared_mutex : why read lock is even necessary for reading threads [duplicate]

I have a class that has a state (a simple enum) and that is accessed from two threads. For changing state I use a mutex (boost::mutex). Is it safe to check the state (e.g. compare state_ == ESTABLISHED) or do I have to use the mutex in this case too? In other words do I need the mutex when I just want to read a variable which could be concurrently written by another thread?
It depends.
The C++ language says nothing about threads or atomicity.
But on most modern CPU's, reading an integer is an atomic operation, which means that you will always read a consistent value, even without a mutex.
However, without a mutex, or some other form of synchronization, the compiler and CPU are free to reorder reads and writes, so anything more complex, anything involving accessing multiple variables, is still unsafe in the general case.
Assuming the writer thread updates some data, and then sets an integer flag to inform other threads that data is available, this could be reordered so the flag is set before updating the data. Unless you use a mutex or another form of memory barrier.
So if you want correct behavior, you don't need a mutex as such, and it's no problem if another thread writes to the variable while you're reading it. It'll be atomic unless you're working on a very unusual CPU. But you do need a memory barrier of some kind to prevent reordering in the compiler or CPU.
You have two threads, they exchange information, yes you need a mutex and you probably also need a conditional wait.
In your example (compare state_ == ESTABLISHED) indicates that thread #2 is waiting for thread #1 to initiate a connection/state. Without a mutex or conditionals/events, thread #2 has to poll the status continously.
Threads is used to increase performance (or improve responsiveness), polling usually results in decreased performance, either by consuming a lot of CPU or by introducing latencey due to the poll interval.
Yes. If thread a reads a variable while thread b is writing to it, you can read an undefined value. The read and write operation are not atomic, especially on a multi-processor system.
Generally speaking you don't, if your variable is declared with "volatile". And ONLY if it is a single variable - otherwise you should be really careful about possible races.
actually, there is no reason to lock access to the object for reading. you only want to lock it while writing to it. this is exactly what a reader-writer lock is. it doesn't lock the object as long as there are no write operations. it improves performance and prevents deadlocks. see the following links for more elaborate explanations :
wikipedia
codeproject
The access to the enum ( read or write) should be guarded.
Another thing:
If the thread contention is less and the threads belong to same process then Critical section would be better than mutex.

Memory fences: acquire/load and release/store

My understanding of std::memory_order_acquire and std::memory_order_release is as follows:
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence.
Release means that no memory accesses which appear before the release fence can be reordered to after the fence.
What I don't understand is why with the C++11 atomics library in particular, the acquire fence is associated with load operations, while the release fence is associated with store operations.
To clarify, the C++11 <atomic> library enables you to specify memory fences in two ways: either you can specify a fence as an extra argument to an atomic operation, like:
x.load(std::memory_order_acquire);
Or you can use std::memory_order_relaxed and specify the fence separately, like:
x.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
What I don't understand is, given the above definitions of acquire and release, why does C++11 specifically associate acquire with load, and release with store? Yes, I've seen many of the examples that show how you can use an acquire/load with a release/store to synchronize between threads, but in general it seems that the idea of acquire fences (prevent memory reordering after statement) and release fences (prevent memory reordering before statement) is orthogonal to the idea of loads and stores.
So, why, for example, won't the compiler let me say:
x.store(10, std::memory_order_acquire);
I realize I can accomplish the above by using memory_order_relaxed, and then a separate atomic_thread_fence(memory_order_acquire) statement, but again, why can't I use store directly with memory_order_acquire?
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
Say I write some data, and then I write an indication that the data is now ready. It's imperative that no other thread who sees the indication that the data is ready not see the write of the data itself. So prior writes cannot move past that write.
Say I read that some data is ready. It's imperative that any reads I issue after seeing that take place after the read that saw that the data was ready. So subsequent reads cannot move behind that read.
So when you do a synchronized write, you typically need to make sure that all writes you did before that are visible to anyone who sees the synchronized write. And when you do a synchronized read, it's typically imperative that any reads you do after that take place after the synchronized read.
Or, to put it another way, an acquire is typically reading that you can take or access the resource, and subsequent reads and writes must not be moved before it. A release is typically writing that you are done with the resource, and preceding writes must not be moved to after it.
(Partial answer correcting a mistake in the early part of the question. David Schwartz's answer already nicely covers the main question you're asking. Jeff Preshing's article on acquire / release is also good reading for another take on it.)
The definitions you gave for acquire / release are wrong for fences; they only apply to acquire operations and release operations, like x.store(mo_release), not std::atomic_thread_fence(mo_release).
Acquire means that no memory accesses which appear after the acquire fence can be reordered to before the fence. [wrong, would be correct for acquire operation]
Release means that no memory accesses which appear before the release fence can be reordered to after the fence. [wrong, would be correct for release operation]
They're insufficient for fences, which is why ISO C++ has stronger ordering rules for acquire fences (blocking LoadStore / LoadLoad reordering) and release fences (LoadStore / StoreStore).
Of course ISO C++ doesn't define "reordering", that would imply there is some global coherent state that you're accessing. ISO C++ instead
Jeff Preshing's articles are relevant here:
Acquire and Release Semantics (acquire / release operations such as loads, stores, and RMWs)
Acquire and Release Fences Don't Work the Way You'd Expect explains why those one-way barrier definitions are incorrect and insufficient for fences, unlike for operations. (Because it would let the fence reorder all the way to one end of your program and leave all the operations unordered wrt. each other, because it's not tied to an operation itself.)
A possible use case for this might be if I want to ensure that some store, say x = 10, happens before some other statement executes that might affect other threads.
If that "other statement" is a load from an atomic shared variable, you actually need std::memory_order_seq_cst to avoid StoreLoad reordering. acquire / release / acq_rel won't block that.
If you mean make sure the atomic store is visible before some other atomic store, the normal way is to make the 2nd atomic store use mo_release.
If the 2nd store isn't atomic, it's unlikely any reader could safely sync with anything in a way that it could observe the value without data-race UB.
(Although you do run into a use case for a release fence when hacking up a SeqLock that uses plain non-atomic objects for the payload, to allow a compiler to optimize. But that's an implementation-specific behaviour that depends on knowing how std::atomic stuff compiles for real CPUs. See Implementing 64 bit atomic counter with 32 bit atomics for example.)
std::memory_order_acquire fence only ensures all load operation after the fence is not reordered before any load operation before the fence, thus memory_order_acquire cannot ensure the store is visible for other threads when after loads are executed. This is why memory_order_acquire is not supported for store operation, you may need memory_order_seq_cst to achieve the acquire of store.
As an alternative, you may say
x.store(10, std::memory_order_releaxed);
x.load(std::memory_order_acquire); // this introduce a data dependency
to ensure all loads not reordered before the store. Again, the fence not work here.
Besides, memory order in atomic operation could be cheaper than a memory fence, because it only ensures the order relative to the atomic instruction, not all instruction before and after the fence.
See also formal description and explanation for detail.

Atomic action - mutex

I heard that there is something known as "atomic action" which is faster then using a mutex with critical section.
Does somebody know what is it, and how do I use it?
An atomic operation is an operation where the CPU reads and writes memory during the same bus access, this prevents other CPUs or system devices from modifying the memory simultaneously. E.g. a "test and set" operation, which could do "read memory at location X, if it is 0 set it to 1, return an indicator telling whether the value was set" without any chance of simultaneous access.
Wikipedia's http://en.wikipedia.org/wiki/Linearizability describes atomic operations.
If you're on windows, have a look at e.g. InterlockedTestExchange or InterlockedIncrement which are wrappers for atomic operations.
EDIT: Sample usage
A spinlock could be implemented using a test-and-set atomic operation:
while (test_and_set(&x) == 1) ;
This will keep looping until the current thread is the one that sets x to 1. If all other threads treat x in the same way its effect is the same as a mutex.
Atomic action only refers to the fact that an action will be done atomically uninterrupted by a co-running thread/processes.
What you are probably looking for are atomic built-ins in compilers. For example GCC provides this set: http://gcc.gnu.org/onlinedocs/gcc-4.5.1/gcc/Atomic-Builtins.html
These are usually implemented very efficiently using CPU support.
It is a tradeoff. As other posters have stated, an atomic operation is "try to grab this flag and return true on success". It is fast, but there are downsides.
A proper mutex blocks the threads that need to get into the critical section. With only atomic operations the waiting threads have to loop until they get the flag - this wastes CPU cycles. Another downside is that mutexes guarantee fair access - usually by just queueing the waiting processes in a FIFO queue. With spinlocks, there's a risk of resource starvation.
So, the bare atomic operations are faster but only when there's not too many threads trying to grab the critical section.
Since a GCC-specific answer was given, here's the VC++-specific link -- all of the intrinsics listed that begin with _Interlocked are relevant: http://msdn.microsoft.com/en-us/library/hd9bdb82.aspx. Also note that there are more intrinsics available for x64 than for x86: http://msdn.microsoft.com/en-us/library/azcs88h2.aspx.
Critical section implementation (shurely in windows) uses atomic variable to detect is a critical section (cs) captured by another thread or not, and enters kernel-side synchronization primitive only if real collision occurs. So if you need to protect a small piece of code and collision probability is small enough, critical section is a good solution.
However, if the protected code does nothing except of incrementing/decrementing or testing and modifying a single variable then it's the right case to use atomic operation.
There is also possible to protect by atomics more complicated code (google "lock free structures" and "transactional memory" for more info)
It is interesting but very complicated stuff and may not be recommended if some simple solution (like critical section) also works.

Do I need a mutex for reading?

I have a class that has a state (a simple enum) and that is accessed from two threads. For changing state I use a mutex (boost::mutex). Is it safe to check the state (e.g. compare state_ == ESTABLISHED) or do I have to use the mutex in this case too? In other words do I need the mutex when I just want to read a variable which could be concurrently written by another thread?
It depends.
The C++ language says nothing about threads or atomicity.
But on most modern CPU's, reading an integer is an atomic operation, which means that you will always read a consistent value, even without a mutex.
However, without a mutex, or some other form of synchronization, the compiler and CPU are free to reorder reads and writes, so anything more complex, anything involving accessing multiple variables, is still unsafe in the general case.
Assuming the writer thread updates some data, and then sets an integer flag to inform other threads that data is available, this could be reordered so the flag is set before updating the data. Unless you use a mutex or another form of memory barrier.
So if you want correct behavior, you don't need a mutex as such, and it's no problem if another thread writes to the variable while you're reading it. It'll be atomic unless you're working on a very unusual CPU. But you do need a memory barrier of some kind to prevent reordering in the compiler or CPU.
You have two threads, they exchange information, yes you need a mutex and you probably also need a conditional wait.
In your example (compare state_ == ESTABLISHED) indicates that thread #2 is waiting for thread #1 to initiate a connection/state. Without a mutex or conditionals/events, thread #2 has to poll the status continously.
Threads is used to increase performance (or improve responsiveness), polling usually results in decreased performance, either by consuming a lot of CPU or by introducing latencey due to the poll interval.
Yes. If thread a reads a variable while thread b is writing to it, you can read an undefined value. The read and write operation are not atomic, especially on a multi-processor system.
Generally speaking you don't, if your variable is declared with "volatile". And ONLY if it is a single variable - otherwise you should be really careful about possible races.
actually, there is no reason to lock access to the object for reading. you only want to lock it while writing to it. this is exactly what a reader-writer lock is. it doesn't lock the object as long as there are no write operations. it improves performance and prevents deadlocks. see the following links for more elaborate explanations :
wikipedia
codeproject
The access to the enum ( read or write) should be guarded.
Another thing:
If the thread contention is less and the threads belong to same process then Critical section would be better than mutex.

Why do locks work?

If the locks make sure only one thread accesses the locked data at a time, then what controls access to the locking functions?
I thought that boost::mutex::scoped_lock should be at the beginning of each of my functions so the local variables don't get modified unexpectedly by another thread, is that correct? What if two threads are trying to acquire the lock at very close times? Won't the lock's local variables used internally be corrupted by the other thread?
My question is not boost-specific but I'll probably be using that unless you recommend another.
You're right, when implementing locks you need some way of guaranteeing that two processes don't get the lock at the same time. To do this, you need to use an atomic instruction - one that's guaranteed to complete without interruption. One such instruction is test-and-set, an operation that will get the state of a boolean variable, set it to true, and return the previously retrieved state.
What this does is this allows you to write code that continually tests to see if it can get the lock. Assume x is a shared variable between threads:
while(testandset(x));
// ...
// critical section
// this code can only be executed by once thread at a time
// ...
x = 0; // set x to 0, allow another process into critical section
Since the other threads continually test the lock until they're let into the critical section, this is a very inefficient way of guaranteeing mutual exclusion. However, using this simple concept, you can build more complicated control structures like semaphores that are much more efficient (because the processes aren't looping, they're sleeping)
You only need to have exclusive access to shared data. Unless they're static or on the heap, local variables inside functions will have different instances for different threads and there is no need to worry. But shared data (stuff accessed via pointers, for example) should be locked first.
As for how locks work, they're carefully designed to prevent race conditions and often have hardware level support to guarantee atomicity. IE, there are some machine language constructs guaranteed to be atomic. Semaphores (and mutexes) may be implemented via these.
The simplest explanation is that the locks, way down underneath, are based on a hardware instruction that is guaranteed to be atomic and can't clash between threads.
Ordinary local variables in a function are already specific to an individual thread. It's only statics, globals, or other data that can be simultaneously accessed by multiple threads that needs to have locks protecting it.
The mechanism that operates the lock controls access to it.
Any locking primitive needs to be able to communicate changes between processors, so it's usually implemented on top of bus operations, i.e., reading and writing to memory. It also needs to be structured such that two threads attempting to claim it won't corrupt its state. It's not easy, but you can usually trust that any OS implemented lock will not get corrupted by multiple threads.