Related
in a lock-free queue.pop(), I read a trivialy_copyable variable (of integral type) after synchronization with an atomic aquire inside a loop.
Minimized pseudo code:
//somewhere else writePosition.store(...,release)
bool pop(size_t & returnValue){
writePosition = writePosition.load(aquire)
oldReadPosition = readPosition.load(relaxed)
size_t value{};
do{
value = data[oldReadPosition]
newReadPosition = oldReadPosition+1
}while(readPosition.compare_exchange(oldReadPosition, newReadPosition, relaxed)
// here we are owner of the value
returnValue = value;
return true;
}
the memory of data[oldReadPosition] can only be changed iff this value was read from another thread bevor.
read and write Positions are ABA safe.
with a simple copy, value = data[oldReadPosition] the memory of data[oldReadPosition] will not be changed.
but a write thread queue.push(...) can change data[oldReadPosition] while reading, iff another thread has already read oldPosition and changed the readPosition.
it would be a race condition, if you use the value, but is it also a race condition, and thus undefined behavior, when we leave value untouched? the standard is not spezific enough or I don´t understand it.
imo, this should be possible, because it has no effect.
I would be very happy to get an qualified answer to get deeper insights
thanks a lot
Yes, it's UB in ISO C++; value = data[oldReadPosition] in the C++ abstract machine involves reading the value of that object. (Usually that means lvalue to rvalue conversion, IIRC.)
But it's mostly harmless, probably only going to be a problem on machines with hardware race detection (not normal mainstream CPUs, but possibly on C implementations like clang with threadsanitizer).
Another use-case for non-atomic read and then checking for possible tearing is the SeqLock, where readers can prove no tearing by reading the same value from an atomic counter before and after the non-atomic read. It's UB in C++, even with volatile for the non-atomic data, although that may be helpful in making sure the compiler-generated asm is safe. (With memory barriers and current handling of atomics by existing compilers, even non-volatile makes working asm). See Optimal way to pass a few variables between 2 threads pinning different CPUs
atomic_thread_fence is still necessary for a SeqLock to be safe, and some of the necessary ordering of atomic loads wrt. non-atomic may be an implementation detail if it can't sync with something and create a happens-before.
People do use Seq Locks in real life, depending on the fact that real-life compilers de-facto define a bit more behaviour than ISO C++. Or another way to put it is that happen to work for now; if you're careful about what code you put around the non-atomic read it's unlikely for a compiler to be able to do anything problematic.
But you're definitely venturing out past the safe area of guaranteed behaviour, and probably need to understand how C++ compiles to asm, and how asm works on the target platforms you care about; see also Who's afraid of a big bad optimizing compiler? on LWN; it's aimed at Linux kernel code, which is the main user of hand-rolled atomics and stuff like that.
I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
Here's the code in compiler explorer.
The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
So why don't compilers do this optimization?
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.
Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.
Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].
wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.
volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).
You are referring to dead-stores elimination.
It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.
Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.
from N4455 No Sane Compiler Would Optimize Atomics
The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.
Consider this code executed by a thread A:
y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);
Can it be optimised as y.store(3, std::memory_order_seq_cst)?
If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.
However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).
If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).
Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.
As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.
I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.
While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.
NB: I was going to comment this but it's a bit too wordy.
One interesting fact is that this behavior isn't in the terms of C++ a data race.
Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is
not atomic
Also on p.11 note 5 :
“Relaxed” atomic operations are not synchronization operations even
though, like synchronization operations, they cannot contribute to
data races.
So a conflicting action on an atomic is never a data race - in terms of the C++ standard.
These operations are all atomic (and specifically relaxed) but no data race here folks!
I agree there's no reliable/predictable difference between these two on any (reasonable) platform:
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
and
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
}
But within the definition provided C++ memory model it isn't a data race.
I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.
For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic.
For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.
What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.
In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!).
It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.
In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.
There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).
If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.
How is this possible if you only do half of the writes (or even only a single one)? It isn't.
What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?
It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.
A practical use case for the pattern, if the thread does something important between updates that does not depend on or modify y, might be: *Thread 2 reads the value of y to check how much progress Thread 1 has made.`
So, maybe Thread 1 is supposed to load the configuration file as step 1, put its parsed contents into a data structure as step 2, and display the main window as step 3, while Thread 2 is waiting on step 2 to complete so it can perform another task in parallel that depends on the data structure. (Granted, this example calls for acquire/release semantics, not relaxed ordering.)
I’m pretty sure a conforming implementation allows Thread 1 not to update y at any intermediate step—while I haven’t pored over the language standard, I would be shocked if it does not support hardware on which another thread polling y might never see the value 2.
However, that is a hypothetical instance where it might be pessimal to optimize away the status updates. Maybe a compiler dev will come here and say why that compiler chose not to, but one possible reason is letting you shoot yourself in the foot, or at least stub yourself in the toe.
The compiler writer cannot just perform the optimisation. They must also convince themselves that the optimisation is valid in the situations where the compiler writer intends to apply it, that it will not be applied in situations where it is not valid, that it doesn't break code that is in fact broken but "works" on other implementations. This is probably more work than the optimisation itself.
On the other hand, I could imagine that in practice (that is in programs that are supposed to do a job, and not benchmarks), this optimisation will save very little in execution time.
So a compiler writer will look at the cost, then look at the benefit and the risks, and probably will decide against it.
Let's walk a little further away from the pathological case of the three stores being immediately next to each other. Let's assume there's some non-trivial work being done between the stores, and that such work does not involve y at all (so that data path analysis can determine that the three stores are in fact redundant, at least within this thread), and does not itself introduce any memory barriers (so that something else doesn't force the stores to be visible to other threads). Now it is quite possible that other threads have an opportunity to get work done between the stores, and perhaps those other threads manipulate y and that this thread has some reason to need to reset it to 1 (the 2nd store). If the first two stores were dropped, that would change the behaviour.
Since variables contained within an std::atomic object are expected to be accessed from multiple threads, one should expect that they behave, at a minimum, as if they were declared with the volatile keyword.
That was the standard and recommended practice before CPU architectures introduced cache lines, etc.
[EDIT2] One could argue that std::atomic<> are the volatile variables of the multicore age. As defined in C/C++, volatile is only good enough to synchronize atomic reads from a single thread, with an ISR modifying the variable (which in this case is effectively an atomic write as seen from the main thread).
I personally am relieved that no compiler would optimize away writes to an atomic variable. If the write is optimized away, how can you guarantee that each of these writes could potentially be seen by readers in other threads? Don't forget that that is also part of the std::atomic<> contract.
Consider this piece of code, where the result would be greatly affected by wild optimization by the compiler.
#include <atomic>
#include <thread>
static const int N{ 1000000 };
std::atomic<int> flag{1};
std::atomic<bool> do_run { true };
void write_1()
{
while (do_run.load())
{
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
}
}
void write_0()
{
while (do_run.load())
{
flag = -1; flag = -1; flag = -1; flag = -1;
}
}
int main(int argc, char** argv)
{
int counter{};
std::thread t0(&write_0);
std::thread t1(&write_1);
for (int i = 0; i < N; ++i)
{
counter += flag;
std::this_thread::yield();
}
do_run = false;
t0.join();
t1.join();
return counter;
}
[EDIT] At first, I was not advancing that the volatile was central to the implementation of atomics, but...
Since there seemed to be doubts as to whether volatile had anything to do with atomics, I investigated the matter. Here's the atomic implementation from the VS2017 stl. As I surmised, the volatile keyword is everywhere.
// from file atomic, line 264...
// TEMPLATE CLASS _Atomic_impl
template<unsigned _Bytes>
struct _Atomic_impl
{ // struct for managing locks around operations on atomic types
typedef _Uint1_t _My_int; // "1 byte" means "no alignment required"
constexpr _Atomic_impl() _NOEXCEPT
: _My_flag(0)
{ // default constructor
}
bool _Is_lock_free() const volatile
{ // operations that use locks are not lock-free
return (false);
}
void _Store(void *_Tgt, const void *_Src, memory_order _Order) volatile
{ // lock and store
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Load(void *_Tgt, const void *_Src,
memory_order _Order) const volatile
{ // lock and load
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Exchange(void *_Left, void *_Right, memory_order _Order) volatile
{ // lock and exchange
_Atomic_exchange(&_My_flag, _Bytes, _Left, _Right, _Order);
}
bool _Compare_exchange_weak(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_weak(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
bool _Compare_exchange_strong(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_strong(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
private:
mutable _Atomic_flag_t _My_flag;
};
All of the specializations in the MS stl use volatile on the key functions.
Here's the declaration of one of such key function:
inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)
You will notice the required volatile uint8_t* holding the value contained in the std::atomic. This pattern can be observed throughout the MS std::atomic<> implementation, Here is no reason for the gcc team, nor any other stl provider to have done it differently.
In a multi-producer, multi-consumer situation. If producers are writing into int a, and consumers are reading from int a, do I need memory barriers around int a?
We all learned that: Shared resources should always be protected and the standard does not guarantee a proper behavior otherwise.
However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.
Therefore, why protect int a at all?
At least in C++11 (or later), you don't need to (explicitly) protect your variable with a mutex or memory barriers.
You can use std::atomic to create an atomic variable. Changes to that variable are guaranteed to propagate across threads.
std::atomic<int> a;
// thread 1:
a = 1;
// thread 2 (later):
std::cout << a; // shows `a` has the value 1.
Of course, there's a little more to it than that--for example, there's no guarantee that std::cout works atomically, so you probably will have to protect that (if you try to write from more than one thread, anyway).
It's then up to the compiler/standard library to figure out the best way to handle the atomicity requirements. On a typical architecture that ensures cache coherence, it may mean nothing more than "don't allocate this variable in a register". It could impose memory barriers, but is only likely to do so on a system that really requires them.
On real world C++ implementations where volatile worked as a pre-C++11 way to roll your own atomics (i.e. all of them), no barriers are needed for inter-thread visibility, only for ordering wrt. operations on other variables. Most ISAs do need special instructions or barriers for the default memory_order_seq_cst.
On the other hand, explicitly specifying memory ordering (especially acquire and release) for an atomic variable may allow you to optimize the code a bit. By default, an atomic uses sequential ordering, which basically acts like there are barriers before and after access--but in a lot of cases you only really need one or the other, not both. In those cases, explicitly specifying the memory ordering can let you relax the ordering to the minimum you actually need, allowing the compiler to improve optimization.
(Not all ISAs actually need separate barrier instructions even for seq_cst; notably AArch64 just has a special interaction between stlr and ldar to stop seq_cst stores from reordering with later seq_cst loads, on top of acquire and release ordering. So it's as weak as the C++ memory model allows, while still complying with it. But weaker orders, like memory_order_acquire or relaxed, can avoid even that blocking of reordering when it's not needed.)
However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.
Unless you strictly adhere to the requirements of the C++ spec to avoid data races, the compiler is not obligated to make your code function the way it appears to. For example:
int a = 0, b = 0; // shared variables, initialized to zero
a = 1;
b = 1;
Say you do this on your fully cache-coherent architecture. On such hardware it would seem that since a is written before b no thread will ever be able to see b with a value of 1 without a also having that value.
But this is not the case. If you have failed to strictly adhere to the requirements of the C++ memory model for avoiding data races, e.g. you read these variables without the correct synchronization primitives being inserted anywhere, then your program may in fact observe b being written before a. The reason is that you have introduce "undefined behavior" and the C++ implementation has no obligation to do anything that makes sense to you.
What may be going on in practice, is that the compiler may reorder writes even if the hardware works very hard to make it seem as if all writes occur in the order of the machine instructions performing the writes. You need the entire toolchain to cooperate, and cooperation from just the hardware, such as strong cache coherency, is not sufficient.
The book C++ Concurrency in Action is a good source if you care to learn about the details of the C++ memory model and writing portable, concurrent code in C++.
I have really simple question. I have simple type variable (like int). I have one process, one writer thread, several "readonly" threads. How should I declare variable?
volatile int
std::atomic<int>
int
I expect that when "writer" thread modifies value all "reader" threads should see fresh value ASAP.
It's ok to read and write variable at the same time, but I expect reader to obtain either old value or new value, not some "intermediate" value.
I'm using single-CPU Xeon E5 v3 machine. I do not need to be portable, I run the code only on this server, i compile with -march=native -mtune=native. Performance is very important so I do not want to add "synchronization overhead" unless absolutely required.
If I just use int and one thread writes value is it possible that in another thread I do not see "fresh" value for a while?
Just use std::atomic.
Don't use volatile, and don't use it as it is; that doesn't give the necessary synchronisation. Modifying it in one thread and accessing it from another without synchronisation will give undefined behaviour.
If you have unsynchronized access to a variable where you have one or more writers then your program has undefined behavior. Some how you have to guarantee that while a write is happening no other write or read can happen. This is called synchronization. How you achieve this synchronization depends on the application.
For something like this where we have one writer and and several readers and are using a TriviallyCopyable datatype then a std::atomic<> will work. The atomic variable will make sure under the hood that only one thread can access the variable at the same time.
If you do not have a TriviallyCopyable type or you do not want to use a std::atomic You could also use a conventional std::mutex and a std::lock_guard to control access
{ // enter locking scope
std::lock_guard lock(mutx); // create lock guard which locks the mutex
some_variable = some_value; // do work
} // end scope lock is destroyed and mutx is released
An important thing to keep in mind with this approach is that you want to keep the // do work section as short as possible as while the mutex is locked no other thread can enter that section.
Another option would be to use a std::shared_timed_mutex(C++14) or std::shared_mutex(C++17) which will allow multiple readers to share the mutex but when you need to write you can still look the mutex and write the data.
You do not want to use volatile to control synchronization as jalf states in this answer:
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until
much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be
read. In our code, we simply set the flag after preparing the data, so
all looks fine. But what if the instructions are reordered so the flag
is set first?
volatile does guarantee the first point. It also guarantees that no
reordering occurs between different volatile reads/writes. All
volatile memory accesses will occur in the order in which they're
specified. That is all we need for what volatile is intended for:
manipulating I/O registers or memory-mapped hardware, but it doesn't
help us in multithreaded code where the volatile object is often
only used to synchronize access to non-volatile data. Those accesses
can still be reordered relative to the volatile ones.
As always if you measure the performance and the performance is lacking then you can try a different solution but make sure to remeasure and compare after changing.
Lastly Herb Sutter has an excellent presentation he did at C++ and Beyond 2012 called Atomic Weapons that:
This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).
I'll complete a little bit the previous answers.
As exposed previously, just using int or eventually volatile int is not enough for various reason (even with the memory order constraint of Intel processors.)
So, yes, you should use atomic types for that, but you need extra considerations: atomic types guarantee coherent access but if you have visibility concerns you need to specify memory barrier (memory order.)
Barriers will enforce visibility and coherency between threads, on Intel and most modern architectures, it will enforce cache synchronizations so updates are visible for every cores. The problem is that it may be expensive if you're not careful enough.
Possible memory order are:
relaxed: no special barrier, only coherent read/write are enforce;
sequential consistency: strongest possible constraint (the default);
acquire: enforce that no loads after the current one are reordered before and add the required barrier to ensure that released stores are visible;
consume: a simplified version of acquire that mostly only constraint reordering;
release: enforce that all stores before are complete before the current one and that memory writes are done and visible to loads performing an acquire barrier.
So, if you want to be sure that updates to the variable are visible to readers, you need to flag your store with a (at least) a release memory order and, on the reader side you need an acquire memory order (again, at least.) Otherwise, readers may not see the actual version of the integer (it'll see a coherent version at least, that is the old or the new one, but not an ugly mix of the two.)
Of course, the default behavior (full consistency) will also give you the correct behavior, but at the expense of a lot of synchronization. In short, each time you add a barrier it forces cache synchronization which is almost as expensive as several cache misses (and thus reads/writes in main memory.)
So, in short you should declare your int as atomic and use the following code for store and load:
// Your variable
std::atomic<int> v;
// Read
x = v.load(std::memory_order_acquire);
// Write
v.store(x, std::memory_order_release);
And just to complete, sometimes (and more often that you think) you don't really need the sequential consistency (even the partial release/acquire consistency) since visibility of updates are pretty relative. When dealing with concurrent operations, updates take place not when write is performed but when others see the change, reading the old value is probably not a problem !
I strongly recommend reading articles related to relativistic programming and RCU, here are some interesting links:
Relativistic Programming wiki: http://wiki.cs.pdx.edu/rp/
Structured Deferral: Synchronization via Procrastination: https://queue.acm.org/detail.cfm?id=2488549
Introduction to RCU Concepts: http://www.rdrop.com/~paulmck/RCU/RCU.LinuxCon.2013.10.22a.pdf
Let's start from int at int. In general, when used on single processor, single core machine this should be sufficient, assuming int size same or smaller than CPU word (like 32bit int on 32bit CPU). In this case, assuming correctly aligned address word addresses (high level language should assure this by default) the write/read operations should be atomic. This is guaranteed by Intel as stated in [1] . However, in C++ specification simultaneous reading and writing from different threads is undefined behaviour.
$1.10
6 Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.
Now volatile. This keyword disables almost every optimization. This is the reason why it was used. For example, sometimes when optimizing the compiler can come to idea, that variable you only read in one thread is constant there and simply replace it with it's initial value. This solves such problems. However, it does not make access to variable atomic. Also, in most cases, it is simply unnecessary, because use of proper multithreading tools, like mutex or memory barrier, will achieve same effect as volatile on it's own, as described for instance in [2]
While this may be sufficient for most uses, there are other operations that are not guaranteed to be atomic. Like incrementation is a one. This is when std::atomic comes in. It has those operations defined, like here for mentioned incrementations in [3]. It is also well defined when reading and writing from different threads [4].
In addition, as stated in answers in [5], there exists a lot of other factors that may influence (negatively) atomicity of operations. From loosing cache coherency between multiple cores to some hardware details are the factors that may change how operations are performed.
To summarize, std::atomic is created to support accesses from different threads and it is highly recommended to use it when multithreading.
[1] http://www.intel.com/Assets/PDF/manual/253668.pdf see section 8.1.1.
[2] https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
[3] http://en.cppreference.com/w/cpp/atomic/atomic/operator_arith
[4] http://en.cppreference.com/w/cpp/atomic/atomic
[5] Are C++ Reads and Writes of an int Atomic?
The other answers, which say to use atomic and not volatile, are correct when portability matters. If you’re asking this question, and it’s a good question, that’s the practical answer for you, not, “But, if the standard library doesn’t provide one, you can implement a lock-free, wait-free data structure yourself!” Nevertheless, if the standard library doesn’t provide one, you can implement a lock-free data structure yourself that works on a particular compiler and a particular architecture, provided that there’s only one writer. (Also, somebody has to implement those atomic primitives in the standard library.) If I’m wrong about this, I’m sure someone will kindly inform me.
If you absolutely need an algorithm guaranteed to be lock-free on all platforms, you might be able to build one with atomic_flag. If even that doesn’t suffice, and you need to roll your own data structure, you can do that.
Since there’s only one writer thread, your CPU might guarantee that certain operations on your data will still work atomically even if you just use normal accesses instead of locks or even compare-and-swaps. This is not safe according to the language standard, because C++ has to work on architectures where it isn’t, but it can be safe, for example, on an x86 CPU if you guarantee that the variable you’re updating fits into a single cache line that it doesn’t share with anything else, and you might be able to ensure this with nonstandard extensions such as __attribute__ (( aligned (x) )).
Similarly, your compiler might provide some guarantees: g++ in particular makes guarantees about how the compiler will not assume that the memory referenced by a volatile* hasn’t changed unless the current thread could have changed it. It will actually re-read the variable from memory each time you dereference it. That is in no way sufficient to ensure thread-safety, but it can be handy if another thread is updating the variable.
A real-world example might be: the writer thread maintains some kind of pointer (on its own cache line) which points to a consistent view of the data structure that will remain valid through all future updates. It updates its data with the RCU pattern, making sure to use a release operation (implemented in an architecture-specific way) after updating its copy of the data and before making the pointer to that data globally visible, so that any other thread that sees the updated pointer is guaranteed to see the updated data as well. The reader then makes a local copy (not volatile) of the current value of the pointer, getting a view of the data which will stay valid even after the writer thread updates again, and works with that. You want to use volatile on the single variable that notifies the readers of the updates, so they can see those updates even if the compiler “knows” your thread couldn’t have changed it. In this framework, the shared data just needs to be constant, and readers will use the RCU pattern. That’s one of the two ways I’ve seen volatile be useful in the real world (the other being when you don’t want to optimize out your timing loop).
There also needs to be some way, in this scheme, for the program to know when nobody’s using an old view of the data structure any longer. If that’s a count of readers, that count needs to be atomically modified in a single operation at the same time as the pointer is read (so getting the current view of the data structure involves an atomic CAS). Or this might be a periodic tick when all the threads are guaranteed to be done with the data they’re working with now. It might be a generational data structure where the writer rotates through pre-allocated buffers.
Also observe that a lot of things your program might do could implicitly serialize the threads: those atomic hardware instructions lock the processor bus and force other CPUs to wait, those memory fences could stall your threads, or your threads might be waiting in line to allocate memory from the heap.
Unfortunately it depends.
When a variable is read and written in multiple threads, there may be 2 failures.
1) tearing. Where half the data is pre-change and half the data is post change.
2) stale data. Where the data read has some older value.
int, volatile int and std:atomic all don't tear.
Stale data is a different issue. However, all values have existed, can be concieved as correct.
volatile. This tells the compiler neither to cache the data, nor to re-order operations around it. This improves the coherence between threads by ensuring all operations in a thread are either before the variable, at the variable, or after.
This means that
volatile int x;
int y;
y =5;
x = 7;
the instruction for x = 7 will be written after y = 5;
Unfortunately, the CPU is also capable of re-ordering operations. This can mean that another thread sees x ==7 before y =5
std::atomic x; would allow a guarantee that after seeing x==7, another thread would see y ==5. (Assuming other threads are not modifying y)
So all reads of int, volatile int, std::atomic<int> would show previous valid values of x. Using volatile and atomic increase the ordering of values.
See kernel.org barriers
I have simple type variable (like int).
I have one process, one writer thread, several "readonly" threads. How
should I declare variable?
volatile int
std::atomic
int
Use std::atomic with memory_order_relaxed for the store and load
It's quick, and from your description of your problem, safe. E.g.
void func_fast()
{
std::atomic<int> a;
a.store(1, std::memory_order_relaxed);
}
Compiles to:
func_fast():
movl $1, -24(%rsp)
ret
This assumes you don't need to guarantee that any other data is seen to be written before the integer is updated, and therefore the slower and more complicated synchronisation is unnecessary.
If you use the atomic naively like this:
void func_slow()
{
std::atomic<int> b;
b = 1;
}
You get an MFENCE instruction with no memory_order* specification which is massive slower (100 cycles more more vs just 1 or 2 for the bare MOV).
func_slow():
movl $1, -24(%rsp)
mfence
ret
See http://goo.gl/svPpUa
(Interestingly on Intel the use of memory_order_release and _acquire for this code results in the same assembly language. Intel guarantees that writes and reads happen in order when using the standard MOV instruction).
Here is my attempt at bounty:
- a. General answer already given above says 'use atomics'. This is correct answer. volatile is not enough.
-a. If you dislike the answer, and you are on Intel, and you have properly aligned int, and you love unportable solutions, you can do away with simple volatile, using Intel strong memory ordering gurantees.
TL;DR: Use std::atomic<int> with a mutex around it if you read multiple times.
Depends on how strong guarantees you want.
First volatile is a compiler hint and you shouldn't count on it doing something helpful.
If you use int you can suffer for memory aliasing. Say you have something like
struct {
int x;
bool q;
}
Depending on how this is aligned in memory and the exact implementation of CPU and memory bus it's possible that writing to q will actually overwrite x when the page is copied from the cpu cache back to ram. So unless you know how much to allocate around your int it's not guaranteed that your writer will be able to write without being overwritten by some other thread.
Also even if you write you depend on the processor for reloading the data to the cache of other cores so there's no guarantee that your other thread will see a new value.
std::atomic<int> basically guarantees that you will always allocate sufficient memory, properly aligned so that you don't suffer from aliasing. Depending on the memory order requested you will also disable a bunch of optimizations, like caching, so everything will run slightly slower.
This still doesn't grantee that if your read the var multiple times you'll get the value. The only way to do that is to put a mutex around it to block the writer from changing it.
Still better find a library that already solves the problem you have and it has already been tested by others to make sure it works well.
Anyone know whether primitive global variable is thread safe or not?
// global variable
int count = 0;
void thread1()
{
count++;
}
void thread2()
{
count--;
if (count == 0) print("Stuff thing");
}
Can I do it this way without any lock protection for count?
Thank you.
This is not threadsafe. You have a race-condition here. The reason for that is, that count++ is not necessarily atomic (means not a single processor operation). The value is first loaded, then incremented, and then written back. Between each of these steps, the other thread can also modify the value.
No, it's not. It may be, depending on the implementation, the compile-time options and even the phase of the moon.
But the standard doesn't mandate something is thread-safe, specifically because there's nothing about threading in the current standard.
See also here for a more detailed analysis of this sort of issue.
If you're using an environment where threads are supported, you can use a mutex: examine the pthread_mutex_* calls under POSIX threads, for example.
If you're coding for C++0x/C++11, use either a mutex or one of the atomic operations detailed in that standard.
It will be thread safe only if you have 1 CPU with ++ and -- atomic operations on your PC.
If you want to make it thread safe this is the way for Windows:
LONG volatile count = 0;
void Thread1()
{
::InterlockedIncrement( &count );
}
void Thread2()
{
if(::InterlockedDecrement( &count ) == 0 )
{
printf("Stuf");
}
}
You need two things to safely use an object concurrently by two threads or more: atomicity of operations and ordering guarantees.
Some people will pretend that on some platforms what you're attempting here is safe because e.g. operations on whatever type int stands for those platforms are atomic (even incrementing or whatever). The problem with this is that you don't necessarily have ordering guarantees. So while you want and know that this particular variable is going to be accessed concurrently, the compiler doesn't. (And the compiler is right to assume that this variable is going to be used by only one thread at a time: you don't want every variable to be treated as being potentially shared. The performance consequences would be terrible.)
So don't use primitive types this way. You have no guarantees from the language and even if some platforms have their own guarantees (e.g. atomicity) you have no way of telling the compiler that the variable is shared with C++. Either use compiler extensions for atomic types, the C++0x atomic types, or library solutions (e.g. mutexes). And don't let the name mislead you: to be truly useful, an atomic type has to provide ordering guarantees along with the atomicity that comes with the name.
It is a Global variable and hence multiple threads can race to change it. It is not thread safe.
Use a mutex lock.
In general, no, you can't get away with that. In your trivial example case, it might happen to work but it's not reliable.