Why don't compilers merge redundant std::atomic writes? - c++

I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
Here's the code in compiler explorer.

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
So why don't compilers do this optimization?
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.
Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.
Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].
wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.
volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

You are referring to dead-stores elimination.
It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.
Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.
from N4455 No Sane Compiler Would Optimize Atomics
The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.
Consider this code executed by a thread A:
y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);
Can it be optimised as y.store(3, std::memory_order_seq_cst)?
If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.
However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).
If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).
Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.
As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.
I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.

While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.

NB: I was going to comment this but it's a bit too wordy.
One interesting fact is that this behavior isn't in the terms of C++ a data race.
Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is
not atomic
Also on p.11 note 5 :
“Relaxed” atomic operations are not synchronization operations even
though, like synchronization operations, they cannot contribute to
data races.
So a conflicting action on an atomic is never a data race - in terms of the C++ standard.
These operations are all atomic (and specifically relaxed) but no data race here folks!
I agree there's no reliable/predictable difference between these two on any (reasonable) platform:
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
and
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
}
But within the definition provided C++ memory model it isn't a data race.
I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.
For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic.
For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.
What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.
In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!).
It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.

In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.
There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).
If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.
How is this possible if you only do half of the writes (or even only a single one)? It isn't.
What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?
It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.

A practical use case for the pattern, if the thread does something important between updates that does not depend on or modify y, might be: *Thread 2 reads the value of y to check how much progress Thread 1 has made.`
So, maybe Thread 1 is supposed to load the configuration file as step 1, put its parsed contents into a data structure as step 2, and display the main window as step 3, while Thread 2 is waiting on step 2 to complete so it can perform another task in parallel that depends on the data structure. (Granted, this example calls for acquire/release semantics, not relaxed ordering.)
I’m pretty sure a conforming implementation allows Thread 1 not to update y at any intermediate step—while I haven’t pored over the language standard, I would be shocked if it does not support hardware on which another thread polling y might never see the value 2.
However, that is a hypothetical instance where it might be pessimal to optimize away the status updates. Maybe a compiler dev will come here and say why that compiler chose not to, but one possible reason is letting you shoot yourself in the foot, or at least stub yourself in the toe.

The compiler writer cannot just perform the optimisation. They must also convince themselves that the optimisation is valid in the situations where the compiler writer intends to apply it, that it will not be applied in situations where it is not valid, that it doesn't break code that is in fact broken but "works" on other implementations. This is probably more work than the optimisation itself.
On the other hand, I could imagine that in practice (that is in programs that are supposed to do a job, and not benchmarks), this optimisation will save very little in execution time.
So a compiler writer will look at the cost, then look at the benefit and the risks, and probably will decide against it.

Let's walk a little further away from the pathological case of the three stores being immediately next to each other. Let's assume there's some non-trivial work being done between the stores, and that such work does not involve y at all (so that data path analysis can determine that the three stores are in fact redundant, at least within this thread), and does not itself introduce any memory barriers (so that something else doesn't force the stores to be visible to other threads). Now it is quite possible that other threads have an opportunity to get work done between the stores, and perhaps those other threads manipulate y and that this thread has some reason to need to reset it to 1 (the 2nd store). If the first two stores were dropped, that would change the behaviour.

Since variables contained within an std::atomic object are expected to be accessed from multiple threads, one should expect that they behave, at a minimum, as if they were declared with the volatile keyword.
That was the standard and recommended practice before CPU architectures introduced cache lines, etc.
[EDIT2] One could argue that std::atomic<> are the volatile variables of the multicore age. As defined in C/C++, volatile is only good enough to synchronize atomic reads from a single thread, with an ISR modifying the variable (which in this case is effectively an atomic write as seen from the main thread).
I personally am relieved that no compiler would optimize away writes to an atomic variable. If the write is optimized away, how can you guarantee that each of these writes could potentially be seen by readers in other threads? Don't forget that that is also part of the std::atomic<> contract.
Consider this piece of code, where the result would be greatly affected by wild optimization by the compiler.
#include <atomic>
#include <thread>
static const int N{ 1000000 };
std::atomic<int> flag{1};
std::atomic<bool> do_run { true };
void write_1()
{
while (do_run.load())
{
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
}
}
void write_0()
{
while (do_run.load())
{
flag = -1; flag = -1; flag = -1; flag = -1;
}
}
int main(int argc, char** argv)
{
int counter{};
std::thread t0(&write_0);
std::thread t1(&write_1);
for (int i = 0; i < N; ++i)
{
counter += flag;
std::this_thread::yield();
}
do_run = false;
t0.join();
t1.join();
return counter;
}
[EDIT] At first, I was not advancing that the volatile was central to the implementation of atomics, but...
Since there seemed to be doubts as to whether volatile had anything to do with atomics, I investigated the matter. Here's the atomic implementation from the VS2017 stl. As I surmised, the volatile keyword is everywhere.
// from file atomic, line 264...
// TEMPLATE CLASS _Atomic_impl
template<unsigned _Bytes>
struct _Atomic_impl
{ // struct for managing locks around operations on atomic types
typedef _Uint1_t _My_int; // "1 byte" means "no alignment required"
constexpr _Atomic_impl() _NOEXCEPT
: _My_flag(0)
{ // default constructor
}
bool _Is_lock_free() const volatile
{ // operations that use locks are not lock-free
return (false);
}
void _Store(void *_Tgt, const void *_Src, memory_order _Order) volatile
{ // lock and store
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Load(void *_Tgt, const void *_Src,
memory_order _Order) const volatile
{ // lock and load
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Exchange(void *_Left, void *_Right, memory_order _Order) volatile
{ // lock and exchange
_Atomic_exchange(&_My_flag, _Bytes, _Left, _Right, _Order);
}
bool _Compare_exchange_weak(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_weak(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
bool _Compare_exchange_strong(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_strong(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
private:
mutable _Atomic_flag_t _My_flag;
};
All of the specializations in the MS stl use volatile on the key functions.
Here's the declaration of one of such key function:
inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)
You will notice the required volatile uint8_t* holding the value contained in the std::atomic. This pattern can be observed throughout the MS std::atomic<> implementation, Here is no reason for the gcc team, nor any other stl provider to have done it differently.

Related

Does statement re-ordering apply to conditional/control statements?

As described in other posts, without any sort of volatile or std::atomic qualification, the compiler and/or processor is at liberty to re-order the sequence of statements (e.g. assignments):
// this code
int a = 2;
int b = 3;
int c = a;
b = c;
// could be re-ordered/re-written as the following by the compiler/processor
int c = 2;
int a = c;
int b = a;
However, are conditionals and control statements (e.g. if, while, for, switch, goto) also allowed to be re-ordered, or are they considered essentially a "memory fence"?
int* a = &previously_defined_int;
int b = *a;
if (a == some_predefined_ptr)
{
a = some_other_predefined_ptr; // is this guaranteed to occur after "int b = *a"?
}
If the above statements could be re-ordered (e.g. store a in a temporary register, update a, then populate b by de-reference the "old" a in the temporary register), which I guess they could be while still satisfying the same "abstract machine" behavior in a single-threaded environment, then why aren't there problems when using locks/mutexes?
bool volatile locked = false; // assume on given implementation, "locked" reads/writes occur in 1 CPU instruction
// volatile so that compiler doesn't optimize out
void thread1(void)
{
while (locked) {}
locked = true;
// do thread1 things // couldn't these be re-ordered in front of "locked = true"?
locked = false;
}
void thread2(void)
{
while (locked) {}
locked = true;
// do thread2 things // couldn't these be re-ordered in front of "locked = true"?
locked = false;
}
Even if std::atomic was used, non-atomic statements can still be re-ordered around atomic statements, so that wouldn't help ensure that "critical section" statements (i.e. the "do threadX things") were contained within their intended critical section (i.e. between the locking/unlocking).
Edit: Actually, I realize the lock example doesn't actually have anything to do with the conditional/control statement question I asked. It would still be nice to have clarification on both of the questions asked though:
re-ordering within and around conditional/control statements
are locks/mutexes of the form given above robust?
so far the answer from the comments is, "no, because there is a race condition between the while() check and claiming the lock", but apart from that, I am also wondering about the placement of thread function code outside of the "critical section"
The "as-if" rule ([intro.abstract]) is important to note here:
The semantic descriptions in this document define a parameterized nondeterministic abstract machine. This document places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine as explained below
Anything* can be reordered so long as the implementation can guarantee that the observable behavior of the resulting program is unchanged.
Thread synchronization constructs in general cannot be implemented properly without fences and preventing reordering. For example, the standard guarantees that lock/unlock operations on mutexes shall behave as atomic operations. Atomics also explicitly introduce fences, especially with respect to the memory_order specified. This means that statements depending on an (unrelaxed) atomic operation cannot be reordered, otherwise the observable behavior of the program may change.
[intro.races] talks a great deal about data races and ordering.
Branches are very much the opposite of a memory fence in assembly. Speculative execution + out-of-order exec means that control dependencies are not data dependencies, so for example if(x) tmp = y; can load y without waiting for a cache miss on x. See memory model, how load acquire semantic actually works?
Of course in C++, this just means that no, if() doesn't help. The definitions of (essentially deprecated) memory_order_consume might even specify that if isn't a data dependency. Real compilers promote it to acquire because it's too hard to implement as originally specified, though.
So TL:DR: you still need atomics with mo_acquire and mo_release (or stronger) if you want to establish a happens-before between two threads. Using a relaxed variable in an if() doesn't help at all, and in fact makes reordering in practice on real CPUs easier.
And of course non-atomic variables aren't safe without synchronization. An if(data_ready.load(acquire)) is sufficient to protect access to a non-atomic variable, though. So is a mutex; mutex lock/unlock count as acquire and release operations on the mutex object, according to C++ definitions. (Many practical implementations involve full barriers, but formally C++ only guarantees acq and rel for mutexes)
Conditions can also be re-ordered so long as no program that follows the rules will have its behavior affected by the re-ordering. There aren't problems with locks/mutexes because optimizations are only legal if they don't break programs that follow the rules. Programs that properly use locks and mutexes follow the rules, so the implementation has to be careful not to break them.
You example code using while (locked) {} where locked is volatile either follows the platform's rules or it doesn't. There are some platforms where volatile has guaranteed semantics that make this code work and that code would be safe on those platforms. But on platforms where volatile has no specified multi-thread semantics, all bets are off. Optimizations are allowed to break the code since the code relies on behavior the platform does not guarantee.
I am also wondering about the placement of thread function code outside of the "critical section"
Check your platform's documentation. It either guarantees that operations won't be re-ordered around accesses to volatile objects or it doesn't. If it doesn't, then it is free to re-order the operations and you would be foolish to rely on it not doing so.
Be careful. There was a time when you often had pretty much no choice but to do things like this. But that was a long time ago and modern platforms provide sensible atomic operations, operations with well-defined memory visibility, and so on. There is a history of new optimizations coming out that significantly improve performance of realistic code but break code that relied on assumed, but not guaranteed, semantics. Don't add to the problem without reason.
Compiler is free to 'Optimize' the source code without affecting the result of the program.
The optimization can be re-ordering or removal of unnecessary statements, as long as the net result of the program does not get affected.
For example, the following assignments and 'if' condition can be optimized into a single statement:
Before optimization:
int a = 0;
int b = 20;
// ...
if (a == 0) {
b = 10;
}
Optimized code
int b = 10;
At 41:27 of Herb Sutter's talk, he states that "opaque function calls" (I would presume functions that the compilation can only see the declaration, but not definition of) require a full memory barrier. Therefore, while conditional/control statements are fully transparent and can be re-ordered by the compiler/processor, a workaround could be to use opaque functions (e.g. #include a library with functions that have "NOP" implementations in a source file).

what makes c++ atomic atomic

I just know that even the code int64_t_variable = 10 is not an atomic operation. for example
int64_t i = 0;
i = 100;
in another thread (say T2) it may read the value which is not 0 or 100.
is the above true?
std::atomic<int64_t> i; i.store(100, std::memory_order_relaxed) is atomic. so what magic does atomic use to make it happen based on Q1 is true?
I always think any operation which handles less than 64 bits is atomic (assume 64 bit cpu), looks I was wrong. so for v = n, how I can know if it is atomic or not? for example, if v is void *, is it atomic or not?
==================================
Update: In my question: when T2 read the i, both 0 and 100 are ok for me. but any other result is not reasonable. This is the point. so I don't think cpu cache or compiler stuff can make this happen.
is the above true?
Yes. If you do not use synchronization (std::atmoic<>, std::mutex, ...) then any change you make in one thread can not be assumed to show up in other threads. It could even be the compiler optimizes something away because there is no way it can change in function. For example
bool flag = true;
void stop_stuff()
{
flag = false;
}
void do_stuff()
{
while (flag)
std::cout << "never stop";
}
since there is no synchronization for flag, the compiler is free to assume it never changes optmize away even checking flag in the loop condition. If it does then no matter how many times you call stop_stuff, do_stuff will never end.
If you change flag to std::atomic<bool> flag, then the compiler can no longer make such an assumption as you are telling it that this variable can change outside the scope of the function and it needs to be checked.
Do note that not providing synchronization when you have more than one thread with shared data and at least one of those threads writes to the shared data is called a data race and per the standard is undefined behavior. The above example is just one possible outcome of that undefined behavior.
std::atomic<int64_t> i; i.store(100, std::memory_order_relaxed) is atomic. so what magic does atomic use to make it happen based on Q1 is true?
It either uses an atomic primitive your system provides, or it uses a locking mechanism like std::mutex to guard the access.
I always think any operation which handles less than 64 bits is atomic (assume 64 bit cpu), looks I was wrong. so for v = n, how I can know if it is atomic or not? for example, if v is void *, is it atomic or not?
While this may be true on some systems, it is not true for the C++ memory model. The only things that are atomic are std::atomic<T>.
Yes. Reasons include "it requires multiple instructions", "memory caching can mess with it" and "the compiler may do surprising things (aka UB) if you fail to mention that the variable might be changed elsewhere".
The above reasons need to all be treated, and std::atomic gives the compiler/C++ library implementation the information to do so. The compiler won't be allowed to do any surprise optimizations, it might issue cache flushes where necessary, and the implementation may use a locking mechanism to prevent different operations on the same object from interleaving. Note that not all of these may be necessary: x86 has some atomicity guarantees built in. You can inform yourself whether a given std::atomic type is always lock-free on your platform with std::atomic::is_always_lockfree, and whether a given instance is lock-free (due to e.g. aligned/unaligned access).
If you don't use std::atomic, you have no atomicity guarantees due to the above reasons. It might be atomic on the instruction-level on your CPU, but C++ is defined on an abstract machine without such guarantees. If your code relies on atomicity on the abstract machine but fails to specify this, the compiler might make invalid optimizations and produce UB.
Yes, for example if there is 8bit memory, another thread may read partially written variable.
Whatever it takes to guarantee correct results, it can be nothing, built in atomic operations, locks etc.
void* is not atomic std::atomic < void* > is.
Same for the char, to be atomic it needs to be explicitly stated as atomic.

Should i specify volatile keyword for every object that shares its memory between different threads

I just read Do not use volatile as a synchronization primitive article on CERT site and noticed that a compiler can theoretically optimize the following code in the way that it'll store a flag variable in the registers instead of modifying actual memory shared between different threads:
bool flag = false;//Not declaring as {{volatile}} is wrong. But even by declaring {{volatile}} this code is still erroneous
void test() {
while (!flag) {
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup() {
flag = true;
}
void debit(int amount){
test();
account_balance -= amount;//We think it is safe to go inside the critical section
}
Am I right?
Is it true that I need to use volatile keyword for every object in my program that shares its memory between different threads? Not because it does some kind of synchronization for me (I need to use mutexes or any other synchronization primitives to accomplish such task anyway) but just because of the fact that a compiler can possibly optimize my code and store all shared variables in the registers so other threads will never get updated values?
It's not just about storing them in the registers, there are all sorts of levels of caching between the shared main memory and the CPU. Much of that caching is per CPU-core so any change made there will not be seen by other cores for a long time (or potentially if other cores are modifying the same memory then those changes may be lost completely).
There are no guarantees about how that caching will behave and even if something is true for current processors it may well not be true for older processors or for the next generation of processors. In order to write safe multi threading code you need to do it properly. The easiest way is to use the libraries and tools provided in order to do so. Trying to do it yourself using low level primitives like volatile is a very hard thing involving a lot of in-depth knowledge.
It is actually very simple, but confusing at the same time. On a high level, there are two optimization entities at play when you write C++ code - compiler and CPU. And within compiler, there are two major optimization techniue in regards to variable access - omitting variable access even if written in the code and moving other instructions around this particular variable access.
In particular, following example demonstrates those two techniques:
int k;
bool flag;
void foo() {
flag = true;
int i = k;
k++;
k = i;
flag = false;
}
In the code provided, compiler is free to skip first modification of flag - leaving only final assignment to false; and completely remove any modifications to k. If you make k volatile, you will require compiler to preserve all access to k = it will be incremented, and than original value put back. If you make flag volatile as well, both assignments first to true, than two false will remain in the code. However, reordering would still be possible, and the effective code might look like
void foo() {
flag = true;
flag = false;
int i = k;
k++;
k = i;
}
This will have unpleasant effect if another thread would be expecting flag to indicate if k is being modified now.
One of the way to achive the desired effect would be to define both variables as atomic. This would prevent compiler from both optimizations, ensuring code executed will be the same as code written. Note that atomic is, in effect, a volatile+ - it does all the volatile does + more.
Another thing to notice is that compiler optimizations are, indeed, a very powerful and desired tool. One should not impede them just for the fun of it, so atomicity should be used only when it is required.
On your particular
bool flag = false;
example, declaring it as volatile will universally work and is 100% correct.
But it will not buy you that all the time.
Volatile IMPOSES on the compiler that each and every evaluation of an object (or mere C variable) is either done directly on the memory/register or preceded by retrieval from external-memory medium into internal memory/registers. In some cases code and memory-footprint size can be quite larger, but the real issue is that it's not enough.
When some time-based context-switching is going on (e.g. threads), and your volatile object/variable is aligned and fits in a CPU register, you get what you intended. Under these strict conditions, a change or evaluation is atomically done, so in a context switching scenario the other thread will be immediately "aware" of any changes.
However, if your object/ big variable does not fit in a CPU register (from size or no alignment) a thread context-switch on a volatile may still be a NO-NO... an evaluation at the concurrent thread may catch a mid-changing procedure... e.g. while changing a 5-member struct copy, the concurrent thread is invoked amid 3rd member changing. cabum!
The conclusion is (back to "Operating-Systems 101"), you need to identify your shared objects, elect preemptive+blocking or non-preemptive or other concurrent-resource access strategy, and make your evaluaters/changers atomic. The access methods (change/eval) usually incorporate the make-atomic strategy, or (if it's aligned and small) simply declare it as volatile.

Volatile vs. memory fences

The code below is used to assign work to multiple threads, wake them up, and wait until they are done. The "work" in this case consists of "cleaning a volume". What exactly this operation does is irrelevant for this question -- it just helps with the context. The code is part of a huge transaction processing system.
void bf_tree_cleaner::force_all()
{
for (int i = 0; i < vol_m::MAX_VOLS; i++) {
_requested_volumes[i] = true;
}
// fence here (seq_cst)
wakeup_cleaners();
while (true) {
usleep(10000); // 10 ms
bool remains = false;
for (int vol = 0; vol < vol_m::MAX_VOLS; ++vol) {
// fence here (seq_cst)
if (_requested_volumes[vol]) {
remains = true;
break;
}
}
if (!remains) {
break;
}
}
}
A value in a boolean array _requested_volumes[i] tells whether thread i has work to do. When it is done, the worker thread sets it to false and goes back to sleep.
The problem I am having is that the compiler generates an infinite loop, where the variable remains is always true, even though all values in the array have been set to false. This only happens with -O3.
I have tried two solutions to fix that:
Declare _requested_volumes volatile
(EDIT: this solution does work actually. See edit below)
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses. But there's a lot of dispute over this on the Internet. The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access. In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
Introduce memory fences
The method wakeup_cleaners() acquires a pthread_mutex_t internally in order to set a wake-up flag in the worker threads, so it should implicitly produce proper memory fences. But I'm not sure if those fences affect memory accesses in the caller method (force_all()). Therefore, I manually introduced fences in the locations specified by the comments above. This should make sure that writes performed by the worker thread in _requested_volumes are visible in the main thread.
What puzzles me is that none of these solutions works, and I have absolutely no idea why. The semantics and proper use of memory fences and volatile is confusing me right now. The problem is that the compiler is applying an undesired optimization -- hence the volatile attempt. But it could also be a problem of thread synchronization -- hence the memory fence attempt.
I could try a third solution in which a mutex protects every access to _requested_volumes, but even if that works, I would like to understand why, because as far as I understand, it's all about memory fences. Thus, it should make no difference whether it's done explicitly or implicitly via a mutex.
EDIT: My assumptions were wrong and Solution 1 actually does work. However, my question remains in order to clarify the use of volatile vs. memory fences. If volatile is such a bad thing, that should never be used in multithreaded programming, what else should I use here? Do memory fences also affect compiler optimizations? Because I see these as two orthogonal issues, and therefore orthogonal solutions: fences for visibility in multiple threads and volatile for preventing optimizations.
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses.
Yes.
But there's a lot of dispute over this on the Internet.
Not, generally, between "the experts".
The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access.
Nope.
Non-pure, non-constexpr non-inlined function calls (getters/accessors) also necessarily have this effect. Admittedly link-time optimization confuses the issue of which functions may really get inlined.
In C, and by extension C++, volatile affects memory access optimization. Java took this keyword, and since it can't (or couldn't) do the tasks C uses volatile for in the first place, altered it to provide a memory fence.
The correct way to get the same effect in C++ is using std::atomic.
In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
No, it may have the desired effect, depending on how it interacts with your platform's cache hardware. This is brittle - it could change any time you upgrade a CPU, or add another one, or change your scheduler behaviour - and it certainly isn't portable.
If you're really just tracking how many workers are still working, sane methods might be a semaphore (synchronized counter), or mutex+condvar+integer count. Either are likely more efficient than busy-looping with a sleep.
If you're wedded to the busy loop, you could still reasonably have a single counter, such as std::atomic<size_t>, which is set by wakeup_cleaners and decremented as each cleaner completes. Then you can just wait for it to reach zero.
If you really want a busy loop and really prefer to scan the array each time, it should be an array of std::atomic<bool>. That way you can decide what consistency you need from each load, and it will control both the compiler optimizations and the memory hardware appropriately.
Apparently, volatile does the necessary for your example. The topic of volatile qualifier itself is too broad: you can start by searching "C++ volatile vs atomic" etc. There are a lot of articles and questions&answers on the internet, e.g. Concurrency: Atomic and volatile in C++11 memory model .
Briefly, volatile tells the compiler to disable some aggressive optimizations, particularly, to read the variable each time it is accessed (rather than storing it in a register or cache). There are compilers which do more so making volatile to act more like std::atomic: see Microsoft Specific section here. In your case disablement of an aggressive optimization is exactly what was necessary.
However, volatile doesn't define the order for the execution of the statements around it. That is why you need memory order in case you need to do something else with the data after the flags you check have been set.
For inter-thread communication it is appropriate to use std::atomic, particularly, you need to refactor _requested_volumes[vol] to be of type std::atomic<bool> or even std::atomic_flag: http://en.cppreference.com/w/cpp/atomic/atomic .
An article that discourages usage of volatile and explains that volatile can be used only in rare special cases (connected with hardware I/O): https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt

Does as-if rule prevent compiler reordering of accesses to global/member variables?

I'm studying the effects that compiler optimizations (specifically, instruction reordering here) may have for a multi-threaded program.
Let's say we have a reader thread and a writer thread.
// Global shared data between threads
bool data;
bool flag = false;
// writer.cpp
void writer()
{
data = true; // (1)
flag = true; // (2)
}
// reader.cpp
void reader()
{
if (flag)
{
count << data;
}
}
May a C++11-compliant compiler reorder instruction (1) and (2)?
According to C++ "as-if" rule, the transformation shouldn't change a program's observable behavior. Apparently, when compiling the writer, a compiler generally can't be sure whether reordering (1) and (2) will change the program's observable behavior or not, because data and flag are both globals which may affect another thread's observable behavior.
But it states here that this kind of reordering can happen, see memory ordering at compile time.
So do we need a compiler barrier between (1) and (2)? (I'm well aware of possible CPU reordering. This question is only on compiler reordering)
Absolutely they may. The compiler has no obligation whatsoever to consider side effects to other threads or hardware.
The compiler is only forced to consider this if you use volatile or synchronization (and those two are not interchangable).
The Standard memory model is known as SC-DRF, or Sequentially Consistent Data Race Free. A data race would be exactly the scenario you've just described- where one thread is observing non-synchronized variables whilst another is mutating them. This is undefined behaviour. Effectively, the Standard explicitly gives the compiler the freedom to assume that there are no other threads or hardware which is reading non-volatile non-synchronized variables. It is absolutely legitimate for a compiler to optimize on that basis.
By the way, that link is kinda crap. His "fix" does not fix anything at all. The only correct fix for multi-threaded code is to use synchronization.
You have already asked this question on comp.lang.c++.moderated a week and a half ago, and you were given a complete explanation about the relationship between the C++11 requirements for sequencing within a single thread and for "happens before" relationships on synchronization between threads, here: https://groups.google.com/forum/#!topic/comp.lang.c++.moderated/43_laZwVXYg
What part of that reply didn't you understand, and we can try again?