Is write guaranteed with one thread writing and another reading non-atomic

Is write guaranteed with one thread writing and another reading non-atomic - c++

Say I have
bool unsafeBool = false;
int main()
{
std::thread reader = std::thread([](){
std::this_thread::sleep_for(1ns);
if(unsafeBool)
std::cout << "unsafe bool is true" << std::endl;
});
std::thread writer = std::thread([](){
unsafeBool = true;
});
reader.join();
writer.join();
}
Is it guaranteed that unsafeBool becomes true after writer finishes. I know that it is undefined behavior what reader outputs but the write should be fine as far as I understand.

UB is and stays UB, there can be reasoning about why things are happinging the way they happen, though, you ain't allowed to rely on that.
You have a race condition, fix it by either: adding a lock or changing the type to an atomic.
Since you have UB in your code, your compiler is allowed to assume that doesn't happen. If it can detect this, it can change your complete function to a noop, as it can never be called in a valid program.
If it doesn't do so, the behaviour will depend on your processor and the caching linked to it. There, if the code after the joins uses the same core as the thread that read the boolean (before the join), you might even still have false in there without the need to invalidate the cache.
In practice, using Intel X86 processors, you will not see a lot of side effects from the race conditions as it has been made to invalidate the caches on write.

After writer.join() it guaranteed that unsafeBool == true. But in reader thread the access to it is a data race.

Some implementations guarantee that any attempt to read the value of a word-size-or-smaller object that isn't qualified volatile around the time that it changes will either yield an old or new value, chosen arbitrarily. In cases where this guarantee would be useful, the cost for a compiler to consistently uphold it would generally be less than the cost of working around its absence (among other things, because any ways by which programmers could work around its absence would restrict a compiler's freedom to choose between an old or new value).
In some other implementations, however, even operations that would seem like they should involve a single read of a value might yield code that combines the results from multiple reads. When ARM gcc 9.2.1 is invoked with command-line arguments -xc -O2 -mcpu=cortex-m0 and given:
#include <stdint.h>
#include <string.h>
#if 1
uint16_t test(uint16_t *p)
{
uint16_t q = *p;
return q - (q >> 15);
}
it generates code which reads from *p and then from *(int16_t*)p, shifts the latter right by 15, and adds that to the former. If the value of *p were to change between the two reads, this could cause the function to return 0xFFFF, a value which should be impossible.
Unfortunately, many people who design compilers so that they will always refrain from "splitting" reads in such fashion think such behavior is sufficiently natural and obvious that there's no particular reason to expressly document the fact that they never do anything else. Meanwhile, some other compiler writers figure that because the Standard allows compilers to split reads even when there's no reason to (splitting the read in the above code makes it bigger and slower than it would be if it simply read the value once) any code that would rely upon compilers refraining from such "optimizations" is "broken".

Related

is a concurent write and read to a non-atomic variable of fundamental type without using it undefined behavior?

in a lock-free queue.pop(), I read a trivialy_copyable variable (of integral type) after synchronization with an atomic aquire inside a loop.
Minimized pseudo code:
//somewhere else writePosition.store(...,release)
bool pop(size_t & returnValue){
writePosition = writePosition.load(aquire)
oldReadPosition = readPosition.load(relaxed)
size_t value{};
do{
value = data[oldReadPosition]
newReadPosition = oldReadPosition+1
}while(readPosition.compare_exchange(oldReadPosition, newReadPosition, relaxed)
// here we are owner of the value
returnValue = value;
return true;
}
the memory of data[oldReadPosition] can only be changed iff this value was read from another thread bevor.
read and write Positions are ABA safe.
with a simple copy, value = data[oldReadPosition] the memory of data[oldReadPosition] will not be changed.
but a write thread queue.push(...) can change data[oldReadPosition] while reading, iff another thread has already read oldPosition and changed the readPosition.
it would be a race condition, if you use the value, but is it also a race condition, and thus undefined behavior, when we leave value untouched? the standard is not spezific enough or I don´t understand it.
imo, this should be possible, because it has no effect.
I would be very happy to get an qualified answer to get deeper insights
thanks a lot

Yes, it's UB in ISO C++; value = data[oldReadPosition] in the C++ abstract machine involves reading the value of that object. (Usually that means lvalue to rvalue conversion, IIRC.)
But it's mostly harmless, probably only going to be a problem on machines with hardware race detection (not normal mainstream CPUs, but possibly on C implementations like clang with threadsanitizer).
Another use-case for non-atomic read and then checking for possible tearing is the SeqLock, where readers can prove no tearing by reading the same value from an atomic counter before and after the non-atomic read. It's UB in C++, even with volatile for the non-atomic data, although that may be helpful in making sure the compiler-generated asm is safe. (With memory barriers and current handling of atomics by existing compilers, even non-volatile makes working asm). See Optimal way to pass a few variables between 2 threads pinning different CPUs
atomic_thread_fence is still necessary for a SeqLock to be safe, and some of the necessary ordering of atomic loads wrt. non-atomic may be an implementation detail if it can't sync with something and create a happens-before.
People do use Seq Locks in real life, depending on the fact that real-life compilers de-facto define a bit more behaviour than ISO C++. Or another way to put it is that happen to work for now; if you're careful about what code you put around the non-atomic read it's unlikely for a compiler to be able to do anything problematic.
But you're definitely venturing out past the safe area of guaranteed behaviour, and probably need to understand how C++ compiles to asm, and how asm works on the target platforms you care about; see also Who's afraid of a big bad optimizing compiler? on LWN; it's aimed at Linux kernel code, which is the main user of hand-rolled atomics and stuff like that.

C++ Are copies of variables optimised out?

Given a single core CPU embedded environment where reading and writing of variables is guaranteed to be atomic, and the following example:
struct Example
{
bool TheFlag;
void SetTheFlag(bool f) {
Theflag = f;
}
void UseTheFlag() {
if (TheFlag) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (TheFlag) {
...
}
}
};
It is clear that if SetTheFlag was called by chance on another thread (or interrupt) between the two uses of TheFlag in UseTheFlag, there could be unexpected behavior (or some could argue it is expected behavior in this case!).
Can the following workaround be used to guarantee behavior?
void UseTheFlag() {
auto f = TheFlag;
if (f) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (f) {
...
}
}
My practical testing showed the variable f is never optimised out and copied once from TheFlag (GCC 10, ARM Cortex M4). But, I would like to know for sure is it guaranteed by the compiler that f will not be optimised out?
I know there are better design practices, critical sections, disabling interrupts etc, but this question is about the behavior of compiler optimisation in this use case.

You might consider this from the point of view of the "as-if" rule, which, loosely stated, states that any optimisations applied by the compiler must not change the original meaning of the code.
So, unless the compiler can prove that TheFlag doesn't change during the lifetime of f, it is obliged to make a local copy.
That said, I'm not sure if 'proof' extends to modifications made to TheFlag in another thread or ISR. Marking TheFlag as atomic (or volatile, for an ISR) might help there.

The C++ standard does not say anything about what will happen in this case. It's just UB, since an object can be modified in one thread while another thread is accessing it.
You only say the platform specifies that these operations are atomic. Obviously, that isn't enough to ensure this code operates correctly. Atomicity only guarantees that two concurrent writes will leave the value as one of the two written values and that a read during one or more writes will never see a value not written. It says nothing about what happens in cases like this.
There is nothing wrong with any optimization that breaks this code. In particularly, atomicity does not prevent a read operation in another thread from seeing a value written before that read operation unless something known to synchronize was used.
If the compiler sees register pressure, nothing prevents it from simply reading TheFlag twice rather than creating a local copy. If the compile can deduce that the intervening code in this thread cannot modify TheFlag, the optimization is legal. Optimizers don't have to take into account what other threads might do unless you follow the rules and use things defined to synchronize or only require the explicit guarantees atomicity replies.
You go beyond that, so all bets are off. You need more than atomicity for TheFlag, so don't use a type that is merely atomic -- it isn't enough.

Once more volatile: necessary to prevent optimization?

I've been reading a lot about the 'volatile' keyword but I still don't have a definitive answer.
Consider this code:
class A
{
public:
void work()
{
working = true;
while(working)
{
processSomeJob();
}
}
void stopWorking() // Can be called from another thread
{
working = false;
}
private:
bool working;
}
As work() enters its loop the value of 'working' is true.
Now I'm guessing the compiler is allowed to optimize the while(working) to while(true) as the value of 'working' is true when starting the loop.
If this is not the case, that would mean something like this would be quite inefficient:
for(int i = 0; i < someOtherClassMember; i++)
{
doSomething();
}
...as the value of someOtherClassMember would have to be loaded each iteration.
If this is the case, I would think 'working' has to be volatile in order to prevent the compiler from optimising it.
Which of these two is the case? When googling the use of volatile I find people claiming it's only useful when working with I/O devices writing to memory directly, but I also find claims that it should be used in a scenario like mine.

Your program will get optimized into an infinite loop†.
void foo() { A{}.work(); }
gets compiled to (g++ with O2)
foo():
sub rsp, 8
.L2:
call processSomeJob()
jmp .L2
The standard defines what a hypothetical abstract machine would do with a program. Standard-compliant compilers have to compile your program to behave the same way as that machine in all observable behaviour. This is known as the as-if rule, the compiler has freedom as long as what your program does is the same, regardless of how.
Normally, reading and writing to a variable doesn't constitute as observable, which is why a compiler can elide as much reads and writes as it likes. The compiler can see working doesn't get assigned to and optimizes the read away. The (often misunderstood) effect of volatile is exactly to make them observable, which forces the compilers to leave the reads and writes alone‡.
But wait you say, another thread may assign to working. This is where the leeway of undefined behaviour comes in. The compiler may do anything when there is undefined behaviour, including formatting your hard drive and still be standard-compliant. Since there are no synchronization and working isn't atomic, any other thread writing to working is a data race, which is unconditionally undefined behaviour. Therefore, the only time an infinite loop is wrong is when there is undefined behaviour, by which the compiler decided your program might as well keep on looping.
TL;DR Don't use plain bool and volatile for multi-threading. Use std::atomic<bool>.
†Not in all situations. void bar(A& a) { a.work(); } doesn't for some versions.
‡Actually, there is some debate around this.

Now I'm guessing the compiler is allowed to optimize the while(working) to while(true)
Potentially, yes. But only if it can prove that processSomeJob() does not modify the working variable i.e. if it can prove that the loop is infinite.
If this is not the case, that would mean something like this would be quite inefficient ... as the value of someOtherClassMember would have to be loaded each iteration
Your reasoning is sound. However, the memory location might remain in cache, and reading from CPU cache isn't necessarily significantly slow. If doSomething is complex enough to cause someOtherClassMember to be evicted from the cache, then sure we'd have to load from memory, but on the other hand doSomething might be so complex that a single memory load is insignificant in comparison.
Which of these two is the case?
Either. The optimiser will not be able to analyse all possible code paths; we cannot assume that the loop could be optimised in all cases. But if someOtherClassMember is provably not modified in any code paths, then proving it would be possible in theory, and therefore the loop can be optimised in theory.
but I also find claims that [volatile] should be used in a scenario like mine.
volatile doesn't help you here. If working is modified in another thread, then there is a data race. And data race means that the behaviour of the program is undefined.
To avoid a data race, you need synchronisation: Either use a mutex, or atomic operations to share access across threads.

Volatile will make the while loop reload the working variable on every check. Practically that will often allow you to stop the working function with a call to stopWorking made from an asynchronous signal handler or another thread, but as per the standard it's not enough. The standard requires lock-free atomics or variables of type volatile sig_atomic_t for sighandler <-> regular context communication and atomics for inter-thread communication.

Why don't compilers merge redundant std::atomic writes?

I'm wondering why no compilers are prepared to merge consecutive writes of the same value to a single atomic variable, e.g.:
#include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
Every compiler I've tried will issue the above write three times. What legitimate, race-free observer could see a difference between the above code and an optimized version with a single write (i.e. doesn't the 'as-if' rule apply)?
If the variable had been volatile, then obviously no optimization is applicable. What's preventing it in my case?
Here's the code in compiler explorer.

The C++11 / C++14 standards as written do allow the three stores to be folded/coalesced into one store of the final value. Even in a case like this:
y.store(1, order);
y.store(2, order);
y.store(3, order); // inlining + constant-folding could produce this in real code
The standard does not guarantee that an observer spinning on y (with an atomic load or CAS) will ever see y == 2. A program that depended on this would have a data race bug, but only the garden-variety bug kind of race, not the C++ Undefined Behaviour kind of data race. (It's UB only with non-atomic variables). A program that expects to sometimes see it is not necessarily even buggy. (See below re: progress bars.)
Any ordering that's possible on the C++ abstract machine can be picked (at compile time) as the ordering that will always happen. This is the as-if rule in action. In this case, it's as if all three stores happened back-to-back in the global order, with no loads or stores from other threads happening between the y=1 and y=3.
It doesn't depend on the target architecture or hardware; just like compile-time reordering of relaxed atomic operations are allowed even when targeting strongly-ordered x86. The compiler doesn't have to preserve anything you might expect from thinking about the hardware you're compiling for, so you need barriers. The barriers may compile into zero asm instructions.
So why don't compilers do this optimization?
It's a quality-of-implementation issue, and can change observed performance / behaviour on real hardware.
The most obvious case where it's a problem is a progress bar. Sinking the stores out of a loop (that contains no other atomic operations) and folding them all into one would result in a progress bar staying at 0 and then going to 100% right at the end.
There's no C++11 std::atomic way to stop them from doing it in cases where you don't want it, so for now compilers simply choose never to coalesce multiple atomic operations into one. (Coalescing them all into one operation doesn't change their order relative to each other.)
Compiler-writers have correctly noticed that programmers expect that an atomic store will actually happen to memory every time the source does y.store(). (See most of the other answers to this question, which claim the stores are required to happen separately because of possible readers waiting to see an intermediate value.) i.e. It violates the principle of least surprise.
However, there are cases where it would be very helpful, for example avoiding useless shared_ptr ref count inc/dec in a loop.
Obviously any reordering or coalescing can't violate any other ordering rules. For example, num++; num--; would still have to be full barrier to runtime and compile-time reordering, even if it no longer touched the memory at num.
Discussion is under way to extend the std::atomic API to give programmers control of such optimizations, at which point compilers will be able to optimize when useful, which can happen even in carefully-written code that isn't intentionally inefficient. Some examples of useful cases for optimization are mentioned in the following working-group discussion / proposal links:
http://wg21.link/n4455: N4455 No Sane Compiler Would Optimize Atomics
http://wg21.link/p0062: WG21/P0062R1: When should compilers optimize atomics?
See also discussion about this same topic on Richard Hodges' answer to Can num++ be atomic for 'int num'? (see the comments). See also the last section of my answer to the same question, where I argue in more detail that this optimization is allowed. (Leaving it short here, because those C++ working-group links already acknowledge that the current standard as written does allow it, and that current compilers just don't optimize on purpose.)
Within the current standard, volatile atomic<int> y would be one way to ensure that stores to it are not allowed to be optimized away. (As Herb Sutter points out in an SO answer, volatile and atomic already share some requirements, but they are different). See also std::memory_order's relationship with volatile on cppreference.
Accesses to volatile objects are not allowed to be optimized away (because they could be memory-mapped IO registers, for example).
Using volatile atomic<T> mostly fixes the progress-bar problem, but it's kind of ugly and might look silly in a few years if/when C++ decides on different syntax for controlling optimization so compilers can start doing it in practice.
I think we can be confident that compilers won't start doing this optimization until there's a way to control it. Hopefully it will be some kind of opt-in (like a memory_order_release_coalesce) that doesn't change the behaviour of existing code C++11/14 code when compiled as C++whatever. But it could be like the proposal in wg21/p0062: tag don't-optimize cases with [[brittle_atomic]].
wg21/p0062 warns that even volatile atomic doesn't solve everything, and discourages its use for this purpose. It gives this example:
if(x) {
foo();
y.store(0);
} else {
bar();
y.store(0); // release a lock before a long-running loop
for() {...} // loop contains no atomics or volatiles
}
// A compiler can merge the stores into a y.store(0) here.
Even with volatile atomic<int> y, a compiler is allowed to sink the y.store() out of the if/else and just do it once, because it's still doing exactly 1 store with the same value. (Which would be after the long loop in the else branch). Especially if the store is only relaxed or release instead of seq_cst.
volatile does stop the coalescing discussed in the question, but this points out that other optimizations on atomic<> can also be problematic for real performance.
Other reasons for not optimizing include: nobody's written the complicated code that would allow the compiler to do these optimizations safely (without ever getting it wrong). This is not sufficient, because N4455 says LLVM already implements or could easily implement several of the optimizations it mentioned.
The confusing-for-programmers reason is certainly plausible, though. Lock-free code is hard enough to write correctly in the first place.
Don't be casual in your use of atomic weapons: they aren't cheap and don't optimize much (currently not at all). It's not always easy easy to avoid redundant atomic operations with std::shared_ptr<T>, though, since there's no non-atomic version of it (although one of the answers here gives an easy way to define a shared_ptr_unsynchronized<T> for gcc).

You are referring to dead-stores elimination.
It is not forbidden to eliminate an atomic dead store but it is harder to prove that an atomic store qualifies as such.
Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones.
Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations.
In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.
from N4455 No Sane Compiler Would Optimize Atomics
The problem of atomic DSE, in the general case, is that it involves looking for synchronization points, in my understanding this term means points in the code where there is happen-before relationship between an instruction on a thread A and instruction on another thread B.
Consider this code executed by a thread A:
y.store(1, std::memory_order_seq_cst);
y.store(2, std::memory_order_seq_cst);
y.store(3, std::memory_order_seq_cst);
Can it be optimised as y.store(3, std::memory_order_seq_cst)?
If a thread B is waiting to see y = 2 (e.g. with a CAS) it would never observe that if the code gets optimised.
However, in my understanding, having B looping and CASsing on y = 2 is a data race as there is not a total order between the two threads' instructions.
An execution where the A's instructions are executed before the B's loop is observable (i.e. allowed) and thus the compiler can optimise to y.store(3, std::memory_order_seq_cst).
If threads A and B are synchronized, somehow, between the stores in thread A then the optimisation would not be allowed (a partial order would be induced, possibly leading to B potentially observing y = 2).
Proving that there is not such a synchronization is hard as it involves considering a broader scope and taking into account all the quirks of an architecture.
As for my understanding, due to the relatively small age of the atomic operations and the difficulty in reasoning about memory ordering, visibility and synchronization, compilers don't perform all the possible optimisations on atomics until a more robust framework for detecting and understanding the necessary conditions is built.
I believe your example is a simplification of the counting thread given above, as it doesn't have any other thread or any synchronization point, for what I can see, I suppose the compiler could have optimised the three stores.

While you are changing the value of an atomic in one thread, some other thread may be checking it and performing an operation based on the value of the atomic. The example you gave is so specific that compiler developers don't see it worth optimizing. However, if one thread is setting e.g. consecutive values for an atomic: 0, 1, 2, etc., the other thread may be putting something in the slots indicated by the value of the atomic.

NB: I was going to comment this but it's a bit too wordy.
One interesting fact is that this behavior isn't in the terms of C++ a data race.
Note 21 on p.14 is interesting: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3690.pdf (my emphasis):
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is
not atomic
Also on p.11 note 5 :
“Relaxed” atomic operations are not synchronization operations even
though, like synchronization operations, they cannot contribute to
data races.
So a conflicting action on an atomic is never a data race - in terms of the C++ standard.
These operations are all atomic (and specifically relaxed) but no data race here folks!
I agree there's no reliable/predictable difference between these two on any (reasonable) platform:
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
y.store(1, order);
y.store(1, order);
}
and
include <atomic>
std::atomic<int> y(0);
void f() {
auto order = std::memory_order_relaxed;
y.store(1, order);
}
But within the definition provided C++ memory model it isn't a data race.
I can't easily understand why that definition is provided but it does hand the developer a few cards to engage in haphazard communication between threads that they may know (on their platform) will statistically work.
For example, setting a value 3 times then reading it back will show some degree of contention for that location. Such approaches aren't deterministic but many effective concurrent algorithms aren't deterministic.
For example, a timed-out try_lock_until() is always a race condition but remains a useful technique.
What it appears the C++ Standard is providing you with certainty around 'data races' but permitting certain fun-and-games with race conditions which are on final analysis different things.
In short the standard appears to specify that where other threads may see the 'hammering' effect of a value being set 3 times, other threads must be able to see that effect (even if they sometimes may not!).
It's the case where pretty much all modern platforms that other thread may under some circumstances see the hammering.

In short, because the standard (for example the paragaraphs around and below 20 in [intro.multithread]) disallows for it.
There are happens-before guarantees which must be fulfilled, and which among other things rule out reordering or coalescing writes (paragraph 19 even says so explicitly about reordering).
If your thread writes three values to memory (let's say 1, 2, and 3) one after another, a different thread may read the value. If, for example, your thread is interrupted (or even if it runs concurrently) and another thread also writes to that location, then the observing thread must see the operations in exactly the same order as they happen (either by scheduling or coincidence, or whatever reason). That's a guarantee.
How is this possible if you only do half of the writes (or even only a single one)? It isn't.
What if your thread instead writes out 1 -1 -1 but another one sporadically writes out 2 or 3? What if a third thread observes the location and waits for a particular value that just never appears because it's optimized out?
It is impossible to provide the guarantees that are given if stores (and loads, too) aren't performed as requested. All of them, and in the same order.

A practical use case for the pattern, if the thread does something important between updates that does not depend on or modify y, might be: *Thread 2 reads the value of y to check how much progress Thread 1 has made.`
So, maybe Thread 1 is supposed to load the configuration file as step 1, put its parsed contents into a data structure as step 2, and display the main window as step 3, while Thread 2 is waiting on step 2 to complete so it can perform another task in parallel that depends on the data structure. (Granted, this example calls for acquire/release semantics, not relaxed ordering.)
I’m pretty sure a conforming implementation allows Thread 1 not to update y at any intermediate step—while I haven’t pored over the language standard, I would be shocked if it does not support hardware on which another thread polling y might never see the value 2.
However, that is a hypothetical instance where it might be pessimal to optimize away the status updates. Maybe a compiler dev will come here and say why that compiler chose not to, but one possible reason is letting you shoot yourself in the foot, or at least stub yourself in the toe.

The compiler writer cannot just perform the optimisation. They must also convince themselves that the optimisation is valid in the situations where the compiler writer intends to apply it, that it will not be applied in situations where it is not valid, that it doesn't break code that is in fact broken but "works" on other implementations. This is probably more work than the optimisation itself.
On the other hand, I could imagine that in practice (that is in programs that are supposed to do a job, and not benchmarks), this optimisation will save very little in execution time.
So a compiler writer will look at the cost, then look at the benefit and the risks, and probably will decide against it.

Let's walk a little further away from the pathological case of the three stores being immediately next to each other. Let's assume there's some non-trivial work being done between the stores, and that such work does not involve y at all (so that data path analysis can determine that the three stores are in fact redundant, at least within this thread), and does not itself introduce any memory barriers (so that something else doesn't force the stores to be visible to other threads). Now it is quite possible that other threads have an opportunity to get work done between the stores, and perhaps those other threads manipulate y and that this thread has some reason to need to reset it to 1 (the 2nd store). If the first two stores were dropped, that would change the behaviour.

Since variables contained within an std::atomic object are expected to be accessed from multiple threads, one should expect that they behave, at a minimum, as if they were declared with the volatile keyword.
That was the standard and recommended practice before CPU architectures introduced cache lines, etc.
[EDIT2] One could argue that std::atomic<> are the volatile variables of the multicore age. As defined in C/C++, volatile is only good enough to synchronize atomic reads from a single thread, with an ISR modifying the variable (which in this case is effectively an atomic write as seen from the main thread).
I personally am relieved that no compiler would optimize away writes to an atomic variable. If the write is optimized away, how can you guarantee that each of these writes could potentially be seen by readers in other threads? Don't forget that that is also part of the std::atomic<> contract.
Consider this piece of code, where the result would be greatly affected by wild optimization by the compiler.
#include <atomic>
#include <thread>
static const int N{ 1000000 };
std::atomic<int> flag{1};
std::atomic<bool> do_run { true };
void write_1()
{
while (do_run.load())
{
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
flag = 1; flag = 1; flag = 1; flag = 1;
}
}
void write_0()
{
while (do_run.load())
{
flag = -1; flag = -1; flag = -1; flag = -1;
}
}
int main(int argc, char** argv)
{
int counter{};
std::thread t0(&write_0);
std::thread t1(&write_1);
for (int i = 0; i < N; ++i)
{
counter += flag;
std::this_thread::yield();
}
do_run = false;
t0.join();
t1.join();
return counter;
}
[EDIT] At first, I was not advancing that the volatile was central to the implementation of atomics, but...
Since there seemed to be doubts as to whether volatile had anything to do with atomics, I investigated the matter. Here's the atomic implementation from the VS2017 stl. As I surmised, the volatile keyword is everywhere.
// from file atomic, line 264...
// TEMPLATE CLASS _Atomic_impl
template<unsigned _Bytes>
struct _Atomic_impl
{ // struct for managing locks around operations on atomic types
typedef _Uint1_t _My_int; // "1 byte" means "no alignment required"
constexpr _Atomic_impl() _NOEXCEPT
: _My_flag(0)
{ // default constructor
}
bool _Is_lock_free() const volatile
{ // operations that use locks are not lock-free
return (false);
}
void _Store(void *_Tgt, const void *_Src, memory_order _Order) volatile
{ // lock and store
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Load(void *_Tgt, const void *_Src,
memory_order _Order) const volatile
{ // lock and load
_Atomic_copy(&_My_flag, _Bytes, _Tgt, _Src, _Order);
}
void _Exchange(void *_Left, void *_Right, memory_order _Order) volatile
{ // lock and exchange
_Atomic_exchange(&_My_flag, _Bytes, _Left, _Right, _Order);
}
bool _Compare_exchange_weak(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_weak(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
bool _Compare_exchange_strong(
void *_Tgt, void *_Exp, const void *_Value,
memory_order _Order1, memory_order _Order2) volatile
{ // lock and compare/exchange
return (_Atomic_compare_exchange_strong(
&_My_flag, _Bytes, _Tgt, _Exp, _Value, _Order1, _Order2));
}
private:
mutable _Atomic_flag_t _My_flag;
};
All of the specializations in the MS stl use volatile on the key functions.
Here's the declaration of one of such key function:
inline int _Atomic_compare_exchange_strong_8(volatile _Uint8_t *_Tgt, _Uint8_t *_Exp, _Uint8_t _Value, memory_order _Order1, memory_order _Order2)
You will notice the required volatile uint8_t* holding the value contained in the std::atomic. This pattern can be observed throughout the MS std::atomic<> implementation, Here is no reason for the gcc team, nor any other stl provider to have done it differently.

Reading shared variables with relaxed ordering: is it possible in theory? Is it possible in C++?

Consider the following pseudocode:
expected = null;
if (variable == expected)
{
atomic_compare_exchange_strong(
&variable, expected, desired(), memory_order_acq_rel, memory_order_acq);
}
return variable;
Observe there are no "acquire" semantics when the variable == expected check is performed.
It seems to me that desired will be called at least once in total, and at most once per thread.
Furthermore, if desired never returns null, then this code will never return null.
Now, I have three questions:
Is the above necessarily true? i.e., can we really have well-ordered reads of shared variables even in the absence of fences on every read?
Is it possible to implement this in C++? If so, how? If not, why?
(Hopefully with a rationale, not just "because the standard says so".)
If the answer to (2) is yes, then is it also possible to implement this in C++ without requiring variable == expected to perform an atomic read of variable?
Basically, my goal is to understand if it is possible to perform lazy-initialization of a shared variable in a manner that has performance identical to that of a non-shared variable once the code has been executed at least once by each thread?
(This is somewhat of a "language-lawyer" question. So that implies the question isn't about whether this is a good or useful idea, but rather about whether it's technically possible to do this correctly.)

Regarding the question whether it is possible to perform lazy initialisation of a shared variable in C++, that has a performance (almost) identical to that of a non-shared variable:
The answer is, that it depends on the hardware architecture, and the implementation of the compiler and run-time environment. At least, it is possible in some environments. In particular on x86 with GCC and Clang.
On x86, atomic reads can be implemented without memory fences. Basically, an atomic read is identical to a non-atomic read. Take a look at the following compilation unit:
std::atomic<int> global_value;
int load_global_value() { return global_value.load(std::memory_order_seq_cst); }
Although I used an atomic operation with sequential consistency (the default), there is nothing special in the generated code. The assembler code generated by GCC and Clang looks as follows:
load_global_value():
movl global_value(%rip), %eax
retq
I said almost identical, because there are other reasons that might impact the performance. For example:
although there is no fence, the atomic operations still prevent some compiler optimisations, e.g. reordering instructions and elimination of stores and loads
if there is at least one thread, that writes to a different memory location on the same cache line, it will have a huge impact on the performance (known as false sharing)
Having said that, the recommended way to implement lazy initialisation is to use std::call_once. That should give you the best result for all compilers, environments and target architectures.
std::once_flag _init;
std::unique_ptr<gadget> _gadget;
auto get_gadget() -> gadget&
{
std::call_once(_init, [this] { _gadget.reset(new gadget{...}); });
return *_gadget;
}

This is undefined behavior. You're modifying variable, at
least in some thread, which means that all accesses to
variable must be protected. In particular, when you're
executing the atomic_compare_exchange_strong in one thread,
there is nothing to guarantee that another thread might see the
new value of variable before it sees the writes that might
have occurred in desired(). (atomic_compare_exchange_strong
only guarantees any ordering in the thread that executes it.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js