Is store reordering allowed by C++ as-if rule? - c++

The "as-if" rule basically defines what transformations an implementation is allowed to perform on a legal C++ program. In short, all transformations that do not affect a program's observable behavior are allowed.
As to what exactly "observable behavior" stands for, cppreference.com seems to have a different definition with the one given by the Standard, regarding input/output. I'm not sure if that's an reinterpretation of the Standard, or a mistake.
"as-if" rule by cppreference.com:
All input and output operations occur in the same order and with the
same content as if the program was executed as written.
"as-if" rule by the Standard:
The input and output dynamics of interactive devices shall take place
in such a fashion that prompting output is actually delivered before a
program waits for input. What constitutes an interactive device is
implementation-defined
This difference is important to me because I want to know if a normal store reordering is a valid compiler optimization or not. Per cppreference's wording, a memory store should belong to output operations it mentions. But according to the Standard, a memory store doesn't seem to be the output dynamics of interactive devices. (What's interactive devices anyway?)
An example to follow.
int A = 0;
int B = 0;
void foo()
{
A = B + 1; // (1)
B = 1; // (2)
}
A modern compiler may generate the following code for function foo:
mov 0x804a018, %eax
movl $0x1, 0x804a018 ; store 1 to B
add $0x1, %eax
mov %eax, 0x804a01c ; store 1 to A
ret
As seen, the store to A is reordered with the store to B. Is it compliant to the "as-if" rule? Is this kind of reordering permitted by the Standard?

If cppreference.com disagrees with the actual text of the C++ standard, cppreference.com is wrong. The only things that can supersede the text of the standard are a newer version of the standard, and official resolutions of defect reports (which sometimes get rolled up into documents called "technical corrigienda", which is a fancy name for a minor release of the standard).
However, in this case, you have misunderstood what cppreference.com means by "input and output operations". (If memory serves, that text is taken verbatim from an older version of the standard.) Stores to memory are NOT output operations. Only writing to a file (that is, any stdio.h or iostream output stream, or other implementation-defined mechanism, e.g. a Unix file descriptor) counts as output for purposes of this rule.
The C and C++ standards, prior to their 2011 revisions, assumed a single-threaded abstract machine, and therefore did not bother specifying anything about store ordering, because there was no way to observe stores out of program order. C(++)11 added a whole bunch of rules for store ordering as part of the new multithreading specification.

The real articulation of the as-if rule is found at §1.9/8 in the standard:
Access to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to one of the possible results that
execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place in such a fashion that prompting
output is actually delivered before a program waits for input. What constitutes an interactive device
is implementation-defined.
Since A and B are not volatile, this reordering is allowed.

Related

C++ volatile: guaranteed 32-bit accesses?

In my Linux C++ project, I have a hardware memory region mapped somewhere in the physical address space which I access using uint32_t pointers, after doing a mmap.
The release build of the application crashes with a SIGBUS (bus error).
This is happening because the compiler optimizes accesses to the aforementioned hardware memory using 64-bit accesses, instead of sticking to 32-bit => bus error, the hardware memory can only be accessed using 32-bit reads/writes.
I marked the uint32_t pointer as volatile.
It works. For this one specific code portion at least. Because the compiler is told not to do reordering. And most of the time it would have to reorder to optimize.
I know that volatile controls when the compiler accesses the memory. The question is: does volatile also tell the compiler how to access the memory, i.e. to access it exactly as the programmer instructs? Am I guaranteed that the compiler will always stick to doing 32-bit accesses to volatile uint32_t buffers?
e.g. Does volatile also guarantee that the compiler will be accessing the 2 consecutive writes to the 2 consecutive 32-bit values in the following code-snippet using 32-bit reads/writes as well?
void aFunction(volatile uint32_t* hwmem_array)
{
[...]
// Are we guaranteed by volatile that the following 2 consecutive writes, in consecutive memory regions
// are not merged into a single 64-bit write by the compiler?
hwmem_array[0] = 0x11223344u;
hwmem_array[1] = 0xaabbccddu;
[...]
}
I think I answered my own question, please correct me if I'm wrong.
C99 standard draft: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
Quotes:
“
6 An object that has volatile-qualified type may be modified in ways
unknown to the implementation or have other unknown side effects.
Therefore any expression referring to such an object shall be
evaluated strictly according to the rules of the abstract machine, as
described in 5.1.2.3. Furthermore, at every sequence point the value
last stored in the object shall agree with that prescribed by the
abstract machine, except as modified by the unknown factors mentioned
previously.
”
Section 5.1.2.3:
“
2 Accessing a volatile object, modifying an object, modifying a file,
or calling a function that does any of those operations are all side
effects, which are changes in the state of the execution environment.
Evaluation of an expression may produce side effects. At certain
specified points in the execution sequence called sequence points, all
side effects of previous evaluations shall be complete and no side
effects of subsequent evaluations shall have taken place. (A summary
of the sequence points is given in annex C.)
5 The least requirements on a conforming implementation are: — At
sequence points, volatile objects are stable in the sense that
previous accesses are complete and subsequent accesses have not yet
occurred.
”
Annex C (informative) Sequence points:
“
1 The following are the sequence points described in 5.1.2.3:
[…]
— The end of a full expression: an initializer (6.7.8); the expression
in an expression statement (6.8.3); the controlling expression of a
selection statement (if or switch) (6.8.4); the controlling expression
of a while or do statement (6.8.5); each of the expressions of a for
statement (6.8.5.3); the expression in a return statement (6.8.6.4).
”
So theoretically we are guaranteed that at the end of any expression involving a volatile object, the volatile object is written/read, as the compiler was instructed to do.

What formally guarantees that non-atomic variables can't see out-of-thin-air values and create a data race like atomic relaxed theoretically can?

This is a question about the formal guarantees of the C++ standard.
The standard points out that the rules for std::memory_order_relaxed atomic variables allow "out of thin air" / "out of the blue" values to appear.
But for non-atomic variables, can this example have UB? Is r1 == r2 == 42 possible in the C++ abstract machine? Neither variable == 42 initially so you'd expect neither if body should execute, meaning no writes to the shared variables.
// Global state
int x = 0, y = 0;
// Thread 1:
r1 = x;
if (r1 == 42) y = r1;
// Thread 2:
r2 = y;
if (r2 == 42) x = 42;
The above example is adapted from the standard, which explicitly says such behavior is allowed by the specification for atomic objects:
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);
However, implementations should not allow such behavior. – end note]
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads seeing out-of-thin-air values?
When a race condition would exist with different values for x and y, what guarantees that read of a shared variable (normal, non atomic) cannot see such values?
Can not-executed if bodies create self-fulfilling conditions that lead to a data-race?
The text of your question seems to be missing the point of the example and out-of-thin-air values. Your example does not contain data-race UB. (It might if x or y were set to 42 before those threads ran, in which case all bets are off and the other answers citing data-race UB apply.)
There is no protection against real data races, only against out-of-thin-air values.
I think you're really asking how to reconcile that mo_relaxed example with sane and well-defined behaviour for non-atomic variables. That's what this answer covers.
The note is pointing out a hole in the atomic mo_relaxed formalism, not warning you of a real possible effect on some implementations.
This gap does not (I think) apply to non-atomic objects, only to mo_relaxed.
They say However, implementations should not allow such behavior. – end note]. Apparently the standards committee couldn't find a way to formalize that requirement so for now it's just a note, but is not intended to be optional.
It's clear that even though this isn't strictly normative, the C++ standard intends to disallow out-of-thin-air values for relaxed atomic (and in general I assume). Later standards discussion, e.g. 2018's p0668r5: Revising the C++ memory model (which doesn't "fix" this, it's an unrelated change) includes juicy side-nodes like:
We still do not have an acceptable way to make our informal (since C++14) prohibition of out-of-thin-air results precise. The primary practical effect of that is that formal verification of C++ programs using relaxed atomics remains unfeasible. The above paper suggests a solution similar to http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html . We continue to ignore the problem here ...
So yes, the normative parts of the standard are apparently weaker for relaxed_atomic than they are for non-atomic. This seems to be an unfortunately side effect of how they define the rules.
AFAIK no implementations can produce out-of-thin-air values in real life.
Later versions of the standard phrase the informal recommendation more clearly, e.g. in the current draft: https://timsong-cpp.github.io/cppwp/atomics.order#8
Implementations should ensure that no “out-of-thin-air” values are computed that circularly depend on their own computation.
...
[ Note: The recommendation [of 8.] similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order::relaxed);
if (r1 == 42) y.store(42, memory_order::relaxed);
// Thread 2:
r2 = y.load(memory_order::relaxed);
if (r2 == 42) x.store(42, memory_order::relaxed);
— end note ]
(This rest of the answer was written before I was sure that the standard intended to disallow this for mo_relaxed, too.)
I'm pretty sure the C++ abstract machine does not allow r1 == r2 == 42.
Every possible ordering of operations in the C++ abstract machine operations leads to r1=r2=0 without UB, even without synchronization. Therefore the program has no UB and any non-zero result would violate the "as-if" rule.
Formally, ISO C++ allows an implementation to implement functions / programs in any way that gives the same result as the C++ abstract machine would. For multi-threaded code, an implementation can pick one possible abstract-machine ordering and decide that's the ordering that always happens. (e.g. when reordering relaxed atomic stores when compiling to asm for a strongly-ordered ISA. The standard as written even allows coalescing atomic stores but compilers choose not to). But the result of the program always has to be something the abstract machine could have produced. (Only the Atomics chapter introduces the possibility of one thread observing the actions of another thread without mutexes. Otherwise that's not possible without data-race UB).
I think the other answers didn't look carefully enough at this. (And neither did I when it was first posted). Code that doesn't execute doesn't cause UB (including data-race UB), and compilers aren't allowed to invent writes to objects. (Except in code paths that already unconditionally write them, like y = (x==42) ? 42 : y; which would obviously create data-race UB.)
For any non-atomic object, if don't actually write it then other threads might also be reading it, regardless of code inside not-executed if blocks. The standard allows this and doesn't allow a variable to suddenly read as a different value when the abstract machine hasn't written it. (And for objects we don't even read, like neighbouring array elements, another thread might even be writing them.)
Therefore we can't do anything that would let another thread temporarily see a different value for the object, or step on its write. Inventing writes to non-atomic objects is basically always a compiler bug; this is well known and universally agreed upon because it can break code that doesn't contain UB (and has done so in practice for a few cases of compiler bugs that created it, e.g. IA-64 GCC I think had such a bug at one point that broke the Linux kernel). IIRC, Herb Sutter mentioned such bugs in part 1 or 2 of his talk, atomic<> Weapons: The C++ Memory Model and Modern Hardware", saying that it was already usually considered a compiler bug before C++11, but C++11 codified that and made it easier to be sure.
Or another recent example with ICC for x86:
Crash with icc: can the compiler invent writes where none existed in the abstract machine?
In the C++ abstract machine, there's no way for execution to reach either y = r1; or x = r2;, regardless of sequencing or simultaneity of the loads for the branch conditions. x and y both read as 0 and neither thread ever writes them.
No synchronization is required to avoid UB because no order of abstract-machine operations leads to a data-race. The ISO C++ standard doesn't have anything to say about speculative execution or what happens when mis-speculation reaches code. That's because speculation is a feature of real implementations, not of the abstract machine. It's up to implementations (HW vendors and compiler writers) to ensure the "as-if" rule is respected.
It's legal in C++ to write code like if (global_id == mine) shared_var = 123; and have all threads execute it, as long as at most one thread actually runs the shared_var = 123; statement. (And as long as synchronization exists to avoid a data race on non-atomic int global_id). If things like this broke down, it would be chaos. For example, you could apparently draw wrong conclusions like reordering atomic operations in C++
Observing that a non-write didn't happen isn't data-race UB.
It's also not UB to run if(i<SIZE) return arr[i]; because the array access only happens if i is in bounds.
I think the "out of the blue" value-invention note only applies to relaxed-atomics, apparently as a special caveat for them in the Atomics chapter. (And even then, AFAIK it can't actually happen on any real C++ implementations, certainly not mainstream ones. At this point implementations don't have to take any special measures to make sure it can't happen for non-atomic variables.)
I'm not aware of any similar language outside the atomics chapter of the standard that allows an implementation to allow values to appear out of the blue like this.
I don't see any sane way to argue that the C++ abstract machine causes UB at any point when executing this, but seeing r1 == r2 == 42 would imply that unsynchronized read+write had happened, but that's data-race UB. If that can happen, can an implementation invent UB because of speculative execution (or some other reason)? The answer has to be "no" for the C++ standard to be usable at all.
For relaxed atomics, inventing the 42 out of nowhere wouldn't imply that UB had happened; perhaps that's why the standard says it's allowed by the rules? As far as I know, nothing outside the Atomics chapter of the standard allows it.
A hypothetical asm / hardware mechanism that could cause this
(Nobody wants this, hopefully everyone agrees that it would be a bad idea to build hardware like this. It seems unlikely that coupling speculation across logical cores would ever be worth the downside of having to roll back all cores when one detects a mispredict or other mis-speculation.)
For 42 to be possible, thread 1 has to see thread 2's speculative store and the store from thread 1 has to be seen by thread 2's load. (Confirming that branch speculation as good, allowing this path of execution to become the real path that was actually taken.)
i.e. speculation across threads: Possible on current HW if they ran on the same core with only a lightweight context switch, e.g. coroutines or green threads.
But on current HW, memory reordering between threads is impossible in that case. Out-of-order execution of code on the same core gives the illusion of everything happening in program order. To get memory reordering between threads, they need to be running on different cores.
So we'd need a design that coupled together speculation between two logical cores. Nobody does that because it means more state needs to rollback if a mispredict is detected. But it is hypothetically possible. For example an OoO SMT core that allows store-forwarding between its logical cores even before they've retired from the out-of-order core (i.e. become non-speculative).
PowerPC allows store-forwarding between logical cores for retired stores, meaning that threads can disagree about the global order of stores. But waiting until they "graduate" (i.e. retire) and become non-speculative means it doesn't tie together speculation on separate logical cores. So when one is recovering from a branch miss, the others can keep the back-end busy. If they all had to rollback on a mispredict on any logical core, that would defeat a significant part of the benefit of SMT.
I thought for a while I'd found an ordering that lead to this on single core of a real weakly-ordered CPUs (with user-space context switching between the threads), but the final step store can't forward to the first step load because this is program order and OoO exec preserves that.
T2: r2 = y; stalls (e.g. cache miss)
T2: branch prediction predicts that r2 == 42 will be true. ( x = 42 should run.
T2: x = 42 runs. (Still speculative; r2 = yhasn't obtained a value yet so ther2 == 42` compare/branch is still waiting to confirm that speculation).
a context switch to Thread 1 happens without rolling back the CPU to retirement state or otherwise waiting for speculation to be confirmed as good or detected as mis-speculation.
This part won't happen on real C++ implementations unless they use an M:N thread model, not the more common 1:1 C++ thread to OS thread. Real CPUs don't rename the privilege level: they don't take interrupts or otherwise enter the kernel with speculative instructions in flight that might need to rollback and redo entering kernel mode from a different architectural state.
T1: r1 = x; takes its value from the speculative x = 42 store
T1: r1 == 42 is found to be true. (Branch speculation happens here, too, not actually waiting for store-forwarding to complete. But along this path of execution, where the x = 42 did happen, this branch condition will execute and confirm the prediction).
T1: y = 42 runs.
this was all on the same CPU core so this y=42 store is after the r2=y load in program-order; it can't give that load a 42 to let the r2==42 speculation be confirmed. So this possible ordering doesn't demonstrate this in action after all. This is why threads have to be running on separate cores with inter-thread speculation for effects like this to be possible.
Note that x = 42 doesn't have a data dependency on r2 so value-prediction isn't required to make this happen. And the y=r1 is inside an if(r1 == 42) anyway so the compiler can optimize to y=42 if it wants, breaking the data dependency in the other thread and making things symmetric.
Note that the arguments about Green Threads or other context switch on a single core isn't actually relevant: we need separate cores for the memory reordering.
I commented earlier that I thought this might involve value-prediction. The ISO C++ standard's memory model is certainly weak enough to allow the kinds of crazy "reordering" that value-prediction can create to use, but it's not necessary for this reordering. y=r1 can be optimized to y=42, and the original code includes x=42 anyway so there's no data dependency of that store on the r2=y load. Speculative stores of 42 are easily possible without value prediction. (The problem is getting the other thread to see them!)
Speculating because of branch prediction instead of value prediction has the same effect here. And in both cases the loads need to eventually see 42 to confirm the speculation as correct.
Value-prediction doesn't even help make this reordering more plausible. We still need inter-thread speculation and memory reordering for the two speculative stores to confirm each other and bootstrap themselves into existence.
ISO C++ chooses to allow this for relaxed atomics, but AFAICT is disallows this non-atomic variables. I'm not sure I see exactly what in the standard does allow the relaxed-atomic case in ISO C++ beyond the note saying it's not explicitly disallowed. If there was any other code that did anything with x or y then maybe, but I think my argument does apply to the relaxed atomic case as well. No path through the source in the C++ abstract machine can produce it.
As I said, it's not possible in practice AFAIK on any real hardware (in asm), or in C++ on any real C++ implementation. It's more of an interesting thought-experiment into crazy consequences of very weak ordering rules, like C++'s relaxed-atomic. (Those ordering rules don't disallow it, but I think the as-if rule and the rest of the standard does, unless there's some provision that allows relaxed atomics to read a value that was never actually written by any thread.)
If there is such a rule, it would only be for relaxed atomics, not for non-atomic variables. Data-race UB is pretty much all the standard needs to say about non-atomic vars and memory ordering, but we don't have that.
When a race condition potentially exists, what guarantees that a read of a shared variable (normal, non atomic) cannot see a write
There is no such guarantee.
When race condition exists, the behaviour of the program is undefined:
[intro.races]
Two actions are potentially concurrent if
they are performed by different threads, or
they are unsequenced, at least one is performed by a signal handler, and they are not both performed by the same signal handler invocation.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior. ...
The special case is not very relevant to the question, but I'll include it for completeness:
Two accesses to the same object of type volatile std::sig_­atomic_­t do not result in a data race if both occur in the same thread, even if one or more occurs in a signal handler. ...
What part of the so called "memory model" protects non atomic objects from these interactions caused by reads that see the interaction?
None. In fact, you get the opposite and the standard explicitly calls this out as undefined behavior. In [intro.races]\21 we have
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, except for the special case for signal handlers described below. Any such data race results in undefined behavior.
which covers your second example.
The rule is that if you have shared data in multiple threads, and at least one of those threads write to that shared data, then you need synchronization. Without that you have a data race and undefined behavior. Do note that volatile is not a valid synchronization mechanism. You need atomics/mutexs/condition variables to protect shared access.
Note: The specific examples I give here are apparently not accurate. I've assumed the optimizer can be somewhat more aggressive than it's apparently allowed to be. There is some excellent discussion about this in the comments. I'm going to have to investigate this further, but wanted to leave this note here as a warning.
Other people have given you answers quoting the appropriate parts of the standard that flat out state that the guarantee you think exists, doesn't. It appears that you're interpreting a part of the standard that says a certain weird behavior is permitted for atomic objects if you use memory_order_relaxed as meaning that this behavior is not permitted for non-atomic objects. This is a leap of inference that is explicitly addressed by other parts of the standard that declare the behavior undefined for non-atomic objects.
In practical terms, here is an order of events that might happen in thread 1 that would be perfectly reasonable, but result in the behavior you think is barred even if the hardware guaranteed that all memory access was completely serialized between CPUs. Keep in mind that the standard has to not only take into account the behavior of the hardware, but the behavior of optimizers, which often aggressively re-order and re-write code.
Thread 1 could be re-written by an optimizer to look this way:
old_y = y; // old_y is a hidden variable (perhaps a register) created by the optimizer
y = 42;
if (x != 42) y = old_y;
There might be perfectly reasonable reasons for an optimizer to do this. For example, it may decide that it's far more likely than not for 42 to be written into y, and for dependency reasons, the pipeline might work a lot better if the store into y occurs sooner rather than later.
The rule is that the apparent result must look as if the code you wrote is what was executed. But there is no requirement that the code you write bears any resemblance at all to what the CPU is actually told to do.
The atomic variables impose constraints on the ability of the compiler to re-write code as well as instructing the compiler to issue special CPU instructions that impose constraints on the ability of the CPU to re-order memory accesses. The constraints involving memory_order_relaxed are much stronger than what is ordinarily allowed. The compiler would generally be allowed to completely get rid of any reference to x and y at all if they weren't atomic.
Additionally, if they are atomic, the compiler must ensure that other CPUs see the entire variable as either with the new value or the old value. For example, if the variable is a 32-bit entity that crosses a cache line boundary and a modification involves changing bits on both sides of the cache line boundary, one CPU may see a value of the variable that is never written because it only sees an update to the bits on one side of the cache line boundary. But this is not allowed for atomic variables modified with memory_order_relaxed.
That is why data races are labeled as undefined behavior by the standard. The space of the possible things that could happen is probably a lot wilder than your imagination could account for, and certainly wider than any standard could reasonably encompass.
(Stackoverflow complains about too many comments I put above, so I gathered them into an answer with some modifications.)
The intercept you cite from from C++ standard working draft N3337 was wrong.
[Note: The requirements do allow r1 == r2 == 42 in the following
example, with x and y initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);
A programming language should never allow this "r1 == r2 == 42" to happen.
This has nothing to do with memory model. This is required by causality, which is the basic logic methodology and the foundation of any programming language design. It is the fundamental contract between human and computer. Any memory model should abide by it. Otherwise it is a bug.
The causality here is reflected by the intra-thread dependences between operations within a thread, such as data dependence (e.g., read after write in same location) and control dependence (e.g., operation in a branch), etc. They cannot be violated by any language specification. Any compiler/processor design should respect the dependence in its committed result (i.e., externally visible result or program visible result).
Memory model is mainly about memory operation ordering among multi-processors, which should never violate the intra-thread dependence, although a weak model may allow the causality happening in one processor to be violated (or unseen) in another processor.
In your code snippet, both threads have (intra-thread) data dependence (load->check) and control dependence (check->store) that ensure their respective executions (within a thread) are ordered. That means, we can check the later op's output to determine if the earlier op has executed.
Then we can use simple logic to deduce that, if both r1 and r2 are 42, there must be a dependence cycle, which is impossible, unless you remove one condition check, which essentially breaks the dependence cycle. This has nothing to do with memory model, but intra-thread data dependence.
Causality (or more accurately, intra-thread dependence here) is defined in C++ std, but not so explicitly in early drafts, because dependence is more of micro-architecture and compiler terminology. In language spec, it is usually defined as operational semantics. For example, the control dependence formed by "if statement" is defined in the same version of draft you cited as "If the condition yields true the first substatement is executed. " That defines the sequential execution order.
That said, the compiler and processor can schedule one or more operations of the if-branch to be executed before the if-condition is resolved. But no matter how the compiler and processor schedule the operations, the result of the if-branch cannot be committed (i.e., become visible to the program) before the if-condition is resolved. One should distinguish between semantics requirement and implementation details. One is language spec, the other is how the compiler and processor implement the language spec.
Actually the current C++ standard draft has corrected this bug in https://timsong-cpp.github.io/cppwp/atomics.order#9 with a slight change.
[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y again initially zero:
// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42)
y.store(42, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42)
x.store(42, memory_order_relaxed);

Unexpected reversal of values when using a function to provide int arguments for constructor [duplicate]

Okay, I'm aware that the standard dictates that a C++ implementation may choose in which order arguments of a function are evaluated, but are there any implementations that actually 'take advantage' of this in a scenario where it would actually affect the program?
Classic Example:
int i = 0;
foo(i++, i++);
Note: I'm not looking for someone to tell me that the order of evaluation can't be relied on, I'm well aware of that. I'm only interested in whether any compilers actually do evaluate out of a left-to-right order because my guess would be that if they did lots of poorly written code would break (rightly so, but they would still probably complain).
It depends on the argument type, the called function's calling convention, the archtecture and the compiler. On an x86, the Pascal calling convention evaluates arguments left to right whereas in the C calling convention (__cdecl) it is right to left. Most programs which run on multiple platforms do take into account the calling conventions to skip surprises.
There is a nice article on Raymond Chen' blog if you are interested. You may also want to take a look at the Stack and Calling section of the GCC manual.
Edit: So long as we are splitting hairs: My answer treats this not as a language question but as a platform one. The language standard does not gurantee or prefer one over the other and leaves it as unspecified. Note the wording. It does not say this is undefined. Unspecified in this sense means something you cannot count on, non-portable behavior. I don't have the C spec/draft handy but it should be similar to that from my n2798 draft (C++)
Certain other aspects and operations of the abstract machine are described in this International Standard as unspecified (for example, order of evaluation of arguments to a function). Where possible, this International Standard defines a set of allowable behaviors. These define the nondeterministic aspects of the abstract machine. An instance of the abstract machine can thus have more than one possible execution sequence for a given program and a given input.
I found answer in c++ standards.
Paragraph 5.2.2.8:
The order of evaluation of arguments is unspecified. All side effects of argument expression evaluations take effect before the function is entered. The order of evaluation of the postfix expression and the argument expression list is
unspecified.
In other words, It depends on compiler only.
Read this
It's not an exact copy of your question, but my answer (and a few others) cover your question as well.
There are very good optimization reasons why the compiler might not just choose right-to-left but also interleave them.
The standard doesn't even guarantee a sequential ordering. It only guarantees that when the function gets called, all arguments have been fully evaluated.
And yes, I have seen a few versions of GCC do exactly this. For your example, foo(0,0) would be called, and i would be 2 afterwards. (I can't give you the exact version number of the compiler. It was a while ago - but I wouldn't be surprised to see this behavior pop up again. It's an efficient way to schedule instructions)
All arguments are evaluated. Order not defined (as per standard). But all implementations of C/C++ (that I know of) evaluate function arguments from right to left. EDIT: CLang is an exception (see comment below).
I believe that the right-to-left evaluation order has been very very old (since the first C compilers). Certainly way before C++ was invented, and most implementations of C++ would be keeping the same evaluation order because early C++ implementations simply translated into C.
There are some technical reasons for evaluating function arguments right-to-left. In stack architectures, arguments are typically pushed onto the stack. In C/C++, you can call a function with more arguments than actually specified -- the extra arguments are simiply ignored. If arguments are evaluated left-to-right, and pushed left-to-right, then the stack slot right under the stack pointer will hold the last argument, and there is no way for the function to get at the offset of any particular argument (because the actual number of arguments pushed depends on the caller).
In a right-to-left push order, the stack slot right under the stack pointer will always hold the first argument, and the next slot holds the second argument etc. Argument offsets will always be deterministic for the function (which may be written and compiled elsewhere into a library, separately from where it is called).
Now, right-to-left push order does not mandate right-to-left evaluation order, but in early compilers, memory is scarce. In right-to-left evaluation order, the same stack can be used in-place (essentially, after evaluating the argument -- which may be an expression or a funciton call! -- the return value is already at the right position on the stack). In left-to-right evaluation, the argument values must be stored separately and the pushed back to the stack in reverse order.
Last time I saw differences was between VS2005 and GCC 3.x on an x86 hardware in 2007.
So it's (was?) a very likely situation. So I never rely on evaluation order anymore. Maybe it's better now.
I expect that most modern compilers would attempt to interleave the instructions computing the arguments, given that they are required by the C++ standard to be independent and thus lack any interdependencies. Doing this should help to keep a deeply-pipelined CPU's execution units full and thereby increase throughput. (At least I would expect that a compiler that claims to be an optimising compiler would do so when optimisation flags are given.)

Sequence points, conditionals and optimizations

I had an argument today with one of my collegues regarding the fact that a compiler could change the semantics of a program when agressive optimizations are enabled.
My collegue states that when optimizations are enabled, a compiler might change the order of some instructions. So that:
function foo(int a, int b)
{
if (a > 5)
{
if (b < 6)
{
// Do something
}
}
}
Might be changed to:
function foo(int a, int b)
{
if (b < 6)
{
if (a > 5)
{
// Do something
}
}
}
Of course, in this case, it doesn't change the program general behavior and isn't really important.
From my understanding, I believe that the two if (condition) belong to two different sequence points and that the compiler can't change their order, even if changing it would keep the same general behavior.
So, dear SO users, what is the truth regarding this ?
If the compiler can verify that there is no observable difference between those two, then it is free to make such optimizations.
Sequence points are a conceptual thing: the compiler has to generate code such that it behaves as if all the semantic rules like sequence points were followed. The generated code doesn't actually have to follow those rules if not following them produces no observable difference in the behavior of the program.
Even if you had:
if (a > 5 && b < 6)
the compiler could freely rearrange this to be
if (b < 6 && a > 5)
because there is no observable difference between the two (in this specific case where a and b are both int values). [This assumes that it is safe to read both a and b; if reading one of them could cause some error (e.g., one has a trap value), then the compiler would be more restricted in what optimizations it could make.]
As there is no observable difference between the two program snippets - provided the implementation is one that doesn't use trap values or anything else that might cause the inner comparison to do something other than just evaluate to true or false - the compiler could optimize one to the other under the "as if" rule. If there was some observable difference or some way that a conforming program might behave differently then the compiler would be non-conforming if it changed one form to the other.
For C++, see 1.9 [intro.execution] / 5.
A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible execution sequences of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution sequence contains an undefined
operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).
[This provision is sometimes called the "as-if" rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.]
Yes, the if statement is a sequence point.
However, a smart and agressive compiler can still reorder the different expressions, statements and alter the sequence points providing no side effects appear.
Sequence points only apply to the abstract machine.
If the target specific optimizer can prove that reversing the order of two instructions has no side effects, it can change them at will.
The end of a full expression (including those that control logical constructs like if, while, et cetera) is a sequence point. However, the sequence point really only provides a guarantee that side-effects of previously-evaluated statements have completed.
If a statement has no observable side-effects the compiler can do what it feels is best.
The truth is that if a>5 is false more often than b<6 is false or vice versa then the sequence will make a very minor difference as it will have to compute both conditionals on more occasions.
In reality though it is so trivial it is not worth bothering about in this particular case.
There are cases where it actually does make a difference, i.e. when you are filtering a large collection of data on several criteria and have to decide which filter to apply first, particularly if only one of them is O(log N) or constant and the subsequent checks are linear through what is left.
Lots of PC programmer replies =)
The compiler may, and likely would, optimize the sequence points for speed if "b" is passed to the function in a quickly-accessed register while "a" is passed on the stack. That's a quite common case for many compilers for 8-bit and 16-bit MCU:s.
Through the optimization it doesn't need to first stack "b", then load "a" into a register, then evaluate "a", then load "b" back into a register, then evaluate "b". Quite a mess I'd rather hope the compiler handled by rearranging the sequence points.
Though of course as already mentioned, to be standard compliant the compiler needs to ensure that it doesn't change the program behavior by the optimization.

Compilers and argument order of evaluation in C++

Okay, I'm aware that the standard dictates that a C++ implementation may choose in which order arguments of a function are evaluated, but are there any implementations that actually 'take advantage' of this in a scenario where it would actually affect the program?
Classic Example:
int i = 0;
foo(i++, i++);
Note: I'm not looking for someone to tell me that the order of evaluation can't be relied on, I'm well aware of that. I'm only interested in whether any compilers actually do evaluate out of a left-to-right order because my guess would be that if they did lots of poorly written code would break (rightly so, but they would still probably complain).
It depends on the argument type, the called function's calling convention, the archtecture and the compiler. On an x86, the Pascal calling convention evaluates arguments left to right whereas in the C calling convention (__cdecl) it is right to left. Most programs which run on multiple platforms do take into account the calling conventions to skip surprises.
There is a nice article on Raymond Chen' blog if you are interested. You may also want to take a look at the Stack and Calling section of the GCC manual.
Edit: So long as we are splitting hairs: My answer treats this not as a language question but as a platform one. The language standard does not gurantee or prefer one over the other and leaves it as unspecified. Note the wording. It does not say this is undefined. Unspecified in this sense means something you cannot count on, non-portable behavior. I don't have the C spec/draft handy but it should be similar to that from my n2798 draft (C++)
Certain other aspects and operations of the abstract machine are described in this International Standard as unspecified (for example, order of evaluation of arguments to a function). Where possible, this International Standard defines a set of allowable behaviors. These define the nondeterministic aspects of the abstract machine. An instance of the abstract machine can thus have more than one possible execution sequence for a given program and a given input.
I found answer in c++ standards.
Paragraph 5.2.2.8:
The order of evaluation of arguments is unspecified. All side effects of argument expression evaluations take effect before the function is entered. The order of evaluation of the postfix expression and the argument expression list is
unspecified.
In other words, It depends on compiler only.
Read this
It's not an exact copy of your question, but my answer (and a few others) cover your question as well.
There are very good optimization reasons why the compiler might not just choose right-to-left but also interleave them.
The standard doesn't even guarantee a sequential ordering. It only guarantees that when the function gets called, all arguments have been fully evaluated.
And yes, I have seen a few versions of GCC do exactly this. For your example, foo(0,0) would be called, and i would be 2 afterwards. (I can't give you the exact version number of the compiler. It was a while ago - but I wouldn't be surprised to see this behavior pop up again. It's an efficient way to schedule instructions)
All arguments are evaluated. Order not defined (as per standard). But all implementations of C/C++ (that I know of) evaluate function arguments from right to left. EDIT: CLang is an exception (see comment below).
I believe that the right-to-left evaluation order has been very very old (since the first C compilers). Certainly way before C++ was invented, and most implementations of C++ would be keeping the same evaluation order because early C++ implementations simply translated into C.
There are some technical reasons for evaluating function arguments right-to-left. In stack architectures, arguments are typically pushed onto the stack. In C/C++, you can call a function with more arguments than actually specified -- the extra arguments are simiply ignored. If arguments are evaluated left-to-right, and pushed left-to-right, then the stack slot right under the stack pointer will hold the last argument, and there is no way for the function to get at the offset of any particular argument (because the actual number of arguments pushed depends on the caller).
In a right-to-left push order, the stack slot right under the stack pointer will always hold the first argument, and the next slot holds the second argument etc. Argument offsets will always be deterministic for the function (which may be written and compiled elsewhere into a library, separately from where it is called).
Now, right-to-left push order does not mandate right-to-left evaluation order, but in early compilers, memory is scarce. In right-to-left evaluation order, the same stack can be used in-place (essentially, after evaluating the argument -- which may be an expression or a funciton call! -- the return value is already at the right position on the stack). In left-to-right evaluation, the argument values must be stored separately and the pushed back to the stack in reverse order.
Last time I saw differences was between VS2005 and GCC 3.x on an x86 hardware in 2007.
So it's (was?) a very likely situation. So I never rely on evaluation order anymore. Maybe it's better now.
I expect that most modern compilers would attempt to interleave the instructions computing the arguments, given that they are required by the C++ standard to be independent and thus lack any interdependencies. Doing this should help to keep a deeply-pipelined CPU's execution units full and thereby increase throughput. (At least I would expect that a compiler that claims to be an optimising compiler would do so when optimisation flags are given.)