Is the following code guaranteed to return the expected value of counter (40,000,000), according to the C++11 memory model? (NOT limited to x86).
#include <atomic>
#include <thread>
using namespace std;
void ThreadProc(atomic<int>& counter)
{
for (int i = 0; i < 10000000; i++)
counter.fetch_add(1, memory_order_relaxed);
}
int main()
{
#define COUNT 4
atomic<int> counter = { 0 };
thread threads[COUNT] = {};
for (size_t i = 0; i < COUNT; i++)
threads[i] = thread(ThreadProc, ref(counter));
for (size_t i = 0; i < COUNT; i++)
threads[i].join();
printf("Counter: %i", counter.load(memory_order_relaxed));
return 0;
}
In particular, will relaxed atomics coordinate such that two threads will not simultaneously read the current value, independently increment it, and both write their incremented value, effectively losing one of the writes?
Some lines from the spec seem to indicate that counter must consistently be 40,000,000 in the above example.
[Note: operations specifying memory_order_relaxed are relaxed with
respect to memory ordering. Implementations must still guarantee that
any given atomic access to a particular atomic object be indivisible
with respect to all other atomic accesses to that object. — end note
.
Atomic read-modify-write operations shall always read the last value
(in the modification order) written the write associated with the
read-modify-write operation.
.
All modifications to a particular atomic object M occur in some
particular total order, called the modification order of M. If A and B
are modifications of an atomic object M and A happens before (as
defined below) B, then A shall precede B in the modification order of
M, which is defined below.
This talk also supports the notion that the above code is race free.
https://www.youtube.com/watch?v=KeLBd2EJLOU&feature=youtu.be&t=1h9m30s
It appears to me that there is an indivisible ordering of the atomic operations, but we have no guarantees what the order is. So all increments must take place 'one before the other' without the race I described above.
But then a few things potentially point in the other direction:
Implementations should make atomic stores visible to atomic loads
within a reasonable amount of time.
I've been informed by a coworker that there are known mistakes in Sutter's talk. Though I've yet to find any sources for this.
Multiple members of the C++ community smarter than I have implied that a relaxed atomic add could be buffered such that a subsequent relaxed atomic add could read and operator on the stale value.
The code in your question is race free; all increments are ordered and the outcome of 40000000 is guaranteed.
The references in your question contain all the relevant quotes from the standard.
The part where it says that atomic stores should be visible within a reasonable time applies only to single stores.
In your case, the counter is incremented with an atomic read-modift-write operation and those are guaranteed to operate on the latest in the modification order.
Multiple members of the C++ community (...) have implied that a relaxed atomic add could be buffered such that a subsequent relaxed atomic add could read and operator on the stale value.
This is not possible, as long as the modifications are based on atomic read-modify-write operations.
Atomic increments would be useless if a reliable outcome was not guaranteed by the standard
Related
Consider the following situation
// Global
int x = 0; // not atomic
// Thread 1
x = 1;
// Thread 2
if (false)
x = 2;
Does this constitute a data race according to the standard?
[intro.races] says:
Two expression evaluations conflict if one of them modifies a memory location (4.4) and the other one reads
or modifies the same memory location.
The execution of a program contains a data race if it contains two potentially concurrent conflicting actions,
at least one of which is not atomic, and neither happens before the other, except for the special case for
signal handlers described below. Any such data race results in undefined behavior.
Is it safe from a language-lawyer perspective, because the program can never be allowed to perform the "expression evaluation" x = 2;?
From a technical standpoint, what if some weird, stupid compiler decided to perform a speculative execution of this write, rolling it back after checking the actual condition?
What inspired this question is the fact that (at least in Standard 11), the following program was allowed to have its result depend entirely on reordering/speculative execution:
// Thread 1:
r1 = y.load(std::memory_order_relaxed);
if (r1 == 42) x.store(r1, std::memory_order_relaxed);
// Thread 2:
r2 = x.load(std::memory_order_relaxed);
if (r2 == 42) y.store(42, std::memory_order_relaxed);
// This is allowed to result in r1==r2==42 in c++11
(compare https://en.cppreference.com/w/cpp/atomic/memory_order)
The key term is "expression evaluation". Take the very simple example:
int a = 0;
for (int i = 0; i != 10; ++i)
++a;
There's one expression ++a, but 10 evaluations. These are all ordered: the 5th evaluation happens-before the 6th evaluation. And the evaluations of ++a are interleaved with the evaluations of i!=10.
So, in
int a = 0;
for (int i = 0; i != 0; ++i)
++a;
there are 0 evaluations. And by a trivial rewrite, that gets us
int a = 0;
if (false)
++a;
Now, if there are 10 evaluations of ++a, we need to worry for all 10 evaluations if they race with another thread (in more complex cases, the answer might vary - say if you start a thread when a==5). But if there are no evaluations at all of ++a, then there's clearly no racing evaluation.
Does this constitute a data race according to the standard?
No, data races are concerned with access to storage locations in expressions which are actually evaluated as your quote states. In if (false) x = 2; the expression x = 2; is never evaluated. Hence it doesn't matter at all to determining the presence of data races.
Is it safe from a language-lawyer perspective, because the program can never be allowed to perform the "expression evaluation" x = 2;?
Yes.
From a technical standpoint, what if some weird, stupid compiler decided to perform a speculative execution of this write, rolling it back after checking the actual condition?
It is not allowed to do that if it could affect the observable behavior of the program. Otherwise it may do that, but it is impossible to observe the difference.
What inspired this question is the fact that (at least in Standard 11), the following program was allowed to have its result depend entirely on reordering/speculative execution:
That's a completely different situation. This program also doesn't have any data races, since the only variables that are accessed in both threads are atomics, which can never have data races. It merely has potentially multiple valid results, meaning a race condition. A data race would always imply undefined behavior, not merely unspecified behavior.
Also the out-of-thin-air issue appears only as a result of the circular dependence of the accesses between multiple atomics. In your initial example there is only one variable, non-atomic and without any such circular dependence.
As known, std::atomic and volatile are different things.
There are 2 main differences:
Two optimizations can be for std::atomic<int> a;, but can't be for volatile int a;:
fused operations: a = 1; a = 2; can be replaced by the compiler on a = 2;
constant propagation: a = 1; local = a; can be replaced by the compiler ona = 1; local = 1;
Reordering of ordinary reads/writes across atomic/volatile operations:
for volatile int a; any volatile-read/write-operations can't be reordered. But nearby ordinary reads/writes can still be reordered around volatile reads/writes.
for std::atomic a; reordering of nearby ordinary reads/writes restricted based on the memory barrier used for atomic operation a.load(std::memory_order_...);
I.e. volatile don't introduce a memory fences, but std::atomic can do it.
As is well described in the article:
Herb Sutter, January 08, 2009 - part 1: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484
Herb Sutter, January 08, 2009 - part 2: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2
For example, std::atomic should be used for concurrent multi-thread programs (CPU-Core <-> CPU-Core), but volatile should be used for access to Mamory Mapped Regions on devices (CPU-Core <-> Device).
But if required, both have unusual semantics and has any or all of the atomicity and/or ordering guarantees needed for lock-free coding, i.e. if required volatile std::atomic<>, require for several reasons:
ordering: to prevent reordering of ordinary reads/writes, for example, for reads from CPU-RAM, to which the data been written using the Device DMA-controller
For example:
char cpu_ram_data_written_by_device[1024];
device_dma_will_write_here( cpu_ram_data_written_by_device );
// physically mapped to device register
volatile bool *device_ready = get_pointer_device_ready_flag();
//... somewhere much later
while(!device_ready); // spin-lock (here should be memory fence!!!)
for(auto &i : cpu_ram_data_written_by_device) std::cout << i;
spilling: CPU write to CPU-RAM and then Device DMA-controller read from this memory: https://en.wikipedia.org/wiki/Register_allocation#Spilling
example:
char cpu_ram_data_will_read_by_device[1024];
device_dma_will_read_it( cpu_ram_data_written_by_device );
// physically mapped to device register
volatile bool *data_ready = get_pointer_data_ready_flag();
//... somewhere much later
for(auto &i : cpu_ram_data_will_read_by_device) i = 10;
data_ready=true; //spilling cpu_ram_data_will_read_by_device to RAM, should be memory fence
atomic: to guarantee that the volatile operation will be atomic - i.e. It will consist of a single operation instead of multiple - i.e. one 8-byte-operation instead of two 4-byte-operations
For this, Herb Sutter said about volatile atomic<T>, January 08, 2009: http://www.drdobbs.com/parallel/volatile-vs-volatile/212701484?pgno=2
Finally, to express a variable that both has unusual semantics and has
any or all of the atomicity and/or ordering guarantees needed for
lock-free coding, only the ISO C++0x draft Standard provides a direct
way to spell it: volatile atomic.
But do modern standards C++11 (not C++0x draft), C++14, and C++17 guarantee that volatile atomic<T> has both semantics (volatile + atomic)?
Does volatile atomic<T> guarantee the most stringent guarantees from both volatile and atomic?
As in volatile: Avoids fused-operations and constant-propagation as described in the beginning of the question
As in std::atomic: Introduces memory fences to provide ordering, spilling, and being atomic.
And can we do reinterpret_cast from volatile int *ptr; to volatile std::atomic<int>*?
Yes, it does.
Section 29.6.5, "Requirements for operations on atomic types"
Many operations are volatile-qualified. The “volatile as device register” semantics have not changed in the standard. This qualification means that volatility is preserved when applying these operations to volatile objects.
I checked working drafts 2008 through 2016, and the same text is in all of them. Therefore it should apply C++11, C++14, and C++17.
And can we do reinterpret_cast from volatile int *ptr; to volatile
std::atomic<int>*?
You can do such casts if and only if the ABI says that both types (here int and std::atomic<int>) have the same representation and restrictions: same size, alignement and possible bit patterns; same meaning for same bit patterns.
Everything that is volatile is directly connected with the ABI: variables that are volatile qualified must have the canonical ABI representation at sequence points and operations on volatile objects only assume they follow their ABI requirements and nothing else. So whenever volatile is used in C or C++, you can rely alternatively on the language standard or the platform ABI.
(I hope this answer is not deleted because some people despise volatile semantic and depending on the ABI and platform specific notions.)
I have a question regarding a GCC-Wiki article. Under the headline "Overall Summary" the following code example is given:
Thread 1:
y.store (20);
x.store (10);
Thread 2:
if (x.load() == 10) {
assert (y.load() == 20)
y.store (10)
}
It is said that, if all stores are release and all loads are acquire, the assert in thread 2 cannot fail. This is clear to me (because the store to x in thread 1 synchronizes with the load from x in thread 2).
But now comes the part that I don't understand. It is also said that, if all stores are release and all loads are consume, the results are the same. Wouldn't it be possible that the load from y is hoisted before the load from x (because there is no dependency between these variables)? That would mean that the assert in thread 2 actually can fail.
The C11 Standard's ruling is as follows.
5.1.2.4 Multi-threaded executions and data races
An evaluation A is dependency-ordered before 16) an evaluation B if:
— A performs a release operation on an atomic object M, and, in another thread, B performs a consume operation on M and reads a value written by any side effect in the release sequence headed by A, or
— for some evaluation X, A is dependency-ordered before X and X carries a dependency to B.
An evaluation A inter-thread happens before an evaluation B if A synchronizes with B, A is dependency-ordered before B, or, for some evaluation X:
— A synchronizes with X and X is sequenced before B,
— A is sequenced before X and X inter-thread happens before B, or
— A inter-thread happens before X and X inter-thread happens before B.
NOTE 7 The ‘‘inter-thread happens before’’ relation describes arbitrary concatenations of ‘‘sequenced before’’, ‘‘synchronizes with’’, and ‘‘dependency-ordered before’’ relationships, with two exceptions. The first exception is that a concatenation is not permitted to end with ‘‘dependency-ordered before’’ followed by ‘‘sequenced before’’. The reason for this limitation is that a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency. The reason that this limitation applies only to the end of such a concatenation is that any subsequent release operation will provide the required ordering for a prior consume operation. The second exception is that a concatenation is not permitted to consist entirely of ‘‘sequenced before’’. The reasons for this limitation are (1) to permit ‘‘inter-thread happens before’’ to be transitively closed and (2) the ‘‘happens before’’ relation, defined below, provides for relationships consisting entirely of ‘‘sequenced before’’.
An evaluation A happens before an evaluation B if A is sequenced before B or A inter-thread happens before B.
A visible side effect A on an object M with respect to a value computation B of M satisfies the conditions:
— A happens before B, and
— there is no other side effect X to M such that A happens before X and X happens before B.
The value of a non-atomic scalar object M, as determined by evaluation B, shall be the value stored by the visible side effect A.
(emphasis added)
In the commentary below, I'll abbreviate below as follows:
Dependency-ordered before: DOB
Inter-thread happens before: ITHB
Happens before: HB
Sequenced before: SeqB
Let us review how this applies. We have 4 relevant memory operations, which we will name Evaluations A, B, C and D:
Thread 1:
y.store (20); // Release; Evaluation A
x.store (10); // Release; Evaluation B
Thread 2:
if (x.load() == 10) { // Consume; Evaluation C
assert (y.load() == 20) // Consume; Evaluation D
y.store (10)
}
To prove the assert never trips, we in effect seek to prove that A is always a visible side-effect at D. In accordance with 5.1.2.4 (15), we have:
A SeqB B DOB C SeqB D
which is a concatenation ending in DOB followed by SeqB. This is explicitly ruled by (17) to not be an ITHB concatenation, despite what (16) says.
We know that since A and D are not in the same thread of execution, A is not SeqB D; Hence neither of the two conditions in (18) for HB is satisfied, and A does not HB D.
It follows then that A is not visible to D, since one of the conditions of (19) is not met. The assert may fail.
How this could play out, then, is described here, in the C++ standard's memory model discussion and here, Section 4.2 Control Dependencies:
(Some time ahead) Thread 2's branch predictor guesses that the if will be taken.
Thread 2 approaches the predicted-taken branch and begins speculative fetching.
Thread 2 out-of-order and speculatively loads 0xGUNK from y (Evaluation D). (Maybe it was not yet evicted from cache?).
Thread 1 stores 20 into y (Evaluation A)
Thread 1 stores 10 into x (Evaluation B)
Thread 2 loads 10 from x (Evaluation C)
Thread 2 confirms the if is taken.
Thread 2's speculative load of y == 0xGUNK is committed.
Thread 2 fails assert.
The reason why it is permitted for Evaluation D to be reordered before C is because a consume does not forbid it. This is unlike an acquire-load, which prevents any load/store after it in program order from being reordered before it. Again, 5.1.2.4(15) states, a consume operation participating in a ‘‘dependency-ordered before’’ relationship provides ordering only with respect to operations to which this consume operation actually carries a dependency, and there most definitely is not a dependency between the two loads.
CppMem verification
CppMem is a tool that helps explore shared data access scenarios under the C11 and C++11 memory models.
For the following code that approximates the scenario in the question:
int main() {
atomic_int x, y;
y.store(30, mo_seq_cst);
{{{ { y.store(20, mo_release);
x.store(10, mo_release); }
||| { r3 = x.load(mo_consume).readsvalue(10);
r4 = y.load(mo_consume); }
}}};
return 0; }
The tool reports two consistent, race-free scenarios, namely:
In which y=20 is successfully read, and
In which the "stale" initialization value y=30 is read. Freehand circle is mine.
By contrast, when mo_acquire is used for the loads, CppMem reports only one consistent, race-free scenario, namely the correct one:
in which y=20 is read.
Both establish a transitive "visibility" order on atomic stores, unless they have been issued with memory_order_relaxed. If a thread reads an atomic object x with one of the modes, it can be sure that it sees all modifications to all atomic objects y that were known to be done before the write to x.
The difference between "acquire" and "consume" is in the visibility of non-atomic writes to some variable z, say. For acquire all writes, atomic or not, are visible. For consume only the atomic ones are guaranteed to be visible.
thread 1 thread 2
z = 5 ... store(&x, 3, release) ...... load(&x, acquire) ... z == 5 // we know that z is written
z = 5 ... store(&x, 3, release) ...... load(&x, consume) ... z == ? // we may not have last value of z
Am I right in the following assumptions:
I don't need to explicitly synchronize an access to the std::atomic<T> objects from different threads on any platform with my own synchronization objects
std::atomic<T> operations could be lock-free or non-lock-free, dependent on the platform
std::atomic_bool and std::atomic<bool> (and the other types like these) are the same things actually
std::atomic_flag is the only class that guarantees platform-independent lock-free operations by the standard
Also where can I find a useful info about std::memory_order and how to use it properly?
Let's go through one by one.
I don't need to explicitly synchronize an access to the std::atomic<T> objects from different threads on any platform with my own synchronization objects
Yes, atomic objects are fully synchronized on all of their accessor methods.
The only time a data race can occur with access to the atomic types is during construction, but it involves constructing an atomic object A, passing its address to another thread via an atomic pointer using memory_order_relaxed to intentionally work around the sequential consistency of std::atomic, and then accessing A from that second thread. So, don't do that? :)
Speaking of construction, there are three ways to initialize your atomic types:
// Method 1: constructor
std::atomic<int> my_int(5);
// Method 2: atomic_init
std::atomic<int> my_int; // must be default constructed
std::atomic_init(&my_int, 5); // only allowed once
// Method 3: ATOMIC_VAR_INIT
// may be implemented using locks even if std::atomic<int> is lock-free
std::atomic<int> my_int = ATOMIC_VAR_INIT(5);
Using either of the two latter methods, the same data race possibility applies.
std::atomic<T> operations could be lock-free or non-lock-free, dependent on the platform
Correct. For all integral types, there are macros you can check that tell you whether a given atomic specialization is not, sometimes, or always lock-free. The value of the macros is 0, 1, or 2 respectively for these three cases. The full list of macros is taken from §29.4 of the standard, where unspecified is their stand-in for "0, 1, or 2":
#define ATOMIC_BOOL_LOCK_FREE unspecified
#define ATOMIC_CHAR_LOCK_FREE unspecified
#define ATOMIC_CHAR16_T_LOCK_FREE unspecified
#define ATOMIC_CHAR32_T_LOCK_FREE unspecified
#define ATOMIC_WCHAR_T_LOCK_FREE unspecified
#define ATOMIC_SHORT_LOCK_FREE unspecified
#define ATOMIC_INT_LOCK_FREE unspecified
#define ATOMIC_LONG_LOCK_FREE unspecified
#define ATOMIC_LLONG_LOCK_FREE unspecified
#define ATOMIC_POINTER_LOCK_FREE unspecified
Note that these defines apply to both the unsigned and signed variants of the corresponding types.
In the case that the #define is 1, you have to check at runtime. This is accomplished as follows:
std::atomic<int> my_int;
if (my_int.is_lock_free()) {
// do lock-free stuff
}
if (std::atomic_is_lock_free(&my_int)) {
// also do lock-free stuff
}
std::atomic_bool and std::atomic<bool> (and the other types like these) are the same things actually
Yes, these are just typedefs for your convenience. The full list is found in Table 194 of the standard:
Named type | Integral argument type
----------------+-----------------------
atomic_char | char
atomic_schar | signed char
atomic_uchar | unsigned char
atomic_short | short
atomic_ushort | unsigned short
atomic_int | int
atomic_uint | unsigned int
atomic_long | long
atomic_ulong | unsigned long
atomic_llong | long long
atomic_ullong | unsigned long long
atomic_char16_t | char16_t
atomic_char32_t | char32_t
atomic_wchar_t | wchar_t
std::atomic_flag is the only class that guarantees platform-independent lock-free operations by the standard
Correct, as guaranteed by §29.7/2 in the standard.
Note that there's no guarantee on the initialization state of atomic_flag unless you initialize it with the macro as follows:
std::atomic_flag guard = ATOMIC_FLAG_INIT; // guaranteed to be initialized cleared
There is a similar macro for the other atomic types,
The standard does not specify if atomic_flag could experience the same data race other atomic types could have during construction.
Also where can I find a useful info about std::memory_order and how to use it properly?
As suggested by #WhozCraig, cppreference.com has the best reference.
And as #erenon suggests, Boost.Atomic has a great essay on how to use memory fences for lock-free programming.
From the C++ (C++11) standard, §1.9.15 which discusses ordering of evaluation, is the following code example:
void g(int i, int* v) {
i = v[i++]; // the behavior is undefined
}
As noted in the code sample, the behavior is undefined.
(Note: The answer to another question with the slightly different construct i + i++, Why is a = i + i++ undefined and not unspecified behaviour, might apply here: The answer is essentially that the behavior is undefined for historical reasons, and not out of necessity. However, the standard seems to imply some justification for this being undefined - see quote immediately below. Also, that linked question indicates agreement that the behavior should be unspecified, whereas in this question I am asking why the behavior is not well-specified.)
The reasoning given by the standard for the undefined behavior is as follows:
If a side effect on a scalar object is unsequenced relative to either
another side effect on the same scalar object or a value computation
using the value of the same scalar object, the behavior is undefined.
In this example I would think that the subexpression i++ would be completely evaluated before the subexpression v[...] is evaluated, and that the result of evaluation of the subexpression is i (before the increment), but that the value of i is the incremented value after that subexpression has been completely evaluated. I would think that at that point (after the subexpression i++ has been completely evaluated), the evaluation v[...] takes place, followed by the assignment i = ....
Therefore, although the incrementing of i is pointless, I would nonetheless think that this should be defined.
Why is this undefined behavior?
I would think that the subexpression i++ would be completely evaluated before the subexpression v[...] is evaluated
But why would you think that?
One historical reason for this code being UB is to allow compiler optimizations to move side-effects around anywhere between sequence points. The fewer sequence points, the more potential opportunities to optimize but the more confused programmers. If the code says:
a = v[i++];
The intention of the standard is that the code emitted can be:
a = v[i];
++i;
which might be two instructions where:
tmp = i;
++i;
a = v[tmp];
would be more than two.
The "optimized code" breaks when a is i, but the standard permits the optimization anyway, by saying that behavior of the original code is undefined when a is i.
The standard easily could say that i++ must be evaluated before the assignment as you suggest. Then the behavior would be fully defined and the optimization would be forbidden. But that's not how C and C++ do business.
Also beware that many examples raised in these discussions make it easier to tell that there's UB around than it is in general. This leads to people saying that it's "obvious" the behavior should be defined and the optimization forbidden. But consider:
void g(int *i, int* v, int *dst) {
*dst = v[(*i)++];
}
The behavior of this function is defined when i != dst, and in that case you'd want all the optimization you can get (which is why C99 introduces restrict, to allow more optimizations than C89 or C++ do). In order to give you the optimization, behavior is undefined when i == dst. The C and C++ standards tread a fine line when it comes to aliasing, between undefined behavior that's not expected by the programmer, and forbidding desirable optimizations that fail in certain cases. The number of questions about it on SO suggests that the questioners would prefer a bit less optimization and a bit more defined behavior, but it's still not simple to draw the line.
Aside from whether the behavior is fully defined is the issue of whether it should be UB, or merely unspecified order of execution of certain well-defined operations corresponding to the sub-expressions. The reason C goes for UB is all to do with the idea of sequence points, and the fact that the compiler need not actually have a notion of the value of a modified object, until the next sequence point. So rather than constrain the optimizer by saying that "the" value changes at some unspecified point, the standard just says (to paraphrase): (1) any code that relies on the value of a modified object prior to the next sequence point, has UB; (2) any code that modifies a modified object has UB. Where a "modified object" is any object that would have been modified since the last sequence point in one or more of the legal orders of evaluation of the subexpressions.
Other languages (e.g. Java) go the whole way and completely define the order of expression side-effects, so there's definitely a case against C's approach. C++ just doesn't accept that case.
I'm going to design a pathological computer1. It is a multi-core, high-latency, single-thread system with in-thread joins that operates with byte-level instructions. So you make a request for something to happen, then the computer runs (in its own "thread" or "task") a byte-level set of instructions, and a certain number of cycles later the operation is complete.
Meanwhile, the main thread of execution continues:
void foo(int v[], int i){
i = v[i++];
}
becomes in pseudo-code:
input variable i // = 0x00000000
input variable v // = &[0xBAADF00D, 0xABABABABAB, 0x10101010]
task get_i_value: GET_VAR_VALUE<int>(i)
reg indx = WAIT(get_i_value)
task write_i++_back: WRITE(i, INC(indx))
task get_v_value: GET_VAR_VALUE<int*>(v)
reg arr = WAIT(get_v_value)
task get_v[i]_value = CALC(arr + sizeof(int)*indx)
reg pval = WAIT(get_v[i]_value)
task read_v[i]_value = LOAD_VALUE<int>(pval)
reg got_value = WAIT(read_v[i]_value)
task write_i_value_again = WRITE(i, got_value)
(discard, discard) = WAIT(write_i++_back, write_i_value_again)
So you'll notice that I didn't wait on write_i++_back until the very end, the same time as I was waiting on write_i_value_again (which value I loaded from v[]). And, in fact, those writes are the only writes back to memory.
Imagine if write to memory are the really slow part of this computer design, and they get batched up into a queue of things that get processed by a parallel memory modifying unit that does things on a per-byte basis.
So the write(i, 0x00000001) and write(i, 0xBAADF00D) execute unordered and in parallel. Each gets turned into byte-level writes, and they are randomly ordered.
We end up writing 0x00 then 0xBA to the high byte, then 0xAD and 0x00 to the next byte, then 0xF0 0x00 to the next byte, and finally 0x0D 0x01 to the low byte. The resulting value in i is 0xBA000001, which few would expect, yet would be a valid result to your undefined operation.
Now, all I did there was result in an unspecified value. We haven't crashed the system. But the compiler would be free to make it completely undefined -- maybe sending two such requests to the memory controller for the same address in the same batch of instructions actually crashes the system. That would still be a "valid" way to compile C++, and a "valid" execution environment.
Remember, this is a language where restricting the size of pointers to 8 bits is still a valid execution environment. C++ allows for compiling to rather wonkey targets.
1: As noted in #SteveJessop's comment below, the joke is that this pathological computer behaves a lot like a modern desktop computer, until you get down to the byte-level operations. Non-atomic int writing by a CPU isn't all that rare on some hardware (such as when the int isn't aligned the way the CPU wants it to be aligned).
The reason is not just historical. Example:
int f(int& i0, int& i1) {
return i0 + i1++;
}
Now, what happens with this call:
int i = 3;
int j = f(i, i);
It's certainly possible to put requirements on the code in f so that the result of this call is well defined (Java does this), but C and C++ don't impose constraints; this gives more freedom to optimizers.
You specifically refer to the C++11 standard so I'm going to answer with the C++11 answer. It is, however, very similar to the C++03 answer, but the definition of sequencing is different.
C++11 defines a sequenced before relation between evaluations on a single thread. It is asymmetric, transitive and pair-wise. If some evaluation A is not sequenced before some evaluation B and B is also not sequenced before A, then the two evaluations are unsequenced.
Evaluating an expression includes both value computations (working out the value of some expression) and side effects. One instance of a side effect is the modification of an object, which is the most important one for answering question. Other things also count as side effects. If a side effect is unsequenced relative to another side effect or value computation on the same object, then your program has undefined behaviour.
So that's the set up. The first important rule is:
Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.
So any full expression is fully evaluated before the next full expression. In your question, we're only dealing with one full expression, namely i = v[i++], so we don't need to worry about this. The next important rule is:
Except where noted, evaluations of operands of individual operators and of subexpressions of individual expressions are unsequenced.
That means that in a + b, for example, the evaluation of a and b are unsequenced (they may be evaluated in any order). Now for our final important rule:
The value computations of the operands of an operator are sequenced before the value computation of the result of the operator.
So for a + b, the sequenced before relationships can be represented by a tree where a directed arrow represents the sequenced before relationship:
a + b (value computation)
^ ^
| |
a b (value computation)
If two evaluations occur in separate branches of the tree, they are unsequenced, so this tree shows that the evaluations of a and b are unsequenced relative to each other.
Now, let's do the same thing to your i = v[i++] example. We make use of the fact that v[i++] is defined to be equivalent to *(v + (i++)). We also use some extra knowledge about the sequencing of postfix increment:
The value computation of the ++ expression is sequenced before the modification of the operand object.
So here we go (a node of the tree is a value computation unless specified as a side effect):
i = v[i++]
^ ^
| |
i★ v[i++] = *(v + (i++))
^
|
v + (i++)
^ ^
| |
v ++ (side effect on i)★
^
|
i
Here you can see that the side effect on i, i++, is in a separate branch to the usage of i in front of the assignment operator (I marked each of these evaluations with a ★). So we definitely have undefined behaviour! I highly recommend drawing these diagrams if you ever wonder if your sequencing of evaluations is going to cause you trouble.
So now we get the question about the fact that the value of i before the assignment operator doesn't matter, because we write over it anyway. But actually, in the general case, that's not true. We can override the assignment operator and make use of the value of the object before the assignment. The standard doesn't care that we don't use that value - the rules are defined such that having any value computation unsequenced with a side effect will be undefined behaviour. No buts. This undefined behaviour is there to allow the compiler to emit more optimized code. If we add sequencing for the assignment operator, this optimization cannot be employed.
In this example I would think that the subexpression i++ would be completely evaluated before the subexpression v[...] is evaluated, and that the result of evaluation of the subexpression is i (before the increment), but that the value of i is the incremented value after that subexpression has been completely evaluated.
The increment in i++ must be evaluated before indexing v and thus before assigning to i, but storing the value of that increment back to memory need not happen before. In the statement i = v[i++] there are two suboperations that modify i (i.e. will end up causing a store from a register into the variable i). The expression i++ is equivalent to x=i+1, i=x, and there is no requirement that both operations need to take place sequentially:
x = i+1;
y = v[i];
i = y;
i = x;
With that expansion, the result of i is unrelated to the value in v[i]. On a different expansion, the i = x assignment could take place before the i = y assignment, and the result would be i = v[i]
There two rules.
The first rule is about multiple writes which give rise to a "write-write hazard": the same object cannot be modified more than once between two sequence points.
The second rule is about "read-write hazards". It is this: if an object is modified in an expression, and also accessed, then all accesses to its value must be for the purpose of computing the new value.
Expressions like i++ + i++ and your expression i = v[i++] violate the first rule. They modify an object twice.
An expression like i + i++ violates the second rule. The subexpression i on the left observes the value of a modified object, without being involved in the calculation of its new value.
So, i = v[i++] violates a different rule (bad write-write) from i + i++ (bad read-write).
The rules are too simplistic, which gives rise to classes of puzzling expressions. Consider this:
p = p->next = q
This appears to have a sane data flow dependency that is free of hazards: the assignment p = cannot take place until the new value is known. The new value is the result of p->next = q. The the value q should not "race ahead" and get inside p, such that p->next is affected.
Yet, this expression breaks the second rule: p is modified, and also used for a purpose not related to computing its new value, namely determining the storage location where the value of q is placed!
So, perversely, compilers are allowed to partially evaluate p->next = q to determine that the result is q, and store that into p, and then go back and complete the p->next = assignment. Or so it would seem.
A key issue here is, what is the value of an assignment expression? The C standard says that the value of an assignment expression is that of the lvalue, after the assignment. But that is ambiguous: it could be interpreted as meaning "the value which the lvalue will have, once the assignment takes place" or as "the value which can be observed in the lvalue after the assignment has taken place". In C++ this is made clear by the wording "[i]n all cases, the assignment is sequenced after the value computation of the right and left operands, and before the value computation of the assignment expression.", so p = p->next = q appears to be valid C++, but dubious C.
I would share your arguments if the example were v[++i], but since i++ modifies i as a side-effect, it is undefined as to when the value is modified. The standard could probably mandate a result one way or the other, but there's no true way of knowing what the value of i should be: (i + 1) or (v[i + 1]).
Think about the sequences of machine operations necessary for each of the following assignment statements, assuming the given declarations are in effect:
extern int *foo(void);
extern int *p;
*p = *foo();
*foo() = *p;
If the evaluation of the subscript on the left side and the value on the right side are unsequenced, the most efficient ways to process the two function calls would likely be something like:
[For *p = *foo()]
call foo (which yields result in r0 and trashes r1)
load r0 from address held in r0
load r1 from address held in p
store r0 to address held in r1
[For *foo() = *p]
call foo (which yields result in r0 and trashes r1)
load r1 from address held in p
load r1 from address held in r1
store r1 to address held in r0
In either case, if p or *p were read into a register before the call to foo, then unless "foo" promises not to disturb that register, the compiler would need to add an extra step to save its value before calling "foo", and another extra step to restore the value afterward. That extra step might be avoided by using a register that "foo" won't disturb, but that would only help if there were a such a register which didn't hold a value needed by the surrounding code.
Letting the compiler read the value of "p" before or after the function call, at its leisure, will allow both patterns above to be handled efficiently. Requiring that the address of the left-hand operand of "=" always be evaluated before the right hand side would likely make the first assignment above less efficient than it otherwise could be, and requiring that the address of the left-hand operand be evaluated after the right-hand side would make the second assignment less efficient.