Is the concept of release-sequence useful in practice?

Is the concept of release-sequence useful in practice? - c++

C++ atomic semantics only guarantee visibility (through happen-before relation) of memory operations performed by the last thread that did a release write (simple or read-modify-write) operation.
Consider
int x, y;
atomic<int> a;
Thread 1:
x = 1;
a.store(1,memory_order_release);
Thread 2:
y = 2;
if (a.load(memory_order_relaxed) == 1))
a.store(2,memory_order_release);
Then the observation of a == 2 implies visibility of thread 2 operations (y == 2) but not thread 1 (one cannot even read x).
As far as I know, real implementations of multithreading use concepts of fences (and sometimes release store) but not happen-before or release-sequence which are high level C++ concepts; I fail to see what real hardware details these concepts map to.
How can a real implementation not guarantee visibility of thread 1 memory operations when the value of 2 in a is globally visible?
In other words, is there any good in the release-sequence definition? Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Consider in particular silly-thread 3:
if (a.load(memory_order_relaxed) == 2))
a.store(2,memory_order_relaxed);
Can silly-thread 3 ever suppress any visibility guarantee on any real hardware? In other words, if value 2 is globally visible, how would making it again globally visible break any ordering?
Is my mental model of real multiprocessing incorrect? Can a value of partially visible, on some CPU but note another one?
(Of course I assume a non crazy semantic for relaxed writes, as writes that go back in time make language semantics of C++ absolutely nonsensical, unlike safe languages like Java that always have bounded semantics. No real implementation can have crazy, non-causal relaxed semantic.)

Let's first answer your question:
Why wouldn't the release-sequence extend to every subsequent modification in the modification order?
Because if so, we would lose some potential optimization. For example, consider the thread:
x = 1; // #1
a.store(1,memory_order_relaxed); // #2
Under current rules, the compiler is able to reorder #1 and #2. However, after the extension of release-sequence, the compiler is not allowed to reorder the two lines because another thread like your thread 2 may introduce a release sequence headed by #2 and tailed by a release operation, thus it is possible that some read-acquire operation in another thread would be synchronized with #2.
You give a specific example, and claim that all implementations would produce a specific outcome while the language rules do not guarantee this outcome. This is not a problem because the language rules are intended to handle all cases, not only your specific example. Of course the language rules may be improved so that it can guarantee the expected outcome for your specific example, but that is not a trivial work. At least, as we have argued above, simply extending the definition for release-sequence is not an accepted solution.

Related

If a RMW operation changes nothing, can it be optimized away, for all memory orders?

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations, such as:
x++,
x--;
or even simply
x+=0; // return value is ignored
For an atomic scalar x?
Does that hold for sequential consistency or just weaker memory orders?
(Note: For weaker memory orders that still do something; for relaxed, there is no real question here. EDIT AGAIN: No actually there is a serious question in that special case. See my own answer. Not even relaxed is cleared for removal.)
EDIT:
The question is not about code generation for a particular access: if I wanted to see two lock add generated on Intel for the first example, I would have made x volatile.
The question is whether these C/C++ instructions have any impact what so ever: can the compiler just filter and remove these nul operations (that are not relaxed order operations), as a sort of source to source transformation? (or abstract tree to abstract tree transformation, perhaps in the compiler "front end")
EDIT 2:
Summary of the hypotheses:
not all operations are relaxed
nothing is volatile
atomic objects are really potentially accessible by multiple functions and threads (no automatic atomic whose address isn't shared)
Optional hypothesis:
If you want, you may assume that the address of the atomic so not taken, that all accesses are by name, and that all accesses have a property:
That no access of that variable, anywhere, has a relaxed load/store element: all load operations should have acquire and all stores should have release (so all RMW should be at least acq_rel).
Or, that for those accesses that are relaxed, the access code doesn't read the value for a purpose other than changing it: a relaxed RMW does not conserve the value further (and does not test the value to decide what to do next). In other words, no data or control dependency on the value of the atomic object unless the load has an acquire.
Or that all accesses of the atomic are sequentially consistent.
That is I'm especially curious about these (I believe quite common) use cases.
Note: an access is not considered "completely relaxed" even if it's done with a relaxed memory order, when the code makes sure observers have the same memory visibility, so this is considered valid for (1) and (2):
atomic_thread_fence(std::memory_order_release);
x.store(1,std::memory_order_relaxed);
as the memory visibility is at least as good as with just x.store(1,std::memory_order_release);
This is considered valid for (1) and (2):
int v = x.load(std::memory_order_relaxed);
atomic_thread_fence(std::memory_order_acquire);
for the same reason.
This is stupidly, trivially valid for (2) (i is just an int)
i=x.load(std::memory_order_relaxed),i=0; // useless
as no information from a relaxed operation was kept.
This is valid for (2):
(void)x.fetch_add(1, std::memory_order_relaxed);
This is not valid for (2):
if (x.load(std::memory_order_relaxed))
f();
else
g();
as a consequential decision was based on a relaxed load, neither is
i += x.fetch_add(1, std::memory_order_release);
Note: (2) covers one of the most common uses of an atomic, the thread safe reference counter. (CORRECTION: It isn't clear that all thread safe counters technically fit the description as acquire can be done only on 0 post decrement, and then a decision was taken based on counter>0 without an acquire; a decision to not do something but still...)

No, definitely not entirely. It's at least a memory barrier within the thread for stronger memory orders.
For mo_relaxed atomics, yes I think it could in theory be optimized away completely, as if it wasn't there in the source. It's equivalent for a thread to simply not be part of a release-sequence it might have been part of.
If you used the result of the fetch_add(0, mo_relaxed), then I think collapsing them together and just doing a load instead of an RMW of 0 might not be exactly equivalent. Barriers in this thread surrounding the relaxed RMW still have an effect on all operations, including ordering the relaxed operation wrt. non-atomic operations. With a load+store tied together as an atomic RMW, things that order stores could order an atomic RMW when they wouldn't have ordered a pure load.
But I don't think any C++ ordering is like that: mo_release stores order earlier loads and stores, and atomic_thread_fence(mo_release) is like an asm StoreStore + LoadStore barrier. (Preshing on fences). So yes, given that any C++-imposed ordering would also apply to a relaxed load equally to a relaxed RMW, I think int tmp = shared.fetch_add(0, mo_relaxed) could be optimized to just a load.
(In practice compilers don't optimize atomics at all, basically treating them all like volatile atomic, even for mo_relaxed. Why don't compilers merge redundant std::atomic writes? and http://wg21.link/n4455 + http://wg21.link/p0062. It's too hard / no mechanism exists to let compilers know when not to.)
But yes, the ISO C++ standard on paper makes no guarantee that other threads can actually observe any given intermediate state.
Thought experiment: Consider a C++ implementation on a single-core cooperative multi-tasking system. It implements std::thread by inserting yield calls where needed to avoid deadlocks, but not between every instruction. Nothing in the standard requires a yield between num++ and num-- to let other threads observe that state.
The as-if rule basically allows a compiler to pick a legal/possible ordering and decide at compile-time that it's what happens every time.
In practice this can create fairness problems if an unlock/re-lock never actually gives other threads the chance to take a lock if --/++ are combined together into just a memory barrier with no modification of the atomic object! This among other things is why compilers don't optimize.
Any stronger ordering for one or both of the operations can begin or be part of a release-sequence that synchronizes-with a reader. A reader that does an acquire load of a release store/RMW Synchronizes-With this thread, and must see all previous effects of this thread as having already happened.
IDK how the reader would know that it was seeing this thread's release-store instead of some previous value, so a real example is probably hard to cook up. At least we could create one without possible UB, e.g. by reading the value of another relaxed atomic variable so we avoid data-race UB if we didn't see this value.
Consider the sequence:
// broken code where optimization could fix it
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
done.fetch_add(-1, mo_relaxed);
done.fetch_add(+1, mo_release); // release-store publishes the result
This could optimize to just done.store(1, mo_release); which correctly publishes a 1 to the other thread without the risk of the 1 being visible too soon, before the updated buf values.
But it could also optimize just the cancelling pair of RMWs into a fence after the relaxed store, which would still be broken. (And not the optimization's fault.)
// still broken
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
atomic_thread_fence(mo_release);
I haven't thought of an example where safe code becomes broken by a plausible optimization of this sort. Of course just removing the pair entirely even when they're seq_cst wouldn't always be safe.
A seq_cst increment and decrement does still create a sort of memory barrier. If they weren't optimized away, it would be impossible for earlier stores to interleave with later loads. To preserve this, compiling for x86 would probably still need to emit mfence.
Of course the obvious thing would be a lock add [x], 0 which does actually do a dummy RMW of the shared object that we did x++/x-- on. But I think the memory barrier alone, not coupled to an access to that actual object or cache line is sufficient.
And of course it has to act as a compile-time memory barrier, blocking compile-time reordering of non-atomic and atomic accesses across it.
For acq_rel or weaker fetch_add(0) or cancelling sequence, the run-time memory barrier might happen for free on x86, only needing to restrict compile-time ordering.
See also a section of my answer on Can num++ be atomic for 'int num'?, and in comments on Richard Hodges' answer there. (But note that some of that discussion is confused by arguments about when there are modifications to other objects between the ++ and --. Of course all ordering of this thread's operations implied by the atomics must be preserved.)
As I said, this is all hypothetical and real compilers aren't going to optimize atomics until the dust settles on N4455 / P0062.

The C++ memory model provides four coherence requirements for all atomic accesses to the same atomic object. These requirements apply regardless of the memory order. As stated in a non-normative notation:
The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads.
Emphasis added.
Given that both operations are happening to the same atomic variable, and the first definitely happens before the second (due to being sequenced before it), there can be no reordering of these operations. Again, even if relaxed operations are used.
If this pair of operations were removed by a compiler, that would guarantee that no other threads would ever see the incremented value. So the question now becomes whether the standard would require some other thread to be able to see the incremented value.
It does not. Without some way for something to guarantee-ably "happen after" the increment and "happen before" the decrement, there is no guarantee that any operation on any other thread will certainly see the incremented value.
This leaves one question: does the second operation always undo the first? That is, does the decrement undo the increment? That depends on the scalar type in question. ++ and -- are only defined for the pointer and integer specializations of atomic. So we only need to consider those.
For pointers, the decrement undoes the increment. The reason being that the only way incrementing+decrementing a pointer would not result in the same pointer to the same object is if incrementing the pointer was itself UB. That is, if the pointer is invalid, NULL, or is the past-the-end pointer to an object/array. But compilers don't have to consider UB cases since... they're undefined behavior. In all of the cases where incrementing is valid, pointer decrementing must also be valid (or UB, perhaps due to someone freeing the memory or otherwise making the pointer invalid, but again, the compiler doesn't have to care).
For unsigned integers, the decrement always undoes the increment, since wraparound behavior is well-defined for unsigned integers.
That leaves signed integers. C++ usually makes signed integer over/underflow into UB. Fortunately, that's not the case for atomic math; the standard explicitly says:
For signed integer types, arithmetic is defined to use two's complement representation. There are no undefined results.
Wraparound behavior for two's complement atomics works. That means increment/decrement always results in recovering the same value.
So there does not appear to be anything in the standard which would prevent compilers from removing such operations. Again, regardless of the memory ordering.
Now, if you use non-relaxed memory ordering, the implementation cannot completely remove all traces of the atomics. The actual memory barriers behind those orderings still have to be emitted. But the barriers can be emitted without emitting the actual atomic operations.

In the C/C++ memory model, can a compiler just combine and then remove
redundant/NOP atomic modification operations,
No, the removal part is not allowed, at least not in the specific way the question suggests it would be allowed: the intent here is to describe valid source to source transformations, abstract tree to abstract tree, or rather a higher level description of the source code that encodes all the relevant semantic elements that might be needed for later phases of compilation.
The hypothesis is that code generation can be done on the transformed program, without ever checking with the original one. So only safe transformations that cannot break any code are allowed.
(Note: For weaker memory orders that still do something; for relaxed,
there is no real question here.)
No. Even that is wrong: for even relaxed operations, unconditional removal isn't a valid transformation (although in most practical cases it's certainly valid, but mostly correct is still wrong, and "true in >99% practical cases" has nothing to do with the question):
Before the introduction of standard threads, a stuck program was an infinite loop was an empty loop performing no externally visible side effects: no input, output, volatile operation and in practice no system call. A program that will not ever perform something visible is stuck and its behavior is not defined, and that allows the compiler to assume pure algorithms terminate: loops containing only invisible computations must exit somehow (that includes exiting with an exception).
With threads, that definition is obviously not usable: a loop in one thread isn't the whole program, and a stuck program is really one with no thread that can make something useful, and forbidding that would be sound.
But the very problematic standard definition of stuck doesn't describe a program execution but a single thread: a thread is stuck if it will perform no side effect that could potentially have an effect on observable side effects, that is:
no observable obviously (no I/O)
no action that might interact with another thread
The standard definition of 2. is extremely large and simplistic, all operations on an inter-thread communication device count: any atomic operation, any action on any mutex. Full text for the requirement (relevant part in boldface):
[intro.progress]
The implementation may assume that any thread will eventually do one
of the following:
terminate,
make a call to a library I/O function,
perform an access through a volatile glvalue, or
perform a synchronization operation or an atomic operation.
[ Note: This is intended to allow compiler transformations such as removal of
empty loops, even when termination cannot be proven. — end note ]
That definition does not even specify:
an inter thread communication (from one thread to another)
a shared state (visible by multiple threads)
a modification of some state
an object that is not thread private
That means that all these silly operations count:
for fences:
performing an acquire fence (even when followed by no atomic operation) in a thread that has at least once done an atomic store can synchronize with another fence or atomic operation
for mutexes:
locking a locally recently created, patently useless function private mutex;
locking a mutex to just unlock it doing nothing with the mutex locked;
for atomics:
reading an atomic variable declared as const qualified (not a const reference to a non const atomic);
reading an atomic, ignoring the value, even with relaxed memory ordering;
setting a non const qualified atomic to its own immutable value (setting a variable to zero when nothing in the whole program sets it to a non zero value), even with relaxed ordering;;
doing operations on a local atomic variable not accessible by other threads;
for thread operations:
creating a thread (that might do nothing) and joining it seems to create a (NOP) synchronization operation.
It means no early, local transformation of program code that leaves no trace of the transformation to later compiler phases and that removes even the most silly and useless inter-thread primitive is absolutely, unconditionally valid according to the standard, as it might remove the last potentially useful (but actually useless) operation in a loop (a loop doesn't have to be spelled for or while, it's any looping construct, f.ex. a backward goto).
This however doesn't apply if other operations on inter-thread primitives are left in the loop, or obviously if I/O is done.
This looks like a defect.
A meaningful requirement should be based:
not only on using thread primitives,
not be about any thread in isolation (as you can't see if a thread is contributing to anything, but at least a requirement to meaningfully interact with another thread and not use a private atomic or mutex would be better then the current requirement),
based on doing something useful (program observables) and inter-thread interactions that contribute to something being done.
I'm not proposing a replacement right now, as the rest of the thread specification isn't even clear to me.

confused about atomic class: memory_order_relaxed

I am studying this site: https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync, which is very helpful to understand the topic about atomic class.
But this example about relaxed mode is hard to understand:
/*Thread 1:*/
y.store (20, memory_order_relaxed)
x.store (10, memory_order_relaxed)
/*Thread 2*/
if (x.load (memory_order_relaxed) == 10)
{
assert (y.load(memory_order_relaxed) == 20) /* assert A */
y.store (10, memory_order_relaxed)
}
/*Thread 3*/
if (y.load (memory_order_relaxed) == 10)
assert (x.load(memory_order_relaxed) == 10) /* assert B */
To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
But the website says either assert in this example can actually FAIL.

To me assert B should never fail, since x must be 10 and y=10 because of thread 2 has conditioned on this.
In effect, your argument is that since in thread 2 the store of 10 into x occurred before the store of 10 into y, in thread 3 the same must be the case.
However, since you are only using relaxed memory operations, there is nothing in the code that requires two different threads to agree on the ordering between modifications of different variables. So indeed thread 2 may see the store of 10 into x before the store of 10 into y while thread 3 sees those two operations in the opposite order.
In order to ensure that assert B succeeds, you would, in effect, need to ensure that when thread 3 sees the value 10 of y, it also sees any other side effects performed by the thread that stored 10 into y before the time of the store. That is, you need the store of 10 into y to synchronize with the load of 10 from y. This can be done by having the store perform a release and the load perform an acquire:
// thread 2
y.store (10, memory_order_release);
// thread 3
if (y.load (memory_order_acquire) == 10)
A release operation synchronizes with an acquire operation that reads the value stored. Now because the store in thread 2 synchronizes with the load in thread 3, anything that happens after the load in thread 3 will see the side effects of anything that happens before the store in thread 2. Hence the assertion will succeed.
Of course, we also need to make sure assertion A succeeds, by making the x.store in thread 1 use release and the x.load in thread 2 use acquire.

I find it much easier to understand atomics with some knowledge of what might be causing them, so here's some background knowledge. Know that these concepts are in no way stated in the C++ language itself, but is some of the possible reasons why things are the way they are.
Compiler reordering
Compilers, often when optimizing, will choose to refactor the program as long as its effects are the same on a single threaded program. This is circumvented with the use of atomics, which will tell the compiler (among other things) that the variable might change at any moment, and that its value might be read elsewhere.
Formally, atomics ensures one thing: there will be no data races. That is, accessing the variable will not make your computer explode.
CPU reordering
CPU might reorder instructions as it is executing them, which means the instructions might get reordered on the hardware level, independent of how you wrote the program.
Caching
Finally there are effects of caches, which are faster memory that sorta contains a partial copy of the global memory. Caches are not always in sync, meaning they don't always agree on what is "correct". Different threads may not be using the same cache, and due to this, they may not agree on what values variables have.
Back to the problem
What the above surmounts to is pretty much what C++ says about the matter: unless explicitly stated otherwise, the ordering of side effects of each instruction is totally and completely unspecified. It might not even be the same viewed from different threads.
Formally, the guarantee of an ordering of side effects is called a happens-before relation. Unless a side effect happens-before another, it is not. Loosely, we just say call it synchronization.
Now, what is memory_order_relaxed? It is telling the compiler to stop meddling, but don't worry about how the CPU and cache (and possibly other things) behave. Therefore, one possibility of why you see the "impossible" assert might be
Thread 1 stores 20 to y and then 10 to x to its cache.
Thread 2 reads the new values and stores 10 to y to its cache.
Thread 3 didn't read the values from thread 1, but reads those of thread 2, and then the assert fails.
This might be completely different from what happens in reality, the point is anything can happen.
To ensure a happens-before relation between the multiple reads and writes, see Brian's answer.
Another construct that provides happens-before relations is std::mutex, which is why they are free from such insanities.

The answer to your question is the C++ standard.
The section [intro.races] is surprisingly very clear (which is not the rule of normative text kind: formalism consistency oftenly hurts readability).
I have read many books and tuto which treats the subject of memory order, but it just confused me.
Finally I have read the C++ standard, the section [intro.multithread] is the clearest I have found. Taking the time to read it carefully (twice) may save you some time!
The answer to your question is in [intro.races]/4:
All modifications to a particular atomic object M occur in some particular total order, called the modification
order of M. [ Note: There is a separate order for each atomic object. There is no requirement that these can
be combined into a single total order for all objects. In general this will be impossible since different threads
may observe modifications to different objects in inconsistent orders. — end note ]
You were expecting a single total order on all atomic operations. There is such an order, but only for atomic operations that are memory_order_seq_cst as explained in [atomics.order]/3:
There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens
before” order and modification orders for all affected locations [...]

Using an atomic read-modify-write operation in a release sequence

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:
std::atomic<int> sync{10};
Foo *fp;
// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);
// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);
// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();
The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11
The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12
The standard says (§ 1.10.1-6):
an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]
It also says in (§ 1.10.1-5):
A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.
Now, thread #2a is modified to use an atomic read-modify-write operation:
// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));
If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12.
My questions about the use of an atomic read-modify-write are:
Does the scenario with thread #2b constitute a correct release sequence ?
And if so:
What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?

Does the scenario with thread #2b constitute a correct release sequence ?
Yes, per your quote from the standard.
What are the specific properties of a read-modify-write operation that ensure this scenario is correct?
Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".
As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.
They already give a performance related as to why they didn't choose the latter approach:
Such a requirement would sometimes interfere with efficient implementation.
In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.
So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically1 stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.
Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.
1 The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?

As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.

How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.

memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.

Combining stores/loads of consecutive atomic variables

Referring to a (slightly dated) paper by Hans Boehm, under "Atomic Operations". It mentions that the memory model (proposed at the time) would not prevent an optimizing compiler from combining a sequence of loads, or stores, on the same variable from being combined into a single load. His example is as follows (updated to hopefully correct current syntax):
Given
atomic<int> v;
The code
while( v.load( memory_order_acquire ) ) { ... }
Could be optimized to:
int a = v.load(memory_order_acquire);
while(a) { ... }
Obviously this would be bad, as he states. Now my question is, as the paper is a bit old, does the current C++0x memory model prevent this type of optimization, or is it still technically allowed?
My reading of the standard would seem to lean towards it being disallowed, but the use "acquire" semantics makes it less clear. For example if it were "seq_cst" the model seems to guarantee that the load must partake in a total ordering on the access and loading the value only once would thus seem to violate ordering (as it breaks the sequence happens before relationship).
For acquire I interpret 29.3.2 to mean that this optimization can not occur, since any "release" operation must be observed by the "acquire" operation. Doing only one acquire would seem not valid.
So my question is whether the current model (in the pending standard) would disallow this type of optimization? And if yes, then which part specifically forbids it? If no, does using a volatile atomic solve the problem?
And for bonus, if the load operation has a "relaxed" ordering is the optimization then allowed?

The C++0x standard attempts to outlaw this optimization.
The relevant words are from 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
If the thread that is doing the load only ever issues one load instruction then this is violated, as if it misses the write the first time, it will never see it. It doesn't matter which memory ordering is used for the load, it is the same for both memory_order_seq_cst and memory_order_relaxed.
However, the following optimization is allowed, unless there is something in the loop that forces an ordering:
while( v.load( memory_order_acquire ) ) {
for(unsigned __temp=0;__temp<100;++__temp) {
// original loop body goes here
}
}
i.e. the compiler can generate code that executes the actual loads arbitrarily infrequently, provided it still executes them. This is even permitted for memory_order_seq_cst unless there are other memory_order_seq_cst operations in the loop, since this is equivalent to running 100 iterations between any memory accesses by other threads.
As an aside, the use of memory_order_acquire doesn't have the effect you describe --- it is not required to see release operations (other than by 29.3p13 quoted above), just that if it does see the release operation then it imposes visibility constraints on other accesses.

Right from the very paper you're linking:
Volatiles guarantee that the right number of memory operations are
performed.
The standard says essentially the same:
Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
This has always been the case, since the very first C compiler by Dennis Ritchie I think. It has to be this way because memory mapped I/O registers won't work otherwise. To read two characters from your keyboard, you need to read the corresponding memory mapped register twice. If the compiler had a different idea about the number of reads it has to perform, that would be too bad!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js