C++ atomic load ordering efficiency

C++ atomic load ordering efficiency - c++

I have a memory variable that is updated in thread A and read in other threads. The reader only cares if the value is non-zero. I am guaranteed that once the value is incremented, it never goes back to zero.
Does it make sense to optimize as below?
In other words, on the reader side, I dont need "fence" once I got my condition satisfied.
std::atomic<int> counter;
writer:
increment()
{
counter.store(counter+1, std:memory_order_release)
}
reader:
iszero()
{
if (counter.load(std::memory_order_relaxed) > 0) return false;
// memory fence only if condition not yet reached
return (counter.load(std::memory_order_acquire) == 0);
}

First, if you've not actually tried using the default (sequentially consistent) atomics, measured the performance of your app, profiled it, and shown observed them causing a performance problem, I'd suggest turning back now.
However, if you really do need to start reasoning about relaxed atomics...
That is not guaranteed to do what you expect, although it will almost certainly work on x86.
I'm guessing that you're using this to guard the publication of some other non-atomic data.
In that case, you need the guarantee that if you read a non-zero value in the reader thread, then various other side-effects to non-atomic memory locations (i.e. initializing the data you're publishing) that you made in the writer thread prior to the store will be visible to the reader thread.
Reading non-zero with std::memory_order_relaxed does not synchronize with the std::memory_order_release store, so your code above does not have this guarantee.
To get the behaviour I've described, you need to use std::memory_order_acquire. If you're on x86, then acquire doesn't produce any memory fence instructions, so the only way it will differ in performance from memory_order_relaxed is via preventing some compiler optimizations.

Related

std::atomic - behaviour of relaxed ordering

Can the following call to print result in outputting stale/unintended values?
std::mutex g;
std::atomic<int> seq;
int g_s = 0;
int i = 0, j = 0, k = 0; // ignore fact that these could easily made atomic
// Thread 1
void do_work() // seldom called
{
// avoid over
std::lock_guard<std::mutex> lock{g};
i++;
j++;
k++;
seq.fetch_add(1, std::memory_order_relaxed);
}
// Thread 2
void consume_work() // spinning
{
const auto s = g_s;
// avoid overhead of constantly acquiring lock
g_s = seq.load(std::memory_order_relaxed);
if (s != g_s)
{
// no lock guard
print(i, j, k);
}
}

TL:DR: this is super broken; use a Seq Lock instead. Or RCU if your data structure is bigger.
Yes, you have data-race UB, and in practice stale values are likely; so are inconsistent values (from different increments). ISO C++ has nothing to say about what will happen, so it depends on how it happens to compile for some real machine, and interrupts / context switches in the reader that happen in the middle of reading some of these multiple vars. e.g. if the reader sleeps for any reason between reading i and j, you could miss many updates, or at least get a j that doesn't match your i.
Relaxed seq with writer+reader using lock_guard
I'm assuming the writer would look the same, so the atomic RMW increment is inside the critical section.
I'm picturing the reader checking seq like it is now, and only taking a lock after that, inside the block that runs print.
Even if you did use lock_guard to make sure the reader got a consistent snapshot of all three variables (something you couldn't get from making each of them separately atomic), I'm not sure relaxed would be sufficient in theory. It might be in practice on most real implementations for real machines (where compilers have to assume there might be a reader that synchronizes a certain way, even if there isn't in practice). I'd use at least release/acquire for seq, if I was going to take a lock in the reader.
Taking a mutex is an acquire operation, same as a std::memory_order_acquire load on the mutex object. A relaxed increment inside a critical section can't become visible to other threads until after the writer has taken the lock.
But in the reader, with if( xyz != seq.load(relaxed) ) { take_lock; ... }, the load is not guaranteed to "happen before" taking the lock. In practice on many ISAs it will, especially x86 where all atomic RMWs are full memory barriers. But in ISO C++, and maybe some real implementations, it's possible for the relaxed load to reorder into the reader's critical section. Of course, ISO C++ doesn't define things in terms of "reordering", only in terms of syncing with and values loads are allowed to see.
(This reordering may not be fully plausible; it would mean the read side would have to actually take the lock based on branch prediction / speculation on the load result. Maybe with lock elision like x86 did with transactional memory, except without x86's strong memory ordering?)
Anyway, it's pretty hairly to reason about, and release / acquire ops are quite cheap on most CPUs. If you expected it to be expensive, and for the check to often be false, you could check again with an acquire load, or put an acquire fence inside the if so it doesn't happen on the no-new-work path.
Use a Seq Lock
Your problem is better solved by using your sequence counter as part of a Seq Lock, so neither reader nor writer needs a mutex. (Summary: increment before writing, then touch the payload, then increment again. In the reader, read i, j, and k into local temporaries, then check the sequence number again to make sure it's the same, and an even number. With appropriate memory barriers.
See the wikipedia article and/or link below for actual details, but the real change from what you have now is that the sequence number has to increment by 2. If you can't handle that, use a separate counter for the actual lock, with seq as part of the payload.)
If you don't want to use a mutex in the reader, using one in the writer only helps in terms of implementation-detail side-effects, like making sure stores to memory actually happen, not keeping i in a register across calls if do_work inlines into some caller.
BTW, updating seq doesn't need to be an atomic RMW if there's only one writer. You can relaxed load and separately store an incremented temporary (with release semantics).
A Seq Lock is good for cheap reads and occasional writes that make the reader retry. Implementing 64 bit atomic counter with 32 bit atomics shows appropriate fencing.
It relies on non-atomic reads that may have a data race, but not using the result if your sequence counter detects tearing. C++ doesn't define the behaviour in that case, but it works in practice on real implementations. (C++ is mostly keeping its options open in case of hardware race detection, which normal CPUs don't do.)
If you have multiple writers, you'd still use a normal lock to give mutual exclusion between them. or use the sequence counter as a spinlock, as a writer acquires it by making the count odd. Otherwise you just need the sequence counter.
Your global g_s is just to track the latest sequence number the reader has seen? Storing it next to the data defeats some of the purpose/benefit, since it means the reader is writing the same cache line as the writer, assuming that variables declared near each other all end up together. Consider making it static inside the function, or separate it with other stuff, or with padding, like alignas(64) or 128. (That wouldn't guarantee that a compiler doesn't put it right before the other vars, though; a struct would let you control the layout of all of them. With enough alignment, you can make sure they're not in the same aligned pair of cache lines.)

Even ignoring the staleness, this is causes a data race and UB.
Thread 2 can read i,j,k while thread 1 is modifying them, you don't synchronize the access to those variables. If thread 2 doesn't respect the g, there's no point in locking it in thread 1.

Yes, it can.
First of all, the lock guard does not have any effect on your code. A lock has to be used by at least two threads to have any effect.
Thread 2 can read at any moment. It can read an incremented i and not incremented j and k. In theory, it can even read a weird partial value obtained by reading in between updating the various bytes that compose i - for example incrementing from 0xFF to 0x100 results reading 0x1FF or 0x0 - but not on x86 where these updates happen to be atomic.

memory model, how load acquire semantic actually works?

From very nice Paper and article about memory reordering.
Q1: I understand that cache-coherence, store buffer and invalidation queue is root cause of memory reordering ?
Store release is quite understandable, have to wait for all load and store are completed before set flag to true.
About load acquire, typical use of atomic load is waiting for a flag. Suppose we have 2 threads:
int x = 0;
std::atomic<bool> ready_flag = false;
// thread-1
if(ready_flag.load(std::memory_order_relaxed))
{
// (1)
// load x here
}
// (2)
// load x here
// thread-2
x = 100;
ready_flag.store(true, std::memory_order_release);
EDIT: in thread-1, it should be a while loop, but I copied the logic from article above. So, assume memory-reorder is occurred just in time.
Q2: Because (1) and (2) depends on if condition, CPU have to wait for ready_flag, does it mean write-release is enough ? How memory-reordering can happens with this context ?
Q3: Obviously we have load-acquire, so I guess mem-reorder is possible, then where should we place the fence, (1) or (2) ?

Accessing an atomic variable is not a mutex operation; it merely accesses the stored value atomically, with no chance for any CPU operation to interrupt the access such that no data races can occur with regard to accessing that value (it can also issue barriers with regard to other accesses, which is what the memory orders provide). But that's it; it doesn't wait for any particular value to appear in the atomic variable.
As such, your if statement will read whatever value happens to be there at the time. If you want to guard access to x until the other statement has written to it and signaled the atomic, you must:
Not allow any code to read from x until the atomic flag has returned the value true. Simply testing the value once won't do that; you must loop over repeated accesses until it is true. Any other attempt to read from x results in a data race and is therefore undefined behavior.
Whenever you access the flag, you must do so in a way that tells the system that values written by the thread setting that flag should be visible to subsequent operations that see the set value. That requires a proper memory order, one which must be at least memory_order_acquire.
To be technical, the read from the flag itself doesn't have to do the acquire. You could perform an acquire operation after having read the proper value from the flag. But you need to have an acquire-equivalent operation happen before reading x.
The writing statement must set the flag using a releasing memory order that must be at least as powerful as memory_order_release.

Because (1) and (2) depends on if condition, CPU have to wait for ready_flag
There are 2 showstopper flaws in that reasoning:
Branch prediction + speculative execution is a real thing in real CPUs. Control dependencies behave differently from data dependencies. Speculative execution breaks control dependencies.
In most (but not all) real CPUs, data dependencies do work like C++ memory_order_consume. A typical use-case is loading a pointer and then dereferencing it. That's still not safe in C++'s very weak memory model, but will happen to compile to asm that works for most ISAs other than DEC Alpha. Alpha can (in practice on some hardware) even manage to violate causality and load a stale value when dereferencing a just-loaded pointer, even if the stores were correctly ordered.
Compilers can break control and even data dependencies. C++ source logic doesn't always translate directly to asm. In this case a compiler could emit asm that works like this:
tmp = load(x); // compile time reordering before the relaxed load
if (load(ready_flag)
actually use tmp;
It's data-race UB in C++ to read x while it might still be being written, but for most specific ISAs there's no problem with that. You just have to avoid actually using any load results that might be bogus.
This might not be a useful optimization for most ISAs but nothing rules it out. Hiding load latency on in-order pipelines by doing the load earlier might actually be useful sometimes, (if it wasn't being written by another thread, and the compiler might guess that wasn't happening because there's no acquire load).
By far your best bet is to use ready_flag.load(mo_acquire).
A separate problem is that you have commented out code that reads x after the if(), which will run even if the load didn't see the data ready. As #Nicol explained in an answer, this means data-race UB is possible because you might be reading x while the producer is writing it.
Perhaps you wanted to write a spin-wait loop like while(!ready_flag){ _mm_pause(); }? Generally be careful of wasting huge amounts of CPU time spinning; if it might be a long time, use a library-supported thing like maybe a condition variable that gives you efficient fallback to OS-supported sleep/wakeup (e.g. Linux futex) after spinning for a short time.
If you did want a manual barrier separate from the load, it would be
if (ready_flag.load(mo_relaxed))
atomic_thread_fence(mo_acquire);
int tmp = x; // now this is safe
}
// atomic_thread_fence(mo_acquire); // still wouldn't make it safe to read x
// because this code runs even after ready_flag == false
Using if(ready_flag.load(mo_acquire)) would lead to an unconditional fence before branching on the ready_flag load, when compiling for any ISA where acquire-load wasn't available with a single instruction. (On x86 all loads are acquire, on AArch64 ldar does an acquire load. ARM needs load + dsb ish)

The C++ standard doesn't specify the code generated by any particular construct; only correct combinations of thread communication tools product a guaranteed result.
You don't get guarantees from the CPU in C++ because C++ is not a kind of (macro) assembly, not even a "high level assembly", at least not when not all objects have a volatile type.
Atomic objects are communication tools to exchange data between threads. The correct use, for correct visibility of memory operations, is either a store operation with (at least) release followed by a load with acquire, the same with RMW in between, either the store (resp. the load) replaced by RMW with (at least) a release (resp. acquire), on any variant with a relaxed operation and a separate fence.
In all cases:
the thread "publishing" the "done" flag must use a memory ordering at least release (that is: release, release+acquire or sequential consistency),
and the "subscribing" thread, the one acting on the flag must use at least acquire (that is: acquire, release+acquire or sequential consistency).
In practice with separately compiled code other modes might work, depending on the CPU.

Do we have the guarantee that any atomic write will store the new value of the atomic variable in the main memory immediately?

So, I was reading a lot about instruction and memory reordering and how we can prevent it, but i still have no answer to one qustion (probably because I'm not attentive enough). My question is: Do we have the guarantee that any atomic write will store the new value of the atomic variable in the main memory immediately?
Let's take a look at a small example:
std::atomic<bool> x;
std::atomic<bool> y;
std::atomic<int> count;
void WritingValues()
{
x.store(true, std::memory_order_relaxed);
y.store(true, std::memory_order_relaxed);
}
void ReadValues()
{
while( !y.load(std::memory_order_relaxed) );
if( x.load(std::memory_order_relaxed) )
++count;
}
int main()
{
x = false;
y = false;
count = 0;
std::thread tA(WritingValues);
std::thread tB(ReadValues);
tA.join();
tB.join();
assert( count.load() != 0 );
}
So, here our assert can definitely fire, as we use std::memory_order_relaxed and do not prevent any instruction reordering (or memory reordering at compile time, i suppose that is the same thing). But if we place some compiler barrier in WritingValues to prevent instruction reordering, will everything be OK? I mean, does x.store(true, std::memory_order_relaxed) guarantees, that the write of that particular atomic variable will be directly into the memory, without any latency? Or does x.load(std::memory_order_relaxed) guarantess, that the value would be readed from the memory, not the cache with invalid value? In other words, this store guarantees only atomicity of the operation and have the same memory behaviour as usual non-atomic variable, or it also has influence on memory behaviour?

I mean, does x.store(true, std::memory_order_relaxed) guarantees, that the
of that particular atomic variable will be directly into the memory,
without any latency?
No it doesn't and in fact given bool and memory order relaxed there is no 'invalid' value if you read it only once, both true and false are ok.
Since relaxed memory order explicitly stays that no ordering is performed. Basically in your case it only means that after flipping from false to true, at some point it will become true for all other processes, but doesn't state 'whent' it will happen. So the only thing you can be sure here is that it won't become false again after becoming true. But there is no constraints on how long it will be false in another thread.
Also it guarantees that you won't see any partially written variable in another thread, but that's hardly the case for the bools.
You need to use aquire and release here. And even that won't give any guarantees about actual memory itself, only about program behavior, cache synchronization may do the trick even without bouncing data back and froth to the memory.

Since all of the load and store instructions are atomic they each a single machine instruction so the two threads never "interrupt each other" in the "middle" of a load or store instruction.
The title of your question is "Do we have the guarantee that any atomic write will store the new value of the atomic variable in the main memory immediately?". But the very definition of an atomic instruction is that it cannot be interrupted by a context switch, hardware interrupt, software expection -- nothing!
std::memory_order_relaxed allows for some reordering of instructions in a single function. See for example this question. It is almost the same as your question but you have memory_order_relaxed in ReadValues() instead of memory_order_acquire. In this function it is possible that the spinlock on variable y is placed after the counter increment due to the relaxed condition (compiler reordering etc.). In any case the ASSERT may fail because y may be set to true before x is in WriteValues() due to the memory reordering allowed by memory_order_relaxed (referencing the answer in the similar question.

relaxed ordering and inter thread visibility

I learnt from relaxed ordering as a signal that a store on an atomic variable should be visible to other thread in a "within a reasonnable amount of time".
That say, I am pretty sure it should happen in a very short time (some nano second ?).
However, I don't want to rely on "within a reasonnable amount of time".
So, here is some code :
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_relaxed))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_relaxed);
}
Thread A and B are within a kind of ThreadPool, so there is no creation of thread or whatsoever in this problem.
I don't need to protect any data, so acquire / consume / release ordering on atomic store/load are not needed here (I think?).
We know for sure that the functionThatWillBeLaunchedInThreadAfunction will be launched AFTER the end of the functionThatWillBeLaunchedInThreadB.
However, in such a code, we don't have any guarantee that the store will be visible in the thread A, so the thread A can read a stale value (false).
Here are some solution I think about.
Solution 1 : Use volatility
Just declare volatile std::atomic_bool canBegin{false}; Here the volatileness guarantee us that we will not see stale value.
Solution 2 : Use mutex or spinlock
Here the idea is to protect the canBegin access via a mutex / spinlock that guarantee via a release/acquire ordering that we will not see a stale value.
I don't need canGo to be an atomic either.
Solution 3 : not sure at all, but memory fence?
Maybe this code will not work, so, tell me :).
bool canGo{false}; // not an atomic value now
// in thread A
std::atomic_thread_fence(std::memory_order_acquire);
if(canGo) produceData();
// in thread B
canGo = true;
std::atomic_thread_fence(std::memory_order_release);
On cpp reference, for this case, it is write that :
all non-atomic and relaxed atomic stores that are sequenced-before FB
in thread B will happen-before all non-atomic and relaxed atomic loads
from the same locations made in thread A after FA
Which solution would you use and why?

There's nothing you can do to make a store visible to other threads any sooner. See If I don't use fences, how long could it take a core to see another core's writes? - barriers don't speed up visibility to other cores, they just make this core wait until that's happened.
The store part of an RMW is not different from a pure store for this, either.
(Certainly on x86; not totally sure about other ISAs, where a relaxed LL/SC might possibly get special treatment from the store buffer, possibly being more likely to commit before other stores if this core can get exclusive ownership of the cache line. But I think it still would have to retire from out-of-order execution so the core knows it's not speculative.)
Anthony's answer that was linked in comment is misleading; as I commented there:
If the RMW runs before the other thread's store commits to cache, it doesn't see the value, just like if it was a pure load. Does that mean "stale"? No, it just means that the store hasn't happened yet.
The only reason RMWs need a guarantee about "latest" value is that they're inherently serializing operations on that memory location. This is what you need if you want 100 unsynchronized fetch_add operations to not step on each other and be equivalent to += 100, but otherwise best-effort / latest-available value is fine, and that's what you get from a normal atomic load.
If you require instant visibility of results (a nanosecond or so), that's only possible within a single thread, like x = y; x += z;
Also note, the C / C++ standard requirement (actually just a note) to make stores visible in a reasonable amount of time is in addition to the requirements on ordering of operations. It doesn't mean seq_cst store visibility can be delayed until after later loads. All seq_cst operations happen in some interleaving of program order across all threads.
On real-world C++ implementations, the visibility time is entirely up to hardware inter-core latency. But the C++ standard is abstract, and could in theory be implemented on a CPU that required manual flushing to make stores visible to other threads. Then it would be up to the compiler to not be lazy and defer that for "too long".
volatile atomic<T> is useless; compilers already don't optimize atomic<T>, so every atomic access done by the abstract machine will already happen in the asm. (Why don't compilers merge redundant std::atomic writes?). That's all that volatile does, so volatile atomic<T> compiles to the same asm as atomic<T> for anything you can with the atomic.
Defining "stale" is a problem because separate threads running on separate cores can't see each other's actions instantly. It takes tens of nanoseconds on modern hardware to see a store from another thread.
But you can't read "stale" values from cache; that's impossible because real CPUs have coherent caches. (That's why volatile int could be used to roll your own atomics before C++11, but is no longer useful.) You may need an ordering stronger than relaxed to get the semantics you want as far as one value being older than another (i.e. "reordering", not "stale"). But for a single value, if you don't see a store, that means your load executed before the other core took exclusive ownership of the cache line in order to commit its store. i.e. that the store hasn't truly happened yet.
In the formal ISO C++ rules, there are guarantees about what value you're allowed to see which effectively give you the guarantees you'd expect from cache coherency for a single object, like that after a reader sees a store, further loads in this thread won't see some older store and then eventually back to the newest store. (https://eel.is/c++draft/intro.multithread#intro.races-19).
(Note for 2 writers + 2 readers with non-seq_cst operations, it's possible for the readers to disagree about the order in which the stores happened. This is called IRIW reordering, but most hardware can't do it; only some PowerPC. Will two atomic writes to different locations in different threads always be seen in the same order by other threads? - so it's not always quite as simple as "the store hasn't happened yet", it be visible to some threads before others. But it's still true that you can't speed up visibility, only for example slow down the readers so none of them see it via the "early" mechanism, i.e. with hwsync for the PowerPC loads to drain the store buffer first.)

We know for sure that the functionThatWillBeLaunchedInThreadAfunction
will be launched AFTER the end of the
functionThatWillBeLaunchedInThreadB.
First of all, if this is the case then it's likely that your task queue mechanism takes care of the necessary synchronization already.
On to the answer...
By far the simplest thing to do is acquire/release ordering. All the solutions you gave are worse.
std::atomic_bool canBegin{false};
void functionThatWillBeLaunchedInThreadA() {
if(canBegin.load(std::memory_order_acquire))
produceData();
}
void functionThatWillBeLaunchedInThreadB() {
canBegin.store(true, std::memory_order_release);
}
By the way, shouldn't this be a while loop?
void functionThatWillBeLaunchedInThreadA() {
while (!canBegin.load(std::memory_order_acquire))
{ }
produceData();
}
I don't need to protect any data, so acquire / consume / release
ordering on atomic store/load are not needed here (I think?)
In this case, the ordering is required to keep the compiler/CPU/memory subsystem from ordering the canBegin store true before the previous reads/writes have completed. And it should actually stall the CPU until it can be guaranteed that every write that comes before in program order will propagate before the store to canBegin. On the load side it prevents memory from being read/written before canBegin is read as true.
However, in such a code, we don't have any guarantee that the store
will be visible in the thread A, so the thread A can read a stale
value (false).
You said yourself:
a store on an atomic variable should be visible to other thread in a
"within a reasonnable amount of time".
Even with relaxed memory order, a write is guaranteed to eventually reach the other cores and all cores will eventually agree on any given variable's store history, so there are no stale values. There are only values that haven't propagated yet. What's "relaxed" about it is the store order in relation to other variables. Thus, memory_order_relaxed solves the stale read problem (but doesn't address the ordering required as discussed above).
Don't use volatile. It doesn't provide all the guarantees required of atomics in the C++ memory model, so using it would be undefined behavior. See https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering at the bottom to read about it.
You could use a mutex or spinlock, but a mutex operation is much more expensive than a lock-free std::atomic acquire-load/release-store. A spinlock will do at least one atomic read-modify-write operation...and possibly many. A mutex is definitely overkill. But both have the benefit of simplicity in the C++ source. Most people know how to use locks so it's easier to demonstrate correctness.
A memory fence will also work but your fences are in the wrong spot (it's counter-intuitive) and the inter-thread communication variable should be std::atomic. (Careful when playing these games...! It's easy to get undefined behavior) Relaxed ordering is ok thanks to the fences.
std::atomic<bool> canGo{false}; // MUST be atomic
// in thread A
if(canGo.load(std::memory_order_relaxed))
{
std::atomic_thread_fence(std::memory_order_acquire);
produceData();
}
// in thread B
std::atomic_thread_fence(std::memory_order_release);
canGo.store(true, memory_order_relaxed);
The memory fences are actually more strict than acquire/release ordering on the std::atomicload/store so this gains nothing and could be more expensive.
It seems like you really want to avoid overhead with your signaling mechanism. This is exactly what the std::atomic acquire/release semantics were invented for! You are worrying too much about stale values. Yes, an atomic RMW will give you the "latest" value, but they're also very expensive operations themselves. I want to give you an idea of how fast acquire/release is. It's most likely that you're targeting x86. x86 has total store order and word-sized loads/stores are atomic, so an load acquire compiles to just a regular load and and a release store compiles to a regular store. So it turns out that almost everything in this long post will probably compile to exactly the same code anyway.

Memory barriers vs. interlocked operations

I am trying to improve my understanding of memory barriers. Suppose we have a weak memory model and we adapt Dekker's algorithm. Is it possible to make it work correctly under the weak memory model by adding memory barriers?
I think the answer is a surprising no. The reason (if I am correct) is that although a memory barrier can be used to ensure that a read is not moved past another, it cannot ensure that a read does not see stale data (such as that in a cache). Thus it could see some time in the past when the critical section was unlocked (per the CPU's cache) but at the current time other processors might see it as locked. If my understanding is correct, one must use interlocked operations such as those commonly called test-and-set or compare-and-swap to ensure synchronized agreement of a value at a memory location among multiple processors.
Thus, can we rightly expect that no weak memory model system would provide only memory barriers? The system must supply operations like test-and-set or compare-and-swap to be useful.
I realize that popular processors, including x86, provide memory models much stronger than a weak memory model. Please focus the discussion on weak memory models.
(If Dekker's algorithm is a poor choice, choose another mutual exclusion algorithm where memory barriers can successfully achieve correct synchronization, if possible.)

You are right that a memory barrier cannot ensure that a read sees up-to-date values. What it does do is enforce an ordering between operations, both on a single thread, and between threads.
For example, if thread A does a series of stores and then executes a release barrier before a final store to a flag location, and thread B reads from the flag location and then executes an acquire barrier before reading the other values then the other variables will have the values stored by thread A:
// initially x=y=z=flag=0
// thread A
x=1;
y=2;
z=3;
release_barrier();
flag=1;
// thread B
while(flag==0) ; // loop until flag is 1
acquire_barrier();
assert(x==1); // asserts will not fire
assert(y==2);
assert(z==3);
Of course, you need to ensure that the load and store to flag is atomic (which simple load and store instructions are on most common CPUs, provided the variables are suitably aligned). Without the while loop on thread B, thread B may well read a stale value (0) for flag, and thus you cannot guarantee any of the values read for the other variables.
Fences can thus be used to enforce synchronization in Dekker's algorithm.
Here's an example implementation in C++ (using C++0x atomic variables):
std::atomic<bool> flag0(false),flag1(false);
std::atomic<int> turn(0);
void p0()
{
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag1.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 0)
{
flag0.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 0)
{
}
flag0.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(1,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag0.store(false,std::memory_order_relaxed);
}
void p1()
{
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
while (flag0.load(std::memory_order_relaxed))
{
if (turn.load(std::memory_order_relaxed) != 1)
{
flag1.store(false,std::memory_order_relaxed);
while (turn.load(std::memory_order_relaxed) != 1)
{
}
flag1.store(true,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_seq_cst);
}
}
std::atomic_thread_fence(std::memory_order_acquire);
// critical section
turn.store(0,std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release);
flag1.store(false,std::memory_order_relaxed);
}
For a full analysis see my blog entry at http://www.justsoftwaresolutions.co.uk/threading/implementing_dekkers_algorithm_with_fences.html

Say you put in a load and store barrier after every statement, and in addition you ensured that the compiler didn't reorder your stores. Wouldn't this, on any reasonable architecture, provide strict consistency? Dekker's works on sequentially consistent architectures. Sequential consistency is a weaker condition than strict consistency.
http://www.cs.nmsu.edu/~pfeiffer/classes/573/notes/consistency.html
Even on a CPU that has a weak consistency model, you'd still expect cache coherence. I think that where things get derailed is the behavior of store buffers and speculated reads, and what operations are available flush stored writes and invalidate speculated reads. If there isn't a load fence that can invalidate speculated reads, or there isn't a write fence that flushes a store buffer, in addition to not being able to implement Dekker's, you won't be able to implement a mutex!
So here's my claim. If you have a write barrier available, and a read barrier, and the cache is coherent between CPUs, then you can trivially make all code sequentially consistent by flushing writes (store fence) after every instruction, and flushing speculations (read fence) before every instruction. So I claim that you don't need atomics for what you're talking about, and that you can do what you need with Dekker's only. Sure you wouldn't want to.
BTW, Corensic, the company I work for, writes cool tools for debugging concurrency issues. Check out http://www.corensic.com.

Some barriers (such as the powerpc isync, and a .acq load on ia64) also have an effect on ubsequent loads. ie: if a load was available before the isync due to prefetching it must be discarded. When used appropriately perhaps that's enough to make Dekker's algorithm work on a weak memory model.
You've also got cache invalidation logic working for you. If you know that your load is current due to something like an isync and that the cached version of the data is invalidated if another cpu touches it, is that enough?
Interesting questions aside, Dekker's algorithm is for all practical purposes dumb. You are going to want to use atomic hardware interfaces and memory barriers for any real application, so focusing on how to fix up Dekker's with atomics just doesn't seem worthwhile to me;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js