Understanding std::atomic::compare_exchange_weak() in C++11 - c++

bool compare_exchange_weak (T& expected, T val, ..);
compare_exchange_weak() is one of compare-exchange primitives provided in C++11. It's weak in the sense that it returns false even if the value of the object is equal to expected. This is due to spurious failure on some platforms where a sequence of instructions (instead of one as on x86) are used to implement it. On such platforms, context switch, reloading of the same address (or cache line) by another thread, etc can fail the primitive. It's spurious as it's not the value of the object (not equal to expected) that fails the operation. Instead, it's kind of timing issues.
But what puzzles me is what's said in C++11 Standard (ISO/IEC 14882),
29.6.5
..
A consequence of spurious failure is that nearly all uses of weak
compare-and-exchange will be in a loop.
Why does it have to be in a loop in nearly all uses ? Does that mean we shall loop when it fails because of spurious failures? If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
Another question related. In his book "C++ Concurrency In Action" Anthony says,
//Because compare_exchange_weak() can fail spuriously, it must typically
//be used in a loop:
bool expected=false;
extern atomic<bool> b; // set somewhere else
while(!b.compare_exchange_weak(expected,true) && !expected);
//In this case, you keep looping as long as expected is still false,
//indicating that the compare_exchange_weak() call failed spuriously.
Why is !expected there in the loop condition? Does it there to prevent that all threads may starve and make no progress for some time?
One last question
On platforms that no single hardware CAS instruction exist, both the weak and strong version are implemented using LL/SC (like ARM, PowerPC, etc). So is there any difference between the following two loops? Why, if any? (To me, they should have similar performance.)
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_weak(..))
{ .. }
// use LL/SC (or CAS on x86) and ignore/loop on spurious failures
while (!compare_exchange_strong(..))
{ .. }
I come up w/ this last question you guys all mention that there maybe a performance difference inside a loop. It's also mentioned by the C++11 Standard (ISO/IEC 14882):
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?

Why doing exchange in a loop?
Usually, you want your work to be done before you move on, thus, you put compare_exchange_weak into a loop so that it tries to exchange until it succeeds (i.e., returns true).
Note that also compare_exchange_strong is often used in a loop. It does not fail due to spurious failure, but it does fail due to concurrent writes.
Why to use weak instead of strong?
Quite easy: Spurious failure does not happen often, so it is no big performance hit. In constrast, tolerating such a failure allows for a much more efficient implementation of the weak version (in comparison to strong) on some platforms: strong must always check for spurious failure and mask it. This is expensive.
Thus, weak is used because it is a lot faster than strong on some platforms
When should you use weak and when strong?
The reference states hints when to use weak and when to use strong:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms. When a weak compare-and-exchange
would require a loop and a strong one would not, the strong one is
preferable.
So the answer seems to be quite simple to remember: If you would have to introduce a loop only because of spurious failure, don't do it; use strong. If you have a loop anyway, then use weak.
Why is !expected in the example
It depends on the situation and its desired semantics, but usually it is not needed for correctness. Omitting it would yield a very similar semantics. Only in a case where another thread might reset the value to false, the semantics could become slightly different (yet I cannot find a meaningful example where you would want that). See Tony D.'s comment for a detailed explanation.
It is simply a fast track when another thread writes true: Then the we abort instead of trying to write true again.
About your last question
But as analyzed above, two versions in a loop should give the same/similar performance.
What's the thing I miss?
From Wikipedia:
Real implementations of LL/SC do not always succeed if there are no
concurrent updates to the memory location in question. Any exceptional
events between the two operations, such as a context switch, another
load-link, or even (on many platforms) another load or store
operation, will cause the store-conditional to spuriously fail. Older
implementations will fail if there are any updates broadcast over the
memory bus.
So, LL/SC will fail spuriously on context switch, for example. Now, the strong version would bring its "own small loop" to detect that spurious failure and mask it by trying again. Note that this own loop is also more complicated than a usual CAS loop, since it must distinguish between spurious failure (and mask it) and failure due to concurrent access (which results in a return with value false). The weak version does not have such own loop.
Since you provide an explicit loop in both examples, it is simply not necessary to have the small loop for the strong version. Consequently, in the example with the strong version, the check for failure is done twice; once by compare_exchange_strong (which is more complicated since it must distinguish spurious failure and concurrent acces) and once by your loop. This expensive check is unnecessary and the reason why weak will be faster here.
Also note that your argument (LL/SC) is just one possibility to implement this. There are more platforms that have even different instruction sets. In addition (and more importantly), note that std::atomic must support all operations for all possible data types, so even if you declare a ten million byte struct, you can use compare_exchange on this. Even when on a CPU that does have CAS, you cannot CAS ten million bytes, so the compiler will generate other instructions (probably lock acquire, followed by a non-atomic compare and swap, followed by a lock release). Now, think of how many things can happen while swapping ten million bytes. So while a spurious error may be very rare for 8 byte exchanges, it might be more common in this case.
So in a nutshell, C++ gives you two semantics, a "best effort" one (weak) and a "I will do it for sure, no matter how many bad things might happen inbetween" one (strong). How these are implemented on various data types and platforms is a totally different topic. Don't tie your mental model to the implementation on your specific platform; the standard library is designed to work with more architectures than you might be aware of. The only general conclusion we can draw is that guaranteeing success is usually more difficult (and thus may require additional work) than just trying and leaving room for possible failure.

I'm trying to answer this myself, after going through various online resources (e.g., this one and this one), the C++11 Standard, as well as the answers given here.
The related questions are merged (e.g., "why !expected ?" is merged with "why put compare_exchange_weak() in a loop ?") and answers are given accordingly.
Why does compare_exchange_weak() have to be in a loop in nearly all uses?
Typical Pattern A
You need achieve an atomic update based on the value in the atomic variable. A failure indicates that the variable is not updated with our desired value and we want to retry it. Note that we don't really care about whether it fails due to concurrent write or spurious failure. But we do care that it is us that make this change.
expected = current.load();
do desired = function(expected);
while (!current.compare_exchange_weak(expected, desired));
A real-world example is for several threads to add an element to a singly linked list concurrently. Each thread first loads the head pointer, allocates a new node and appends the head to this new node. Finally, it tries to swap the new node with the head.
Another example is to implement mutex using std::atomic<bool>. At most one thread can enter the critical section at a time, depending on which thread first set current to true and exit the loop.
Typical Pattern B
This is actually the pattern mentioned in Anthony's book. In contrary to pattern A, you want the atomic variable to be updated once, but you don't care who does it. As long as it's not updated, you try it again. This is typically used with boolean variables. E.g., you need implement a trigger for a state machine to move on. Which thread pulls the trigger is regardless.
expected = false;
// !expected: if expected is set to true by another thread, it's done!
// Otherwise, it fails spuriously and we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
Note that we generally cannot use this pattern to implement a mutex. Otherwise, multiple threads may be inside the critical section at the same time.
That said, it should be rare to use compare_exchange_weak() outside a loop. On the contrary, there are cases that the strong version is in use. E.g.,
bool criticalSection_tryEnter(lock)
{
bool flag = false;
return lock.compare_exchange_strong(flag, true);
}
compare_exchange_weak is not proper here because when it returns due to spurious failure, it's likely that no one occupies the critical section yet.
Starving Thread?
One point worth mentioning is that what happens if spurious failures continue to happen thus starving the thread? Theoretically it could happen on platforms when compare_exchange_XXX() is implement as a sequence of instructions (e.g., LL/SC). Frequent access of the same cache line between LL and SC will produce continuous spurious failures. A more realistic example is due to a dumb scheduling where all concurrent threads are interleaved in the following way.
Time
| thread 1 (LL)
| thread 2 (LL)
| thread 1 (compare, SC), fails spuriously due to thread 2's LL
| thread 1 (LL)
| thread 2 (compare, SC), fails spuriously due to thread 1's LL
| thread 2 (LL)
v ..
Can it happen?
It won't happen forever, fortunately, thanks to what C++11 requires:
Implementations should ensure that weak compare-and-exchange
operations do not consistently return false unless either the atomic
object has value different from expected or there are concurrent
modifications to the atomic object.
Why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong().
It depends.
Case 1: When both need to be used inside a loop. C++11 says:
When a compare-and-exchange is in a loop, the weak version will yield
better performance on some platforms.
On x86 (at least currently. Maybe it'll resort to a similiar scheme as LL/SC one day for performance when more cores are introduced), the weak and strong version are essentially the same because they both boil down to the single instruction cmpxchg. On some other platforms where compare_exchange_XXX() isn't implemented atomically (here meaning no single hardware primitive exists), the weak version inside the loop may win the battle because the strong one will have to handle the spurious failures and retry accordingly.
But,
rarely, we may prefer compare_exchange_strong() over compare_exchange_weak() even in a loop. E.g., when there is a lot of things to do between atomic variable is loaded and a calculated new value is exchanged out (see function() above). If the atomic variable itself doesn't change frequently, we don't need repeat the costly calculation for every spurious failure. Instead, we may hope that compare_exchange_strong() "absorb" such failures and we only repeat calculation when it fails due to a real value change.
Case 2: When only compare_exchange_weak() need to be used inside a loop. C++11 also says:
When a weak compare-and-exchange would require a loop and a strong one
would not, the strong one is preferable.
This is typically the case when you loop just to eliminate spurious failures from the weak version. You retry until exchange is either successful or failed because of concurrent write.
expected = false;
// !expected: if it fails spuriously, we should try again.
while (!current.compare_exchange_weak(expected, true) && !expected);
At best, it's reinventing the wheels and perform the same as compare_exchange_strong(). Worse? This approach fails to take full advantage of machines that provide non-spurious compare-and-exchange in hardware.
Last, if you loop for other things (e.g., see "Typical Pattern A" above), then there is a good chance that compare_exchange_strong() shall also be put in a loop, which brings us back to the previous case.

Why does it have to be in a loop in nearly all uses ?
Because if you don't loop and it fails spuriously your program hasn't done anything useful - you didn't update the atomic object and you don't know what its current value is (Correction: see comment below from Cameron). If the call doesn't do anything useful what's the point of doing it?
Does that mean we shall loop when it fails because of spurious failures?
Yes.
If that's the case, why do we bother use compare_exchange_weak() and write the loop ourselves? We can just use compare_exchange_strong() which I think should get rid of spurious failures for us. What are the common use cases of compare_exchange_weak()?
On some architectures compare_exchange_weak is more efficient, and spurious failures should be fairly uncommon, so it might be possible to write more efficient algorithms using the weak form and a loop.
In general it is probably better to use the strong version instead if your algorithm doesn't need to loop, as you don't need to worry about spurious failures. If it needs to loop anyway even for the strong version (and many algorithms do need to loop anyway), then using the weak form might be more efficient on some platforms.
Why is !expected there in the loop condition?
The value could have got set to true by another thread, so you don't want to keep looping trying to set it.
Edit:
But as analyzed above, two versions in a loop should give the same/similar performance. What's the thing I miss?
Surely it's obvious that on platforms where spurious failure is possible the implementation of compare_exchange_strong has to be more complicated, to check for spurious failure and retry.
The weak form just returns on spurious failure, it doesn't retry.

Alright, so I need a function which performs atomic left-shifting. My processor doesn't have a native operation for this, and the standard library doesn't have a function for it, so it looks like I'm writing my own. Here goes:
void atomicLeftShift(std::atomic<int>* var, int shiftBy)
{
do {
int oldVal = std::atomic_load(var);
int newVal = oldVal << shiftBy;
} while(!std::compare_exchange_weak(oldVal, newVal));
}
Now, there's two reasons that loop might be executed more than once.
Someone else changed the variable while I was doing my left shift. The results of my computation should not be applied to the atomic variable, because it would effectively erase that someone else's write.
My CPU burped and the weak CAS spuriously failed.
I honestly don't care which one. Left shifting is fast enough that I may as well just do it again, even if the failure was spurious.
What's less fast, though, is the extra code that strong CAS needs to wrap around weak CAS in order to be strong. That code doesn't do much when the weak CAS succeeds... but when it fails, strong CAS needs to do some detective work to determine whether it was Case 1 or Case 2. That detective work takes the form of a second loop, effectively inside my own loop. Two nested loops. Imagine your algorithms teacher glaring at you right now.
And as I previously mentioned, I don't care about the result of that detective work! Either way I'm going to be redoing the CAS. So using strong CAS gains me precisely nothing, and loses me a small but measurable amount of efficiency.
In other words, weak CAS is used to implement atomic update operations. Strong CAS is used when you care about the result of CAS.

I think most of the answers above address "spurious failure" as some kind of problem, performance VS correctness tradeoff.
It can be seen as the weak version is faster most of the times, but in case of spurious failure, it becomes slower. And the strong version is a version that has no possibility of spurious failure, but it is almost always slower.
For me, the main difference is how these two version handle the ABA problem:
weak version will succeed only if noone has touched the cache line between load and store, so it will 100% detect ABA problem.
strong version will fail only if the comparison fails, so it will not detect ABA problem without extra measures.
So, in theory, if you use weak version on weak-ordered architecture, you don't need ABA detection mechanism and the implementation will be much simpler, giving better performance.
But, on x86 (strong-ordered architecture), weak version and strong version are the same, and they both suffer from ABA problem.
So if you write a completely cross-platform algorithm, you need to address ABA problem anyway, so there is no performance benefit from using the weak version, but there is a performance penalty for handling spurious failures.
In conclusion - for portability and performance reasons, the strong version is always a better-or-equal option.
Weak version can only be a better option if it lets you skip ABA countermeasures completely or your algorithm doesn't care about ABA.

Related

Can/should non-lock-free atomics be implemented with a SeqLock?

In both MSVC STL and LLVM libc++ implementations std::atomic for non-atomic size is implemented using a spin lock.
libc++ (Github):
_LIBCPP_INLINE_VISIBILITY void __lock() const volatile {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
_LIBCPP_INLINE_VISIBILITY void __lock() const {
while(1 == __cxx_atomic_exchange(&__a_lock, _LIBCPP_ATOMIC_FLAG_TYPE(true), memory_order_acquire))
/*spin*/;
}
MSVC (Github) (recently discussed in this Q&A):
inline void _Atomic_lock_acquire(long& _Spinlock) noexcept {
#if defined(_M_IX86) || (defined(_M_X64) && !defined(_M_ARM64EC))
// Algorithm from Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, May 2020
// Example 2-4. Contended Locks with Increasing Back-off Example - Improved Version, page 2-22
// The code in mentioned manual is covered by the 0BSD license.
int _Current_backoff = 1;
const int _Max_backoff = 64;
while (_InterlockedExchange(&_Spinlock, 1) != 0) {
while (__iso_volatile_load32(&reinterpret_cast<int&>(_Spinlock)) != 0) {
for (int _Count_down = _Current_backoff; _Count_down != 0; --_Count_down) {
_mm_pause();
}
_Current_backoff = _Current_backoff < _Max_backoff ? _Current_backoff << 1 : _Max_backoff;
}
}
#elif
/* ... */
#endif
}
While thinking of a better possible implementation, I wonder if it is feasible to replace this with SeqLock? Advantage would be cheap reads if reads don't contend with writes.
Another thing I'm questioning is if SeqLock can be improved to use OS wait. It appears to me that if reader observes an odd count, it can wait with atomic wait underlying mechanism (Linux futex/Windows WaitOnAddress), thus avoiding the starvation problem of spinlock.
To me it looks like possible. Though C++ memory model doesn't cover Seqlock currently, types in std::atomic must be trivially copyable, so memcpy reads/writes in seqlock will work and will deal with races if sufficient barriers are used to get a volatile-equivalent without defeating optimizations too badly. This will be part of a specific C++ implementation's header files so it doesn't have to be portable.
Existing SO Q&As about implement a SeqLock in C++ (perhaps using other std::atomic operations)
Implementing 64 bit atomic counter with 32 bit atomics
how to implement a seqlock lock using c++11 atomic library
Yes, you can use a SeqLock as a readers/writers lock if you provide mutual exclusion between writers. You'd still get read-side scalability, while writes and RMWs would stay about the same.
It's not a bad idea, although it has potential fairness problems for readers if you have very frequent writes. Maybe not a good idea for a mainstream standard library, at least not without some testing with some different workloads / use-cases on a range of hardware, since working great on some machines but faceplanting on others is not what you want for standard library stuff. (Code that wants great performance for its special case often unfortunately has to use an implementation that's tuned for it, not the standard one.)
Mutual exclusion is possible with a separate spinlock, or just using the low bit of the sequence number. In fact I've seen other descriptions of a SeqLock that assumed you'd be using it with multiple writers, and didn't even mention the single-writer case that allows pure-load and pure-store for the sequence number to avoid the cost of an atomic RMW.
How to use the sequence number as a spinlock
A writer or RMWer attempts to atomically CAS the sequence number to increment (if it wasn't already odd). If the sequence number is already odd, writers just spin until they see an even value.
This would mean writers have to start by reading the sequence number before trying to write, which can cause extra coherency traffic (MESI Share request, then RFO). On a machine that actually had a fetch_or in hardware, you could use that to atomically make the count odd and see if you won the race to take it from even to odd.
On x86-64, you can use lock bts to set the low bit and find out what the old low bit was, then load the whole sequence number if it was previously even (because you won the race, no other writer is going to be modifying it). So you can do a release-store of that plus 1 to "unlock" instead of needing a lock add.
Making other writers faster at reclaiming the lock may actually be a bad thing, though: you want to give a window for readers to complete. Maybe just use multiple pause instructions (or equivalent on non-x86) in write-side spin loops, more than in read-side spins. If contention is low, readers probably had time to see it before writers got to it, otherwise writers will frequently see it locked and go into the slower spin loop. Maybe with faster-increasing backoff for writers, too.
An LL/SC machine could (in asm at least) test-and-increment just as easily as CAS or TAS. I don't know how to write pure C++ that would compile to just that. fetch_or could compile efficiently for LL/SC, but still to a store even if it was already odd. (If you have to LL separately from SC, you might as well make the most of it and not store at all if it will be useless, and hope that the hardware is designed to make the best of things.)
(It's critical to not unconditionally increment; you must not unlock another writer's ownership of the lock. But an atomic-RMW that leaves the value unchanged is always ok for correctness, if not performance.)
It may not be a good idea by default because of bad results with heavy write activity making it potentially hard for a reader to get a successful read done. As Wikipedia points out:
The reader never blocks, but it may have to retry if a write is in progress; this speeds up the readers in the case where the data was not modified, since they do not have to acquire the lock as they would with a traditional read–write lock. Also, writers do not wait for readers, whereas with traditional read–write locks they do, leading to potential resource starvation in a situation where there are a number of readers (because the writer must wait for there to be no readers). Because of these two factors, seqlocks are more efficient than traditional read–write locks for the situation where there are many readers and few writers. The drawback is that if there is too much write activity or the reader is too slow, they might livelock (and the readers may starve).
The "too slow reader" problem is unlikely, just a small memcpy. Code shouldn't expect good results from std::atomic<T> for very large T; the general assumption is that you'd only bother with std::atomic for a T that can be lock-free on some implementations. (Usually not including transactional memory since mainstream implementations don't do that.)
But the "too much write" problem could still be real: SeqLock is best for read-mostly data. Readers may have a bad time with a heavy write mix, retrying even more than with a simple spinlock or a readers-writers lock.
It would be nice if there was a way to make this an option for an implementation, like an optional template parameter such as std::atomic<T, true>, or a #pragma, or #define before including <atomic>. Or a command-line options.
An optional template param affects every use of the type, but might be slightly less clunky than a separate class name like gnu::atomic_seqlock<T>. An optional template param would still make std::atomic types be that class name, so e.g. matching specializations of other things for std::atomic. But might break other things, IDK.
Might be fun to hack something up to experiment with.

Is it possible that a store with memory_order_relaxed never reaches other threads?

Suppose I have a thread A that writes to an atomic_int x = 0;, using x.store(1, std::memory_order_relaxed);. Without any other synchronization methods, how long would it take before other threads can see this, using x.load(std::memory_order_relaxed);? Is it possible that the value written to x stays entirely thread-local given the current definition of the C/C++ memory model that the standard gives?
The practical case that I have at hand is where a thread B reads an atomic_bool frequently to check if it has to quit; Another thread, at some point, writes true to this bool and then calls join() on thread B. Clearly I do not mind to call join() before thread B can even see that the atomic_bool was set, nor do I mind when thread B already saw the change and exited execution before I call join(). But I am wondering: using memory_order_relaxed on both sides, is it possible to call join() and block "forever" because the change is never propagated to thread B?
Edit
I contacted Mark Batty (the brain behind mathematically verifying and subsequently fixing the C++ memory model requirements). Originally about something else (which turned out to be a known bug in cppmem and his thesis; so fortunately I didn't make a complete fool of myself, and took the opportunity to ask him about this too; his answer was:
Q: Can it theoretically be that such a store [memory_order_relaxed without (any following) release operation] never reaches the other thread?
Mark: Theoretically, yes, but I don't think that has been observed.
Q: In other words, do relaxed stores make no sense
whatsoever unless you combine them with some release operation (and
acquire on the other thread), assuming you want another thread to
see it?
Mark: Nearly all of the use cases for them do use release and acquire, yes.
This is all the standard has to say on the matter, I believe:
[intro.multithread]/25 An implementation should ensure that the last value (in modification order) assigned by an atomic or synchronization operation will become visible to all other threads in a finite period of time.
In practice
Without any other synchronization methods, how long would it take
before other threads can see this, using
x.load(std::memory_order_relaxed);?
No time. It's a normal write, it goes to the store buffer, so it will be available in the L1d cache in less time than a blink. But that's only when the assembly instruction is run.
Instructions can be reordered by the compiler, but no reasonable compiler would reorder atomic operation over arbitrarily long loops.
In theory
Q: Can it theoretically be that such a store [memory_order_relaxed
without (any following) release operation] never reaches the other
thread?
Mark: Theoretically, yes,
You should have asked him what would happen if the "following release fence" was added back. Or with atomic store release operation.
Why wouldn't these be reordered and delayed a loooong time? (so long that it seems like an eternity in practice)
Is it possible that the value written to x stays entirely thread-local
given the current definition of the C/C++ memory model that the
standard gives?
If an imaginary and especially perverse implementation wanted to delay the visibility of atomic operation, why would it do that only for relaxed operations? It could well do it for all atomic operations.
Or never run some threads.
Or run some threads so slowly that you would believe they aren't running.
This is what the standard says in 29.3.12:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
There is no guarantee a store will become visible in another thread, there is no guaranteed timing and there is no formal relationship with memory order.
Of course, on each regular architecture a store will become visible, but on rare platforms that do not support cache coherency, it may never become visible to a load.
In that case, you would have to reach for an atomic read-modify-write operation to get the latest value in the modification order.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.

compare-and-swap atomic operation vs Load-link/store-conditional operation

Under an x86 processor I am not sure of the difference between compare-and-swap atomic operation and Load-link/store-conditional operation. Is the latter safer than the former? Is it the case that the first is better than the second?
There are three common styles of atomic primitive: Compare-Exchange, Load-Linked/Store-Conditional, and Compare-And-Swap.
A CompareExchange operation will atomically read a memory location and, if it matches a compare value, store a specified new value. If the value that was read does not match the compare value, no store takes place. In any case, the operation will report the original value read.
A Compare-And-Swap operation is similar to CompareExchange, except that it does not report what value was read--merely whether whatever value was read matched the compare value. Note that a CompareExchange may be used to implement Compare-And-Swap by having it report whether the value read from memory matched the specified compare value.
The LL/SC combination allows a store operation to be conditioned upon whether some outside influence might have affected the target since its value was loaded. In particular, it guarantees that if the store succeeds, the location has not been written at all by outside code. Even if outside code wrote a new value and then re-wrote the original value, that would be guaranteed to cause the conditional code to fail. Conceptually, this might make LL/SC seem more powerful than other methods, since it wouldn't have the "ABA" problem. Unfortunately, LL/SC semantics allow for stores to spontaneously fail, and the probability of spontaneously failure may go up rapidly as the complexity of the code between the load and store is increased. While using LL/SC to implement something like an atomic increment directly would be more efficient than using it to implement a compare-and-swap, and then implementing an atomic increment using that compare-and-swap implementation, in situations where one would need to do much between a load and store, one should generally use LL-SC to implement a compare-and-swap, and then use that compare-and-swap method in a load-modify-CompareAndSwap loop.
Of the three primitives, the Compare-And-Swap is the least powerful, but it can be implemented in terms of either of the other two. CompareAndSwap can do a pretty good job of emulating CompareExchange, though there are some corner cases where such emulation might live-lock. Neither CompareExchange nor Compare-And-Swap can offer guarantees quite as strong as LL-SC, though the limited amount of code one can reliably place within an LL/SC loop limits the usefulness of its guarantees.
x86 does not provide LL/SC instructions. Check out wikipedia for platforms that do. Also see this SO question.

Synchronizing access to variable

I need to provide synchronization to some members of a structure.
If the structure is something like this
struct SharedStruct {
int Value1;
int Value2;
}
and I have a global variable
SharedStruct obj;
I want that the write from a processor
obj.Value1 = 5; // Processor B
to be immediately visible to the other processors, so that when I test the value
if(obj.Value1 == 5) { DoSmth(); } // Processor A
else DoSmthElse();
to get the new value, not some old value from the cache.
First I though that if I use volatile when writing/reading the values, it is enough. But I read that volatile can't solve this kind o issues.
The members are guaranteed to be properly aligned on 2/4/8 byte boundaries, and writes should be atomic in this case, but I'm not sure how the cache could interfere with this.
Using memory barriers (mfence, sfence, etc.) would be enough ? Or some interlocked operations are required ?
Or maybe something like
lock mov addr, REGISTER
?
The easiest would obviously be some locking mechanism, but speed is critical and can't afford locks :(
Edit
Maybe I should clarify a bit. The value is set only once (behaves like a flag). All the other threads need just to read it. That's why I think that it may be a way to force the read of this new value without using locks.
Thanks in advance!
There Ain't No Such Thing As A Free Lunch. If your data is being accessed from multiple threads, and it is necessary that updates are immediately visible by those other threads, then you have to protect the shared struct by a mutex, or a readers/writers lock, or some similar mechanism.
Performance is a valid concern when synchronizing code, but it is trumped by correctness. Generally speaking, aim for correctness first and then profile your code. Worrying about performance when you haven't yet nailed down correctness is premature optimization.
Use explicitly atomic instructions. I believe most compilers offer these as intrinsics. Compare and Exchange is another good one.
If you intend to write a lockless algorithm, you need to write it so that your changes only take effect when conditions are as expected.
For example, if you intend to insert a linked list object, use the compare/exchange stuff so that it only inserts if the pointer still points at the same location when you actually do the update.
Or if you are going to decrement a reference count and free the memory at count 0, you will want to pre-free it by making it unavailable somehow, check that the count is still 0 and then really free it. Or something like that.
Using a lock, operate, unlock design is generally a lot easier. The lock-free algorithms are really difficult to get right.
All the other answers here seem to hand wave about the complexities of updating shared variables using mutexes, etc. It is true that you want the update to be atomic.
And you could use various OS primitives to ensure that, and that would be good
programming style.
However, on most modern processors (certainly the x86), writes of small, aligned scalar values is atomic and immediately visible to other processors due to cache coherency.
So in this special case, you don't need all the synchronizing junk; the hardware does the
atomic operation for you. Certainly this is safe with 4 byte values (e.g., "int" in 32 bit C compilers).
So you could just initialize Value1 with an uninteresting value (say 0) before you start the parallel threads, and simply write other values there. If the question is exiting the loop on a fixed value (e.g., if value1 == 5) this will be perfectly safe.
If you insist on capturing the first value written, this won't work. But if you have a parallel set of threads, and any value written other than the uninteresting one will do, this is also fine.
I second peterb's answer to aim for correctness first. Yes, you can use memory barriers here, but they will not do what you want.
You said immediately. However, how immediate this update ever can be, you could (and will) end up with the if() clause being executed, then the flag being set, and than the DoSmthElse() being executed afterwards. This is called a race condition...
You want to synchronize something, it seems, but it is not this flag.
Making the field volatile should make the change "immediately" visible in other threads, but there is no guarantee that the instant at which thread A executes the update doesn't occur after thread B tests the value but before thread B executes the body of the if/else statement.
It sounds like what you really want to do is make that if/else statement atomic, and that will require either a lock, or an algorithm that is tolerant of this sort of situation.