Does compiler need to care about other threads during optimizations?

Does compiler need to care about other threads during optimizations? - c++

This is a spin-off from a discussion about C# thread safety guarantees.
I had the following presupposition:
in absence of thread-aware primitives (mutexes, std::atomic* etc., let's exclude volatile as well for simplicity) a valid C++ compiler may do any kinds of transformations, including introducing reads from the memory (or e. g. writes if it wants to), if the semantics of the code in the current thread (that is, output and [excluded in this question] volatile accesses) remain the same from the current thread's point of view, that is, disregarding existence of other threads. The fact that introducing reads/writes may change other thread's behavior (e. g. because the other threads read the data without proper synchronization or performing other kinds of UB) can be totally ignored by a standard-conform compiler.
Is this presupposition correct or not? I would expect this to follow from the as-if rule. (I believe it is, but other people seem to disagree with me.) If possible, please include the appropriate normative references.

Yes, C++ defines data race UB as potentially-concurrent access to non-atomic objects when not all the accesses are reads. Another recent Q&A quotes the standard, including.
[intro.races]/2 - Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21 ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
That gives the compiler freedom to optimize code in ways that preserve the behaviour of the thread executing a function, but not what other threads (or a debugger) might see if they go looking at things they're not supposed to. (i.e. data race UB means that the order of reading/writing non-atomic variables is not part of the observable behaviour an optimizer has to preserve.)
introducing reads/writes may change other thread's behavior
The as-if rule allows you to invent reads, but no you can't invent writes to objects this thread didn't already write. That's why if(a[i] > 10) a[i] = 10; is different from a[i] = a[i]>10 ? 10 : a[i].
It's legal for two different threads to write a[1] and a[2] at the same time, and one thread loading a[0..3] and then storing back some modified and some unmodified elements could step on the store by the thread that wrote a[2].
Crash with icc: can the compiler invent writes where none existed in the abstract machine? is a detailed look at a compiler bug where ICC did that when auto-vectorizing with SIMD blends. Including links to Herb Sutter's atomic weapons talk where he discusses the fact that compilers must not invent writes.
By contrast, AVX-512 masking and AVX vmaskmovps etc, like ARM SVE and RISC-V vector extensions I think, do have proper masking with fault suppression to actually not store at all to some SIMD elements, without branching.
When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements? AVX-512 masking does indeed do fault-suppression for read-only or unmapped pages that masked-off elements extend into.
AVX-512 and Branching - auto-vectorizing with stores inside an if() vs. branchless.
It's legal to invent atomic RMWs (except without the Modify part), e.g. an 8-byte lock cmpxchg [rcx], rdx if you want to modify some of the bytes in that region. But in practice that's more costly than just storing modified bytes individually so compilers don't do that.
Of course a function that does unconditionally write a[2] can write it multiple times, and with different temporary values before eventually updating it to the final value. (Probably only a Deathstation 9000 would invent different-valued temporary contents, like turning a[2] = 3 into a[2] = 2; a[2]++;)
For more about what compilers can legally do, see Who's afraid of a big bad optimizing compiler? on LWN. The context for that article is Linux kernel development, where they rely on GCC to go beyond the ISO C standard and actually behave in sane ways that make it possible to roll their own atomics with volatile int* and inline asm. It explains many of the practical dangers of reading or writing a non-atomic shared variable.

Related

What is the difference between load/store relaxed atomic and normal variable?

As I see from a test-case: https://godbolt.org/z/K477q1
The generated assembly load/store atomic relaxed is the same as the normal variable: ldr and str
So, is there any difference between relaxed atomic and normal variable?

The difference is that a normal load/store is not guaranteed to be tear-free, whereas a relaxed atomic read/write is. Also, the atomic guarantees that the compiler doesn't rearrange or optimise-out memory accesses in a similar fashion to what volatile guarantees.
(Pre-C++11, volatile was an essential part of rolling your own atomics. But now it's obsolete for that purpose. It does still work in practice but is never recommended: When to use volatile with multi threading? - essentially never.)
On most platforms it just happens that the architecture provides a tear-free load/store by default (for aligned int and long) so it works out the same in asm if loads and stores don't get optimized away. See Why is integer assignment on a naturally aligned variable atomic on x86? for example. In C++ it's up to you to express how the memory should be accessed in your source code instead of relying on architecture-specific features to make the code work as intended.
If you were hand-writing in asm, your source code would already nail down when values were kept in registers vs. loaded / stored to (shared) memory. In C++, telling the compiler when it can/can't keep values private is part of why std::atomic<T> exists.
If you read one article on this topic, take a look at the Preshing one here:
https://preshing.com/20130618/atomic-vs-non-atomic-operations/
Also try this presentation from CppCon 2017:
https://www.youtube.com/watch?v=ZQFzMfHIxng
Links for further reading:
Read a non-atomic variable, atomically?
https://en.cppreference.com/w/cpp/atomic/memory_order#Relaxed_ordering
Causing non-atomics to tear
https://lwn.net/Articles/793895/
What is the (slight) difference on the relaxing atomic rules? which includes a link to a Herb Sutter "atomic weapons" article which is also linked here:
https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/
Also see Peter Cordes' linked article: https://electronics.stackexchange.com/q/387181
And a related one about the Linux kernel: https://lwn.net/Articles/793253/
No tearing is only part of what you get with std::atomic<T> - you also avoid data race undefined behaviour.

atomic<T> constrains the optimizer to not assume the value is unchanged between accesses in the same thread.
atomic<T> also makes sure the object is sufficiently aligned: e.g. some C++ implementations for 32-bit ISAs have alignof(int64_t) = 4 but alignof(atomic<int64_t>) = 8 to enable lock-free 64-bit operations. (e.g. gcc for 32-bit x86 GNU/Linux). In that case, usually a special instruction is needed that the compiler might not use otherwise, e.g. ARMv8 32-bit ldp load-pair, or x86 SSE2 movq xmm before bouncing to integer regs.
In asm for most ISAs, pure-load and pure-store of naturally-aligned int and long are atomic for free, so atomic<T> with memory_order_relaxed can compile to the same asm as plain variables; atomicity (no tearing) doesn't require any special asm. For example: Why is integer assignment on a naturally aligned variable atomic on x86? Depending on surrounding code, the compiler might not manage to optimize out any accesses to non-atomic objects, in which case code-gen will be the same between plain T and atomic<T> with mo_relaxed.
The reverse is not true: It's not at all safe to write C++ as if you were writing in asm. In C++, multiple threads accessing the same object at the same time is data-race undefined behaviour, unless all the accesses are reads.
Thus C++ compilers are allowed to assume that no other threads are changing a variable in a loop, per the "as-if" optimization rule. If bool done is not atomic, a loop like while(!done) { } will compile into if(!done) infinite_loop;, hoisting the load out of the loop. See Multithreading program stuck in optimized mode but runs normally in -O0 for a detailed example with compiler asm output. (Compiling with optimization disabled is very similar to making every object volatile: memory in sync with the abstract machine between C++ statements for consistent debugging.)
Also obviously RMW operations like += or var.fetch_add(1, mo_seq_cst) are atomic and do have to compile to different asm than non-atomic +=. Can num++ be atomic for 'int num'?
The constraints on the optimizer placed by atomic operations are similar to what volatile does. In practice volatile is a way to roll your own mo_relaxed atomic<T>, but without any easy way to get ordering wrt. other operations. It's de-facto supported on some compilers, like GCC, because it's used by the Linux kernel. However, atomic<T> is guaranteed to work by the ISO C++ standard; When to use volatile with multi threading? - there's almost never a reason to roll your own, just use atomic<T> with mo_relaxed.
Also related: Why don't compilers merge redundant std::atomic writes? / Can and does the compiler optimize out two atomic loads? - compilers currently don't optimize atomics at all, so atomic<T> is currently equivalent to volatile atomic<T>, pending further standards work to provide ways for programmers to control when / what optimization would be ok.

Very good question actually, and I asked the same question when I started leaning concurrency.
I'll answer as simple as possible, even though the answer is a bit more complicated.
Reading and writing to the same non atomic variable from different threads* is undefined behavior - one thread is not guaranteed to read the value that the other thread wrote.
Using an atomic variable solves the problem - by using atomics all threads are guarantees to read the latest writen-value even if the memory order is relaxed.
In fact, atomics are always thread safe, regardless of the memory order!
The memory order is not for the atomics -> it's for non atomic data.
Here is the thing - if you use locks, you don't have to think about those low-level things. memory orders are used in lock-free environments where we need to synchronize non atomic data.
Here is the beautiful thing about lock free algorithms, we use atomic operations that are always thread safe, but we "piggy-pack" those operations with memory orders to synchronize the non atomic data used in those algorithms.
For example, a lock-free linked list. Usually, a lock-free link list node looks something like this:
Node:
Atomic<Node*> next_node;
T non_atomic_data
Now, let's say I push a new node into the list. next_node is always thread safe, another thread will always see the latest atomic value.
But who grantees that other threads see the correct value of non_atomic_data?
No-one.
Here is a perfect example of the usage of memory orders - we "piggyback" atomic stores and loads to next_node by also adding memory orders that synchronize the value of non_atomic_data.
So when we store a new node to the list, we use memory_order_release to "push" the non atomic data to the main memory. when we read the new node by reading next_node, we use memory_order_acquire and then we "pull" the non atomic data from the main memory.
This way we assure that both next_node and non_atomic_data are always synchronized across threads.
memory_order_relaxed doesn't synchronize any non-atomic data, it synchronizes only itself - the atomic variable. When this is used, developers can assume that the atomic variable doesn't reference any non-atomic data published by the same thread that wrote the atomic variable. In other words, that atomic variable isn't, for example, an index of a non-atomic array, or a pointer to non atomic data, or an iterator to some non-thread safe collection. (It would be fine to use relaxed atomic stores and loads for an index into a constant lookup table, or one that's synchronized separately. You only need acq/rel synchronization if the pointed-to or indexed data was written by the same thread.)
This is faster (at least on some architectures) than using stronger memory orders but can be used in fewer cases.
Great, but even this is not the full answer. I said memory orders are not used for atomics. I was half-lying.
With relaxed memory order, atomics are still thread safe. but they have a downside - they can be re-ordered. look at the following snippet:
a.store(1, std::memory_order_relaxed);
b.store(2, std::memory_order_relaxed);
In reality, a.store can happen after b.store. The CPU does this all the times, it's called Out of Order Execution and its one of the optimizations techniques CPUs use to speed up execution. a and b are still thread-safe, even though the thread-safe stores might happen in a reverse order.
Now, what happens if there is a meaning for the order? Many lock-free algorithms depend on the order of atomic operations for their correctness.
Memory orders are also used to prevent reordering. This is why memory orders are so complicated, because they do 2 things at the same time.
memory_order_acquire tells the compiler and CPU not to execute operations that happen after it code-wise, before it.
similarity, memory_order_release tells the compiler and CPU not to execute operations that before it code-wise, after it.
memory_order_relaxed tells the compiler/cpu that the atomic operation can be re-ordered is possible, in a similar way non atomic operations are reordered whenever possible.

Would a volatile variable be enough in this case? [duplicate]

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?

I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.

disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.

I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

Concurrency: Atomic and volatile in C++11 memory model

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?

volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)

Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.

Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

Combining stores/loads of consecutive atomic variables

Referring to a (slightly dated) paper by Hans Boehm, under "Atomic Operations". It mentions that the memory model (proposed at the time) would not prevent an optimizing compiler from combining a sequence of loads, or stores, on the same variable from being combined into a single load. His example is as follows (updated to hopefully correct current syntax):
Given
atomic<int> v;
The code
while( v.load( memory_order_acquire ) ) { ... }
Could be optimized to:
int a = v.load(memory_order_acquire);
while(a) { ... }
Obviously this would be bad, as he states. Now my question is, as the paper is a bit old, does the current C++0x memory model prevent this type of optimization, or is it still technically allowed?
My reading of the standard would seem to lean towards it being disallowed, but the use "acquire" semantics makes it less clear. For example if it were "seq_cst" the model seems to guarantee that the load must partake in a total ordering on the access and loading the value only once would thus seem to violate ordering (as it breaks the sequence happens before relationship).
For acquire I interpret 29.3.2 to mean that this optimization can not occur, since any "release" operation must be observed by the "acquire" operation. Doing only one acquire would seem not valid.
So my question is whether the current model (in the pending standard) would disallow this type of optimization? And if yes, then which part specifically forbids it? If no, does using a volatile atomic solve the problem?
And for bonus, if the load operation has a "relaxed" ordering is the optimization then allowed?

The C++0x standard attempts to outlaw this optimization.
The relevant words are from 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
If the thread that is doing the load only ever issues one load instruction then this is violated, as if it misses the write the first time, it will never see it. It doesn't matter which memory ordering is used for the load, it is the same for both memory_order_seq_cst and memory_order_relaxed.
However, the following optimization is allowed, unless there is something in the loop that forces an ordering:
while( v.load( memory_order_acquire ) ) {
for(unsigned __temp=0;__temp<100;++__temp) {
// original loop body goes here
}
}
i.e. the compiler can generate code that executes the actual loads arbitrarily infrequently, provided it still executes them. This is even permitted for memory_order_seq_cst unless there are other memory_order_seq_cst operations in the loop, since this is equivalent to running 100 iterations between any memory accesses by other threads.
As an aside, the use of memory_order_acquire doesn't have the effect you describe --- it is not required to see release operations (other than by 29.3p13 quoted above), just that if it does see the release operation then it imposes visibility constraints on other accesses.

Right from the very paper you're linking:
Volatiles guarantee that the right number of memory operations are
performed.
The standard says essentially the same:
Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
This has always been the case, since the very first C compiler by Dennis Ritchie I think. It has to be this way because memory mapped I/O registers won't work otherwise. To read two characters from your keyboard, you need to read the corresponding memory mapped register twice. If the compiler had a different idea about the number of reads it has to perform, that would be too bad!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js