Combining stores/loads of consecutive atomic variables - c++

Referring to a (slightly dated) paper by Hans Boehm, under "Atomic Operations". It mentions that the memory model (proposed at the time) would not prevent an optimizing compiler from combining a sequence of loads, or stores, on the same variable from being combined into a single load. His example is as follows (updated to hopefully correct current syntax):
Given
atomic<int> v;
The code
while( v.load( memory_order_acquire ) ) { ... }
Could be optimized to:
int a = v.load(memory_order_acquire);
while(a) { ... }
Obviously this would be bad, as he states. Now my question is, as the paper is a bit old, does the current C++0x memory model prevent this type of optimization, or is it still technically allowed?
My reading of the standard would seem to lean towards it being disallowed, but the use "acquire" semantics makes it less clear. For example if it were "seq_cst" the model seems to guarantee that the load must partake in a total ordering on the access and loading the value only once would thus seem to violate ordering (as it breaks the sequence happens before relationship).
For acquire I interpret 29.3.2 to mean that this optimization can not occur, since any "release" operation must be observed by the "acquire" operation. Doing only one acquire would seem not valid.
So my question is whether the current model (in the pending standard) would disallow this type of optimization? And if yes, then which part specifically forbids it? If no, does using a volatile atomic solve the problem?
And for bonus, if the load operation has a "relaxed" ordering is the optimization then allowed?

The C++0x standard attempts to outlaw this optimization.
The relevant words are from 29.3p13:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
If the thread that is doing the load only ever issues one load instruction then this is violated, as if it misses the write the first time, it will never see it. It doesn't matter which memory ordering is used for the load, it is the same for both memory_order_seq_cst and memory_order_relaxed.
However, the following optimization is allowed, unless there is something in the loop that forces an ordering:
while( v.load( memory_order_acquire ) ) {
for(unsigned __temp=0;__temp<100;++__temp) {
// original loop body goes here
}
}
i.e. the compiler can generate code that executes the actual loads arbitrarily infrequently, provided it still executes them. This is even permitted for memory_order_seq_cst unless there are other memory_order_seq_cst operations in the loop, since this is equivalent to running 100 iterations between any memory accesses by other threads.
As an aside, the use of memory_order_acquire doesn't have the effect you describe --- it is not required to see release operations (other than by 29.3p13 quoted above), just that if it does see the release operation then it imposes visibility constraints on other accesses.

Right from the very paper you're linking:
Volatiles guarantee that the right number of memory operations are
performed.
The standard says essentially the same:
Access to volatile objects are evaluated strictly according to the
rules of the abstract machine.
This has always been the case, since the very first C compiler by Dennis Ritchie I think. It has to be this way because memory mapped I/O registers won't work otherwise. To read two characters from your keyboard, you need to read the corresponding memory mapped register twice. If the compiler had a different idea about the number of reads it has to perform, that would be too bad!

Related

Does compiler need to care about other threads during optimizations?

This is a spin-off from a discussion about C# thread safety guarantees.
I had the following presupposition:
in absence of thread-aware primitives (mutexes, std::atomic* etc., let's exclude volatile as well for simplicity) a valid C++ compiler may do any kinds of transformations, including introducing reads from the memory (or e. g. writes if it wants to), if the semantics of the code in the current thread (that is, output and [excluded in this question] volatile accesses) remain the same from the current thread's point of view, that is, disregarding existence of other threads. The fact that introducing reads/writes may change other thread's behavior (e. g. because the other threads read the data without proper synchronization or performing other kinds of UB) can be totally ignored by a standard-conform compiler.
Is this presupposition correct or not? I would expect this to follow from the as-if rule. (I believe it is, but other people seem to disagree with me.) If possible, please include the appropriate normative references.
Yes, C++ defines data race UB as potentially-concurrent access to non-atomic objects when not all the accesses are reads. Another recent Q&A quotes the standard, including.
[intro.races]/2 - Two expression evaluations conflict if one of them modifies a memory location ... and the other one reads or modifies the same memory location.
[intro.races]/21 ... The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, at least one of which is not atomic, and neither happens before the other, ...
Any such data race results in undefined behavior.
That gives the compiler freedom to optimize code in ways that preserve the behaviour of the thread executing a function, but not what other threads (or a debugger) might see if they go looking at things they're not supposed to. (i.e. data race UB means that the order of reading/writing non-atomic variables is not part of the observable behaviour an optimizer has to preserve.)
introducing reads/writes may change other thread's behavior
The as-if rule allows you to invent reads, but no you can't invent writes to objects this thread didn't already write. That's why if(a[i] > 10) a[i] = 10; is different from a[i] = a[i]>10 ? 10 : a[i].
It's legal for two different threads to write a[1] and a[2] at the same time, and one thread loading a[0..3] and then storing back some modified and some unmodified elements could step on the store by the thread that wrote a[2].
Crash with icc: can the compiler invent writes where none existed in the abstract machine? is a detailed look at a compiler bug where ICC did that when auto-vectorizing with SIMD blends. Including links to Herb Sutter's atomic weapons talk where he discusses the fact that compilers must not invent writes.
By contrast, AVX-512 masking and AVX vmaskmovps etc, like ARM SVE and RISC-V vector extensions I think, do have proper masking with fault suppression to actually not store at all to some SIMD elements, without branching.
When using a mask register with AVX-512 load and stores, is a fault raised for invalid accesses to masked out elements? AVX-512 masking does indeed do fault-suppression for read-only or unmapped pages that masked-off elements extend into.
AVX-512 and Branching - auto-vectorizing with stores inside an if() vs. branchless.
It's legal to invent atomic RMWs (except without the Modify part), e.g. an 8-byte lock cmpxchg [rcx], rdx if you want to modify some of the bytes in that region. But in practice that's more costly than just storing modified bytes individually so compilers don't do that.
Of course a function that does unconditionally write a[2] can write it multiple times, and with different temporary values before eventually updating it to the final value. (Probably only a Deathstation 9000 would invent different-valued temporary contents, like turning a[2] = 3 into a[2] = 2; a[2]++;)
For more about what compilers can legally do, see Who's afraid of a big bad optimizing compiler? on LWN. The context for that article is Linux kernel development, where they rely on GCC to go beyond the ISO C standard and actually behave in sane ways that make it possible to roll their own atomics with volatile int* and inline asm. It explains many of the practical dangers of reading or writing a non-atomic shared variable.

If a RMW operation changes nothing, can it be optimized away, for all memory orders?

In the C/C++ memory model, can a compiler just combine and then remove redundant/NOP atomic modification operations, such as:
x++,
x--;
or even simply
x+=0; // return value is ignored
For an atomic scalar x?
Does that hold for sequential consistency or just weaker memory orders?
(Note: For weaker memory orders that still do something; for relaxed, there is no real question here. EDIT AGAIN: No actually there is a serious question in that special case. See my own answer. Not even relaxed is cleared for removal.)
EDIT:
The question is not about code generation for a particular access: if I wanted to see two lock add generated on Intel for the first example, I would have made x volatile.
The question is whether these C/C++ instructions have any impact what so ever: can the compiler just filter and remove these nul operations (that are not relaxed order operations), as a sort of source to source transformation? (or abstract tree to abstract tree transformation, perhaps in the compiler "front end")
EDIT 2:
Summary of the hypotheses:
not all operations are relaxed
nothing is volatile
atomic objects are really potentially accessible by multiple functions and threads (no automatic atomic whose address isn't shared)
Optional hypothesis:
If you want, you may assume that the address of the atomic so not taken, that all accesses are by name, and that all accesses have a property:
That no access of that variable, anywhere, has a relaxed load/store element: all load operations should have acquire and all stores should have release (so all RMW should be at least acq_rel).
Or, that for those accesses that are relaxed, the access code doesn't read the value for a purpose other than changing it: a relaxed RMW does not conserve the value further (and does not test the value to decide what to do next). In other words, no data or control dependency on the value of the atomic object unless the load has an acquire.
Or that all accesses of the atomic are sequentially consistent.
That is I'm especially curious about these (I believe quite common) use cases.
Note: an access is not considered "completely relaxed" even if it's done with a relaxed memory order, when the code makes sure observers have the same memory visibility, so this is considered valid for (1) and (2):
atomic_thread_fence(std::memory_order_release);
x.store(1,std::memory_order_relaxed);
as the memory visibility is at least as good as with just x.store(1,std::memory_order_release);
This is considered valid for (1) and (2):
int v = x.load(std::memory_order_relaxed);
atomic_thread_fence(std::memory_order_acquire);
for the same reason.
This is stupidly, trivially valid for (2) (i is just an int)
i=x.load(std::memory_order_relaxed),i=0; // useless
as no information from a relaxed operation was kept.
This is valid for (2):
(void)x.fetch_add(1, std::memory_order_relaxed);
This is not valid for (2):
if (x.load(std::memory_order_relaxed))
f();
else
g();
as a consequential decision was based on a relaxed load, neither is
i += x.fetch_add(1, std::memory_order_release);
Note: (2) covers one of the most common uses of an atomic, the thread safe reference counter. (CORRECTION: It isn't clear that all thread safe counters technically fit the description as acquire can be done only on 0 post decrement, and then a decision was taken based on counter>0 without an acquire; a decision to not do something but still...)
No, definitely not entirely. It's at least a memory barrier within the thread for stronger memory orders.
For mo_relaxed atomics, yes I think it could in theory be optimized away completely, as if it wasn't there in the source. It's equivalent for a thread to simply not be part of a release-sequence it might have been part of.
If you used the result of the fetch_add(0, mo_relaxed), then I think collapsing them together and just doing a load instead of an RMW of 0 might not be exactly equivalent. Barriers in this thread surrounding the relaxed RMW still have an effect on all operations, including ordering the relaxed operation wrt. non-atomic operations. With a load+store tied together as an atomic RMW, things that order stores could order an atomic RMW when they wouldn't have ordered a pure load.
But I don't think any C++ ordering is like that: mo_release stores order earlier loads and stores, and atomic_thread_fence(mo_release) is like an asm StoreStore + LoadStore barrier. (Preshing on fences). So yes, given that any C++-imposed ordering would also apply to a relaxed load equally to a relaxed RMW, I think int tmp = shared.fetch_add(0, mo_relaxed) could be optimized to just a load.
(In practice compilers don't optimize atomics at all, basically treating them all like volatile atomic, even for mo_relaxed. Why don't compilers merge redundant std::atomic writes? and http://wg21.link/n4455 + http://wg21.link/p0062. It's too hard / no mechanism exists to let compilers know when not to.)
But yes, the ISO C++ standard on paper makes no guarantee that other threads can actually observe any given intermediate state.
Thought experiment: Consider a C++ implementation on a single-core cooperative multi-tasking system. It implements std::thread by inserting yield calls where needed to avoid deadlocks, but not between every instruction. Nothing in the standard requires a yield between num++ and num-- to let other threads observe that state.
The as-if rule basically allows a compiler to pick a legal/possible ordering and decide at compile-time that it's what happens every time.
In practice this can create fairness problems if an unlock/re-lock never actually gives other threads the chance to take a lock if --/++ are combined together into just a memory barrier with no modification of the atomic object! This among other things is why compilers don't optimize.
Any stronger ordering for one or both of the operations can begin or be part of a release-sequence that synchronizes-with a reader. A reader that does an acquire load of a release store/RMW Synchronizes-With this thread, and must see all previous effects of this thread as having already happened.
IDK how the reader would know that it was seeing this thread's release-store instead of some previous value, so a real example is probably hard to cook up. At least we could create one without possible UB, e.g. by reading the value of another relaxed atomic variable so we avoid data-race UB if we didn't see this value.
Consider the sequence:
// broken code where optimization could fix it
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
done.fetch_add(-1, mo_relaxed);
done.fetch_add(+1, mo_release); // release-store publishes the result
This could optimize to just done.store(1, mo_release); which correctly publishes a 1 to the other thread without the risk of the 1 being visible too soon, before the updated buf values.
But it could also optimize just the cancelling pair of RMWs into a fence after the relaxed store, which would still be broken. (And not the optimization's fault.)
// still broken
memcpy(buf, stuff, sizeof(buf));
done.store(1, mo_relaxed); // relaxed: can reorder with memcpy
atomic_thread_fence(mo_release);
I haven't thought of an example where safe code becomes broken by a plausible optimization of this sort. Of course just removing the pair entirely even when they're seq_cst wouldn't always be safe.
A seq_cst increment and decrement does still create a sort of memory barrier. If they weren't optimized away, it would be impossible for earlier stores to interleave with later loads. To preserve this, compiling for x86 would probably still need to emit mfence.
Of course the obvious thing would be a lock add [x], 0 which does actually do a dummy RMW of the shared object that we did x++/x-- on. But I think the memory barrier alone, not coupled to an access to that actual object or cache line is sufficient.
And of course it has to act as a compile-time memory barrier, blocking compile-time reordering of non-atomic and atomic accesses across it.
For acq_rel or weaker fetch_add(0) or cancelling sequence, the run-time memory barrier might happen for free on x86, only needing to restrict compile-time ordering.
See also a section of my answer on Can num++ be atomic for 'int num'?, and in comments on Richard Hodges' answer there. (But note that some of that discussion is confused by arguments about when there are modifications to other objects between the ++ and --. Of course all ordering of this thread's operations implied by the atomics must be preserved.)
As I said, this is all hypothetical and real compilers aren't going to optimize atomics until the dust settles on N4455 / P0062.
The C++ memory model provides four coherence requirements for all atomic accesses to the same atomic object. These requirements apply regardless of the memory order. As stated in a non-normative notation:
The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads.
Emphasis added.
Given that both operations are happening to the same atomic variable, and the first definitely happens before the second (due to being sequenced before it), there can be no reordering of these operations. Again, even if relaxed operations are used.
If this pair of operations were removed by a compiler, that would guarantee that no other threads would ever see the incremented value. So the question now becomes whether the standard would require some other thread to be able to see the incremented value.
It does not. Without some way for something to guarantee-ably "happen after" the increment and "happen before" the decrement, there is no guarantee that any operation on any other thread will certainly see the incremented value.
This leaves one question: does the second operation always undo the first? That is, does the decrement undo the increment? That depends on the scalar type in question. ++ and -- are only defined for the pointer and integer specializations of atomic. So we only need to consider those.
For pointers, the decrement undoes the increment. The reason being that the only way incrementing+decrementing a pointer would not result in the same pointer to the same object is if incrementing the pointer was itself UB. That is, if the pointer is invalid, NULL, or is the past-the-end pointer to an object/array. But compilers don't have to consider UB cases since... they're undefined behavior. In all of the cases where incrementing is valid, pointer decrementing must also be valid (or UB, perhaps due to someone freeing the memory or otherwise making the pointer invalid, but again, the compiler doesn't have to care).
For unsigned integers, the decrement always undoes the increment, since wraparound behavior is well-defined for unsigned integers.
That leaves signed integers. C++ usually makes signed integer over/underflow into UB. Fortunately, that's not the case for atomic math; the standard explicitly says:
For signed integer types, arithmetic is defined to use two's complement representation. There are no undefined results.
Wraparound behavior for two's complement atomics works. That means increment/decrement always results in recovering the same value.
So there does not appear to be anything in the standard which would prevent compilers from removing such operations. Again, regardless of the memory ordering.
Now, if you use non-relaxed memory ordering, the implementation cannot completely remove all traces of the atomics. The actual memory barriers behind those orderings still have to be emitted. But the barriers can be emitted without emitting the actual atomic operations.
In the C/C++ memory model, can a compiler just combine and then remove
redundant/NOP atomic modification operations,
No, the removal part is not allowed, at least not in the specific way the question suggests it would be allowed: the intent here is to describe valid source to source transformations, abstract tree to abstract tree, or rather a higher level description of the source code that encodes all the relevant semantic elements that might be needed for later phases of compilation.
The hypothesis is that code generation can be done on the transformed program, without ever checking with the original one. So only safe transformations that cannot break any code are allowed.
(Note: For weaker memory orders that still do something; for relaxed,
there is no real question here.)
No. Even that is wrong: for even relaxed operations, unconditional removal isn't a valid transformation (although in most practical cases it's certainly valid, but mostly correct is still wrong, and "true in >99% practical cases" has nothing to do with the question):
Before the introduction of standard threads, a stuck program was an infinite loop was an empty loop performing no externally visible side effects: no input, output, volatile operation and in practice no system call. A program that will not ever perform something visible is stuck and its behavior is not defined, and that allows the compiler to assume pure algorithms terminate: loops containing only invisible computations must exit somehow (that includes exiting with an exception).
With threads, that definition is obviously not usable: a loop in one thread isn't the whole program, and a stuck program is really one with no thread that can make something useful, and forbidding that would be sound.
But the very problematic standard definition of stuck doesn't describe a program execution but a single thread: a thread is stuck if it will perform no side effect that could potentially have an effect on observable side effects, that is:
no observable obviously (no I/O)
no action that might interact with another thread
The standard definition of 2. is extremely large and simplistic, all operations on an inter-thread communication device count: any atomic operation, any action on any mutex. Full text for the requirement (relevant part in boldface):
[intro.progress]
The implementation may assume that any thread will eventually do one
of the following:
terminate,
make a call to a library I/O function,
perform an access through a volatile glvalue, or
perform a synchronization operation or an atomic operation.
[ Note: This is intended to allow compiler transformations such as removal of
empty loops, even when termination cannot be proven. — end note ]
That definition does not even specify:
an inter thread communication (from one thread to another)
a shared state (visible by multiple threads)
a modification of some state
an object that is not thread private
That means that all these silly operations count:
for fences:
performing an acquire fence (even when followed by no atomic operation) in a thread that has at least once done an atomic store can synchronize with another fence or atomic operation
for mutexes:
locking a locally recently created, patently useless function private mutex;
locking a mutex to just unlock it doing nothing with the mutex locked;
for atomics:
reading an atomic variable declared as const qualified (not a const reference to a non const atomic);
reading an atomic, ignoring the value, even with relaxed memory ordering;
setting a non const qualified atomic to its own immutable value (setting a variable to zero when nothing in the whole program sets it to a non zero value), even with relaxed ordering;;
doing operations on a local atomic variable not accessible by other threads;
for thread operations:
creating a thread (that might do nothing) and joining it seems to create a (NOP) synchronization operation.
It means no early, local transformation of program code that leaves no trace of the transformation to later compiler phases and that removes even the most silly and useless inter-thread primitive is absolutely, unconditionally valid according to the standard, as it might remove the last potentially useful (but actually useless) operation in a loop (a loop doesn't have to be spelled for or while, it's any looping construct, f.ex. a backward goto).
This however doesn't apply if other operations on inter-thread primitives are left in the loop, or obviously if I/O is done.
This looks like a defect.
A meaningful requirement should be based:
not only on using thread primitives,
not be about any thread in isolation (as you can't see if a thread is contributing to anything, but at least a requirement to meaningfully interact with another thread and not use a private atomic or mutex would be better then the current requirement),
based on doing something useful (program observables) and inter-thread interactions that contribute to something being done.
I'm not proposing a replacement right now, as the rest of the thread specification isn't even clear to me.

Would a volatile variable be enough in this case? [duplicate]

A global variable is shared across 2 concurrently running threads on 2 different cores. The threads writes to and read from the variables. For the atomic variable can one thread read a stale value? Each core might have a value of the shared variable in its cache and when one threads writes to its copy in a cache the other thread on a different core might read stale value from its own cache. Or the compiler does strong memory ordering to read the latest value from the other cache? The c++11 standard library has std::atomic support. How this is different from the volatile keyword? How volatile and atomic types will behave differently in the above scenario?
Firstly, volatile does not imply atomic access. It is designed for things like memory mapped I/O and signal handling. volatile is completely unnecessary when used with std::atomic, and unless your platform documents otherwise, volatile has no bearing on atomic access or memory ordering between threads.
If you have a global variable which is shared between threads, such as:
std::atomic<int> ai;
then the visibility and ordering constraints depend on the memory ordering parameter you use for operations, and the synchronization effects of locks, threads and accesses to other atomic variables.
In the absence of any additional synchronization, if one thread writes a value to ai then there is nothing that guarantees that another thread will see the value in any given time period. The standard specifies that it should be visible "in a reasonable period of time", but any given access may return a stale value.
The default memory ordering of std::memory_order_seq_cst provides a single global total order for all std::memory_order_seq_cst operations across all variables. This doesn't mean that you can't get stale values, but it does mean that the value you do get determines and is determined by where in this total order your operation lies.
If you have 2 shared variables x and y, initially zero, and have one thread write 1 to x and another write 2 to y, then a third thread that reads both may see either (0,0), (1,0), (0,2) or (1,2) since there is no ordering constraint between the operations, and thus the operations may appear in any order in the global order.
If both writes are from the same thread, which does x=1 before y=2 and the reading thread reads y before x then (0,2) is no longer a valid option, since the read of y==2 implies that the earlier write to x is visible. The other 3 pairings (0,0), (1,0) and (1,2) are still possible, depending how the 2 reads interleave with the 2 writes.
If you use other memory orderings such as std::memory_order_relaxed or std::memory_order_acquire then the constraints are relaxed even further, and the single global ordering no longer applies. Threads don't even necessarily have to agree on the ordering of two stores to separate variables if there is no additional synchronization.
The only way to guarantee you have the "latest" value is to use a read-modify-write operation such as exchange(), compare_exchange_strong() or fetch_add(). Read-modify-write operations have an additional constraint that they always operate on the "latest" value, so a sequence of ai.fetch_add(1) operations by a series of threads will return a sequence of values with no duplicates or gaps. In the absence of additional constraints, there's still no guarantee which threads will see which values though. In particular, it is important to note that the use of an RMW operation does not force changes from other threads to become visible any quicker, it just means that if the changes are not seen by the RMW then all threads must agree that they are later in the modification order of that atomic variable than the RMW operation. Stores from different threads can still be delayed by arbitrary amounts of time, depending on when the CPU actually issues the store to memory (rather than just its own store buffer), physically how far apart the CPUs executing the threads are (in the case of a multi-processor system), and the details of the cache coherency protocol.
Working with atomic operations is a complex topic. I suggest you read a lot of background material, and examine published code before writing production code with atomics. In most cases it is easier to write code that uses locks, and not noticeably less efficient.
volatile and the atomic operations have a different background, and
were introduced with a different intent.
volatile dates from way back, and is principally designed to prevent
compiler optimizations when accessing memory mapped IO. Modern
compilers tend to do no more than suppress optimizations for volatile,
although on some machines, this isn't sufficient for even memory mapped
IO. Except for the special case of signal handlers, and setjmp,
longjmp and getjmp sequences (where the C standard, and in the case
of signals, the Posix standard, gives additional guarantees), it must be
considered useless on a modern machine, where without special additional
instructions (fences or memory barriers), the hardware may reorder or
even suppress certain accesses. Since you shouldn't be using setjmp
et al. in C++, this more or less leaves signal handlers, and in a
multithreaded environment, at least under Unix, there are better
solutions for those as well. And possibly memory mapped IO, if you're
working on kernal code and can ensure that the compiler generates
whatever is needed for the platform in question. (According to the
standard, volatile access is observable behavior, which the compiler
must respect. But the compiler gets to define what is meant by
“access”, and most seem to define it as “a load or
store machine instruction was executed”. Which, on a modern
processor, doesn't even mean that there is necessarily a read or write
cycle on the bus, much less that it's in the order you expect.)
Given this situation, the C++ standard added atomic access, which does
provide a certain number of guarantees across threads; in particular,
the code generated around an atomic access will contain the necessary
additional instructions to prevent the hardware from reordering the
accesses, and to ensure that the accesses propagate down to the global
memory shared between cores on a multicore machine. (At one point in
the standardization effort, Microsoft proposed adding these semantics to
volatile, and I think some of their C++ compilers do. After
discussion of the issues in the committee, however, the general
consensus—including the Microsoft representative—was that it
was better to leave volatile with its orginal meaning, and to define
the atomic types.) Or just use the system level primitives, like
mutexes, which execute whatever instructions are needed in their code.
(They have to. You can't implement a mutex without some guarantees
concerning the order of memory accesses.)
Here's a basic synopsis of what the 2 things are:
1) Volatile keyword:
Tells the compiler that this value could alter at any moment and therefore it should not EVER cache it in a register. Look up the old "register" keyword in C. "Volatile" is basically the "-" operator to "register"'s "+". Modern compilers now do the optimization that "register" used to explicitly request by default, so you only see 'volatile' anymore. Using the volatile qualifier will guarantee that your processing never uses a stale value, but nothing more.
2) Atomic:
Atomic operations modify data in a single clock tick, so that it is impossible for ANY other thread to access the data in the middle of such an update. They're usually limited to whatever single-clock assembly instructions the hardware supports; things like ++,--, and swapping 2 pointers. Note that this says nothing about the ORDER the different threads will RUN the atomic instructions, only that they will never run in parallel. That's why you have all those additional options for forcing an ordering.
Volatile and Atomic serve different purposes.
Volatile :
Informs the compiler to avoid optimization. This keyword is used for variables that shall change unexpectedly. So, it can be used to represent the Hardware status registers, variables of ISR, Variables shared in a multi-threaded application.
Atomic :
It is also used in case of multi-threaded application. However, this ensures that there is no lock/stall while using in a multi-threaded application. Atomic operations are free of races and indivisble. Few of the key scenario of usage is to check whether a lock is free or used, atomically add to the value and return the added value etc. in multi-threaded application.

In C11/C++11, possible to mix atomic/non-atomic ops on the same memory?

Is it possible to perform atomic and non-atomic ops on the same memory location?
I ask not because I actually want to do this, but because I'm trying to understand the C11/C++11 memory model. They define a "data race" like so:
The execution of a program contains a data race if it contains two
conflicting actions in different threads, at least one of which is not
atomic, and neither happens before the other. Any such data race
results in undefined behavior.
-- C11 §5.1.2.4 p25, C++11 § 1.10 p21
Its the "at least one of which is not atomic" part that is troubling me. If it weren't possible to mix atomic and non-atomic ops, it would just say "on an object which is not atomic."
I can't see any straightforward way of performing non-atomic operations on atomic variables. std::atomic<T> in C++ doesn't define any operations with non-atomic semantics. In C, all direct reads/writes of an atomic variable appear to be translated into atomic operations.
I suppose memcpy() and other direct memory operations might be a way of performing a non-atomic read/write on an atomic variable? ie. memcpy(&atomicvar, othermem, sizeof(atomicvar))? But is this even defined behavior? In C++, std::atomic is not copyable, so would it be defined behavior to memcpy() it in C or C++?
Initialization of an atomic variable (whether through a constructor or atomic_init()) is defined to not be atomic. But this is a one-time operation: you're not allowed to initialize an atomic variable a second time. Placement new or an explicit destructor call could would also not be atomic. But in all of these cases, it doesn't seem like it would be defined behavior anyway to have a concurrent atomic operation that might be operating on an uninitialized value.
Performing atomic operations on non-atomic variables seems totally impossible: neither C nor C++ define any atomic functions that can operate on non-atomic variables.
So what is the story here? Is it really about memcpy(), or initialization/destruction, or something else?
I think you're overlooking another case, the reverse order. Consider an initialized int whose storage is reused to create an std::atomic_int. All atomic operations happen after its ctor finishes, and therefore on initialized memory. But any concurrent, non-atomic access to the now-overwritten int has to be barred as well.
(I'm assuming here that the storage lifetime is sufficient and plays no role)
I'm not entirely sure because I think that the second access to int would be invalid anyway as the type of the accessing expression int doesn't match the object's type at the time (std::atomic<int>). However, "the object's type at the time" assumes a single linear time progression which doesn't hold in a multi-threaded environment. C++11 in general has that solved by making such assumptions about "the global state" Undefined Behavior per se, and the rule from the question appears to fit in that framework.
So perhaps rephrasing: if a single memory location contains an atomic object as well as a non-atomic object, and if the destruction of the earliest created (older) object is not sequenced-before the creation of the other (newer) object, then access to the older object conflicts with access to the newer object unless the former is scheduled-before the latter.
disclaimer: I am not a parallelism guru.
Is it possible to mix atomic/non-atomic ops on the same memory, and if
so, how?
you can write it in the code and compile, but it will probably yield undefined behaviour.
when talking about atomics, it is important to understand what kind o problems do they solve.
As you might know, what we call in shortly "memory" is multi-layered set of entities which are capable to hold memory.
first we have the RAM, then the cache lines , then the registers.
on mono-core processors, we don't have any synchronization problem. on multi-core processors we have all of them. every core has it own set of registers and cache lines.
this casues few problems.
First one of them is memory reordering - the CPU may decide on runtime to scrumble some reading/writing instructions to make the code run faster. this may yield some strange results that are completly invisible on the high-level code that brought this set of instruction. the most classic example of this phenomanon is the "two threads - two integer" example:
int i=0;
int j=0;
thread a -> i=1, then print j
thread b -> j=1 then print i;
logically, the result "00" cannot be. either a ends first, the result may be "01", either b ends first, the result may be "10". if both of them ends in the same time, the result may be "11". yet, if you build small program which imitates this situtation and run it in a loop, very quicly you will see the result "00"
another problem is memory invisibility. like I mentioned before, the variable's value may be cached in one of the cache lines, or be stored in one of the registered. when the CPU updates a variables value - it may delay the writing of the new value back to the RAM. it may keep the value in the cache/regiter because it was told (by the compiler optimizations) that that value will be updated again soon, so in order to make the program faster - update the value again and only then write it back to the RAM. it may cause undefined behaviour if other CPU (and consequently a thread or a process) depends on the new value.
for example, look at this psuedo code:
bool b = true;
while (b) -> print 'a'
new thread -> sleep 4 seconds -> b=false;
the character 'a' may be printed infinitly, because b may be cached and never be updated.
there are many more problems when dealing with paralelism.
atomics solves these kind of issues by (in a nutshell) telling the compiler/CPU how to read and write data to/from the RAM correctly without doing un-wanted scrumbling (read about memory orders). a memory order may force the cpu to write it's values back to the RAM, or read the valuse from the RAM even if they are cached.
So, although you can mix non atomics actions with atomic ones, you only doing part of the job.
for example let's go back to the second example:
atomic bool b = true;
while (reload b) print 'a'
new thread - > b = (non atomicly) false.
so although one thread re-read the value of b from the RAM again and again but the other thread may not write false back to the RAM.
So although you can mix these kind of operations in the code, it will yield underfined behavior.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operationsin parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. That is that the sizes of the atomic and non atomic types are the same.
I definitely would not attempt to mix both types of access to an address in a parallel context, that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.
I'm interested in this topic since I have code in which sometimes I need to access a range of addresses serially, and at other times to access the same addresses in parallel with some way of managing contention.
So not exactly the situation posed by the original question which (I think) implies concurrent, or nearly so, atomic and non atomic operations in parallel code, but close.
I have managed by some devious casting to persuade my C11 compiler to allow me to access an integer and much more usefully a pointer both atomically and non-atomically ("directly"), having established that both types are officially lock-free on my x86_64 system. My, possibly simplistic, interpretation of that is that the sizes of the atomic and non atomic types are the same and that the hardware can update such types in a single operation.
I definitely would not attempt to mix both types of access to an address in a parallel context, i think that would be doomed to fail. However I have been successful in using "direct" syntax operations in serial code and "atomic" syntax in parallel code, giving me the best of both worlds of the fastest possible access (and much simpler syntax) in serial, and safely managed contention when in parallel.
So you can do it so long as you don't try to mix both methods in parallel code and you stick to using lock-free types, which probably means up to the size of a pointer.

C++ memory_order_consume, kill_dependency, dependency-ordered-before, synchronizes-with

I am reading C++ Concurrency in Action by Anthony Williams. Currently I at point where he desribes memory_order_consume.
After that block there is:
Now that I’ve covered the basics of the memory orderings, it’s time to look at the
more complex parts
It scares me a little bit, because I don't fully understand several things:
How dependency-ordered-before differs from synchronizes-with? They both create happens-before relationship. What is exact difference?
I am confused about following example:
int global_data[]={ … };
std::atomic<int> index;
void f()
{
int i=index.load(std::memory_order_consume);
do_something_with(global_data[std::kill_dependency(i)]);
}
What does kill_dependency exactly do? Which dependency it kills? Between which entities? And how compiler can exploit that knowladge?
Can all occurancies of memory_order_consume be safely replaced with memory_order_acquire? I.e. is it stricter in all senses?
At Listing 5.9, can I safely replace
std::atomic<int> data[5]; // all accesses are relaxed
with
int data[5]
? I.e. can acquire and release be used to synchronize access to non-atomic data?
He describes relaxed, acquire and release by some examples with mans in cubicles. Are there some similar simple descriptions of seq_cst and consume?
As to the next to last question, the answer takes a little more explanation. There are three things that can go wrong when multiple threads access the same data:
the system might switch threads in the middle of a read or write, producing a result that's half one value and half another.
the compiler might move code around, on the assumption that there is no other thread looking at the data that's involved.
the processor may be keeping a value in its local cache, without updating main memory after changing the value or re-reading it after another thread changed the value in main memory.
Memory order addresses only number 3. The atomic functions address 1 and 2, and, depending on the memory order argument, maybe 3 as well. So memory_order_relaxed means "don't bother with number 3. The code still handles 1 and 2. In that case, you'd use acquire and release to ensure proper memory ordering.
How dependency-ordered-before differs from synchronizes-with?
From 1.10/10: "[ Note: The relation “is dependency-ordered before” is analogous to “synchronizes with”, but uses release/consume in place of release/acquire. — end note ]".
What does kill_dependency exactly do?
Some compilers do data-dependency analysis. That is, they trace changes to values in variables in order to better figure out what has to be synchronized. kill_dependency tells such compilers not to trace any further because there's something going on in the code that the compiler wouldn't understand.
Can all occurancies of memory_order_consume be safely replaced with
memory_order_acquire? I.e. is it stricter in all senses?
I think so, but I'm not certain.
memory_order_consume requires that the atomic operation happens-before all non-atomic operations that are data dependent on it. A data dependency is any dependency where you cannot evaluate an expression without using that data. For example, in x->y, there is no way to evaluate x->y without first evaluating x.
kill_dependency is a unique function. All other functions have a data dependency on their arguments. Kill_dependency explicitly does not. It shows up when you know that the data itself is already synchronized, but the expression you need to get to the data may not be synchronized. In your example, do_something_with is allowed to assume any cached value of globalldata[i] is safe to use, but i itself must actually be the correct atomic value.
memory_order_acquire is strictly stronger if all changes to the data are properly released with a matching memory_order_release.