C++11 Atomic memory order with non-atomic variables

C++11 Atomic memory order with non-atomic variables - c++

I am unsure about how the memory ordering guarantees of atomic variables in c++11 affect operations to other memory.
Let's say I have one thread which periodically calls the write function to update a value, and another thread which calls read to get the current value. Is it guaranteed that the effects of d = value; will not be seen before effects of a = version;, and will be seen before the effects of b = version;?
atomic<int> a {0};
atomic<int> b {0};
double d;
void write(int version, double value) {
a = version;
d = value;
b = version;
}
double read() {
int x,y;
double ret;
do {
x = b;
ret = d;
y = a;
} while (x != y);
return ret;
}

The rule is that, given a write thread that executes once, and nothing else that modifies a, b or d,
You can read a and b from a different thread at any time, and
if you read b and see version stored in it, then
You can read d; and
What you read will be value.
Note that whether the second part is true depends on the memory ordering; it is true with the default one (memory_order_seq_cst).

Your object d is written and read by two threads and it's not atomic. This is unsafe, as suggested in the C++ standard on multithreading:
1.10/4 Two expression evaluations conflict if one of them modifies a memory location and the other one accesses or modifies the same memory location.
1.10/21 The execution of a program contains a data race if it contains two conflicting actions in different threads,at least one of
which is not atomic, and neither happens before the other. Any such
data race results in undefined behavior.
Important edit:
In your non-atomic case, you have no guarantees about the ordering between the reading and the writing. You don't even have guarantee that the reader will read a value that was written by the writer (this short article explains the risk for non-atomic variables).
Nevertheless, your reader's loop finishes based on a test of the surrounding atomic variables, for which there are strong guarantees. Assuming that version never repeats between writer different calls, and given the reverse order in which you aquire their value:
the order of the d read compared to the d write can't be unfortunate if the two atomics are equal.
similarly, the read value can't be inconsistent if the two atomics are equal.
This means that in case of an adverse race condition on your non-atomic, thanks to the loop, you'll end-up reading the last value.

Is it guaranteed that the effects of d = value; will not be seen before effects of a = version;, and will be seen before the effects of b = version;?
Yes, it is. This is because sequensial consistency barrier is implied when read or write atomic<> variable.
Instead of storing version tag into two atomic variables before value's modification and after it, you can increment single atomic variable before and after modification:
atomic<int> a = {0};
double d;
void write(double value)
{
a = a + 1; // 'a' become odd
d = value; //or other modification of protected value(s)
a = a + 1; // 'a' become even, but not equal to the one before modification
}
double read(void)
{
int x;
double ret;
do
{
x = a;
ret = value; // or other action with protected value(s)
} while((x & 2) || (x != a));
return ret;
}
This is known as seqlock in the Linux kernel: http://en.wikipedia.org/wiki/Seqlock

Related

Does this C++ sample code contains Data Race?

Suppose there are not compiler reorderings.
int32_t value;
int32_t flag = 0;
// thread 1
void UpdateValue(int32_t x) {
value = x;
flag = 1;
}
// thread 2
void DoSomething() {
while (flag == 0);
do_something(value);
}
According to https://en.cppreference.com/w/cpp/language/memory_model, evaluation flag = 1 and evaluation flag == 0 conflict.
And:
flag is not atomic variable
there is no signal handler
flag = 1 doesn't happens before flag == 0
So there is data race?
But in this sample code, every read/write is atomic(4 bytes aligned).
I don't find any undefined behavior and I'm confused...

the data race here is a UB and you can expect any behavior including the one that you expect.
The order different threads read/write that location may
help to understand why it is a UB:
std::memory_order specifies how memory accesses, including regular,
non-atomic memory accesses, are to be ordered around an atomic
operation. Absent any constraints on a multi-core system, when
multiple threads simultaneously read and write to several variables,
one thread can observe the values change in an order different from
the order another thread wrote them. Indeed, the apparent order of
changes can even differ among multiple reader threads. Some similar
effects can occur even on uniprocessor systems due to compiler
transformations allowed by the memory model.

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.

My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.

There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.

All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Independent Read-Modify-Write Ordering

I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:
#include <thread>
#include <atomic>
#include <iostream>
#include <cassert>
struct RMW_Ordering
{
std::atomic<bool> flag {false};
std::atomic<unsigned> done {0}, counter {0};
unsigned race_cancel {0}, race_success {0}, sum {0};
void thread1() // fail
{
race_cancel = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
counter.store(0, std::memory_order_relaxed);
done.store(1, std::memory_order_relaxed);
}
}
void thread2() // success
{
race_success = 1; // data produced
if (counter.fetch_add(1, std::memory_order_release) == 1 &&
!flag.exchange(true, std::memory_order_relaxed))
{
done.store(2, std::memory_order_relaxed);
}
}
void thread3()
{
while (!done.load(std::memory_order_relaxed)); // livelock test
counter.exchange(0, std::memory_order_acquire);
sum = race_cancel + race_success;
}
};
int main()
{
for (unsigned i = 0; i < 1000; ++i)
{
RMW_Ordering test;
std::thread t1([&]() { test.thread1(); });
std::thread t2([&]() { test.thread2(); });
std::thread t3([&]() { test.thread3(); });
t1.join();
t2.join();
t3.join();
assert(test.counter == 0);
}
std::cout << "Done!" << std::endl;
}
Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".
Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.
I tried to find any rules regarding this in the C++ standard but only managed to dig up these:
1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations,
which have special characteristics.
29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated
with the read-modify-write operation.
Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?
EDIT:
I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:
int main()
{
atomic_int x = 0; atomic_int y = 0;
{{{
{
if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
}
}
|||
{
if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
{
cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
}
}
|||
{
// Is it possible for x and y to read 2 and 1, or 1 and 2?
x.load(relaxed).readsvalue(2);
y.load(relaxed).readsvalue(1);
}
}}}
return 0;
}
I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:
#include "relacy/relacy_std.hpp"
struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
rl::atomic<unsigned> x, y;
void before()
{
x($) = y($) = 0;
}
void thread(unsigned tid)
{
if (tid == 0)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
}
}
else if (tid == 1)
{
unsigned exp1 = 0;
if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
{
unsigned exp2 = 0;
y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
}
}
else
{
while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
RL_ASSERT(x($) == y($));
}
}
};
int main()
{
rl::simulate<rmw_experiment>();
}
The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:
Can I always rely on RMW operations not being reordered - even if they affect different memory locations
No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)
But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.
Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.
Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).
As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:
The semantic descriptions in this International Standard deﬁne a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].
This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the
observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side eﬀects aﬀecting the observable behavior of the program are produced.
The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.
std::atomic<int> x, y;
x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B
A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)
To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:
y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D
If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.
In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.
Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.

Reordering and memory_order_relaxed

Cppreference gives the following example about memory_order_relaxed:
Atomic operations tagged memory_order_relaxed are not synchronization
operations, they do not order memory. They only guarantee atomicity
and modification order consistency.
Then explains that, with x and y initially zero, this example code
// Thread 1:
r1 = y.load(memory_order_relaxed); // A
x.store(r1, memory_order_relaxed); // B
// Thread 2:
r2 = x.load(memory_order_relaxed); // C
y.store(42, memory_order_relaxed); // D
is allowed to produce r1 == r2 == 42 because:
Although A is sequenced-before B within thread 1 and C is sequenced-before D in thread 2,
Nothing prevents D from appearing before A in the modification order of y, and B from appearing before C in the modification order of x.
Now my question is: if A and B can't be reordered within thread 1 and, similarly, C and D within thread 2 (since each of those is sequenced-before within its thread), aren't points 1 and 2 in contradiction? In other words, with no reordering (as point 1 seems to require), how is the scenario in point 2, visualized below, even possible?
T1 ........... T2
.............. D(y)
A(y)
B(x)
.............. C(x)
Because in this case C would not be sequenced-before D within thread 2, as point 1 demands.

with no reordering (as point 1 seems to require)
Point 1 does not mean "no reordering". It means sequencing of events within a thread of execution. The compiler will issue the CPU instruction for A before B and the CPU instruction for C before D (although even that may be subverted by the as-if rule), but the CPU has no obligation to execute them in that order, caches/write buffers/invalidation queues have no obligation to propagate them in that order, and memory has no obligation to be uniform.
(individual architectures may offer those guarantees though)

Your interpretation of the text is wrong. Let's break this down:
Atomic operations tagged memory_order_relaxed are not synchronization operations, they do not order memory
This means that these operations make no guarantees regarding the order of events. As explained prior to that statement in the original text, multi threaded processors are allowed to reorder operations within a single thread. This can affect the write, the read or both. Additionally, the compiler is allowed to do the same thing at compile time (mostly for optimization purposes). To see how this relates to the example, suppose we don't use atomic types at all, but we do use primitive types that are atomic be design (an 8 bit value...). Let's rewrite the example:
// Somewhere...
uint8_t y, x;
// Thread 1:
uint8_t r1 = y; // A
x = r1; // B
// Thread 2:
uint8_t r2 = x; // C
y = 42; // D
Considering both the compiler, and the CPU are allowed to reorder operations in each thread, it's easy to see how x == y == 42 is possible.
The next part of the statement is:
They only guarantee atomicity and modification order consistency.
This means the only guarantee is that each operation is atomic, that is, is impossible for an operation to be observed "mid way though". What this means is that if x is an atomic<someComplexType>, it's impossible for one thread to observe x as having a value in between states.
It should already be clear where can that be useful, but let's examine a specific example (for demonstration proposes only, this is not how you'd want to code):
class SomeComplexType {
public:
int size;
int *values;
}
// Thread 1:
SomeComplexType r = x.load(memory_order_relaxed);
if(r.size > 3)
r.values[2] = 123;
// Thread 2:
SomeComplexType a, b;
a.size = 10; a.values = new int[10];
b.size = 0; b.values = NULL;
x.store(a, memory_order_relaxed);
x.store(b, memory_order_relaxed);
What the atomic type does for us is guarantee that r in thread 1 is not an object in between states, specifically, that it's size & values properties are in sync.

According to the STR analogy from this post: C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?, I've created a visualization of what can happen here (as I understand it) as follows:
Thread 1 first sees y=42, then it performs r1=y, and after it x=r1. Thread 2 first sees x=r1 being already 42, then it performs r2=x, and after it y=42.
Lines represent "views" of memory by individual threads. These lines/views cannot cross for a particular thread. But, with relaxed atomics, lines/views of one thread can cross these of other threads.
EDIT:
I guess this is the same as with the following program:
atomic<int> x{0}, y{0};
// thread 1:
x.store(1, memory_order_relaxed);
cout << x.load(memory_order_relaxed) << y.load(memory_order_relaxed);
// thread 2:
y.store(1, memory_order_relaxed);
cout << x.load(memory_order_relaxed) << y.load(memory_order_relaxed);
which can produce 01 and 10 on the output (such an output could not happen with SC atomic operations).

Looking exclusively at the C++ memory model (not talking about compiler or hardware reordering), the only execution that leads to r1=r2=42 is:
Here I replaced r1 with a and r2 with b.
As usual, sb stands for sequenced-before and is simply the inter-thread ordering (the order in which the instructions appear in the source code). The rf are Read-From edges and mean that the Read/load on one end reads the value written/Stored on the other end.
The loop, involving both sb and rf edges, as highlighted in green, is necessary for the outcome: y is written in one thread, which is read in the other thread into a and from there written to x, which is read in the former thread again into b (which is sequenced-before the write to y).
There are two reasons why a constructed graph like this would not be possible: causality and because a rf reads a hidden side effect. In this case the latter is impossible because we only write once to each variable, so clearly one write can not be hidden (overwritten) by another write.
In order to answer the causality question we follow the following rule: A loop is disallowed (impossible) when it involves a single memory location and the direction of the sb edges is in the same direction everywhere in the loop (the direction of the rf edges is not relevant in that case); or, the loop involves more than one variable, all edges (sb AND rf) are in the same direction and AT MOST one of the variables has one or more rf edges between different threads that are not release/acquire.
In this case the loop exists, two variables are involved (one rf edge for x and one rf edge for y), all edges are in the same direction, but TWO variables have a relaxed/relaxed rf edge (namely x and y). Therefore there is no causality violation and this is an execution that is consistent with the C++ memory model.

non-destructive atomic add?

If I have an atomic variable like so:
#include <atomic>
std::atomic<int> a = 5;
I'd like to atomically check whether (a + 4) is less than another variable, without over-writing the original value of a:
if(a.something(4) < another_variable){
//Do not want a to be incremented by 4 at this point
}
I did a quick test on atomic fetch_and_add() and ++ and they all seem to increase the value of variable a afterwards. Is there a way I can atomically increment to test, without the result being permanent?

if(a + 4 < another_variable) // ...
This is the best you can get with a single atomic. You are data-race free, as the reading of the atomic is safe against concurrent writes, and all subsequent operations happen on a copy of the original atomic value. A more verbose but functionally equivalent version would be:
int const copy_of_a = a.load();
if(copy_of_a + 4 < another_variable) // ...
This is also the best you can get in terms of synchronization. You may be worried about the fact that a may be changed on another thread to a value that will change the outcome of the if.
Assume there was a function that did the whole operation atomically:
if(a.plus4IsLessThan(another_variable) // ...
Then whether a concurrent change of a arrives in time to change the outcome of the test is still not known. You did not gain any additional guarantees in terms of synchronization.
If this is a problem for your program, it indicates that you are in need of a more powerful synchronization mechanism. Probably a std::mutex would be a good start.

You can just do:
if (a + 4 < another_variable) { ...
Which should be identical to:
if (a.load() + 4 < another_variable) { ...
By definition (§29.6.5/16-17, here A "refers to one of the atomic types" and "C refers to its corresponding non-atomic type"):
A::operator C() const volatile noexcept;
A::operator C() const noexcept;
Effects: load()
Returns: The result of load()
Neither of which modify a.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js