Do dependent reads require a load-acquire? - c++

Does the following program expose a data race, or any other concurrency concern?
#include <cstdio>
#include <cstdlib>
#include <atomic>
#include <thread>
class C {
public:
int i;
};
std::atomic<C *> c_ptr{};
int main(int, char **) {
std::thread t([] {
auto read_ptr = c_ptr.load(std::memory_order_relaxed);
if (read_ptr) {
// does the following dependent read race?
printf("%d\n", read_ptr->i);
}
});
c_ptr.store(new C{rand()}, std::memory_order_release);
return 0;
}
Godbolt
I am interested in whether reads through pointers need load-acquire semantics when loading the pointer, or whether the dependent-read nature of reads through pointers makes that ordering unnecessary. If it matters, assume arm64, and please describe why it matters, if possible.
I have tried searching for discussions of dependent reads and haven't found any explicit recognition of their implicit load-reordering-barriers. It looks safe to me, but I don't trust my understanding enough to know it's safe.

Your code is not safe, and can break in practice with real compilers for DEC Alpha AXP (which can violate causality via tricky cache bank shenanigans IIRC).
As far as the ISO C++ standard guaranteeing anything in the C++ abstract machine, no, there's no guarantee because nothing creates a happens-before relationship between the init of the int and the read in the other thread.
But in practice C++ compilers implement release the same way regardless of context and without checking the whole program for the existence of a possible reader with at least consume ordering.
In some but not all concrete implementations into asm for real machines by real compilers, this will work. (Because they choose not to look for that UB and break the code on purpose with fancy inter-thread analysis of the only possible reads and writes of that atomic variable.)
DEC Alpha could famously break this code, not guaranteeing dependency ordering in asm, so needing barriers for memory_order_consume, unlike all(?) other ISAs.
Given the current deprecated state of consume, the only way to get efficient asm on ISAs with dependency ordering (not Alpha), but which don't do acquire for free (x86) is to write code like this. The Linux kernel does this in practice for things like RCU.
That requires keeping it simple enough that compilers can't break the dependency ordering by e.g. proving that any non-NULL read_ptr would have a specific value, like the address of a static variable.
See also
What does memory_order_consume really do?
C++11: the difference between memory_order_relaxed and memory_order_consume
Memory order consume usage in C11 - more about the hardware mechanism / guarantee that consume is intended to expose to software. Out-of-order exec can only reorder independent work anyway, not start a load before the load address is known, so on most CPUs enforcing dependency ordering happens for free anyway: only a few models of DEC Alpha could violate causality and effectively load data from before it had the pointer that gave it the address.
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0371r1.html - and other C++ wg21 documents linked from that about why consume is discouraged.

Related

Can two threads write to the same place without synchronization? [duplicate]

Can multiple threads write the same value to the same variable at the same time safely?
For a specific example — is the below code guaranteed by the C++ standard to compile, run without undefined behavior and print "true", on every conforming system?
#include <cstdio>
#include <thread>
int main()
{
bool x = false;
std::thread one{[&]{ x = true; }};
std::thread two{[&]{ x = true; }};
one.join();
two.join();
std::printf(x ? "true" : "false");
}
This is a theoretical question; I want to know whether it definitely always works rather than whether it works in practice (or whether writing code like this is a good idea :)). I'd appreciate if someone could point to the relevant part of the standard. In my experience it always works in practice, but not knowing whether or not it's guaranteed to work I always use std::atomic instead - I'd like to know whether that's strictly necessary for this specific case.
No.
You need to synchronize access to those variables, either by using mutexes or by making them atomic.
There is no exemption for when the same value is being written. You don't know what steps are involved in writing that value (which is the underlying practical concern), and neither does the standard which is why code has undefined behaviour … which means your compiler can just make absolute mayhem with your program (and that's the real issue you need to avoid).
Someone's going to come along and tell you that such-and-such an architecture guarantees atomic writes to these sized variables. But that doesn't change the UB aspect.
The passages you're looking for are:
[intro.races/2]: Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
[intro.races/21]: […] The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, […]. Any such data race results in undefined behavior.
… and the surrounding wording. That section is actually quite esoteric, but you don't really need to parse it as this is a classic, textbook data race that you can read about in any book on programming.
Lightness is correct and spot-on from a standards perspective.
But I'll give you another perspective why this is not a good idea from a hardware architecture perspective.
Without a memory barrier (atomic, mutex, etc...), you can encounter what's known as the cache coherency problem. On a multi-core or multi-processor machine, your two threads could both set x to true, but your main thread potentially could print false even if your compiler didn't stash x into a register. That's because the hardware cache used by the main thread hasn't been updated to have x invalidated from whatever cache line its on yet. The atomic types and lock guards provided by C++ (along with countless OS primitives) are implemented to solve this issue.
In any case, google for Cache Coherence Problem and Cache Coherence Multicore. And for a particular architecture implementation of how atomic transactions are implemented, look up the Intel LOCK prefix.

Can multiple threads write the same value to the same variable at the same time safely?

Can multiple threads write the same value to the same variable at the same time safely?
For a specific example — is the below code guaranteed by the C++ standard to compile, run without undefined behavior and print "true", on every conforming system?
#include <cstdio>
#include <thread>
int main()
{
bool x = false;
std::thread one{[&]{ x = true; }};
std::thread two{[&]{ x = true; }};
one.join();
two.join();
std::printf(x ? "true" : "false");
}
This is a theoretical question; I want to know whether it definitely always works rather than whether it works in practice (or whether writing code like this is a good idea :)). I'd appreciate if someone could point to the relevant part of the standard. In my experience it always works in practice, but not knowing whether or not it's guaranteed to work I always use std::atomic instead - I'd like to know whether that's strictly necessary for this specific case.
No.
You need to synchronize access to those variables, either by using mutexes or by making them atomic.
There is no exemption for when the same value is being written. You don't know what steps are involved in writing that value (which is the underlying practical concern), and neither does the standard which is why code has undefined behaviour … which means your compiler can just make absolute mayhem with your program (and that's the real issue you need to avoid).
Someone's going to come along and tell you that such-and-such an architecture guarantees atomic writes to these sized variables. But that doesn't change the UB aspect.
The passages you're looking for are:
[intro.races/2]: Two expression evaluations conflict if one of them modifies a memory location ([intro.memory]) and the other one reads or modifies the same memory location.
[intro.races/21]: […] The execution of a program contains a data race if it contains two potentially concurrent conflicting actions, […]. Any such data race results in undefined behavior.
… and the surrounding wording. That section is actually quite esoteric, but you don't really need to parse it as this is a classic, textbook data race that you can read about in any book on programming.
Lightness is correct and spot-on from a standards perspective.
But I'll give you another perspective why this is not a good idea from a hardware architecture perspective.
Without a memory barrier (atomic, mutex, etc...), you can encounter what's known as the cache coherency problem. On a multi-core or multi-processor machine, your two threads could both set x to true, but your main thread potentially could print false even if your compiler didn't stash x into a register. That's because the hardware cache used by the main thread hasn't been updated to have x invalidated from whatever cache line its on yet. The atomic types and lock guards provided by C++ (along with countless OS primitives) are implemented to solve this issue.
In any case, google for Cache Coherence Problem and Cache Coherence Multicore. And for a particular architecture implementation of how atomic transactions are implemented, look up the Intel LOCK prefix.

c++11 register cache thread safety

in volatile: The Multithreaded Programmer's Best Friend, Andrei Alexandrescu gives this example:
class Gadget
{
public:
void Wait()
{
while (!flag_)
{
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup()
{
flag_ = true;
}
...
private:
bool flag_;
};
he states,
... the compiler concludes that it can cache flag_ in a register ... it harms correctness: after you call Wait for some Gadget object, although another thread calls Wakeup, Wait will loop forever. This is because the change of flag_ will not be reflected in the register that caches flag_.
then he offers a solution:
If you use the volatile modifier on a variable, the compiler won't cache that variable in registers — each access will hit the actual memory location of that variable.
now, other people mentioned on stackoverflow and elsewhere that volatile keyword doesn't really offer any thread-safety guarantees, and i should use std::atomic or mutex synchronization instead, which i do agree.
however, going the std::atomic route for example, which internally uses memory fences read_acquire and write_release (Acquire and Release Semantics), i don't see how it actually fixes the register-cache problem in particular.
in case of x86 for example, every load on x86/64 already implies acquire semantics and every store implies release semantics such that compiled code under x86 doesn't emit any actual memory barriers at all. (The Purpose of memory_order_consume in C++11)
g = Guard.load(memory_order_acquire);
if (g != 0)
p = Payload;
On Intel x86-64, the Clang compiler generates compact machine code for this example – one machine instruction per line of C++ source code. This family of processors features a strong memory model, so the compiler doesn’t need to emit special memory barrier instructions to implement the read-acquire.
so.... just assuming x86 arch for now, how does std::atomic solve the cache in registry problem? w/ no memory barrier instructions for read-acquire in compiled code, it seems to be the same as the compiled code for just regular read.
Did you notice that there was no load from just a register in your code? There was an explicit memory load from _Guard. So it did in fact prevent caching in a register.
Now how it does this is up to the specific platform's implementation of std::atomic, but it must do this.
And, by the way, Alexandrescu's reasoning is completely wrong for modern platforms. While it's true that volatile prevents the compiler from caching in a register, it doesn't prevent similar caching being done by the CPU or by hardware. On some platforms, it might happen to be adequate, but there is absolutely no reason to write gratuitously non-portable code that might break on a future CPU, compiler, library, or platform when a fully-portable alternative is readily available.
volatile is not necessary for any "sane" implementation when the Gadget example is changed to use std::atomic<bool>. The reason for this is not that the standard forbids the use of registers, instead (§29.3/13 in n3690):
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Of course, what constitutes "reasonable" is open to interpretation, and it's "should", not "shall", so an implementation might ignore the requirement without violating the letter of the standard. Typical implementations do not cache the results of atomic loads, nor (much) delay issuing an atomic store to the CPU, and thus leave the decision largely to the hardware. If you would like to enforce this behavior, you should use volatile std::atomic<bool> instead. In both cases, however, if another thread sets the flag, the Wait() should be finite, but if your compiler and/or CPU are so willing, can still take much longer than you would like.
Also note that a memory fence does not guarantee that a store becomes visible to another thread immediately nor any sooner than it otherwise would. So even if the compiler added fence instructions to Gadget's methods, they wouldn't help at all. Fences are used to guarantee consistency, not to increase performance.

Do I need to use memory barriers to protect a shared resource?

In a multi-producer, multi-consumer situation. If producers are writing into int a, and consumers are reading from int a, do I need memory barriers around int a?
We all learned that: Shared resources should always be protected and the standard does not guarantee a proper behavior otherwise.
However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.
Therefore, why protect int a at all?
At least in C++11 (or later), you don't need to (explicitly) protect your variable with a mutex or memory barriers.
You can use std::atomic to create an atomic variable. Changes to that variable are guaranteed to propagate across threads.
std::atomic<int> a;
// thread 1:
a = 1;
// thread 2 (later):
std::cout << a; // shows `a` has the value 1.
Of course, there's a little more to it than that--for example, there's no guarantee that std::cout works atomically, so you probably will have to protect that (if you try to write from more than one thread, anyway).
It's then up to the compiler/standard library to figure out the best way to handle the atomicity requirements. On a typical architecture that ensures cache coherence, it may mean nothing more than "don't allocate this variable in a register". It could impose memory barriers, but is only likely to do so on a system that really requires them.
On real world C++ implementations where volatile worked as a pre-C++11 way to roll your own atomics (i.e. all of them), no barriers are needed for inter-thread visibility, only for ordering wrt. operations on other variables. Most ISAs do need special instructions or barriers for the default memory_order_seq_cst.
On the other hand, explicitly specifying memory ordering (especially acquire and release) for an atomic variable may allow you to optimize the code a bit. By default, an atomic uses sequential ordering, which basically acts like there are barriers before and after access--but in a lot of cases you only really need one or the other, not both. In those cases, explicitly specifying the memory ordering can let you relax the ordering to the minimum you actually need, allowing the compiler to improve optimization.
(Not all ISAs actually need separate barrier instructions even for seq_cst; notably AArch64 just has a special interaction between stlr and ldar to stop seq_cst stores from reordering with later seq_cst loads, on top of acquire and release ordering. So it's as weak as the C++ memory model allows, while still complying with it. But weaker orders, like memory_order_acquire or relaxed, can avoid even that blocking of reordering when it's not needed.)
However on cache-coherent architectures visibility is ensured automatically and atomicity of 8, 16, 32 and 64 bit variables MOV operation is guaranteed.
Unless you strictly adhere to the requirements of the C++ spec to avoid data races, the compiler is not obligated to make your code function the way it appears to. For example:
int a = 0, b = 0; // shared variables, initialized to zero
a = 1;
b = 1;
Say you do this on your fully cache-coherent architecture. On such hardware it would seem that since a is written before b no thread will ever be able to see b with a value of 1 without a also having that value.
But this is not the case. If you have failed to strictly adhere to the requirements of the C++ memory model for avoiding data races, e.g. you read these variables without the correct synchronization primitives being inserted anywhere, then your program may in fact observe b being written before a. The reason is that you have introduce "undefined behavior" and the C++ implementation has no obligation to do anything that makes sense to you.
What may be going on in practice, is that the compiler may reorder writes even if the hardware works very hard to make it seem as if all writes occur in the order of the machine instructions performing the writes. You need the entire toolchain to cooperate, and cooperation from just the hardware, such as strong cache coherency, is not sufficient.
The book C++ Concurrency in Action is a good source if you care to learn about the details of the C++ memory model and writing portable, concurrent code in C++.

volatile variable and atomic operations on Visual C++ x86

Plain load has acquire semantics on x86, plain store has release semantics, however compiler still can reorder instructions. While fences and locked instructions (locked xchg, locked cmpxchg) prevent both hardware and compiler from reordering, plain loads and stores are still necessary to protect with compiler barriers. Visual C++ provides _ReadWriterBarrier() function, which prevents compiler from reordering, also C++ provides volatile keyword for the same reason. I write all this information just to make sure that I get everything right. So all written above is true, is there any reason to mark as volatile variables which are going to be used in functions protected with _ReadWriteBarrier()?
For example:
int load(int& var)
{
_ReadWriteBarrier();
T value = var;
_ReadWriteBarrier();
return value;
}
Is it safe to make that variable non-volatile? As far as I understand it is, because function is protected and no reordering could be done by compiler inside. On the other hand Visual C++ provides special behavior for volatile variables (different from the one that standard does), it makes volatile reads and writes atomic loads and stores, but my target is x86 and plain loads and stores are supposed to be atomic on x86 anyway, right?
Thanks in advance.
Volatile keyword is available in C too. "volatile" is often used in embedded System, especially when value of the variable may change at any time-without any action being taken by the code - three common scenarios include reading from a memory-mapped peripheral register or global variables either modified by an interrupt service routine or those within a multi-threaded program.
So it is the last scenario where volatile could be considered to be similar to _ReadWriteBarrier.
_ReadWriteBarrier is not a function - _ReadWriteBarrier does not insert any additional instructions, and it does not prevent the CPU from rearranging reads and writes— it only prevents the compiler from rearranging them. _ReadWriteBarrier is to prevent compiler reordering.
MemoryBarrier is to prevent CPU reordering!
A compiler typically rearranges instructions... C++ does not contain built-in support for multithreaded programs so the compiler assumes the code is single-threaded when reordering the code. With MSVC use ­_ReadWriteBarrier in the code, so that the compiler will not move reads and writes across it.
Check this link for more detailed discussion on those topics
http://msdn.microsoft.com/en-us/library/ee418650(v=vs.85).aspx
Regarding your code snippet - you do not have to use ReadWriteBarrier as a SYNC primitive - the first call to _ReadWriteBarrier is not necessary.
When using ReadWriteBarrier you do not have to use volatile
You wrote "it makes volatile reads and writes atomic loads and stores" - I don't think that is OK to say that, Atomicity and volatility are different. Atomic operations are considered to be indivisible - ... http://www.yoda.arachsys.com/csharp/threads/volatility.shtml
Note: I am not an expert on this topic, some of my statements are "what I heard on the internet", but I think I csan still clear up some misconceptions.
[edit] In general, I would rely on platform-specifics such as x86 atomic reads and lack of OOOX only in isolated, local optimizations that are guarded by an #ifdef checking the target platform, ideally accompanied by a portable solution in the #else path.
Things to look out for
atomicity of read / write operations
reordering due to compiler optimizations (this includes a different order seen by another thread due to simple register caching)
out-of-order execution in the CPU
Possible misconceptions
1. As far as I understand it is, because function is protected and no reordering could be done by compiler inside.
[edit] To clarify: the _ReadWriteBarrier provides protection against instruction reordering, however, you have to look beyond the scope of the function. _ReadWriteBarrier has been fixed in VS 2010 to do that, earlier versions may be broken (depending on the optimizations they actually do).
Optimization isn't limited to functions. There are multiple mechanisms (automatic inlining, link time code generation) that span functions and even compilation units (and can provide much more significant optimizations than small-scoped register caching).
2. Visual C++ [...] makes volatile reads and writes atomic loads and stores,
Where did you find that? MSDN says that beyond the standard, will put memory barriers around reads and writes, no guarantee for atomic reads.
[edit] Note that C#, Java, Delphi etc. have different memory mdoels and may make different guarantees.
3. plain loads and stores are supposed to be atomic on x86 anyway, right?
No, they are not. Unaligned reads are not atomic. They happen to be atomic if they are well-aligned - a fact I'd not rely on unless it's isolated and easily exchanged. Otherwise your "simplificaiton fo x86" becomes a lockdown to that target.
[edit] Unaligned reads happen:
char * c = new char[sizeof(int)+1];
load(*(int *)c); // allowed by standard to be unaligned
load(*(int *)(c+1)); // unaligned with most allocators
#pragma pack(push,1)
struct
{
char c;
int i;
} foo;
load(foo.i); // caller said so
#pragma pack(pop)
This is of course all academic if you remember the parameter must be aligned, and you control all code. I wouldn't write such code anymore, because I've been bitten to often by laziness of the past.
4. Plain load has acquire semantics on x86, plain store has release semantics
No. x86 processors do not use out-of-order execution (or rather, no visible OOOX - I think), but this doesn't stop the optimizer from reordering instructions.
5. _ReadBarrier / _WriteBarrier / _ReadWriteBarrier do all the magic
They don't - they just prevent reordering by the optimizer. MSDN finally made it a big bad warning for VS2010, but the information apparently applies to previous versions as well.
Now, to your question.
I assume the purpose of the snippet is to pass any variable N, and load it (atomically?) The straightforward choice would be an interlocked read or (on Visual C++ 2005 and later) a volatile read.
Otherwise you'd need a barrier for both compiler and CPU before the read, in VC++ parlor this would be:
int load(int& var)
{
// force Optimizer to complete all memory writes:
// (Note that this had issues before VC++ 2010)
_WriteBarrier();
// force CPU to settle all pending read/writes, and not to start new ones:
MemoryBarrier();
// now, read.
int value = var;
return value;
}
Noe that _WriteBarrier has a second warning in MSDN:
*In past versions of the Visual C++ compiler, the _ReadWriteBarrier and _WriteBarrier functions were enforced only locally and did not affect functions up the call tree. These functions are now enforced all the way up the call tree.*
I hope that is correct. stackoverflowers, please correct me if I'm wrong.