Why isn't [[carries_dependency]] the default in C++? - c++

I know that memory_order_consume has been deprecated, but I'm trying to understand the logic that went into the original design and how [[carries_dependency]] and kill_dependency were supposed to work. For that, I would like a specific example of code that would break on an IBM PowerPC or DEC alpha or even a hypothetical architecture with a hypothetical compiler that fully implemented consume semantics in C++11 or C++14.
The best I can come up with is an example like this:
int v;
std::atomic<int*> ap;
void
thread_1()
{
v = 1;
ap.store(&v, std::memory_order_release);
}
int
f(int *p [[carries_dependency]])
{
return v;
}
void
thread_2()
{
int *p;
while (!(p = ap.load(std::memory_order_consume)))
;
int v2 = f(p);
assert(*p == v2);
}
I understand that the assertion could fail in this code. However, is it the case that the assertion is not supposed to fail if you remove [[carries_dependency]] from f? If so, why is that the case? After all, you requested a memory_order_consume, so why would you expect other accesses to v to reflect acquire semantics? If removing [[carries_dependency]] does not make the code correct, then what's an example where [[carries_dependency]] (or making [[carries_dependency]] the default for all variables) breaks otherwise correct code?
The only thing I can think is that maybe this has to do with register spills? If a function spills a register onto the stack and later re-loads it, this could break the dependency chain. So maybe [[carries_dependency]] makes things efficient in some cases (says no need to issue memory barrier in the caller before calling this function) but also requires the callee to issue a memory barrier before any register spills or calling another function, which could be less efficient in other cases? I'm grasping at straws here, though, so would still love to hear from someone who understands this stuff...

return v doesn't have a data dependency on int *p, so you'd need acquire not consume for ap.load(consume) / f(p) to synchronize with the release store.
If you'd used return *p then this would be sufficient thanks to dependency ordering, because that load would have a data dependency on the earlier load, no way for the CPU to generate the address earlier and thus load from v before the load from ap that saw the value you were waiting for.
Promoting dropping the dependency-ordering stuff effectively requires promoting consume to acquire by using a memory barrier before the function call.
DEC Alpha would always need a barrier even for consume to work, i.e. it had to promote consume to acquire because the ISA didn't guarantee dependency ordering by the hardware.
Some ISAs (mostly just x86) are so strongly ordered they don't need a barrier because every load is an acquire load, not reordered with other loads. Or at least giving the illusion of not being reordered; actual implementations speculatively load early but nuke the pipeline if mis-speculation is detected, i.e. where a cache line isn't still valid by the time the load was architecturally allowed to happen.
So x86 and Alpha would likely still work even with the [[carries-dependency]] version, because they're either too strong or too weak for mo_consume to be something the hardware can actually do (more cheaply than mo_acquire).
For Alpha it would depend where the compiler put the barrier; it could put it after f(p)'s return v, only before the *p that actually depends on the consume load. Or it could just promote consume to acquire on the spot at the load, like compilers do now (since consume is deprecated after proving too hard to support in its current design.)
As for why ISO C++11 decided to promote consume results to effectively acquire when passing across function boundaries, that might have been a usability consideration. But also performance. Without that, compilers would lose the ability to do some optimizations on incoming function args.
e.g. int ready = foo.load(consume); / if(ready == 1) return non_atomic[ready-ready]; is required to make asm that has a data dependency on the consume-load result, unlike normal when the compiler would just optimize it to return *non_atomic.
(You might be familiar with x86 xor eax,eax being a good way to zero a register. In weakly-ordered ISAs that guarantee dependency ordering, like ARM, eor r0, r0,r0 is guaranteed not to break the dependency on the old value of r0.)
Also constant-propagation in branches like if(ready == 1) should be possible; code inside that if can only run when ready has the constant value 1. So even if we weren't cancelling it out, non_atomic[ready] can't be optimized to non_atomic[1];.
If every incoming function arg potentially carried a dependency, compilers would not be able to do those usual optimizations on any function args or values derived from them.
Related re: what consume is about and/or its deprecation, and that it's still used in a hand-rolled way with volatile in Linux kernel code (e.g. RCU):
[[carries_dependency]] what it means and how to implement
When should you not use [[carries_dependency]]?

Related

c++11 register cache thread safety

in volatile: The Multithreaded Programmer's Best Friend, Andrei Alexandrescu gives this example:
class Gadget
{
public:
void Wait()
{
while (!flag_)
{
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup()
{
flag_ = true;
}
...
private:
bool flag_;
};
he states,
... the compiler concludes that it can cache flag_ in a register ... it harms correctness: after you call Wait for some Gadget object, although another thread calls Wakeup, Wait will loop forever. This is because the change of flag_ will not be reflected in the register that caches flag_.
then he offers a solution:
If you use the volatile modifier on a variable, the compiler won't cache that variable in registers — each access will hit the actual memory location of that variable.
now, other people mentioned on stackoverflow and elsewhere that volatile keyword doesn't really offer any thread-safety guarantees, and i should use std::atomic or mutex synchronization instead, which i do agree.
however, going the std::atomic route for example, which internally uses memory fences read_acquire and write_release (Acquire and Release Semantics), i don't see how it actually fixes the register-cache problem in particular.
in case of x86 for example, every load on x86/64 already implies acquire semantics and every store implies release semantics such that compiled code under x86 doesn't emit any actual memory barriers at all. (The Purpose of memory_order_consume in C++11)
g = Guard.load(memory_order_acquire);
if (g != 0)
p = Payload;
On Intel x86-64, the Clang compiler generates compact machine code for this example – one machine instruction per line of C++ source code. This family of processors features a strong memory model, so the compiler doesn’t need to emit special memory barrier instructions to implement the read-acquire.
so.... just assuming x86 arch for now, how does std::atomic solve the cache in registry problem? w/ no memory barrier instructions for read-acquire in compiled code, it seems to be the same as the compiled code for just regular read.
Did you notice that there was no load from just a register in your code? There was an explicit memory load from _Guard. So it did in fact prevent caching in a register.
Now how it does this is up to the specific platform's implementation of std::atomic, but it must do this.
And, by the way, Alexandrescu's reasoning is completely wrong for modern platforms. While it's true that volatile prevents the compiler from caching in a register, it doesn't prevent similar caching being done by the CPU or by hardware. On some platforms, it might happen to be adequate, but there is absolutely no reason to write gratuitously non-portable code that might break on a future CPU, compiler, library, or platform when a fully-portable alternative is readily available.
volatile is not necessary for any "sane" implementation when the Gadget example is changed to use std::atomic<bool>. The reason for this is not that the standard forbids the use of registers, instead (§29.3/13 in n3690):
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
Of course, what constitutes "reasonable" is open to interpretation, and it's "should", not "shall", so an implementation might ignore the requirement without violating the letter of the standard. Typical implementations do not cache the results of atomic loads, nor (much) delay issuing an atomic store to the CPU, and thus leave the decision largely to the hardware. If you would like to enforce this behavior, you should use volatile std::atomic<bool> instead. In both cases, however, if another thread sets the flag, the Wait() should be finite, but if your compiler and/or CPU are so willing, can still take much longer than you would like.
Also note that a memory fence does not guarantee that a store becomes visible to another thread immediately nor any sooner than it otherwise would. So even if the compiler added fence instructions to Gadget's methods, they wouldn't help at all. Fences are used to guarantee consistency, not to increase performance.

Volatile vs. memory fences

The code below is used to assign work to multiple threads, wake them up, and wait until they are done. The "work" in this case consists of "cleaning a volume". What exactly this operation does is irrelevant for this question -- it just helps with the context. The code is part of a huge transaction processing system.
void bf_tree_cleaner::force_all()
{
for (int i = 0; i < vol_m::MAX_VOLS; i++) {
_requested_volumes[i] = true;
}
// fence here (seq_cst)
wakeup_cleaners();
while (true) {
usleep(10000); // 10 ms
bool remains = false;
for (int vol = 0; vol < vol_m::MAX_VOLS; ++vol) {
// fence here (seq_cst)
if (_requested_volumes[vol]) {
remains = true;
break;
}
}
if (!remains) {
break;
}
}
}
A value in a boolean array _requested_volumes[i] tells whether thread i has work to do. When it is done, the worker thread sets it to false and goes back to sleep.
The problem I am having is that the compiler generates an infinite loop, where the variable remains is always true, even though all values in the array have been set to false. This only happens with -O3.
I have tried two solutions to fix that:
Declare _requested_volumes volatile
(EDIT: this solution does work actually. See edit below)
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses. But there's a lot of dispute over this on the Internet. The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access. In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
Introduce memory fences
The method wakeup_cleaners() acquires a pthread_mutex_t internally in order to set a wake-up flag in the worker threads, so it should implicitly produce proper memory fences. But I'm not sure if those fences affect memory accesses in the caller method (force_all()). Therefore, I manually introduced fences in the locations specified by the comments above. This should make sure that writes performed by the worker thread in _requested_volumes are visible in the main thread.
What puzzles me is that none of these solutions works, and I have absolutely no idea why. The semantics and proper use of memory fences and volatile is confusing me right now. The problem is that the compiler is applying an undesired optimization -- hence the volatile attempt. But it could also be a problem of thread synchronization -- hence the memory fence attempt.
I could try a third solution in which a mutex protects every access to _requested_volumes, but even if that works, I would like to understand why, because as far as I understand, it's all about memory fences. Thus, it should make no difference whether it's done explicitly or implicitly via a mutex.
EDIT: My assumptions were wrong and Solution 1 actually does work. However, my question remains in order to clarify the use of volatile vs. memory fences. If volatile is such a bad thing, that should never be used in multithreaded programming, what else should I use here? Do memory fences also affect compiler optimizations? Because I see these as two orthogonal issues, and therefore orthogonal solutions: fences for visibility in multiple threads and volatile for preventing optimizations.
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses.
Yes.
But there's a lot of dispute over this on the Internet.
Not, generally, between "the experts".
The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access.
Nope.
Non-pure, non-constexpr non-inlined function calls (getters/accessors) also necessarily have this effect. Admittedly link-time optimization confuses the issue of which functions may really get inlined.
In C, and by extension C++, volatile affects memory access optimization. Java took this keyword, and since it can't (or couldn't) do the tasks C uses volatile for in the first place, altered it to provide a memory fence.
The correct way to get the same effect in C++ is using std::atomic.
In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
No, it may have the desired effect, depending on how it interacts with your platform's cache hardware. This is brittle - it could change any time you upgrade a CPU, or add another one, or change your scheduler behaviour - and it certainly isn't portable.
If you're really just tracking how many workers are still working, sane methods might be a semaphore (synchronized counter), or mutex+condvar+integer count. Either are likely more efficient than busy-looping with a sleep.
If you're wedded to the busy loop, you could still reasonably have a single counter, such as std::atomic<size_t>, which is set by wakeup_cleaners and decremented as each cleaner completes. Then you can just wait for it to reach zero.
If you really want a busy loop and really prefer to scan the array each time, it should be an array of std::atomic<bool>. That way you can decide what consistency you need from each load, and it will control both the compiler optimizations and the memory hardware appropriately.
Apparently, volatile does the necessary for your example. The topic of volatile qualifier itself is too broad: you can start by searching "C++ volatile vs atomic" etc. There are a lot of articles and questions&answers on the internet, e.g. Concurrency: Atomic and volatile in C++11 memory model .
Briefly, volatile tells the compiler to disable some aggressive optimizations, particularly, to read the variable each time it is accessed (rather than storing it in a register or cache). There are compilers which do more so making volatile to act more like std::atomic: see Microsoft Specific section here. In your case disablement of an aggressive optimization is exactly what was necessary.
However, volatile doesn't define the order for the execution of the statements around it. That is why you need memory order in case you need to do something else with the data after the flags you check have been set.
For inter-thread communication it is appropriate to use std::atomic, particularly, you need to refactor _requested_volumes[vol] to be of type std::atomic<bool> or even std::atomic_flag: http://en.cppreference.com/w/cpp/atomic/atomic .
An article that discourages usage of volatile and explains that volatile can be used only in rare special cases (connected with hardware I/O): https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt

how to declare and use "one writer, many readers, one process, simple type" variable?

I have really simple question. I have simple type variable (like int). I have one process, one writer thread, several "readonly" threads. How should I declare variable?
volatile int
std::atomic<int>
int
I expect that when "writer" thread modifies value all "reader" threads should see fresh value ASAP.
It's ok to read and write variable at the same time, but I expect reader to obtain either old value or new value, not some "intermediate" value.
I'm using single-CPU Xeon E5 v3 machine. I do not need to be portable, I run the code only on this server, i compile with -march=native -mtune=native. Performance is very important so I do not want to add "synchronization overhead" unless absolutely required.
If I just use int and one thread writes value is it possible that in another thread I do not see "fresh" value for a while?
Just use std::atomic.
Don't use volatile, and don't use it as it is; that doesn't give the necessary synchronisation. Modifying it in one thread and accessing it from another without synchronisation will give undefined behaviour.
If you have unsynchronized access to a variable where you have one or more writers then your program has undefined behavior. Some how you have to guarantee that while a write is happening no other write or read can happen. This is called synchronization. How you achieve this synchronization depends on the application.
For something like this where we have one writer and and several readers and are using a TriviallyCopyable datatype then a std::atomic<> will work. The atomic variable will make sure under the hood that only one thread can access the variable at the same time.
If you do not have a TriviallyCopyable type or you do not want to use a std::atomic You could also use a conventional std::mutex and a std::lock_guard to control access
{ // enter locking scope
std::lock_guard lock(mutx); // create lock guard which locks the mutex
some_variable = some_value; // do work
} // end scope lock is destroyed and mutx is released
An important thing to keep in mind with this approach is that you want to keep the // do work section as short as possible as while the mutex is locked no other thread can enter that section.
Another option would be to use a std::shared_timed_mutex(C++14) or std::shared_mutex(C++17) which will allow multiple readers to share the mutex but when you need to write you can still look the mutex and write the data.
You do not want to use volatile to control synchronization as jalf states in this answer:
For thread-safe accesses to shared data, we need a guarantee that:
the read/write actually happens (that the compiler won't just store the value in a register instead and defer updating main memory until
much later)
that no reordering takes place. Assume that we use a volatile variable as a flag to indicate whether or not some data is ready to be
read. In our code, we simply set the flag after preparing the data, so
all looks fine. But what if the instructions are reordered so the flag
is set first?
volatile does guarantee the first point. It also guarantees that no
reordering occurs between different volatile reads/writes. All
volatile memory accesses will occur in the order in which they're
specified. That is all we need for what volatile is intended for:
manipulating I/O registers or memory-mapped hardware, but it doesn't
help us in multithreaded code where the volatile object is often
only used to synchronize access to non-volatile data. Those accesses
can still be reordered relative to the volatile ones.
As always if you measure the performance and the performance is lacking then you can try a different solution but make sure to remeasure and compare after changing.
Lastly Herb Sutter has an excellent presentation he did at C++ and Beyond 2012 called Atomic Weapons that:
This is a two-part talk that covers the C++ memory model, how locks and atomics and fences interact and map to hardware, and more. Even though we’re talking about C++, much of this is also applicable to Java and .NET which have similar memory models, but not all the features of C++ (such as relaxed atomics).
I'll complete a little bit the previous answers.
As exposed previously, just using int or eventually volatile int is not enough for various reason (even with the memory order constraint of Intel processors.)
So, yes, you should use atomic types for that, but you need extra considerations: atomic types guarantee coherent access but if you have visibility concerns you need to specify memory barrier (memory order.)
Barriers will enforce visibility and coherency between threads, on Intel and most modern architectures, it will enforce cache synchronizations so updates are visible for every cores. The problem is that it may be expensive if you're not careful enough.
Possible memory order are:
relaxed: no special barrier, only coherent read/write are enforce;
sequential consistency: strongest possible constraint (the default);
acquire: enforce that no loads after the current one are reordered before and add the required barrier to ensure that released stores are visible;
consume: a simplified version of acquire that mostly only constraint reordering;
release: enforce that all stores before are complete before the current one and that memory writes are done and visible to loads performing an acquire barrier.
So, if you want to be sure that updates to the variable are visible to readers, you need to flag your store with a (at least) a release memory order and, on the reader side you need an acquire memory order (again, at least.) Otherwise, readers may not see the actual version of the integer (it'll see a coherent version at least, that is the old or the new one, but not an ugly mix of the two.)
Of course, the default behavior (full consistency) will also give you the correct behavior, but at the expense of a lot of synchronization. In short, each time you add a barrier it forces cache synchronization which is almost as expensive as several cache misses (and thus reads/writes in main memory.)
So, in short you should declare your int as atomic and use the following code for store and load:
// Your variable
std::atomic<int> v;
// Read
x = v.load(std::memory_order_acquire);
// Write
v.store(x, std::memory_order_release);
And just to complete, sometimes (and more often that you think) you don't really need the sequential consistency (even the partial release/acquire consistency) since visibility of updates are pretty relative. When dealing with concurrent operations, updates take place not when write is performed but when others see the change, reading the old value is probably not a problem !
I strongly recommend reading articles related to relativistic programming and RCU, here are some interesting links:
Relativistic Programming wiki: http://wiki.cs.pdx.edu/rp/
Structured Deferral: Synchronization via Procrastination: https://queue.acm.org/detail.cfm?id=2488549
Introduction to RCU Concepts: http://www.rdrop.com/~paulmck/RCU/RCU.LinuxCon.2013.10.22a.pdf
Let's start from int at int. In general, when used on single processor, single core machine this should be sufficient, assuming int size same or smaller than CPU word (like 32bit int on 32bit CPU). In this case, assuming correctly aligned address word addresses (high level language should assure this by default) the write/read operations should be atomic. This is guaranteed by Intel as stated in [1] . However, in C++ specification simultaneous reading and writing from different threads is undefined behaviour.
$1.10
6 Two expression evaluations conflict if one of them modifies a memory location (1.7) and the other one accesses or modifies the same memory location.
Now volatile. This keyword disables almost every optimization. This is the reason why it was used. For example, sometimes when optimizing the compiler can come to idea, that variable you only read in one thread is constant there and simply replace it with it's initial value. This solves such problems. However, it does not make access to variable atomic. Also, in most cases, it is simply unnecessary, because use of proper multithreading tools, like mutex or memory barrier, will achieve same effect as volatile on it's own, as described for instance in [2]
While this may be sufficient for most uses, there are other operations that are not guaranteed to be atomic. Like incrementation is a one. This is when std::atomic comes in. It has those operations defined, like here for mentioned incrementations in [3]. It is also well defined when reading and writing from different threads [4].
In addition, as stated in answers in [5], there exists a lot of other factors that may influence (negatively) atomicity of operations. From loosing cache coherency between multiple cores to some hardware details are the factors that may change how operations are performed.
To summarize, std::atomic is created to support accesses from different threads and it is highly recommended to use it when multithreading.
[1] http://www.intel.com/Assets/PDF/manual/253668.pdf see section 8.1.1.
[2] https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
[3] http://en.cppreference.com/w/cpp/atomic/atomic/operator_arith
[4] http://en.cppreference.com/w/cpp/atomic/atomic
[5] Are C++ Reads and Writes of an int Atomic?
The other answers, which say to use atomic and not volatile, are correct when portability matters. If you’re asking this question, and it’s a good question, that’s the practical answer for you, not, “But, if the standard library doesn’t provide one, you can implement a lock-free, wait-free data structure yourself!” Nevertheless, if the standard library doesn’t provide one, you can implement a lock-free data structure yourself that works on a particular compiler and a particular architecture, provided that there’s only one writer. (Also, somebody has to implement those atomic primitives in the standard library.) If I’m wrong about this, I’m sure someone will kindly inform me.
If you absolutely need an algorithm guaranteed to be lock-free on all platforms, you might be able to build one with atomic_flag. If even that doesn’t suffice, and you need to roll your own data structure, you can do that.
Since there’s only one writer thread, your CPU might guarantee that certain operations on your data will still work atomically even if you just use normal accesses instead of locks or even compare-and-swaps. This is not safe according to the language standard, because C++ has to work on architectures where it isn’t, but it can be safe, for example, on an x86 CPU if you guarantee that the variable you’re updating fits into a single cache line that it doesn’t share with anything else, and you might be able to ensure this with nonstandard extensions such as __attribute__ (( aligned (x) )).
Similarly, your compiler might provide some guarantees: g++ in particular makes guarantees about how the compiler will not assume that the memory referenced by a volatile* hasn’t changed unless the current thread could have changed it. It will actually re-read the variable from memory each time you dereference it. That is in no way sufficient to ensure thread-safety, but it can be handy if another thread is updating the variable.
A real-world example might be: the writer thread maintains some kind of pointer (on its own cache line) which points to a consistent view of the data structure that will remain valid through all future updates. It updates its data with the RCU pattern, making sure to use a release operation (implemented in an architecture-specific way) after updating its copy of the data and before making the pointer to that data globally visible, so that any other thread that sees the updated pointer is guaranteed to see the updated data as well. The reader then makes a local copy (not volatile) of the current value of the pointer, getting a view of the data which will stay valid even after the writer thread updates again, and works with that. You want to use volatile on the single variable that notifies the readers of the updates, so they can see those updates even if the compiler “knows” your thread couldn’t have changed it. In this framework, the shared data just needs to be constant, and readers will use the RCU pattern. That’s one of the two ways I’ve seen volatile be useful in the real world (the other being when you don’t want to optimize out your timing loop).
There also needs to be some way, in this scheme, for the program to know when nobody’s using an old view of the data structure any longer. If that’s a count of readers, that count needs to be atomically modified in a single operation at the same time as the pointer is read (so getting the current view of the data structure involves an atomic CAS). Or this might be a periodic tick when all the threads are guaranteed to be done with the data they’re working with now. It might be a generational data structure where the writer rotates through pre-allocated buffers.
Also observe that a lot of things your program might do could implicitly serialize the threads: those atomic hardware instructions lock the processor bus and force other CPUs to wait, those memory fences could stall your threads, or your threads might be waiting in line to allocate memory from the heap.
Unfortunately it depends.
When a variable is read and written in multiple threads, there may be 2 failures.
1) tearing. Where half the data is pre-change and half the data is post change.
2) stale data. Where the data read has some older value.
int, volatile int and std:atomic all don't tear.
Stale data is a different issue. However, all values have existed, can be concieved as correct.
volatile. This tells the compiler neither to cache the data, nor to re-order operations around it. This improves the coherence between threads by ensuring all operations in a thread are either before the variable, at the variable, or after.
This means that
volatile int x;
int y;
y =5;
x = 7;
the instruction for x = 7 will be written after y = 5;
Unfortunately, the CPU is also capable of re-ordering operations. This can mean that another thread sees x ==7 before y =5
std::atomic x; would allow a guarantee that after seeing x==7, another thread would see y ==5. (Assuming other threads are not modifying y)
So all reads of int, volatile int, std::atomic<int> would show previous valid values of x. Using volatile and atomic increase the ordering of values.
See kernel.org barriers
I have simple type variable (like int).
I have one process, one writer thread, several "readonly" threads. How
should I declare variable?
volatile int
std::atomic
int
Use std::atomic with memory_order_relaxed for the store and load
It's quick, and from your description of your problem, safe. E.g.
void func_fast()
{
std::atomic<int> a;
a.store(1, std::memory_order_relaxed);
}
Compiles to:
func_fast():
movl $1, -24(%rsp)
ret
This assumes you don't need to guarantee that any other data is seen to be written before the integer is updated, and therefore the slower and more complicated synchronisation is unnecessary.
If you use the atomic naively like this:
void func_slow()
{
std::atomic<int> b;
b = 1;
}
You get an MFENCE instruction with no memory_order* specification which is massive slower (100 cycles more more vs just 1 or 2 for the bare MOV).
func_slow():
movl $1, -24(%rsp)
mfence
ret
See http://goo.gl/svPpUa
(Interestingly on Intel the use of memory_order_release and _acquire for this code results in the same assembly language. Intel guarantees that writes and reads happen in order when using the standard MOV instruction).
Here is my attempt at bounty:
- a. General answer already given above says 'use atomics'. This is correct answer. volatile is not enough.
-a. If you dislike the answer, and you are on Intel, and you have properly aligned int, and you love unportable solutions, you can do away with simple volatile, using Intel strong memory ordering gurantees.
TL;DR: Use std::atomic<int> with a mutex around it if you read multiple times.
Depends on how strong guarantees you want.
First volatile is a compiler hint and you shouldn't count on it doing something helpful.
If you use int you can suffer for memory aliasing. Say you have something like
struct {
int x;
bool q;
}
Depending on how this is aligned in memory and the exact implementation of CPU and memory bus it's possible that writing to q will actually overwrite x when the page is copied from the cpu cache back to ram. So unless you know how much to allocate around your int it's not guaranteed that your writer will be able to write without being overwritten by some other thread.
Also even if you write you depend on the processor for reloading the data to the cache of other cores so there's no guarantee that your other thread will see a new value.
std::atomic<int> basically guarantees that you will always allocate sufficient memory, properly aligned so that you don't suffer from aliasing. Depending on the memory order requested you will also disable a bunch of optimizations, like caching, so everything will run slightly slower.
This still doesn't grantee that if your read the var multiple times you'll get the value. The only way to do that is to put a mutex around it to block the writer from changing it.
Still better find a library that already solves the problem you have and it has already been tested by others to make sure it works well.

What are the threading guarantees of nowadays C and C++ compilers?

I'm wondering what are the guarantees that compilers make to ensure that threaded writes to memory have visible effects in other threads.
I know countless cases in which this is problematic, and I'm sure that if you're interested in answering you know it too, but please focus on the cases I'll be presenting.
More precisely, I am concerned about the circumstances that can lead to threads missing memory updates done by other threads. I don't care (at this point) if the updates are non-atomic or badly synchronized: as long as the concerned threads notice the changes, I'll be happy.
I hope that compilers makes the distinction between two kinds of variable accesses:
Accesses to variables that necessarily have an address;
Accesses to variables that don't necessarily have an address.
For instance, if you take this snippet:
void sleepingbeauty()
{
int i = 1;
while (i) sleep(1);
}
Since i is a local, I assume that my compiler can optimize it away, and just let the sleeping beauty fall to eternal slumber.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variable, and generate memory reads to it to ensure that maybe some day the prince will come.
int i = 1;
void sleepingbeauty()
{
while (i) sleep(1);
}
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it instead of caching the value.
void sleepingbeauty(int* ptr)
{
*ptr = 1;
while (*ptr) sleep(1);
}
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
I'm fairly sure that this is the memory access model used by every C and C++ compiler in production out there, but I don't think there are any guarantees. In fact, the C++03 is even blind to the existence of threads, so this question wouldn't even make sense with the standard in mind. I'm not sure about C, though.
Is there some documentation out there that specifies if I'm right or wrong? I know these are muddy waters since these may not be on standards grounds, it seems like an important issue to me.
Besides the compiler generating reads, I'm also worried that the CPU cache could technically retain an outdated value, and that even though my compiler did its best to bring the reads and writes about, the values never synchronise between threads. Can this happen?
Accesses to variables that don't necessarily have an address.
All variables must have addresses (from the language's prospective -- compilers are allowed to avoid giving things addresses if they can, but that's not visible from inside the language). It's a side effect that everything must be "pointerable" that everything has an address -- even the empty class typically has size of at least a char so that a pointer can be created to it.
Since i is a local, but its address is taken and passed to another function, I assume that my compiler will now know that it's an "addressable" variables, and generate memory reads to it to ensure that maybe some day the prince will come.
That depends on the content of onedaymyprincewillcome. The compiler may inline that function if it wishes and still make no memory reads.
Since i is a global, I assume that my compiler knows the variable has an address and will generate reads to it.
Yes, but it really doesn't matter if there are reads to it. These reads might simply be going to cache on your current local CPU core, not actually going all the way back to main memory. You would need something like a memory barrier for this, and no C++ compiler is going to do that for you.
I hope that the dereference operator is explicit enough to have my compiler generate a memory read on each loop iteration.
Nope -- not required. The function may be inlined, which would allow the compiler to completely remove these things if it so desires.
The only language feature in the standard that lets you control things like this w.r.t. threading is volatile, which simply requires that the compiler generate reads. That does not mean the value will be consistent though because of the CPU cache issue -- you need memory barriers for that.
If you need true multithreading correctness, you're going to be using some platform specific library to generate memory barriers and things like that, or you're going to need a C++0x compiler which supports std::atomic, which does make these kinds of requirements on variables explicit.
You assume wrong.
void onedaymyprincewillcome(int* i);
void sleepingbeauty()
{
int i = 1;
onedaymyprincewillcome(&i);
while (i) sleep(1);
}
In this code, your compiler will load i from memory each time through the loop. Why? NOT because it thinks another thread could alter its value, but because it thinks that sleep could modify its value. It has nothing to do with whether or not i has an address or must have an address, and everything to do with the operations that this thread performs which could modify the code.
In particular, it is not guaranteed that assigning to an int is even atomic, although this happens to be true on all platforms we use these days.
Too many things go wrong if you don't use the proper synchronization primitives for your threaded programs. For example,
char *str = 0;
asynch_get_string(&str);
while (!str)
sleep(1);
puts(str);
This could (and even will, on some platforms) sometimes print out utter garbage and crash the program. It looks safe, but because you are not using the proper synchronization primitives, the change to ptr could be seen by your thread before the change to the memory location it refers to, even though the other thread initializes the string before setting the pointer.
So just don't, don't, don't do this kind of stuff. And no, volatile is not a fix.
Summary: The basic problem is that the compiler only changes what order the instructions go in, and where the load and store operations go. This is not enough to guarantee thread safety in general, because the processor is free to change the order of loads and stores, and the order of loads and stores is not preserved between processors. In order to ensure things happen in the right order, you need memory barriers. You can either write the assembly yourself or you can use a mutex / semaphore / critical section / etc, which does the right thing for you.
While the C++98 and C++03 standards do not dictate a standard memory model that must be used by compilers, C++0x does, and you can read about it here: http://www.hpl.hp.com/personal/Hans_Boehm/misc_slides/c++mm.pdf
In the end, for C++98 and C++03, it's really up to the compiler and the hardware platform. Typically there will not be any memory barrier or fence-operation issued by the compiler for normally written code unless you use a compiler intrinsic or something from your OS's standard library for synchronization. Most mutex/semaphore implementations also include a built-in memory barrier operation to prevent speculative reads and writes across the locking and unlocking operations on the mutex by the CPU, as well as prevent any re-ordering of operations across the same read or write calls by the compiler.
Finally, as Billy points out in the comments, on Intel x86 and x86_64 platforms, any read or write operation in a single byte increment is atomic, as well as a read or write of a register value to any 4-byte aligned memory location on x86 and 4 or 8-byte aligned memory location on x86_64. On other platforms, that may not be the case and you would have to consult the platform's documentation.
The only control you have over optimisation is volatile.
Compilers make NO gaurantee about concurrent threads accessing the same location at the same time. You will need to some type of locking mechanism.
I can only speak for C and since synchronization is a CPU-implemented functionality a C programmer would need to call a library function for the OS containg an access to the lock (CriticalSection functions in the Windows NT engine) or implement something simpler (such as a spinlock) and access the functionality himself.
volatile is a good property to use at the module level. Sometimes a non-static (public) variable will work too.
local (stack) variables will not be accessible from other threads and should not be.
variables at the module level are good candidates for access by multiple threads but will require synchronizetion functions to work predictably.
Locks are unavoidable but they can be used more or less wisely resulting in a negligible or considerable performance penalty.
I answered a similar question here concerning unsynchronized threads but I think you'll be better off browsing on similar topics to get high-quality answers.
I'm not sure you understand the basics of the topic you claim to be discussing. Two threads, each starting at exactly the same time and looping one million times each performing an inc on the same variable will NOT result in a final value of two million (two * one million increments). The value will end up somewhere in-between one and two million.
The first increment will cause the value to be read from RAM into the L1 (via first the L3 then the L2) cache of the accessing thread/core. The increment is performed and the new value written initially to L1 for propagation to lower caches. When it reaches L3 (the highest cache common to both cores) the memory location will be invalidated to the other core's caches. This may seem safe but in the meantime the other core has simultaneously performed an increment based on the same initial value in the variable. The invalidation from the write by the first value will be superseeded by the write from the second core invalidating the data in the caches of the first core.
Sounds like a mess? It is! The cores are so fast that what happens in the caches falls way behind: the cores are where the action is. This is why you need explicit locks: to make sure that the new value winds up low enough in the memory hierarchy such that other cores will read the new value and nothing else. Or put another way: slow things down so the caches can catch up with the cores.
A compiler does not "feel." A compiler is rule-based and, if constructed correctly, will optimize to the extent that the rules allow and the compiler writers are able to construct the optimizer. If a variable is volatile and the code is multi-threaded the rules won't allow the compiler to skip a read. Simple as that even though on the face of it it may appear devilishly tricky.
I'll have to repeat myself and say that locks cannot be implemented in a compiler because they are specific to the OS. The generated code will call all functions without knowing if they are empty, contain lock code or will trigger a nuclear explosion. In the same way the code will not be aware of a lock being in progress since the core will insert wait states until the lock request has resulted in the lock being in place. The lock is something that exists in the core and in the mind of the programmer. The code shouldn't (and doesn't!) care.
I'm writing this answer because most of the help came from comments to questions, and not always from the authors of the answers. I already upvoted the answers that helped me most, and I'm making this a community wiki to not abuse the knowledge of others. (If you want to upvote this answer, consider also upvoting Billy's and Dietrich's answers too: they were the most helpful authors to me.)
There are two problems to address when values written from a thread need to be visible from another thread:
Caching (a value written from a CPU could never make it to another CPU);
Optimizations (a compiler could optimize away the reads to a variable if it feels it can't be changed).
The first one is rather easy. On modern Intel processors, there is a concept of cache coherence, which means changes to a cache propagate to other CPU caches.
Turns out the optimization part isn't too hard either. As soon as the compiler cannot guarantee that a function call cannot change the content of a variable, even in a single-threaded model, it won't optimize the reads away. In my examples, the compiler doesn't know that sleep cannot change i, and this is why reads are issued at every operation. It doesn't need to be sleep though, any function for which the compiler doesn't have the implementation details would do. I suppose that a particularly well-suited function to use would be one that emits a memory barrier.
In the future, it's possible that compilers will have better knowledge of currently impenetrable functions. However, when that time will come, I expect that there will be standard ways to ensure that changes are propagated correctly. (This is coming with C++11 and the std::atomic<T> class. I don't know for C1x.)

C++ Thread, shared data

I have an application where 2 threads are running... Is there any certanty that when I change a global variable from one thread, the other will notice this change?
I don't have any syncronization or Mutual exclusion system in place... but should this code work all the time (imagine a global bool named dataUpdated):
Thread 1:
while(1) {
if (dataUpdated)
updateScreen();
doSomethingElse();
}
Thread 2:
while(1) {
if (doSomething())
dataUpdated = TRUE;
}
Does a compiler like gcc optimize this code in a way that it doesn't check for the global value, only considering it value at compile time (because it nevers get changed at the same thred)?
PS: Being this for a game-like application, it really doen't matter if there will be a read while the value is being written... all that matters is that the change gets noticed by the other thread.
Yes. No. Maybe.
First, as others have mentioned you need to make dataUpdated volatile; otherwise the compiler may be free to lift reading it out of the loop (depending on whether or not it can see that doSomethingElse doesn't touch it).
Secondly, depending on your processor and ordering needs, you may need memory barriers. volatile is enough to guarentee that the other processor will see the change eventually, but not enough to guarentee that the changes will be seen in the order they were performed. Your example only has one flag, so it doesn't really show this phenomena. If you need and use memory barriers, you should no longer need volatile
Volatile considered harmful and Linux Kernel Memory Barriers are good background on the underlying issues; I don't really know of anything similar written specifically for threading. Thankfully threads don't raise these concerns nearly as often as hardware peripherals do, though the sort of case you describe (a flag indicating completion, with other data presumed to be valid if the flag is set) is exactly the sort of thing where ordering matterns...
Here is an example that uses boost condition variables:
bool _updated=false;
boost::mutex _access;
boost::condition _condition;
bool updated()
{
return _updated;
}
void thread1()
{
boost::mutex::scoped_lock lock(_access);
while (true)
{
boost::xtime xt;
boost::xtime_get(&xt, boost::TIME_UTC);
// note that the second parameter to timed_wait is a predicate function that is called - not the address of a variable to check
if (_condition.timed_wait(lock, &updated, xt))
updateScreen();
doSomethingElse();
}
}
void thread2()
{
while(true)
{
if (doSomething())
_updated=true;
}
}
Use a lock. Always always use a lock to access shared data. Marking the variable as volatile will prevent the compiler from optimizing away the memory read, but will not prevent other problems such as memory re-ordering. Without a lock there is no guarantee that the memory writes in doSomething() will be visible in the updateScreen() function.
The only other safe way is to use a memory fence, either explicitly or an implicitly using an Interlocked* function for example.
Use the volatile keyword to hint to the compiler that the value can change at any time.
volatile int myInteger;
The above will guarantee that any access to the variable will be to and from memory without any specific optimizations and as a result all threads running on the same processor will "see" changes to the variable with the same semantics as the code reads.
Chris Jester-Young pointed out that coherency concerns to such a variable value change may arise in a multi-processor systems. This is a consideration and it depends on the platform.
Actually, there are really two considerations to think about relative to platform. They are coherency and atomicity of the memory transactions.
Atomicity is actually a consideration for both single and multi-processor platforms. The issue arises because the variable is likely multi-byte in nature and the question is if one thread could see a partial update to the value or not. ie: Some bytes changed, context switch, invalid value read by interrupting thread. For a single variable that is at the natural machine word size or smaller and naturally aligned should not be a concern. Specifically, an int type should always be OK in this regard as long as it is aligned - which should be the default case for the compiler.
Relative to coherency, this is a potential concern in a multi-processor system. The question is if the system implements full cache coherency or not between processors. If implemented, this is typically done with the MESI protocol in hardware. The question didn't state platforms, but both Intel x86 platforms and PowerPC platforms are cache coherent across processors for normally mapped program data regions. Therefore this type of issue should not be a concern for ordinary data memory accesses between threads even if there are multiple processors.
The final issue relative to atomicity that arises is specific to read-modify-write atomicity. That is, how do you guarantee that if a value is read updated in value and the written, that this happen atomically, even across processors if more than one. So, for this to work without specific synchronization objects, would require that all potential threads accessing the variable are readers ONLY but expect for only one thread can ever be a writer at one time. If this is not the case, then you do need a sync object available to be able to ensure atomic actions on read-modify-write actions to the variable.
Your solution will use 100% CPU, among other problems. Google for "condition variable".
Chris Jester-Young pointed out that:
This only work under Java 1.5+'s memory model. The C++ standard does not address threading, and volatile does not guarantee memory coherency between processors. You do need a memory barrier for this
being so, the only true answer is implementing a synchronization system, right?
Use the volatile keyword to hint to the compiler that the value can change at any time.
volatile int myInteger;
No, it's not certain. If you declare the variable volatile, then the complier is supposed to generate code that always loads the variable from memory on a read.
If the scope is right ( "extern", global, etc. ) then the change will be noticed. The question is when? And in what order?
The problem is that the compiler can and frequently will re-order your logic to fill all it's concurrent pipelines as a performance optimization.
It doesn't really show in your specific example because there aren't any other instructions around your assignment, but imagine functions declared after your bool assign execute before the assignment.
Check-out Pipeline Hazard on wikipedia or search google for "compiler instruction reordering"
As others have said the volatile keyword is your friend. :-)
You'll most likely find that your code would work when you had all of the optimisation options disabled in gcc. In this case (I believe) it treats everything as volatile and as a result the variable is accessed in memory for every operation.
With any sort of optimisation turned on the compiler will attempt to use a local copy held in a register. Depending on your functions this may mean that you only see the change in variable intermittently or, at worst, never.
Using the keyword volatile indicates to the compiler that the contents of this variable can change at any time and that it should not use a locally cached copy.
With all of that said you may find better results (as alluded to by Jeff) through the use of a semaphore or condition variable.
This is a reasonable introduction to the subject.