I've been reading a lot about the 'volatile' keyword but I still don't have a definitive answer.
Consider this code:
class A
{
public:
void work()
{
working = true;
while(working)
{
processSomeJob();
}
}
void stopWorking() // Can be called from another thread
{
working = false;
}
private:
bool working;
}
As work() enters its loop the value of 'working' is true.
Now I'm guessing the compiler is allowed to optimize the while(working) to while(true) as the value of 'working' is true when starting the loop.
If this is not the case, that would mean something like this would be quite inefficient:
for(int i = 0; i < someOtherClassMember; i++)
{
doSomething();
}
...as the value of someOtherClassMember would have to be loaded each iteration.
If this is the case, I would think 'working' has to be volatile in order to prevent the compiler from optimising it.
Which of these two is the case? When googling the use of volatile I find people claiming it's only useful when working with I/O devices writing to memory directly, but I also find claims that it should be used in a scenario like mine.
Your program will get optimized into an infinite loop†.
void foo() { A{}.work(); }
gets compiled to (g++ with O2)
foo():
sub rsp, 8
.L2:
call processSomeJob()
jmp .L2
The standard defines what a hypothetical abstract machine would do with a program. Standard-compliant compilers have to compile your program to behave the same way as that machine in all observable behaviour. This is known as the as-if rule, the compiler has freedom as long as what your program does is the same, regardless of how.
Normally, reading and writing to a variable doesn't constitute as observable, which is why a compiler can elide as much reads and writes as it likes. The compiler can see working doesn't get assigned to and optimizes the read away. The (often misunderstood) effect of volatile is exactly to make them observable, which forces the compilers to leave the reads and writes alone‡.
But wait you say, another thread may assign to working. This is where the leeway of undefined behaviour comes in. The compiler may do anything when there is undefined behaviour, including formatting your hard drive and still be standard-compliant. Since there are no synchronization and working isn't atomic, any other thread writing to working is a data race, which is unconditionally undefined behaviour. Therefore, the only time an infinite loop is wrong is when there is undefined behaviour, by which the compiler decided your program might as well keep on looping.
TL;DR Don't use plain bool and volatile for multi-threading. Use std::atomic<bool>.
†Not in all situations. void bar(A& a) { a.work(); } doesn't for some versions.
‡Actually, there is some debate around this.
Now I'm guessing the compiler is allowed to optimize the while(working) to while(true)
Potentially, yes. But only if it can prove that processSomeJob() does not modify the working variable i.e. if it can prove that the loop is infinite.
If this is not the case, that would mean something like this would be quite inefficient ... as the value of someOtherClassMember would have to be loaded each iteration
Your reasoning is sound. However, the memory location might remain in cache, and reading from CPU cache isn't necessarily significantly slow. If doSomething is complex enough to cause someOtherClassMember to be evicted from the cache, then sure we'd have to load from memory, but on the other hand doSomething might be so complex that a single memory load is insignificant in comparison.
Which of these two is the case?
Either. The optimiser will not be able to analyse all possible code paths; we cannot assume that the loop could be optimised in all cases. But if someOtherClassMember is provably not modified in any code paths, then proving it would be possible in theory, and therefore the loop can be optimised in theory.
but I also find claims that [volatile] should be used in a scenario like mine.
volatile doesn't help you here. If working is modified in another thread, then there is a data race. And data race means that the behaviour of the program is undefined.
To avoid a data race, you need synchronisation: Either use a mutex, or atomic operations to share access across threads.
Volatile will make the while loop reload the working variable on every check. Practically that will often allow you to stop the working function with a call to stopWorking made from an asynchronous signal handler or another thread, but as per the standard it's not enough. The standard requires lock-free atomics or variables of type volatile sig_atomic_t for sighandler <-> regular context communication and atomics for inter-thread communication.
Related
in a lock-free queue.pop(), I read a trivialy_copyable variable (of integral type) after synchronization with an atomic aquire inside a loop.
Minimized pseudo code:
//somewhere else writePosition.store(...,release)
bool pop(size_t & returnValue){
writePosition = writePosition.load(aquire)
oldReadPosition = readPosition.load(relaxed)
size_t value{};
do{
value = data[oldReadPosition]
newReadPosition = oldReadPosition+1
}while(readPosition.compare_exchange(oldReadPosition, newReadPosition, relaxed)
// here we are owner of the value
returnValue = value;
return true;
}
the memory of data[oldReadPosition] can only be changed iff this value was read from another thread bevor.
read and write Positions are ABA safe.
with a simple copy, value = data[oldReadPosition] the memory of data[oldReadPosition] will not be changed.
but a write thread queue.push(...) can change data[oldReadPosition] while reading, iff another thread has already read oldPosition and changed the readPosition.
it would be a race condition, if you use the value, but is it also a race condition, and thus undefined behavior, when we leave value untouched? the standard is not spezific enough or I don´t understand it.
imo, this should be possible, because it has no effect.
I would be very happy to get an qualified answer to get deeper insights
thanks a lot
Yes, it's UB in ISO C++; value = data[oldReadPosition] in the C++ abstract machine involves reading the value of that object. (Usually that means lvalue to rvalue conversion, IIRC.)
But it's mostly harmless, probably only going to be a problem on machines with hardware race detection (not normal mainstream CPUs, but possibly on C implementations like clang with threadsanitizer).
Another use-case for non-atomic read and then checking for possible tearing is the SeqLock, where readers can prove no tearing by reading the same value from an atomic counter before and after the non-atomic read. It's UB in C++, even with volatile for the non-atomic data, although that may be helpful in making sure the compiler-generated asm is safe. (With memory barriers and current handling of atomics by existing compilers, even non-volatile makes working asm). See Optimal way to pass a few variables between 2 threads pinning different CPUs
atomic_thread_fence is still necessary for a SeqLock to be safe, and some of the necessary ordering of atomic loads wrt. non-atomic may be an implementation detail if it can't sync with something and create a happens-before.
People do use Seq Locks in real life, depending on the fact that real-life compilers de-facto define a bit more behaviour than ISO C++. Or another way to put it is that happen to work for now; if you're careful about what code you put around the non-atomic read it's unlikely for a compiler to be able to do anything problematic.
But you're definitely venturing out past the safe area of guaranteed behaviour, and probably need to understand how C++ compiles to asm, and how asm works on the target platforms you care about; see also Who's afraid of a big bad optimizing compiler? on LWN; it's aimed at Linux kernel code, which is the main user of hand-rolled atomics and stuff like that.
Say I have
bool unsafeBool = false;
int main()
{
std::thread reader = std::thread([](){
std::this_thread::sleep_for(1ns);
if(unsafeBool)
std::cout << "unsafe bool is true" << std::endl;
});
std::thread writer = std::thread([](){
unsafeBool = true;
});
reader.join();
writer.join();
}
Is it guaranteed that unsafeBool becomes true after writer finishes. I know that it is undefined behavior what reader outputs but the write should be fine as far as I understand.
UB is and stays UB, there can be reasoning about why things are happinging the way they happen, though, you ain't allowed to rely on that.
You have a race condition, fix it by either: adding a lock or changing the type to an atomic.
Since you have UB in your code, your compiler is allowed to assume that doesn't happen. If it can detect this, it can change your complete function to a noop, as it can never be called in a valid program.
If it doesn't do so, the behaviour will depend on your processor and the caching linked to it. There, if the code after the joins uses the same core as the thread that read the boolean (before the join), you might even still have false in there without the need to invalidate the cache.
In practice, using Intel X86 processors, you will not see a lot of side effects from the race conditions as it has been made to invalidate the caches on write.
After writer.join() it guaranteed that unsafeBool == true. But in reader thread the access to it is a data race.
Some implementations guarantee that any attempt to read the value of a word-size-or-smaller object that isn't qualified volatile around the time that it changes will either yield an old or new value, chosen arbitrarily. In cases where this guarantee would be useful, the cost for a compiler to consistently uphold it would generally be less than the cost of working around its absence (among other things, because any ways by which programmers could work around its absence would restrict a compiler's freedom to choose between an old or new value).
In some other implementations, however, even operations that would seem like they should involve a single read of a value might yield code that combines the results from multiple reads. When ARM gcc 9.2.1 is invoked with command-line arguments -xc -O2 -mcpu=cortex-m0 and given:
#include <stdint.h>
#include <string.h>
#if 1
uint16_t test(uint16_t *p)
{
uint16_t q = *p;
return q - (q >> 15);
}
it generates code which reads from *p and then from *(int16_t*)p, shifts the latter right by 15, and adds that to the former. If the value of *p were to change between the two reads, this could cause the function to return 0xFFFF, a value which should be impossible.
Unfortunately, many people who design compilers so that they will always refrain from "splitting" reads in such fashion think such behavior is sufficiently natural and obvious that there's no particular reason to expressly document the fact that they never do anything else. Meanwhile, some other compiler writers figure that because the Standard allows compilers to split reads even when there's no reason to (splitting the read in the above code makes it bigger and slower than it would be if it simply read the value once) any code that would rely upon compilers refraining from such "optimizations" is "broken".
Given a single core CPU embedded environment where reading and writing of variables is guaranteed to be atomic, and the following example:
struct Example
{
bool TheFlag;
void SetTheFlag(bool f) {
Theflag = f;
}
void UseTheFlag() {
if (TheFlag) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (TheFlag) {
...
}
}
};
It is clear that if SetTheFlag was called by chance on another thread (or interrupt) between the two uses of TheFlag in UseTheFlag, there could be unexpected behavior (or some could argue it is expected behavior in this case!).
Can the following workaround be used to guarantee behavior?
void UseTheFlag() {
auto f = TheFlag;
if (f) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (f) {
...
}
}
My practical testing showed the variable f is never optimised out and copied once from TheFlag (GCC 10, ARM Cortex M4). But, I would like to know for sure is it guaranteed by the compiler that f will not be optimised out?
I know there are better design practices, critical sections, disabling interrupts etc, but this question is about the behavior of compiler optimisation in this use case.
You might consider this from the point of view of the "as-if" rule, which, loosely stated, states that any optimisations applied by the compiler must not change the original meaning of the code.
So, unless the compiler can prove that TheFlag doesn't change during the lifetime of f, it is obliged to make a local copy.
That said, I'm not sure if 'proof' extends to modifications made to TheFlag in another thread or ISR. Marking TheFlag as atomic (or volatile, for an ISR) might help there.
The C++ standard does not say anything about what will happen in this case. It's just UB, since an object can be modified in one thread while another thread is accessing it.
You only say the platform specifies that these operations are atomic. Obviously, that isn't enough to ensure this code operates correctly. Atomicity only guarantees that two concurrent writes will leave the value as one of the two written values and that a read during one or more writes will never see a value not written. It says nothing about what happens in cases like this.
There is nothing wrong with any optimization that breaks this code. In particularly, atomicity does not prevent a read operation in another thread from seeing a value written before that read operation unless something known to synchronize was used.
If the compiler sees register pressure, nothing prevents it from simply reading TheFlag twice rather than creating a local copy. If the compile can deduce that the intervening code in this thread cannot modify TheFlag, the optimization is legal. Optimizers don't have to take into account what other threads might do unless you follow the rules and use things defined to synchronize or only require the explicit guarantees atomicity replies.
You go beyond that, so all bets are off. You need more than atomicity for TheFlag, so don't use a type that is merely atomic -- it isn't enough.
I just read Do not use volatile as a synchronization primitive article on CERT site and noticed that a compiler can theoretically optimize the following code in the way that it'll store a flag variable in the registers instead of modifying actual memory shared between different threads:
bool flag = false;//Not declaring as {{volatile}} is wrong. But even by declaring {{volatile}} this code is still erroneous
void test() {
while (!flag) {
Sleep(1000); // sleeps for 1000 milliseconds
}
}
void Wakeup() {
flag = true;
}
void debit(int amount){
test();
account_balance -= amount;//We think it is safe to go inside the critical section
}
Am I right?
Is it true that I need to use volatile keyword for every object in my program that shares its memory between different threads? Not because it does some kind of synchronization for me (I need to use mutexes or any other synchronization primitives to accomplish such task anyway) but just because of the fact that a compiler can possibly optimize my code and store all shared variables in the registers so other threads will never get updated values?
It's not just about storing them in the registers, there are all sorts of levels of caching between the shared main memory and the CPU. Much of that caching is per CPU-core so any change made there will not be seen by other cores for a long time (or potentially if other cores are modifying the same memory then those changes may be lost completely).
There are no guarantees about how that caching will behave and even if something is true for current processors it may well not be true for older processors or for the next generation of processors. In order to write safe multi threading code you need to do it properly. The easiest way is to use the libraries and tools provided in order to do so. Trying to do it yourself using low level primitives like volatile is a very hard thing involving a lot of in-depth knowledge.
It is actually very simple, but confusing at the same time. On a high level, there are two optimization entities at play when you write C++ code - compiler and CPU. And within compiler, there are two major optimization techniue in regards to variable access - omitting variable access even if written in the code and moving other instructions around this particular variable access.
In particular, following example demonstrates those two techniques:
int k;
bool flag;
void foo() {
flag = true;
int i = k;
k++;
k = i;
flag = false;
}
In the code provided, compiler is free to skip first modification of flag - leaving only final assignment to false; and completely remove any modifications to k. If you make k volatile, you will require compiler to preserve all access to k = it will be incremented, and than original value put back. If you make flag volatile as well, both assignments first to true, than two false will remain in the code. However, reordering would still be possible, and the effective code might look like
void foo() {
flag = true;
flag = false;
int i = k;
k++;
k = i;
}
This will have unpleasant effect if another thread would be expecting flag to indicate if k is being modified now.
One of the way to achive the desired effect would be to define both variables as atomic. This would prevent compiler from both optimizations, ensuring code executed will be the same as code written. Note that atomic is, in effect, a volatile+ - it does all the volatile does + more.
Another thing to notice is that compiler optimizations are, indeed, a very powerful and desired tool. One should not impede them just for the fun of it, so atomicity should be used only when it is required.
On your particular
bool flag = false;
example, declaring it as volatile will universally work and is 100% correct.
But it will not buy you that all the time.
Volatile IMPOSES on the compiler that each and every evaluation of an object (or mere C variable) is either done directly on the memory/register or preceded by retrieval from external-memory medium into internal memory/registers. In some cases code and memory-footprint size can be quite larger, but the real issue is that it's not enough.
When some time-based context-switching is going on (e.g. threads), and your volatile object/variable is aligned and fits in a CPU register, you get what you intended. Under these strict conditions, a change or evaluation is atomically done, so in a context switching scenario the other thread will be immediately "aware" of any changes.
However, if your object/ big variable does not fit in a CPU register (from size or no alignment) a thread context-switch on a volatile may still be a NO-NO... an evaluation at the concurrent thread may catch a mid-changing procedure... e.g. while changing a 5-member struct copy, the concurrent thread is invoked amid 3rd member changing. cabum!
The conclusion is (back to "Operating-Systems 101"), you need to identify your shared objects, elect preemptive+blocking or non-preemptive or other concurrent-resource access strategy, and make your evaluaters/changers atomic. The access methods (change/eval) usually incorporate the make-atomic strategy, or (if it's aligned and small) simply declare it as volatile.
The code below is used to assign work to multiple threads, wake them up, and wait until they are done. The "work" in this case consists of "cleaning a volume". What exactly this operation does is irrelevant for this question -- it just helps with the context. The code is part of a huge transaction processing system.
void bf_tree_cleaner::force_all()
{
for (int i = 0; i < vol_m::MAX_VOLS; i++) {
_requested_volumes[i] = true;
}
// fence here (seq_cst)
wakeup_cleaners();
while (true) {
usleep(10000); // 10 ms
bool remains = false;
for (int vol = 0; vol < vol_m::MAX_VOLS; ++vol) {
// fence here (seq_cst)
if (_requested_volumes[vol]) {
remains = true;
break;
}
}
if (!remains) {
break;
}
}
}
A value in a boolean array _requested_volumes[i] tells whether thread i has work to do. When it is done, the worker thread sets it to false and goes back to sleep.
The problem I am having is that the compiler generates an infinite loop, where the variable remains is always true, even though all values in the array have been set to false. This only happens with -O3.
I have tried two solutions to fix that:
Declare _requested_volumes volatile
(EDIT: this solution does work actually. See edit below)
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses. But there's a lot of dispute over this on the Internet. The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access. In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
Introduce memory fences
The method wakeup_cleaners() acquires a pthread_mutex_t internally in order to set a wake-up flag in the worker threads, so it should implicitly produce proper memory fences. But I'm not sure if those fences affect memory accesses in the caller method (force_all()). Therefore, I manually introduced fences in the locations specified by the comments above. This should make sure that writes performed by the worker thread in _requested_volumes are visible in the main thread.
What puzzles me is that none of these solutions works, and I have absolutely no idea why. The semantics and proper use of memory fences and volatile is confusing me right now. The problem is that the compiler is applying an undesired optimization -- hence the volatile attempt. But it could also be a problem of thread synchronization -- hence the memory fence attempt.
I could try a third solution in which a mutex protects every access to _requested_volumes, but even if that works, I would like to understand why, because as far as I understand, it's all about memory fences. Thus, it should make no difference whether it's done explicitly or implicitly via a mutex.
EDIT: My assumptions were wrong and Solution 1 actually does work. However, my question remains in order to clarify the use of volatile vs. memory fences. If volatile is such a bad thing, that should never be used in multithreaded programming, what else should I use here? Do memory fences also affect compiler optimizations? Because I see these as two orthogonal issues, and therefore orthogonal solutions: fences for visibility in multiple threads and volatile for preventing optimizations.
Many experts say that volatile has nothing to do with thread synchronization, and it should only be used in low-level hardware accesses.
Yes.
But there's a lot of dispute over this on the Internet.
Not, generally, between "the experts".
The way I understand it, volatile is the only way to refrain the compiler from optimizing away accesses to memory which is changed outside of the current scope, regardless of concurrent access.
Nope.
Non-pure, non-constexpr non-inlined function calls (getters/accessors) also necessarily have this effect. Admittedly link-time optimization confuses the issue of which functions may really get inlined.
In C, and by extension C++, volatile affects memory access optimization. Java took this keyword, and since it can't (or couldn't) do the tasks C uses volatile for in the first place, altered it to provide a memory fence.
The correct way to get the same effect in C++ is using std::atomic.
In that sense, volatile should do the trick, even if we disagree on best practices for concurrent programming.
No, it may have the desired effect, depending on how it interacts with your platform's cache hardware. This is brittle - it could change any time you upgrade a CPU, or add another one, or change your scheduler behaviour - and it certainly isn't portable.
If you're really just tracking how many workers are still working, sane methods might be a semaphore (synchronized counter), or mutex+condvar+integer count. Either are likely more efficient than busy-looping with a sleep.
If you're wedded to the busy loop, you could still reasonably have a single counter, such as std::atomic<size_t>, which is set by wakeup_cleaners and decremented as each cleaner completes. Then you can just wait for it to reach zero.
If you really want a busy loop and really prefer to scan the array each time, it should be an array of std::atomic<bool>. That way you can decide what consistency you need from each load, and it will control both the compiler optimizations and the memory hardware appropriately.
Apparently, volatile does the necessary for your example. The topic of volatile qualifier itself is too broad: you can start by searching "C++ volatile vs atomic" etc. There are a lot of articles and questions&answers on the internet, e.g. Concurrency: Atomic and volatile in C++11 memory model .
Briefly, volatile tells the compiler to disable some aggressive optimizations, particularly, to read the variable each time it is accessed (rather than storing it in a register or cache). There are compilers which do more so making volatile to act more like std::atomic: see Microsoft Specific section here. In your case disablement of an aggressive optimization is exactly what was necessary.
However, volatile doesn't define the order for the execution of the statements around it. That is why you need memory order in case you need to do something else with the data after the flags you check have been set.
For inter-thread communication it is appropriate to use std::atomic, particularly, you need to refactor _requested_volumes[vol] to be of type std::atomic<bool> or even std::atomic_flag: http://en.cppreference.com/w/cpp/atomic/atomic .
An article that discourages usage of volatile and explains that volatile can be used only in rare special cases (connected with hardware I/O): https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt