C++ Are copies of variables optimised out?

C++ Are copies of variables optimised out? - c++

Given a single core CPU embedded environment where reading and writing of variables is guaranteed to be atomic, and the following example:
struct Example
{
bool TheFlag;
void SetTheFlag(bool f) {
Theflag = f;
}
void UseTheFlag() {
if (TheFlag) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (TheFlag) {
...
}
}
};
It is clear that if SetTheFlag was called by chance on another thread (or interrupt) between the two uses of TheFlag in UseTheFlag, there could be unexpected behavior (or some could argue it is expected behavior in this case!).
Can the following workaround be used to guarantee behavior?
void UseTheFlag() {
auto f = TheFlag;
if (f) {
// Do some stuff that has no effect on TheFlag
}
// Do some more stuff that has no effect on TheFlag
if (f) {
...
}
}
My practical testing showed the variable f is never optimised out and copied once from TheFlag (GCC 10, ARM Cortex M4). But, I would like to know for sure is it guaranteed by the compiler that f will not be optimised out?
I know there are better design practices, critical sections, disabling interrupts etc, but this question is about the behavior of compiler optimisation in this use case.

You might consider this from the point of view of the "as-if" rule, which, loosely stated, states that any optimisations applied by the compiler must not change the original meaning of the code.
So, unless the compiler can prove that TheFlag doesn't change during the lifetime of f, it is obliged to make a local copy.
That said, I'm not sure if 'proof' extends to modifications made to TheFlag in another thread or ISR. Marking TheFlag as atomic (or volatile, for an ISR) might help there.

The C++ standard does not say anything about what will happen in this case. It's just UB, since an object can be modified in one thread while another thread is accessing it.
You only say the platform specifies that these operations are atomic. Obviously, that isn't enough to ensure this code operates correctly. Atomicity only guarantees that two concurrent writes will leave the value as one of the two written values and that a read during one or more writes will never see a value not written. It says nothing about what happens in cases like this.
There is nothing wrong with any optimization that breaks this code. In particularly, atomicity does not prevent a read operation in another thread from seeing a value written before that read operation unless something known to synchronize was used.
If the compiler sees register pressure, nothing prevents it from simply reading TheFlag twice rather than creating a local copy. If the compile can deduce that the intervening code in this thread cannot modify TheFlag, the optimization is legal. Optimizers don't have to take into account what other threads might do unless you follow the rules and use things defined to synchronize or only require the explicit guarantees atomicity replies.
You go beyond that, so all bets are off. You need more than atomicity for TheFlag, so don't use a type that is merely atomic -- it isn't enough.

Related

Is write guaranteed with one thread writing and another reading non-atomic

Say I have
bool unsafeBool = false;
int main()
{
std::thread reader = std::thread([](){
std::this_thread::sleep_for(1ns);
if(unsafeBool)
std::cout << "unsafe bool is true" << std::endl;
});
std::thread writer = std::thread([](){
unsafeBool = true;
});
reader.join();
writer.join();
}
Is it guaranteed that unsafeBool becomes true after writer finishes. I know that it is undefined behavior what reader outputs but the write should be fine as far as I understand.

UB is and stays UB, there can be reasoning about why things are happinging the way they happen, though, you ain't allowed to rely on that.
You have a race condition, fix it by either: adding a lock or changing the type to an atomic.
Since you have UB in your code, your compiler is allowed to assume that doesn't happen. If it can detect this, it can change your complete function to a noop, as it can never be called in a valid program.
If it doesn't do so, the behaviour will depend on your processor and the caching linked to it. There, if the code after the joins uses the same core as the thread that read the boolean (before the join), you might even still have false in there without the need to invalidate the cache.
In practice, using Intel X86 processors, you will not see a lot of side effects from the race conditions as it has been made to invalidate the caches on write.

After writer.join() it guaranteed that unsafeBool == true. But in reader thread the access to it is a data race.

Some implementations guarantee that any attempt to read the value of a word-size-or-smaller object that isn't qualified volatile around the time that it changes will either yield an old or new value, chosen arbitrarily. In cases where this guarantee would be useful, the cost for a compiler to consistently uphold it would generally be less than the cost of working around its absence (among other things, because any ways by which programmers could work around its absence would restrict a compiler's freedom to choose between an old or new value).
In some other implementations, however, even operations that would seem like they should involve a single read of a value might yield code that combines the results from multiple reads. When ARM gcc 9.2.1 is invoked with command-line arguments -xc -O2 -mcpu=cortex-m0 and given:
#include <stdint.h>
#include <string.h>
#if 1
uint16_t test(uint16_t *p)
{
uint16_t q = *p;
return q - (q >> 15);
}
it generates code which reads from *p and then from *(int16_t*)p, shifts the latter right by 15, and adds that to the former. If the value of *p were to change between the two reads, this could cause the function to return 0xFFFF, a value which should be impossible.
Unfortunately, many people who design compilers so that they will always refrain from "splitting" reads in such fashion think such behavior is sufficiently natural and obvious that there's no particular reason to expressly document the fact that they never do anything else. Meanwhile, some other compiler writers figure that because the Standard allows compilers to split reads even when there's no reason to (splitting the read in the above code makes it bigger and slower than it would be if it simply read the value once) any code that would rely upon compilers refraining from such "optimizations" is "broken".

Once more volatile: necessary to prevent optimization?

I've been reading a lot about the 'volatile' keyword but I still don't have a definitive answer.
Consider this code:
class A
{
public:
void work()
{
working = true;
while(working)
{
processSomeJob();
}
}
void stopWorking() // Can be called from another thread
{
working = false;
}
private:
bool working;
}
As work() enters its loop the value of 'working' is true.
Now I'm guessing the compiler is allowed to optimize the while(working) to while(true) as the value of 'working' is true when starting the loop.
If this is not the case, that would mean something like this would be quite inefficient:
for(int i = 0; i < someOtherClassMember; i++)
{
doSomething();
}
...as the value of someOtherClassMember would have to be loaded each iteration.
If this is the case, I would think 'working' has to be volatile in order to prevent the compiler from optimising it.
Which of these two is the case? When googling the use of volatile I find people claiming it's only useful when working with I/O devices writing to memory directly, but I also find claims that it should be used in a scenario like mine.

Your program will get optimized into an infinite loop†.
void foo() { A{}.work(); }
gets compiled to (g++ with O2)
foo():
sub rsp, 8
.L2:
call processSomeJob()
jmp .L2
The standard defines what a hypothetical abstract machine would do with a program. Standard-compliant compilers have to compile your program to behave the same way as that machine in all observable behaviour. This is known as the as-if rule, the compiler has freedom as long as what your program does is the same, regardless of how.
Normally, reading and writing to a variable doesn't constitute as observable, which is why a compiler can elide as much reads and writes as it likes. The compiler can see working doesn't get assigned to and optimizes the read away. The (often misunderstood) effect of volatile is exactly to make them observable, which forces the compilers to leave the reads and writes alone‡.
But wait you say, another thread may assign to working. This is where the leeway of undefined behaviour comes in. The compiler may do anything when there is undefined behaviour, including formatting your hard drive and still be standard-compliant. Since there are no synchronization and working isn't atomic, any other thread writing to working is a data race, which is unconditionally undefined behaviour. Therefore, the only time an infinite loop is wrong is when there is undefined behaviour, by which the compiler decided your program might as well keep on looping.
TL;DR Don't use plain bool and volatile for multi-threading. Use std::atomic<bool>.
†Not in all situations. void bar(A& a) { a.work(); } doesn't for some versions.
‡Actually, there is some debate around this.

Now I'm guessing the compiler is allowed to optimize the while(working) to while(true)
Potentially, yes. But only if it can prove that processSomeJob() does not modify the working variable i.e. if it can prove that the loop is infinite.
If this is not the case, that would mean something like this would be quite inefficient ... as the value of someOtherClassMember would have to be loaded each iteration
Your reasoning is sound. However, the memory location might remain in cache, and reading from CPU cache isn't necessarily significantly slow. If doSomething is complex enough to cause someOtherClassMember to be evicted from the cache, then sure we'd have to load from memory, but on the other hand doSomething might be so complex that a single memory load is insignificant in comparison.
Which of these two is the case?
Either. The optimiser will not be able to analyse all possible code paths; we cannot assume that the loop could be optimised in all cases. But if someOtherClassMember is provably not modified in any code paths, then proving it would be possible in theory, and therefore the loop can be optimised in theory.
but I also find claims that [volatile] should be used in a scenario like mine.
volatile doesn't help you here. If working is modified in another thread, then there is a data race. And data race means that the behaviour of the program is undefined.
To avoid a data race, you need synchronisation: Either use a mutex, or atomic operations to share access across threads.

Volatile will make the while loop reload the working variable on every check. Practically that will often allow you to stop the working function with a call to stopWorking made from an asynchronous signal handler or another thread, but as per the standard it's not enough. The standard requires lock-free atomics or variables of type volatile sig_atomic_t for sighandler <-> regular context communication and atomics for inter-thread communication.

Can a compiler read twice from a global variable, instead of storing a local one?

I've been trying to get re-familiarized multi-threading recently and found this paper. One of the examples says to be careful when using code like this:
int my_counter = counter; // Read global
int (* my_func) (int);
if (my_counter > my_old_counter) {
... // Consume data
my_func = ...;
... // Do some more consumer work
}
... // Do some other work
if (my_counter > my_old_counter) {
... my_func(...) ...
}
Stating that:
If the compiler decides that it needs to spill the register
containing my counter between the two tests, it may well decide to
avoid storing the value (it’s just a copy of counter, after all), and
to instead simply re-read the value of counter for the second
comparison involving my counter[...]
Doing this would turn the code into:
int my_counter = counter; // Read global
int (* my_func) (int);
if (my_counter > my_old_counter) {
... // Consume data
my_func = ...;
... // Do some more consumer work
}
... // Do some other work
my_counter = counter; // Reread global!
if (my_counter > my_old_counter) {
... my_func(...) ...
}
I, however, am skeptical about this. I don't understand why the compiler is allowed to do this, since to my understanding a data race only occurs when trying to access the same memory area with any number of reads and at least a write at the same time. The author goes on to motivate that:
the core problem arises from the compiler taking advantage of the
assumption that variable values cannot asynchronously change without
an explicit assignment
It seems to me that the condition is respected in this case, as the local variable my_counter is never accessed twice and cannot be accessed by other threads. How would the compiler know that the global variable can not be set elsewhere, in another translation unit by another thread? It cannot, and in fact, I assume that the second if case would just be actually optimized away.
Is the author wrong, or am I missing something?

Unless counter is explicitly volatile, the compiler may assume that it never changes if nothing in the current scope of execution could change it. That means if there can be either no alias on a variable, or there are no function calls in between for which the compiler can't know the effects, any external modification is undefined behavior. With volatile you would be declaring external changes as possible, even if the compiler can't know how.
So that optimization is perfectly valid. In fact, even if it did actually perform the copy it still wouldn't be threadsafe as the value may change partially mid read, or might even be completely stale as cache coherency is not guaranteed without synchronisation primitives or atomics.
Well, actually on x86 you won't get an intermediate value for an integer, at least as long as it is aligned. That's one of the guarantees the architecture makes. Stale caches still apply, the value may already have been modified by another thread.
Use either a mutex or an atomic if you need this behavior.

Compilers [are allowed to] optimize presuming that anything which is "undefined behavior" simply cannot happen: that the programmer will prevent the code from being executed in such a way that would invoke the undefined behavior.
This can lead to rather silly executions where, for example, the following loop never terminates!
int vals[10];
for(int i = 0; i < 11; i++) {
vals[i] = i;
}
This is because the compiler knows that vals[10] would be undefined behavior, therefore it assumes it cannot happen, and since it cannot happen, i will never exceed or be equal to 11, therefore this loop never terminates. Not all compilers will aggressively optimize a loop like this in this way, though I do know that GCC does.
In the specific case you're working with, reading a global variable in this way can be undefined behavior iff [sic] it is possible for another thread to modify it in the interim. As a result, the compiler is assuming that cross-thread modifications never happen (because it's undefined behavior, and compilers can optimize presuming UB doesn't happen), and thus it's perfectly safe to reread the value (which it knows doesn't get modified in its own code).
The solution is to make counter atomic (std::atomic<int>), which forces the compiler to acknowledge that there might be some kind of cross-thread manipulation of the variable.

Reading shared variables with relaxed ordering: is it possible in theory? Is it possible in C++?

Consider the following pseudocode:
expected = null;
if (variable == expected)
{
atomic_compare_exchange_strong(
&variable, expected, desired(), memory_order_acq_rel, memory_order_acq);
}
return variable;
Observe there are no "acquire" semantics when the variable == expected check is performed.
It seems to me that desired will be called at least once in total, and at most once per thread.
Furthermore, if desired never returns null, then this code will never return null.
Now, I have three questions:
Is the above necessarily true? i.e., can we really have well-ordered reads of shared variables even in the absence of fences on every read?
Is it possible to implement this in C++? If so, how? If not, why?
(Hopefully with a rationale, not just "because the standard says so".)
If the answer to (2) is yes, then is it also possible to implement this in C++ without requiring variable == expected to perform an atomic read of variable?
Basically, my goal is to understand if it is possible to perform lazy-initialization of a shared variable in a manner that has performance identical to that of a non-shared variable once the code has been executed at least once by each thread?
(This is somewhat of a "language-lawyer" question. So that implies the question isn't about whether this is a good or useful idea, but rather about whether it's technically possible to do this correctly.)

Regarding the question whether it is possible to perform lazy initialisation of a shared variable in C++, that has a performance (almost) identical to that of a non-shared variable:
The answer is, that it depends on the hardware architecture, and the implementation of the compiler and run-time environment. At least, it is possible in some environments. In particular on x86 with GCC and Clang.
On x86, atomic reads can be implemented without memory fences. Basically, an atomic read is identical to a non-atomic read. Take a look at the following compilation unit:
std::atomic<int> global_value;
int load_global_value() { return global_value.load(std::memory_order_seq_cst); }
Although I used an atomic operation with sequential consistency (the default), there is nothing special in the generated code. The assembler code generated by GCC and Clang looks as follows:
load_global_value():
movl global_value(%rip), %eax
retq
I said almost identical, because there are other reasons that might impact the performance. For example:
although there is no fence, the atomic operations still prevent some compiler optimisations, e.g. reordering instructions and elimination of stores and loads
if there is at least one thread, that writes to a different memory location on the same cache line, it will have a huge impact on the performance (known as false sharing)
Having said that, the recommended way to implement lazy initialisation is to use std::call_once. That should give you the best result for all compilers, environments and target architectures.
std::once_flag _init;
std::unique_ptr<gadget> _gadget;
auto get_gadget() -> gadget&
{
std::call_once(_init, [this] { _gadget.reset(new gadget{...}); });
return *_gadget;
}

This is undefined behavior. You're modifying variable, at
least in some thread, which means that all accesses to
variable must be protected. In particular, when you're
executing the atomic_compare_exchange_strong in one thread,
there is nothing to guarantee that another thread might see the
new value of variable before it sees the writes that might
have occurred in desired(). (atomic_compare_exchange_strong
only guarantees any ordering in the thread that executes it.)

Do sequence points prevent code reordering across critical section boundaries?

Suppose that one has some lock based code like the following where mutexes are used to guard against inappropriate concurrent read and write
mutex.get() ; // get a lock.
T localVar = pSharedMem->v ; // read something
pSharedMem->w = blah ; // write something.
pSharedMem->z++ ; // read and write something.
mutex.release() ; // release the lock.
If one assumed that the generated code was created in program order, there is still a requirement for appropriate hardware memory barriers like isync,lwsync,.acq,.rel. I'll assume for this question that the mutex implementation takes care of this part, providing a guarentee that the pSharedMem reads and writes all occur "after" the get, and "before" the release() [but that surrounding reads and writes can get into the critical section as I expect is the norm for mutex implementations]. I'll also assume that volatile accesses are used in the mutex implementation where appropriate, but that volatile is NOT used for the data protected by the mutex (understanding why volatile does not appear to be a requirement for the mutex protected Data is really part of this question).
I'd like to understand what prevents the compiler from moving the pSharedMem accesses outside of the critical region. In the C and C++ standards I see that there is a concept of sequence point. Much of the sequence point text in the standards docs I found incomprehensible, but if I was to guess what it was about, it is a statement that code should not be reordered across a point where there is a call with unknown side effects. Is that the jist of it? If that is the case what sort of optimization freedom does the compiler have here?
With compilers doing tricky optimizations like profile driven interprocedural inlining (even across file boundaries), even the concept of unknown side effect gets kind of blurry.
It is perhaps beyond the scope of a simple question to explain this in a self contained way here, so I am open to being pointed at references (preferrably online and targetted at mortal programmers not compiler writers and language designers).
EDIT: (in response to Jalf's reply)
I'd mentioned the memory barrier instructions like lwsync and isync because of the CPU reordering issues you also mentioned. I happen to work in the same lab as the compiler guys (for one of our platforms at least), and having talked to the implementers of the intrinsics I happen to know that at least for the xlC compiler __isync() and __lwsync() (and the rest of the atomic intrinsics) are also a code reordering barrier. In our spinlock implementation this is visible to the compiler since this part of our critical section is inlined.
However, suppose you weren't using a custom build lock implementation (like we happen to be, which is likely uncommon), and just called a generic interface such as pthread_mutex_lock(). There the compiler isn't informed anything more than the prototype. I've never seen it suggested that code would be non-functional
pthread_mutex_lock( &m ) ;
pSharedMem->someNonVolatileVar++ ;
pthread_mutex_unlock( &m ) ;
pthread_mutex_lock( &m ) ;
pSharedMem->someNonVolatileVar++ ;
pthread_mutex_unlock( &m ) ;
would be non-functional unless the variable was changed to volatile. That increment is going to have a load/increment/store sequence in each of the back to back blocks of code, and would not operate correctly if the value of the first increment is retained in-register for the second.
It seems likely that the unknown side effects of the pthread_mutex_lock() is what protects this back to back increment example from behaving incorrectly.
I'm talking myself into a conclusion that the semantics of a code sequence like this in a threaded environment is not really strictly covered by the C or C++ language specs.

In short, the compiler is allowed to reorder or transform the program as it likes, as long as the observable behavior on a C++ virtual machine does not change. The C++ standard has no concept of threads, and so this fictive VM only runs a single thread. And on such an imaginary machine, we don't have to worry about what other threads see. As long as the changes don't alter the outcome of the current thread, all code transformations are valid, including reordering memory accesses across sequence points.
understanding why volatile does not appear to be a requirement for the mutex protected Data is really part of this question
Volatile ensures one thing, and one thing only: reads from a volatile variable will be read from memory every time -- the compiler won't assume that the value can be cached in a register. And likewise, writes will be written through to memory. The compiler won't keep it around in a register "for a while, before writing it out to memory".
But that's all. When the write occurs, a write will be performed, and when the read occurs, a read will be performed. But it doesn't guarantee anything about when this read/write will take place. The compiler may, as it usually does, reorder operations as it sees fit (as long as it doesn't change the observable behavior in the current thread, the one that the imaginary C++ CPU knows about). So volatile doesn't really solve the problem. On the other hand, it offers a guarantee that we don't really need. We don't need every write to the variable to be written out immediately, we just want to ensure that they get written out before crossing this boundary. It's fine if they're cached until then - and likewise, once we've crossed the critical section boundary, subsequent writes can be cached again for all we care -- until we cross the boundary the next time. So volatile offers a too strong guarantee which we don't need, but doesn't offer the one we do need (that reads/writes won't get reordered)
So to implement critical sections, we need to rely on compiler magic. We have to tell it that "ok, forget about the C++ standard for a moment, I don't care what optimizations it would have allowed if you'd followed that strictly. You must NOT reorder any memory accesses across this boundary".
Critical sections are typically implemented via special compiler intrinsics (essentially special functions that are understood by the compiler), which 1) force the compiler to avoid reordering across that intrinsic, and 2) makes it emit the necessary instructions to get the CPU to respect the same boundary (because the CPU reorders instructions too, and without issuing a memory barrier instruction, we'd risk the CPU doing the same reordering that we just prevented the compiler from doing)

No, sequence points do not prevent rearranging of operations. The main, most broad rule that governs optimizations is the requirement imposed on so called observable behavior. Observable behavior, by definition, is read/write accesses to volatile variables and calls to library I/O functions. These events must occur in the same order and produce the same results as they would in the "canonically" executed program. Everything else can be rearranged and optimized absolutely freely by the compiler, in any way it sees fit, completely ignoring any orderings imposed by sequence points.
Of course, most compilers are trying not to do any excessively wild rearrangements. However, the issue you are mentioning has become a real practical issue with modern compilers in recent years. Many implementations offer additional implementation-specific mechanisms that allow user to ask the compiler not to cross certain boundaries when doing optimizing rearrangements.
Since, as you are saying, the protected data is not declared as volatile, formally speaking the access can be moved outside of the protected region. If you declare the data as volatile, it should prevent this from happening (assuming that mutex access is also volatile).

Let look at the following example:
my_pthread_mutex_lock( &m ) ;
someNonVolatileGlobalVar++ ;
my_pthread_mutex_unlock( &m ) ;
The function my_pthread_mutex_lock() is just calling pthread_mutex_lock(). By using my_pthread_mutex_lock(), I'm sure the compiler doesn't know that it's a synchronization function. For the compiler, it's just a function, and for me, it's a synchronization function that I can easily reimplement.
Because someNonVolatileGlobalVar is global, I expected the compiler doesn't move someNonVolatileGlobalVar++ outside the critical section. In fact, due to the observable behavior, even in a single thread situation, the compiler doesn't know if the function before and the one after this instruction are modifying the global var. So, to keep the observable behavior correct, it has to keep the execution order as it is written.
I hope pthread_mutex_lock() and pthread_mutex_unlock() also perform hardware memory barriers, in order to prevent the hardware moving this instruction outside the critical section.
do I am right ?
If I write:
my_pthread_mutex_lock( &m ) ;
someNonVolatileGlobalVar1++ ;
someNonVolatileGlobalVar2++ ;
my_pthread_mutex_unlock( &m ) ;
I cannot know which one of the two variables is incremented first, but this is normally not an issue.
Now, if I write:
someGlobalPointer = &someNonVolatileLocalVar;
my_pthread_mutex_lock( &m ) ;
someNonVolatileLocalVar++ ;
my_pthread_mutex_unlock( &m ) ;
or
someLocalPointer = &someNonVolatileGlobalVar;
my_pthread_mutex_lock( &m ) ;
(*someLocalPointer)++ ;
my_pthread_mutex_unlock( &m ) ;
Is the compiler doing what an ingenuous developer expects ?

C/C++ sequence points occur, for example, when ';' is encountered. At which point all side-effects of all operations that preceded it must occur. However, I'm fairly certain that by "side-effect" what's meant is operations that are part of the language itself (like z being incremented in 'z++') and not effects at lower/higher levels (like what the OS actually does with regards to memory management, thread management, etc. after an operation is completed).
Does that answer your question kinda? My point is really just that AFAIK the concept of sequence points doesn't really have anything to do with the side effects you're referring to.
hth

see something in [linux-kernel]/Documentation/memory-barriers.txt

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js