Why doesn't my C++ compiler optimize these memory writes away?

Why doesn't my C++ compiler optimize these memory writes away? - c++

I created this program. It does nothing of interest but use processing power.
Looking at the output with objdump -d, I can see the three rand calls and corresponding mov instructions near the end even when compiling with O3 .
Why doesn't the compiler realize that memory isn't going to be used and just replace the bottom half with while(1){}? I'm using gcc, but I'm mostly interested in what is required by the standard.
/*
* Create a program that does nothing except slow down the computer.
*/
#include <cstdlib>
#include <unistd.h>
int getRand(int max) {
return rand() % max;
}
int main() {
for (int thread = 0; thread < 5; thread++) {
fork();
}
int len = 1000;
int *garbage = (int*)malloc(sizeof(int)*len);
for (int x = 0; x < len; x++) {
garbage[x] = x;
}
while (true) {
garbage[getRand(len)] = garbage[getRand(len)] - garbage[getRand(len)];
}
}

Because GCC isn't smart enough to perform this optimization on dynamically allocated memory. However, if you change garbageto be a local array instead, GCC compiles the loop to this:
.L4:
call rand
call rand
call rand
jmp .L4
This just calls rand repeatedly (which is needed because the call has side effects), but optimizes out the reads and writes.
If GCC was even smarter, it could also optimize out the randcalls, because its side effects only affect any later randcalls, and in this case there aren't any. However, this sort of optimization would probably be a waste of compiler writers' time.

It can't, in general, tell that rand() doesn't have observable side-effects here, and it isn't required to remove those calls.
It could remove the writes, but it may be the use of arrays is enough to suppress that.
The standard neither requires nor prohibits what it is doing. As long as the program has the correct observable behaviour any optimisation is purely a quality of implementation matter.

This code causes undefined behaviour because it has an infinite loop with no observable behaviour. Therefore any result is permissible.
In C++14 the text is 1.10/27:
The implementation may assume that any thread will eventually do one of the following:
terminate,
make a call to a library I/O function,
access or modify a volatile object, or
perform a synchronization operation or an atomic operation.
[Note: This is intended to allow compiler transformations such as removal of empty loops, even when termination cannot be proven. —end note ]
I wouldn't say that rand() counts as an I/O function.
Related question

Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.

Related

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

In all likelihood, a lockless implementation is already overkill for the purposes of my application, but I wanted to look into memory barriers and lockless-ness anyways in case I ever actually need to use these concepts in the future.
From what I can tell:
an "InterlockedAcquire" function performs an atomic operation while preventing the compiler from moving code statements after the InterlockedAcquire to before the InterlockedAcquire.
an "InterlockedRelease" function performs an atomic operation while preventing the compiler from moving code statements before the InterlockedRelease to after the InterlockedRelease.
a vanilla "Interlocked" function performs an atomic operation while preventing the compiler from moving code statements in either direction across the Interlocked call.
My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
For a more concrete example, here's my current application - the produce() function as part of what will eventually be a multiple producer, single consumer queue built using a circular buffer:
template <typename T>
class Queue {
private:
long headIndex;
long tailIndex;
T* array[MAXQUEUESIZE];
public:
Queue() {
headIndex = 0;
tailIndex = 0;
memset(array, 0, MAXQUEUESIZE*sizeof(void*);
}
~Queue() {
}
bool produce(T value) {
//1) prevents concurrent calls to produce() from causing corruption:
long indexRetVal;
long reservedIndex;
do {
reservedIndex = tailIndex;
indexRetVal = InterlockedCompareExchange64(&tailIndex, (reservedIndex + 1) % MAXQUEUESIZE, reservedIndex);
} while (indexRetVal != reservedIndex);
//2) allocates the node.
T* newValPtr = (T*) malloc(sizeof(T));
if (newValPtr == null) {
OutputDebugString("Queue: malloc returned null");
return false;
}
*newValPtr = value;
//3) prevents a concurrent call to consume from causing corruption by atomically replacing the old pointer:
T* valPtrRetVal = InterlockedCompareExchangePointer(array + reservedIndex, newValPtr, null);
//if the previous value wasn't null, then our circular buffer overflowed:
if (valPtrRetVal != null) {
OutputDebugString("Queue: circular buffer overflowed");
free(newValPtr); //as pointed out by RbMm
return false;
}
//otherwise, everything worked fine
return true;
}
};
As I understand it, 3) will occur after 1) and 2) regardless of what I do anyways, but I should change 1) to an InterlockedRelease because I don't care whether it occurs before or after 2) and I should let the compiler decide.

My question is, if a function is structured such that the compiler can't reorder any of the code anyways because doing so would affect single-threaded behavior, is there a difference between any of the variants of an Interlocked function, or all they all effectively the same? Is the only difference between them how they interact with code reordering?
You may be confusing C++ statements with instructions. Your question isn't CPU specific, so you have to pretend you have no idea what the CPU instructions look like.
Consider this code:
if (a == 2)
{
b = 5;
}
Now, here's an example of a re-ordering of this code that doesn't affect a single thread:
int c = b;
b = 5;
if (a != 2)
b = c;
This performs the same operations but in a different order. It has no effect on single-threaded code. But, of course, if another thread was accessing b, it could see a value of 5 from this code even if a was never 2.
Thus it could also see a value of 5 from the original code even if a is never 2!
Why, because the two bits of code perform the same from the point of view of a single thread. And unless you use operations with guaranteed threading semantics, that's all the compiler, CPU, caches, and other platform components need to preserve.
So most likely, your belief that reordering any of the code would affect single-threaded behavior is probably incorrect. There's lots of ways to reorder and optimize code that doesn't affect single-threaded behavior.

There is an document on the msdn Explained the difference: Acquire and Release Semantics.
For the sample:
a++;
b++;
c++;
If we use acquire semantics to increment a, other processors would always see the increment of a before the increments of b and c;
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
the InterlockedXxx routines perform, have both acquire and release semantics by default.
More specific, for 4 values:
a++;
b++;
c++;
d++;
If we use acquire semantics to increment b, other processors would always see the increment of b before the increments of c and d;
The order may be a->b->c,d or b->a,c,d.
If we use release semantics to increment c, other processors would always see the increments of a and b before the increment of c;
The order may be a,b->c->d or a,b,d->c.
To quote from this answer of #antiduh:
Acquire says "only worry about stuff after me". Release says "only
worry about stuff before me". Combining those both is a full memory
barrier.

All three versions prevent the compiler from moving code across the function call, but the compiler is not the only place that reordering takes place.
Modern CPUs have "out-of-order execution" and even "speculative execution". Acquire and release semantics cause the code to compiler to instructions with flags or prefixes controlling reordering within the CPU.

Is there a way to turn off loop optimisation on both C++ and Rust compilation?

I'm looking for a compiler flag that will allow me to prevent the compiler optimising away the loop in code like this:
void func() {
std::unique_ptr<int> up1(new int(0)), up2;
up2 = move(up1);
for(int i = 0; i < 1000000000; i++) {
if(up2) {
*up2 += 1;
}
}
if(up2)
printf("%d", *up2);
}
in both C++ and Rust code. I'm trying to compare similar sections of code in terms of speed and running this loop rather than just evaluating the overall result is important. Since Rust statically guarantees that the pointer ownership hasn't been moved, it doesn't need the null pointer checks on each iteration of the loop and I would imagine therefore it would produce faster code if the loop couldn't be optimised out for whatever reason.
Rust compiles using an LLVM backend, so I would preferably be using that for C++ as well.

In Rust you can use test::black_box.
In C++ (using gcc) asm volatile("" : "+r" (datum));. See this.

One typical way to avoid having the compiler optimize away loops is to make their bounds indeterminate at compile time. In this example, rather than looping up to 10000000, loop up to a count which is read from stdin or argv.

An example of an optimization that involves compiler reordering

C & C++ compilers are allowed to reorder operations as long as the as-if rule holds. What is an example of such a reordering performed by a compiler, and what is the potential performance gain to be had by doing it?
Examples involving any (C/C++) compiler on any platform are welcome.

Suppose you have the following operations being performed:
int i=0,j=0;
i++;
i++;
i++;
j++;
j++;
j++;
Ignoring for the moment that the three increments would likely be optimized away by the compiler into one +=3, you will end up having a higher processor-pipeline throughput if you reordered the operations as
i++;
j++;
i++;
j++;
i++;
j++;
since j++ doesn't have to wait for the result of i++ while in the previous case, most of the instructions had a data dependency on the previous instruction. In more complicated computations, where there isn't an easy way to reducing the number of instructions to be performed, the compiler can still look at data dependencies and reorder instructions so that an instruction depending on the result of an earlier instruction is as far away from it as possible.
Another example of such an optimization is when you are dealing with pure functions. Looking at a simple example again, assume you have a pure function f(int x) which you are summing over a loop.
int tot = 0;
int x;//something known only at runtime
for(int i = 0; i < 100; i++)
tot += f(x);
Since f is a pure function, the compiler can reorder calls to it as it pleases. In particular, it can transform this loop to
int tot = 0;
int x;//something known only at runtime
int fval = f(x);
for(int i = 0; i < 100; i++)
tot += fval;

I'm sure there are quite a few examples where reordering operations will yield faster performance. An obvious example would be to reorder loads as early as possible, since these are typically much slower than other CPU operations. By doing other, unrelated work whilst the memory is being fetched, the CPU can save time overall.
That is, given something like this:
expensive_calculation();
x = load();
do_something(x);
We can reorder it like this:
x = load();
expensive_calculation();
do_something(x);
So while we're waiting for the load to complete, we can essentially do expensive_calculation() for free.

Suppose you have a loop like:
for (i=0; i<n; i++) dest[i] = src[i];
Think memcpy. You might want the compiler to be able to vectorize this, i.e. load 8 or 16 bytes at a time and then store 8 or 16 at a time. Making that transformation is a reordering, since it would cause src[1] to be read before dest[0] is stored. Moreover, unless the compiler knows that src and dest don't overlap, it's an invalid transformation, i.e. one the compiler is not allowed to make. Use of the restrict keyword (C99 and later) allows you to tell the compiler that they don't overlap so that this kind of (extremely valuable) optimization is possible.
The same sort of thing arises all the time in operations on arrays that aren't just copying - things like vector/matrix operations, transformations of sound/image sample data, etc.

Evaluation of constants in for loop condition

for(int i = 0; i < my_function(MY_CONSTANT); ++i){
//code using i
}
In this example, will my_function(MY_CONSTANT) be evaluated at each iteration, or will it be stored automatically? Would this depend on the optimization flags used?

It has to work as if the function is called each time.
However, if the compiler can prove that the function result will be the same each time, it can optimize under the “as if” rule.
E.g. this usually happens with calls to .end() for standard containers.
General advice: when in doubt about whether to micro-optimize a piece of code,
Don't do it.
If you're still thinking of doing it, measure.
Well there was a third point but I've forgetting, maybe it was, still wait.
In other words, decide whether to use a variable based on how clear the code then is, not on imagined performance.

It will be evaluated each iteration. You can save the extra computation time by doing something like
const int stop = my_function(MY_CONSTANT);
for(int i = 0; i < stop; ++i){
//code using i
}

A modern optimizing compiler under the as-if rule may be able to optimize away the function call in the case that you outlined in your comment here. The as-if rule says that conforming compiler only has the emulate the observable behavior, we can see this by going to the draft C++ standard section 1.9 Program execution which says:
[...]Rather, conforming implementations are required to emulate (only)
the observable behavior of the abstract machine as explained below.5
So if you are using a constant expression and my_function does not have observable side effects it could be optimized out. We can put together a simple test (see it live on godbolt):
#include <stdio.h>
#define blah 10
int func( int x )
{
return x + 20 ;
}
void withConstant( int y )
{
for(int i = 0; i < func(blah); i++)
{
printf("%d ", i ) ;
}
}
void withoutConstant(int y)
{
for(int i = 0; i < func(i+y); i++)
{
printf("%d ", i ) ;
}
}
In the case of withConstant we can see it optimizes the computation:
cmpl $30, %ebx #, i
and even in the case of withoutConstant it inlines the calculation instead of performing a function call:
leal 0(%rbp,%rbx), %eax #, D.2605

If my_function is declared constexpr and the argument is really a constant, the value is calculated at compile time and thereby fulfilling the "as-if" and "sequential-consistency with no data-race" rule.
constexpr my_function(const int c);
If your function has side effects it would prevent the compiler from moving it out of the for-loop as it would not fulfil the "as-if" rule, unless the compiler can reason its way out of it.
The compiler might inline my_function, reduce on it as if it was part of the loop and with constant reduction find out that its really only a constant, de-facto removing the call and replacing it with a constant.
int my_function(const int c) {
return 17+c; // inline and constant reduced to the value.
}
So the answer to your question is ... maybe!

Optimization and multithreading in B.Stroustrup's new book

Please refer to section 41.2.2 Instruction Reordering of "TCPL" 4th edition by B.Stroustrup, which I transcribe below:
To gain performance, compilers, optimizers, and hardware reorder
instructions. Consider:
// thread 1:
int x;
bool x_init;
void init()
{
x = initialize(); // no use of x_init in initialize()
x_init = true;
// ...
}
For this piece of code there is no stated reason to assign to x before
assigning to x_init. The optimizer (or the hardware instruction
scheduler) may decide to speed up the program by executing x_init =
true first. We probably meant for x_init to indicate whether x had
been initialized by initializer() or not. However, we did not say
that, so the hardware, the compiler, and the optimizer do not know
that.
Add another thread to the program:
// thread 2:
extern int x;
extern bool x_init;
void f2()
{
int y;
while (!x_init) // if necessary, wait for initialization to complete
this_thread::sleep_for(milliseconds{10});
y = x;
// ...
}
Now we have a problem: thread 2 may never wait and thus will assign an
uninitialized x to y. Even if thread 1 did not set x_init and x in
‘‘the wrong order,’’ we still may have a problem. In thread 2, there
are no assignments to x_init, so an optimizer may decide to lift the
evaluation of !x_init out of the loop, so that thread 2 either never
sleeps or sleeps forever.
Does the Standard allow the reordering in thread 1? (some quote from the Standard would be forthcoming) Why would that speed up the program?
Both answers in this discussion on SO seem to indicate that no such optimization occurs when there are global variables in the code, as x_init above.
What does the author mean by "to lift the evaluation of !x_init out of the loop"? Is this something like this?
if( !x_init ) while(true) this_thread::sleep_for(milliseconds{10});
y = x;

This is not so much a issue of the C++ compiler/standard, but that of modern CPUs. Have a look here. The compiler isn't going to emit memory barrier instructions between the assignments of x and x_init unless you tell it to.
For what it is worth, prior to C++11, the standard had no notion of multi threading in it's abstract machine model. Things are a bit nicer these days.

The C++11 standard does not "allow" or "prevent" reordering. It specifies some way to force a specific "barrier" that, it turns, prevent the compiler to move instructions before/after them. A compiler, in this example might reorder the assignment because it might be more efficient on a CPU with multiple computing unit (ALU/Hyperthreading/etc...) even with a single core. Typically, if your CPU has 2 ALU that can work in parallel, there is no reason the compiler would not try to feed them with as much work as it can.
I'm not speaking of out-of-order reordering of the CPU instructions that's done internally in Intel CPU (for example), but compile time ordering to ensure all the computing resources are busy doing some work.
I think it depends on the compiler compilation flags. Typically unless you tell it to, the compiler must assume that another compilation unit (say B.cpp, which is not visible at compile time) can have a "extern bool x_init", and can change it at any time. Then, the re-ordering optimization would break with the expected behavior (B can define the initialize() function). This example is trivial and unlikely to break. The linked SO answer are not related to this "optimization", but simply that, in their case, the compiler can not make the assumption that the global array is not modified externally, and as such can not make the optimization. This is not like your example.
Yes. It's a very common optimization trick, instead of:
// test is a bool
for (int i = 0; i < 345; i++) {
if (test) do_something();
}
The compiler might do:
if (test) for(int i = 0; i < 345; i++) { do_something(); }
And save 344 useless tests.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Why doesn't my C++ compiler optimize these memory writes away? - c++

Leave it a chance to crash by array overflow ! The compiler won't speculate on the range of outputs of getRand.

Related

Difference between Interlocked, InterlockedAcquire, and InterlockedRelease if single thread reordering is impossible

Is there a way to turn off loop optimisation on both C++ and Rust compilation?

An example of an optimization that involves compiler reordering

Evaluation of constants in for loop condition

Optimization and multithreading in B.Stroustrup's new book

Categories

Resources