Can I guarantee the C++ compiler will not reorder my calculations? - c++

I'm currently reading through the excellent Library for Double-Double and Quad-Double Arithmetic paper, and in the first few lines I notice they perform a sum in the following way:
std::pair<double, double> TwoSum(double a, double b)
{
double s = a + b;
double v = s - a;
double e = (a - (s - v)) + (b - v);
return std::make_pair(s, e);
}
The calculation of the error, e, relies on the fact that the calculation follows that order of operations exactly because of the non-associative properties of IEEE-754 floating point math.
If I compile this within a modern optimizing C++ compiler (e.g. MSVC or gcc), can I be ensured that the compiler won't optimize out the way this calculation is done?
Secondly, is this guaranteed anywhere within the C++ standard?

You might like to look at the g++ manual page: http://gcc.gnu.org/onlinedocs/gcc-4.6.1/gcc/Optimize-Options.html#Optimize-Options
Particularly -fassociative-math, -ffast-math and -ffloat-store
According to the g++ manual it will not reorder your expression unless you specifically request it.

Yes, that is safe (at least in this case). You only use two "operators" there, the primary expression (something) and the binary something +/- something (additive).
Section 1.9 Program execution (of C++0x N3092) states:
Operators can be regrouped according to the usual mathematical rules only where the operators really are associative or commutative.
In terms of the grouping, 5.1 Primary expressions states:
A parenthesized expression is a primary expression whose type and value are identical to those of the enclosed expression. ... The parenthesized expression can be used in exactly the same contexts as those where the enclosed expression can be used, and with the same meaning, except as otherwise indicated.
I believe the use of the word "identical" in that quote requires a conforming implementation to guarantee that it will be executed in the specified order unless another order can give the exact same results.
And for adding and subtracting, section 5.7 Additive operators has:
The additive operators + and - group left-to-right.
So the standard dictates the results. If the compiler can ascertain that the same results can be obtained with different ordering of the operations then it may re-arrange them. But whether this happens or not, you will not be able to discern a difference.

This is a very valid concern, because Intel's C++ compiler, which is very widely used, defaults to performing optimizations that can change the result.
See http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/common_options/option_fp_model.htm#option_fp_model

I would be quite surprised if any compiler wrongly assumed associativity of arithmetic operators with default optimising options.
But be wary of extended precision of FP registers.
Consult compiler documentation on how to ensure that FP values do not have extended precision.

If you really need to, I think you can make a noinline function no_reorder(float x) { return x; }, and then use it instead of parenthesis. Obviously, it's not a particularly efficient solution though.

In general, you should be able to -- the optimizer should be aware of the properties of the real operations.
That said, I'd test the hell out of the compiler I was using.

Yes. The compiler will not change the order of your calculations within a block like that.

Between compiler optimizations and out-of-order execution on the processor, it is almost a guarantee that things will not happen exactly as you ordered them.
HOWEVER, it is also guaranteed that this will NEVER change the result. C++ follows standard order of operations and all optimizations preserve this behavior.
Bottom line: Don't worry about it. Write your C++ code to be mathematically correct and trust the compiler. If anything goes wrong, the problem was almost certainly not the compiler.

As per the other answers you should be able to rely on the compiler doing the right thing -- most compilers allow you to compile and inspect the assembler (use -S for gcc) -- you may want to do that to make sure you get the order of operation you expect.
Different optimization levels (in gcc, -O _O2 etc) allows code to be re-arranged (however sequential code like this is unlikely to be affected) -- but I would suggest you should then isolate that particular part of code into a separate file, so that you can control the optimization level for just the calculation.

The short answer is: the compiler will probably change the order of your calculations, but it will never change the behavior of your program (unless your code makes use of expression with undefined behavior: http://blog.regehr.org/archives/213)
However, you can still influence this behavior by deactivating all compiler optimizations (option "-O0" with gcc). If you still needs the compiler to optimize the rest of your code, you may put this function in a separate ".c" which you can compile with "-O0".
Additionally, you can use some hacks. For instance, if you interleaves your code with extern function calls the compiler may consider that it is unsafe to re-order your code as the function may have unknown side-effect. Calling "printf" to print the value of your intermediate results will conduct to similar behavior.
Anyway, unless you have any very good reason (e.g. debugging) you typically don't want to care about that, and you should trust the compiler.

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

Enforcing order of execution

I would like to ensure that the calculations requested are executed exactly in the order I specify, without any alterations from either the compiler or CPU (including the linker, assembler, and anything else you can think of).
Operator left-to-right associativity is assumed in the C language
I am working in C (possibly also interested in C++ solutions), which states that for operations of equal precedence there is an assumed left-to-right operator associativity, and hence
a = b + c - d + e + f - g ...;
is equivalent to
a = (...(((((b + c) - d) + e) + f) - g) ...);
A small example
However, consider the following example:
double a, b = -2, c = -3;
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
So many opportunities for optimisation
For many compilers and pre-processors they may be clever enough to recognise the "+ 2 - 2" is redundant and optimise this away. Similarly they could recognise that the "+= 2*b" followed by the "+= c" can be written using a single FMA. Even if they don't optimise in an FMA, they may switch the order of these operations etc. Furthermore, if the compiler doesn't do any of these optimisations, the CPU may well decide to do some out of order execution, and decide it can do the "+= c" before the "+= 2*b", etc.
As floating-point arithmetic is non-associative, each type of optimisation may result in a different end result, which may be noticeable if the following is inlined somewhere.
Why worry about floating point associativity?
For most of my code I would like as much optimisation as I can have and don't care about floating-point associativity or bit-wise reproduciblilty, but occasionally there is a small snippet (similar to the above example) which I would like to be untampered with and totally respected. This is because I am working with a mathematical method which exactly requires a reproducible result.
What can I do to resolve this?
A few ideas which have come to mind:
Disable compiler optimisations and out of order execution
I don't want this, as I want the other 99% of my code to be heavily optimised. (This seems to be cutting off my nose to spite my face). I also most likely won't have permission to change my hardware settings.
Use a pragma
Write some assembly
The code snippets are small enough that this might be reasonable, although I'm not very confident in this, especially if (when) it comes to debugging.
Put this in a separate file, compile separately as un-optimised as possible, and then link using a function call
Volatile variables
To my mind these are just for ensuring that memory access is respected and un-optimised, but perhaps they might prove useful.
Access everything through judicious use of pointers
Perhaps, but this seems like a disaster in readability, performance, and bugs waiting to happen.
If anyone can think of any feasibly solutions (either from any of the ideas I've suggested or otherwise) that would be ideal. The "pragma" option or "function call" to my mind seem like the best approaches.
The ultimate goal
To have something that marks off a small chuck of simple and largely vanilla C code as protected and untouchable to any (realistically most) optimisations, while allowing for the rest of the code to be heavily optimised, covering optimisations from both the CPU and compiler.
This is not a complete answer, but it is informative, partially answers, and is too long for a comment.
Clarifying the Goal
The question actually seeks reproducibility of floating-point results, not order of execution. Also, order of execution is irrelevant; we do not care if, in (a+b)+(c+d), a+b or c+d is executed first. We care that the result of a+b is added to the result of c+d, without any reassociation or other rewriting of arithmetic unless the result is known to be the same.
Reproducibility of floating-point arithmetic is in general an unsolved technological problem. (There is no theoretical barrier; we have reproducible elementary operations. Reproducibility is a matter of what hardware and software vendors have provided and how hard it is to express the computations we want performed.)
Do you want reproducibility on one platform (e.g., always using the same version of the same math library)? Does your code use any math library routines like sin or log? Do you want reproducibility across different platforms? With multithreading? Across changes of compiler version?
Addressing Some Specific Issues
The samples shown in the question can largely be handled by writing each individual floating-point operation in its own statement, as by replacing:
a = 1 + 2 - 2 + 3 + 4;
a += 2*b;
a += c;
with:
t0 = 1 + 2;
t0 = t0 - 2;
t0 = t0 + 3;
t0 = t0 + 4;
t1 = 2*b;
t0 += t1;
a += c;
The basis for this is that both C and C++ permit an implementation to use “excess precision” when evaluating an expression but require that precision to be “discarded” when an assignment or cast is performed. Limiting each assignment expression to one operation or executing a cast after each operation effectively isolates the operations.
In many cases, a compiler will then generate code using instructions of the nominal type, instead of instructions using a type with excess precision. In particular, this should avoid a fused multiply-add (FMA) being substituted for a multiplication followed by an addition. (An FMA has effectively infinite precision in the product before it is added to the addend, thus falling under the “excess precision is permitted” rule.) There are caveats, however. An implementation might first evaluate an operation with excess precision and then round it to the nominal precision. In general, this can cause a different result than doing a single operation in the nominal precision. For the elementary operations of addition, subtract, multiplication, division, and even square root, this does not happen if the excess precision is sufficient greater than the nominal precision. (There are proofs that a result with sufficient excess precision is always close enough to the infinitely precise result that the rounding to nominal precision gets the same result.) This is true for the case where the nominal precision is the IEEE-754 basic 32-bit binary floating-point format, and the excess precision is the 64-bit format. However, it is not true where the nominal precision is the 64-bit format and the excess precision is Intel’s 80-bit format.
So, whether this workaround works depends on the platform.
Other Issues
Aside from the use of excess precision and features like FMA or the optimizer rewriting expressions, there are other things that affect reproducibility, such as non-standard treatment of subnormals (notably replacing them with zeroes), variations between math library routines. (sin, log, and similar functions return different results on different platforms. Nobody has fully implemented correctly rounded math library routines with known bounded performance.)
These are discussed in other Stack Overflow questions about floating-point reproducibility, as well as papers, specifications, and standards documents.
Irrelevant Issues
The order in which a processor executes floating-point operations is irrelevant. Processor reordering of calculations obeys rigid semantics; the results are identical regardless of the chronological order of execution. (Processor timing can affect results if, for example, a task is partitioned into subtasks, such as assigning multiple threads or processes to process different parts of the arrays. Among other issues, their results could arrive in different orders, and the process receiving their results might then add or otherwise combine their results in different orders.)
Using pointers will not fix anything. As far as C or C++ is concerned, *p where p is a pointer to double is the same as a where a is a double. One the objects has a name (a) and one of them does not, but they are like roses: They smell the same. (There are issues where, if you have some other pointer q, the compiler might not know whether *q and *p refer to the same thing. But that also holds true for *q and a.)
Using volatile qualifiers will not aid in reproducibility regarding the excess precision or expression rewriting issue. That is because only an object (not a value) is volatile, which means it has no effect until you write it or read it. But, if you write it, you are using an assignment expression1, so the rule about discarding excess precision already applies. When reading the object, you would force the compiler to retrieve the actual value from memory, but this value will not be any different than the non-volatile object has after assignment, so nothing is accomplished.
Footnote
1 I would have to check on other things that modify an object, such as ++, but those are likely not significant for this discussion.
Write this critical chunk of code in assembly language.
The situation you're in is unusual. Most of the time people want the compiler to do optimizations, so compiler developers don't spend much development effort on means to avoid them. Even with the knobs you do get (pragmas, separate compilation, indirections, ...) you can never be sure something won't be optimized. Some of the undesirable optimizations you mention (constant folding, for instance) cannot be turned off by any means in modern compilers.
If you use assembly language you can be sure you're getting exactly what you wrote. If you do it any other way you won't have that level of confidence.
"clever enough to recognise the + 2 - 2 is redundant and optimise this
away"
No ! All decent compilers will apply constant propagation and figure out that a is constant and optimize all your statement away, into something equivalent to a = 1;. Here the example with assembly.
Now if you make a volatile, the compiler has to assume that any change of a could have an impact outside the C++ programme. Constant propagation will still be performed to optimise each of these calculations, but the intermediary assignments are guaranteed to happen. Here the example with assembly.
If you don't want constant propagation to happen, you need to deactivate optimizations. In this case, the best would be to keep your code separate so to compile the rest with all optilizations on.
However this is not ideal. The optimizer could outperform you and with this approach, you'll loose global optimisation across the function boundaries.
Recommendation/quote of the day:
Don't diddle code; Find better algorithms
- B.W.Kernighan & P.J.Plauger

Do these macros evaluate to the same code using gcc at compile-time?

Of course this is going to be a function of the compiler you are using, but I figured this would be a simple question to answer.
#define UBRRVAL(baud) (F_CPU/(16*baud)-1)
As compared with
#define UBRRVAL(baud) (F_CPU/16/baud-1)
I know that the latter is going to evaluate to (assuming F_CPU = 20000000):
#define UBRRVAL(baud) (12500000/baud-1)
Considering the forced precidence by the parenthesis I was curious to know if most compilers (gcc in particular) would evaluate the former expression equivalently to the latter at compile-time.
This is code that is going into an embeddded system, so if these expressions are not evaluated at compile-time equivalently, then the latter is more efficient; a single division at run-time is more efficient than a division and a mulitplication of course.
Simple answer, no.
Because neither macro is fully parenthesized, there are cases where the two are very different.
Consider UBRRVAL(2+1). The first would expand to (F_CPU/(16*2+1)-1), which is equivalent to F_CPU/33 - 1. The second would expand to (F_CPU/16/2+1-1), which is equivalent to F_CPU/32. Not the same at all.
Of course, it probably isn't meant to be called with an expression, just with a single constant value, but there's nothing to prevent it, and as such, someone will do it sometime in the future. One of the many evils of macros. I would recommend using a short (static) inline function (or constexpr as suggested in comments, if this is using a recent enough C++ compiler) instead...
Simple answer, yes. Within the specific constraints given both will be fully evaluated at compile time.
Parentheses force precedence but they do not force order of evaluation, except to the extent defined by the "as if" rule. You cannot be sure what code will be emitted if the expression is slightly more complicated so it is not evaluated at compile time. This may well depend on the specific processor.
As a side point, on most processors a 4 bit shift left or shift right are the same cost, and if the baud rate is a power of two the compiler is likely to generate shift operations.
[And be careful about parenthesising macro arguments. You got away with it this time, but only just.]

Efficency of repeated arithmetic between two macros

In an ANSI C project I am working on, I have two macros defined: PERIOD_IN_MS and CYCLES_PER_MS. In the actual period handling logic, I do many comparisons between a counter that is incremented every ''cycle'' and PERIOD_IN_MS * CYCLES_PER_MS. I'm concerned that this arithmetic operation is repeatedly evaluated during each comparison.
Does anyone know if this is true, or if the compiler will evaluate the product of the two integer literals at compile time and use that instead?
I realize that this particular example would probably only remove one instruction out of the generated assembly code, but now I'm curious about this.
The standard doesn't impose any requirement to do this, but any sensible compiler will fold these constants down into one at compile-time. See e.g. http://en.wikipedia.org/wiki/Constant_propagation.
If you're curious to know whether this has actually happened, you can always take a look at the assembler generated by the compiler.
The compiler should (but I believe in C is not required to) evaluate the constant expression at compile-time. A good compiler will almost certainly do it, though, when optimization is turned on.
If you want to avoid multiple evaluation, maybe just to speed up compilation and your constants fit into int, you could enforce single evaluation by using an enumeration constant, instead.
enum { cycles_per_period = PERIOD_IN_MS * CYCLES_PER_MS};

Sequence points, conditionals and optimizations

I had an argument today with one of my collegues regarding the fact that a compiler could change the semantics of a program when agressive optimizations are enabled.
My collegue states that when optimizations are enabled, a compiler might change the order of some instructions. So that:
function foo(int a, int b)
{
if (a > 5)
{
if (b < 6)
{
// Do something
}
}
}
Might be changed to:
function foo(int a, int b)
{
if (b < 6)
{
if (a > 5)
{
// Do something
}
}
}
Of course, in this case, it doesn't change the program general behavior and isn't really important.
From my understanding, I believe that the two if (condition) belong to two different sequence points and that the compiler can't change their order, even if changing it would keep the same general behavior.
So, dear SO users, what is the truth regarding this ?
If the compiler can verify that there is no observable difference between those two, then it is free to make such optimizations.
Sequence points are a conceptual thing: the compiler has to generate code such that it behaves as if all the semantic rules like sequence points were followed. The generated code doesn't actually have to follow those rules if not following them produces no observable difference in the behavior of the program.
Even if you had:
if (a > 5 && b < 6)
the compiler could freely rearrange this to be
if (b < 6 && a > 5)
because there is no observable difference between the two (in this specific case where a and b are both int values). [This assumes that it is safe to read both a and b; if reading one of them could cause some error (e.g., one has a trap value), then the compiler would be more restricted in what optimizations it could make.]
As there is no observable difference between the two program snippets - provided the implementation is one that doesn't use trap values or anything else that might cause the inner comparison to do something other than just evaluate to true or false - the compiler could optimize one to the other under the "as if" rule. If there was some observable difference or some way that a conforming program might behave differently then the compiler would be non-conforming if it changed one form to the other.
For C++, see 1.9 [intro.execution] / 5.
A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible execution sequences of the corresponding instance of the abstract machine with the same program and the same input. However, if any such execution sequence contains an undefined
operation, this International Standard places no requirement on the implementation executing that program with that input (not even with regard to operations preceding the first undefined operation).
[This provision is sometimes called the "as-if" rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.]
Yes, the if statement is a sequence point.
However, a smart and agressive compiler can still reorder the different expressions, statements and alter the sequence points providing no side effects appear.
Sequence points only apply to the abstract machine.
If the target specific optimizer can prove that reversing the order of two instructions has no side effects, it can change them at will.
The end of a full expression (including those that control logical constructs like if, while, et cetera) is a sequence point. However, the sequence point really only provides a guarantee that side-effects of previously-evaluated statements have completed.
If a statement has no observable side-effects the compiler can do what it feels is best.
The truth is that if a>5 is false more often than b<6 is false or vice versa then the sequence will make a very minor difference as it will have to compute both conditionals on more occasions.
In reality though it is so trivial it is not worth bothering about in this particular case.
There are cases where it actually does make a difference, i.e. when you are filtering a large collection of data on several criteria and have to decide which filter to apply first, particularly if only one of them is O(log N) or constant and the subsequent checks are linear through what is left.
Lots of PC programmer replies =)
The compiler may, and likely would, optimize the sequence points for speed if "b" is passed to the function in a quickly-accessed register while "a" is passed on the stack. That's a quite common case for many compilers for 8-bit and 16-bit MCU:s.
Through the optimization it doesn't need to first stack "b", then load "a" into a register, then evaluate "a", then load "b" back into a register, then evaluate "b". Quite a mess I'd rather hope the compiler handled by rearranging the sequence points.
Though of course as already mentioned, to be standard compliant the compiler needs to ensure that it doesn't change the program behavior by the optimization.