C++ compiler optimization for complex equations - c++

I have some equations that involve multiple operations that I would like to run as fast as possible. Since the c++ compiler breaks it down in to machine code anyway does it matter if I break it up to multiple lines like
A=4*B+4*C;
D=3*E/F;
G=A*D;
vs
G=12*E*(B+C)/F;
My need is more complex than this but the i think it conveys the idea. Also if this is in a function that gets called is in a loop, does defining double A, D cost CPU time vs putting it in as a class variable?

Using a modern compiler, Clang/Gcc/VC++/Intel, it won't really matter, the best thing you should do is worry about how readable your code will be and turn on optimizations, compiler designers are well aware of issues like these and design their compilers to (for the most part) optimize according.
If I were to say which would be slower I would assume the first way since there would be 3 mov instructions, I could be wrong. but this isn't something you should worry about too much.

If these variables are integers, that second code fragment is not a valid optimization of the first. For B=1, C=1, E=1, F=6, you have:
A=4*B+4*C; // 8
D=3*E/F; // 0
G=A*D; // 0
and
G=12*E*(B+C)/F; // 4
If floating point, then it really depends on what compiler, what compiler options, and what cpu you have.

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?
No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.
Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02
It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

How much do C/C++ compilers optimize conditional statements?

I recently ran into a situation where I wrote the following code:
for(int i = 0; i < (size - 1); i++)
{
// do whatever
}
// Assume 'size' will be constant during the duration of the for loop
When looking at this code, it made me wonder how exactly the for loop condition is evaluated for each loop. Specifically, I'm curious as to whether or not the compiler would 'optimize away' any additional arithmetic that has to be done for each loop. In my case, would this code get compiled such that (size - 1) would have to be evaluated for every loop iteration? Or is the compiler smart enough to realize that the 'size' variable won't change, thus it could precalculate it for each loop iteration.
This then got me thinking about the general case where you have a conditional statement that may specify more operations than necessary.
As an example, how would the following two pieces of code compile:
if(6)
if(1+1+1+1+1+1)
int foo = 1;
if(foo + foo + foo + foo + foo + foo)
How smart is the compiler? Will the 3 cases listed above be converted into the same machine code?
And while I'm at, why not list another example. What does the compiler do if you are doing an operation within a conditional that won't have any effect on the end result? Example:
if(2*(val))
// Assume val is an int that can take on any value
In this example, the multiplication is completely unnecessary. While this case seems a lot stupider than my original case, the question still stands: will the compiler be able to remove this unnecessary multiplication?
Question:
How much optimization is involved with conditional statements?
Does it vary based on compiler?
Short answer: the compiler is exceptionally clever, and will generally optimise those cases that you have presented (including utterly ignoring irrelevant conditions).
One of the biggest hurdles language newcomers face in terms of truly understanding C++, is that there is not a one-to-one relationship between their code and what the computer executes. The entire purpose of the language is to create an abstraction. You are defining the program's semantics, but the computer has no responsibility to actually follow your C++ code line by line; indeed, if it did so, it would be abhorrently slow as compared to the speed we can expect from modern computers.
Generally speaking, unless you have a reason to micro-optimise (game developers come to mind), it is best to almost completely ignore this facet of programming, and trust your compiler. Write a program that takes the inputs you want, and gives the outputs you want, after performing the calculations you want… and let your compiler do the hard work of figuring out how the physical machine is going to make all that happen.
Are there exceptions? Certainly. Sometimes your requirements are so specific that you do know better than the compiler, and you end up optimising. You generally do this after profiling and determining what your bottlenecks are. And there's also no excuse to write deliberately silly code. After all, if you go out of your way to ask your program to copy a 50MB vector, then it's going to copy a 50MB vector.
But, assuming sensible code that means what it looks like, you really shouldn't spend too much time worrying about this. Because modern compilers are so good at optimising, that you'd be a fool to try to keep up.
The C++ language specification permits the compiler to make any optimization that results in no observable changes to the expected results.
If the compiler can determine that size is constant and will not change during execution, it can certainly make that particular optimization.
Alternatively, if the compiler can also determine that i is not used in the loop (and its value is not used afterwards), that it is used only as a counter, it might very well rewrite the loop to:
for(int i = 1; i < size; i++)
because that might produce smaller code. Even if this i is used in some fashion, the compiler can still make this change and then adjust all other usage of i so that the observable results are still the same.
To summarize: anything goes. The compiler may or may not make any optimization change as long as the observable results are the same.
Yes, there is a lot of optimization, and it is very complex.
It varies based on the compiler, and it also varies based on the compiler options
Check
https://meta.stackexchange.com/questions/25840/can-we-stop-recommending-the-dragon-book-please
for some book recomendations if you really want to understand what a compiler may do. It is a very complex subject.
You can also compile to assembly with the -S option (gcc / g++) to see what the compiler is really doing. Use -O3 / ... / -O0 / -O to experiment with different optimization levels.

Profiling a simple, one cycle length operation

We have an assignment where we need to profile a 'simple instruction' (addition or bit-wise and for example). This means performing the same operation a large number of times (100K+) and measuring the average time in microseconds. The result should be presented in cycle-lengths: (totalTime/iterations)*cphMHz.
So, results may vary but all in all we were told that we should get a result close to 1 cycle-length. Actual result doesn't matter as long as programming is correct.
My question is: what is a good operation to profile?
There are two points I need to concider:
I use loop unrolling to be a bit more accurate, so in each iteration I perform 10 simple instruction. This means I have to choose an operation to wouldn't be performed only once due to compiler optimization (we can't use -o0 flag as school staff does not).
Bad example: var = i; - the compiler would only perform the last command.
What is a real 'simple instruction'? How do I know the number of operations that are actually performed? I tried reading the assembly output, but I couldn't understand it.
Hope I was clear enough, any idea would be great.
Thanks anyway
P.S don't know if it matters but I write in CPP
1) This sounds (to me) like an impossible task, if optimizations are (or might be) enabled. You can never be sure on what the compiler will do during optimizations. I'd definitely do something like reusing the previous result. If allowed to/possible, I'd try to include a raw assembler snippet to be profiled (so you can be sure there's no additional overhead; although it still could be optimized).
2) As for instructions: One assembler command is one instruction. E.g. a += i will - depending on available instruction set and stuff - most likely result in 4 instructions: read a, read i, add, write a. Reading assembly is pretty much straightforward. Depending on the instruction set/processor, there might be different "directions" for reading (i.e. "from -> to"). x86 assemblers (and those for most other common processors) will prefer instruction target, source, while DSPs prefer to use instruction source, target. Just important to know: moving data has to happen through registers. So even a single assignment like a = b will result in two instructions (b to register and register to a).
In general, if this answer goes into the wrong direction, try to elaborate a bit more on your specific task and its requirements (e.g. which compiler is to be used) and drop me a short comment.

How to store doubles in memory

Recently I changed some code
double d0, d1;
// ... assign things to d0/d1 ...
double result = f(d0, d1)
to
double d[2];
// ... assign things to d[0]/d[1]
double result = f(d[0], d[1]);
I did not change any of the assignments to d, nor the calculations in f, nor anything else apart from the fact that the doubles are now stored in a fixed-length array.
However when compiling in release mode, with optimizations on, result changed.
My question is, why, and what should I know about how I should store doubles? Is one way more efficient, or better, than the other? Are there memory alignment issues? I'm looking for any information that would help me understand what's going on.
EDIT: I will try to get some code demonstrating the problem, however this is quite hard as the process that these numbers go through is huge (a lot of maths, numerical solvers, etc.).
However there is no change when compiled in Debug. I will double check this again to make sure but this is almost certain, i.e. the double values are identical in Debug between version 1 and version 2.
Comparing Debug to Release, results have never ever been the same between the two compilation modes, for various optimization reasons.
You probably have a 'fast math' compiler switch turned on, or are doing something in the "assign things" (which we can't see) which allows the compiler to legally reorder calculations. Even though the sequences are equivalent, it's likely the optimizer is treating them differently, so you end up with slightly different code generation. If it's reordered, you end up with slight differences in the least significant bits. Such is life with floating point.
You can prevent this by not using 'fast math' (if that's turned on), or forcing ordering thru the way you construct the formulas and intermediate values. Even that's hard (impossible?) to guarantee. The question is really "Why is the compiler generating different code for arrays vs numbered variables?", but that's basically an analysis of the code generator.
no these are equivalent - you have something else wrong.
Check the /fp:precise flags (or equivalent) the processor floating point hardware can run in more accuracy or more speed mode - it may have a different default in an optimized build
With regard to floating-point semantics, these are equivalent. However, it is conceivable that the compiler might decide to generate slightly different code sequences for the two, and that could result in differences in the result.
Can you post a complete code example that illustrates the difference? Without that to go on, anything anyone posts as an answer is just speculation.
To your concerns: memory alignment cannot effect the value of a double, and a compiler should be able to generate equivalent code for either example, so you don't need to worry that you're doing something wrong (at least, not in the limited example you posted).
The first way is more efficient, in a very theoretical way. It gives the compiler slightly more leeway in assigning stack slots and registers. In the second example, the compiler has to pick 2 consecutive slots - except of course if the compiler is smart enough to realize that you'd never notice.
It's quite possible that the double[2] causes the array to be allocated as two adjacent stack slots where it wasn't before, and that in turn can cause code reordering to improve memory access efficiency. IEEE754 floating point math doesn't obey the regular math rules, i.e. a+b+c != c+b+a

rate ++a,a++,a=a+1 and a+=1 in terms of execution efficiency in C.Assume gcc to be the compiler [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Is there a performance difference between i++ and ++i in C++?
In terms of usage of the following, please rate in terms of execution time in C.
In some interviews i was asked which shall i use among these variations and why.
a++
++a
a=a+1
a+=1
Here is what g++ -S produces:
void irrelevant_low_level_worries()
{
int a = 0;
// movl $0, -4(%ebp)
a++;
// incl -4(%ebp)
++a;
// incl -4(%ebp)
a = a + 1;
// incl -4(%ebp)
a += 1;
// incl -4(%ebp)
}
So even without any optimizer switches, all four statements compile to the exact same machine code.
You can't rate the execution time in C, because it's not the C code that is executed. You have to profile the executable code compiled with a specific compiler running on a specific computer to get a rating.
Also, rating a single operation doesn't give you something that you can really use. Todays processors execute several instructions in parallel, so the efficiency of an operation relies very much on how well it can be paired with the instructions in the surrounding code.
So, if you really need to use the one that has the best performance, you have to profile the code. Otherwise (which is about 98% of the time) you should use the one that is most readable and best conveys what the code is doing.
The circumstances where these kinds of things actually matter is very rare and few in between. Most of the time, it doesn't matter at all. In fact I'm willing to bet that this is the case for you.
What is true for one language/compiler/architecture may not be true for others. And really, the fact is irrelevant in the bigger picture anyway. Knowing these things do not make you a better programmer.
You should study algorithms, data structures, asymptotic analysis, clean and readable coding style, programming paradigms, etc. Those skills are a lot more important in producing performant and manageable code than knowing these kinds of low-level details.
Do not optimize prematurely, but also, do not micro-optimize. Look for the big picture optimizations.
This depends on the type of a as well as on the context of execution. If a is of a primitive type and if all four statements have the same identical effect then these should all be equivalent and identical in terms of efficiency. That is, the compiler should be smart enough to translate them into the same optimized machine code. Granted, that is not a requirement, but if it's not the case with your compiler then that is a good sign to start looking for a better compiler.
For most compilers it should compile to the same ASM code.
Same.
For more details see http://www.linux-kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf
I can't see why there should be any difference in execution time, but let's prove me wrong.
a++
and
++a
are not the same however, but this is not related to efficiency.
When it comes to performance of individual lines, context is always important, and guessing is not a good idea. Test and measure is better
In an interview, I would go with two answers:
At first glance, the generated code should be very similar, especially if a is an integer.
If execution time was definitely a known problem - you have to measure it using some kind of profiler.
Well, you could argue that a++ is short and to the point. It can only increment a by one, but the notation is very well understood. a=a+1 is a little more verbose (not a big deal, unless you have variablesWithGratuitouslyLongNames), but some might argue it's more "flexible" because you can replace the 1 or either of the a's to change the expression. a+=1 is maybe not as flexible as the other two but is a little more clear, in the sense that you can change the increment amount. ++a is different from a++ and some would argue against it because it's not always clear to people who don't use it often.
In terms of efficiency, I think most modern compilers will produce the same code for all of these but I could be mistaken. Really, you'd have to run your code with all variations and measure which performs best.
(assuming that a is an integer)
It depends on the context, and if we are in C or C++. In C the code you posted (except for a-- :-) will cause a modern C compiler to produce exactly the same code. But by a very high chance the expected answer is that a++ is the fastest one and a=a+1 the slowest, since ancient compilers relied on the user to perform such optimizations.
In C++ it depends of the type of a. When a is a numeric type, it acts the same way as in C, which means a++, a+=1 and a=a+1 generate the same code. When a is a object, it depends if any operator (++, + and =) is overloaded, since then the overloaded operator of the a object is called.
Also when you work in a field with very special compilers (like microcontrollers or embedded systems) these compilers can behave very differently on each of these input variations.