In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.
Related
Consider the following code:
void func(int a, size_t n)
{
const bool cond = (a==2);
if (cond){
for (size_t i=0; i<n; i++){
// do something small 1
// continue by doing something else.
}
} else {
for (size_t i=0; i<n; i++){
// do something small 2
// continue by doing something else.
}
}
}
In this code the // continue by doing something else. (which might be a large part and for some reason cannot be separated into a function) is repeated exactly the same. To avoid this repetition one can write:
void func(int a, size_t n)
{
const bool cond = (a==2);
for (size_t i=0; i<n; i++){
if (cond){
// do something small 1
} else {
// do something small 2
}
// continue by doing something else.
}
}
Now we have an if-statement inside a (let's say very large) for-loop. But the condition of the if-statement (cond) is const and will not change. Would the compiler somehow optimize the code (like change it to the initial implementation)? Any tips? Thanks.
Details do matter and you included too little. As you are asking for compiler optimizations you need to know that the compiler will optimize according to the as-if-rule. Sloppy speaking, the compiler can do any optimization as long as they do not change the observable behavior (there are few exceptions). Both your functions have zero observable behavior, hence with optimizations turned on, gcc -O3, this is what the compiler does to them:
func(int, unsigned long):
ret
func2(int, unsigned long):
ret
It is futile to speculate what the compiler does to your code. Don't speculate, but look at the output. You can do that here: https://godbolt.org/z/oznWz6.
PS: Some mantras that I should not forget to include:
Don't do premature optimization. Code should be written primiarily to be read by humans. Only when you profiled and have evidence that you can gain something by improving that function you may consider to trade performance for readability.
Also do not forget that code you write is not instructions for your CPU. Your code is an abstract description of what the final program should do. The compiler knows very well how to rearrange the code to get most out of your CPU. Typically it is much better at this than a human could possibly be.
What assurances do I have that a core constant expression (as in [expr.const].2) possibly containing constexpr function calls will actually be evaluated at compile time and on which conditions does this depend?
The introduction of constexpr implicitly promises runtime performance improvements by moving computations into the translation stage (compile time).
However, the standard does not (and presumably cannot) mandate what code a compiler produces. (See [expr.const] and [dcl.constexpr]).
These two points appear to be at odds with each other.
Under which circumstances can one rely on the compiler resolving a core constant expression (which might contain an arbitrarily complicated computation) at compile time rather than deferring it to runtime?
At least under -O0 gcc appears to actually emit code and call for a constexpr function. Under -O1 and up it doesn't.
Do we have to resort to trickery such as this, that forces the constexpr through the template system:
template <auto V>
struct compile_time_h { static constexpr auto value = V; };
template <auto V>
inline constexpr auto compile_time = compile_time_h<V>::value;
constexpr int f(int x) { return x; }
int main() {
for (int x = 0; x < compile_time<f(42)>; ++x) {}
}
When a constexpr function is called and the output is assigned to a constexpr variable, it will always be run at compiletime.
Here's a minimal example:
// Compile with -std=c++14 or later
constexpr int fib(int n) {
int f0 = 0;
int f1 = 1;
for(int i = 0; i < n; i++) {
int hold = f0 + f1;
f0 = f1;
f1 = hold;
}
return f0;
}
int main() {
constexpr int blarg = fib(10);
return blarg;
}
When compiled at -O0, gcc outputs the following assembly for main:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 55
mov eax, 55
pop rbp
ret
Despite all optimization being turned off, there's never any call to fib in the main function itself.
This applies going all the way back to C++11, however in C++11 the fib function would have to be re-written to use conversion to avoid the use of mutable variables.
Why does the compiler include the assembly for fib in the executable sometimes? A constexpr function can be used at runtime, and when invoked at runtime it will behave like a regular function.
Used properly, constexpr can provide some performance benefits in specific cases, but the push to make everything constexpr is more about writing code that the compiler can check for Undefined Behavior.
What's an example of constexpr providing performance benefits? When implementing a function like std::visit, you need to create a lookup table of function pointers. Creating the lookup table every time std::visit is called would be costly, and assigning the lookup table to a static local variable would still result in measurable overhead because the program has to check if that variable's been initialized every time the function is run.
Thankfully, you can make the lookup table constexpr, and the compiler will actually inline the lookup table into the assembly code for the function so that the contents of the lookup table is significantly more likely to be inside the instruction cache when std::visit is run.
Does C++20 provide any mechanisms for guaranteeing that something runs at compiletime?
If a function is consteval, then the standard specifies that every call to the function must produce a compile-time constant.
This can be trivially used to force the compile-time evaluation of any constexpr function:
template<class T>
consteval T run_at_compiletime(T value) {
return value;
}
Anything given as a parameter to run_at_compiletime must be evaluated at compile-time:
constexpr int fib(int n) {
int f0 = 0;
int f1 = 1;
for(int i = 0; i < n; i++) {
int hold = f0 + f1;
f0 = f1;
f1 = hold;
}
return f0;
}
int main() {
// fib(10) will definitely run at compile time
return run_at_compiletime(fib(10));
}
Never; the C++ standard permits almost the entire compilation to occur at "runtime". Some diagnostics have to be done at compile time, but nothing prevents insanity on the part of the compiler.
Your binary could be a copy of the compiler with your source code appended, and C++ wouldn't say the compiler did anything wrong.
What you are looking at is a QoI - Quality of Implrmentation - issue.
In practice, constexpr variables tend to be compile time computed, and template parameters are always compile time computed.
consteval can also be used to markup functions.
If I have a loop that I know needs to be executed n times is there a way to write a while (or for) loop without a comparison each time? If not is there a way to make a macro turn:
int i = 0;
for(i = 0; i < 5; i++) {
operation();
}
into:
operation();
operation();
operation();
operation();
operation();
P.S. This is the fastest loop I've come up with so far.
int i = 5;
while(i-- >= 0) {
operation();
}
A Sufficiently Smart Compiler will do this for you. More specifically, optimizing compilers understand loop unrolling. It's a fairly basic optimization, especially in cases like your example where the number of iterations is known at compile time.
So in short: turn on compiler optimizations and don't worry about it.
The number of instructions you write in the source code is not strictly related on the number of machine instructions the compiler will generate.
Most compilers are smarter and in your second example can generate code like:
operation();
operation();
operation();
operation();
operation();
automatically because they detect that the loop will always iterate 5 times.
Also if you do a profiling-oriented optimization and a the compiler sees that a loop has tiny a body and a very high repeat count it may unroll it even for a generic number of iterations with code like:
while (count >= 5) {
operation();
operation();
operation();
operation();
operation();
count -= 5;
}
while (count > 0) {
operation();
count--;
}
This will make for large counts about one fifth of tests compared to the naive version.
If this is worth doing or not is something that only profiling can tell.
One thing you can do if you know for sure that the code needs to be executed at least once is to write
do {
operation();
} while (--count);
instead of
while (count--) {
operation();
}
The possibility that count==0 is somewhat annoying for CPUs because requires in the code generated by most compilers an extra JMP forward:
jmp test
loop:
...operation...
test:
...do the test...
jne loop
the machine code for the do { ... } while version instead is simply
loop:
... opertion ...
... do the test...
jne loop
both loops will do comparisons..
anyhow the compiler should identify the constant iteration and unroll the loop.
You could check that with gcc and the optimization flags (-O) and look at the generated code afterwards.
More important:
Don't optimize unless there is significant reason to do!
Once the C code is compiled, the while and for loops are converted to comparison statements in machine language, so there is no way to avoid some type of comparison with the for/while loops. You could make a series of goto and arithmetic statements that avoid using a comparison, but the result would probably be less efficient. You should look into how these loops are compiled into machine language using radare2 or gdb to see how they might be improved there.
With template, you may unroll the loop (in the count is known at compile time) with something like:
namespace detail
{
template <std::size_t ... Is>
void do_operation(std::index_sequence<Is...>)
{
std::initializer_list<std::size_t>{(static_cast<void>(operation()), Is)...};
}
}
template <std::size_t N>
void do_operation()
{
detail::do_operation(std::make_index_sequence<N>());
}
Live demo
but the compiler may already do that sort of optimization for normal loop.
for(int i = 0; i < my_function(MY_CONSTANT); ++i){
//code using i
}
In this example, will my_function(MY_CONSTANT) be evaluated at each iteration, or will it be stored automatically? Would this depend on the optimization flags used?
It has to work as if the function is called each time.
However, if the compiler can prove that the function result will be the same each time, it can optimize under the “as if” rule.
E.g. this usually happens with calls to .end() for standard containers.
General advice: when in doubt about whether to micro-optimize a piece of code,
Don't do it.
If you're still thinking of doing it, measure.
Well there was a third point but I've forgetting, maybe it was, still wait.
In other words, decide whether to use a variable based on how clear the code then is, not on imagined performance.
It will be evaluated each iteration. You can save the extra computation time by doing something like
const int stop = my_function(MY_CONSTANT);
for(int i = 0; i < stop; ++i){
//code using i
}
A modern optimizing compiler under the as-if rule may be able to optimize away the function call in the case that you outlined in your comment here. The as-if rule says that conforming compiler only has the emulate the observable behavior, we can see this by going to the draft C++ standard section 1.9 Program execution which says:
[...]Rather, conforming implementations are required to emulate (only)
the observable behavior of the abstract machine as explained below.5
So if you are using a constant expression and my_function does not have observable side effects it could be optimized out. We can put together a simple test (see it live on godbolt):
#include <stdio.h>
#define blah 10
int func( int x )
{
return x + 20 ;
}
void withConstant( int y )
{
for(int i = 0; i < func(blah); i++)
{
printf("%d ", i ) ;
}
}
void withoutConstant(int y)
{
for(int i = 0; i < func(i+y); i++)
{
printf("%d ", i ) ;
}
}
In the case of withConstant we can see it optimizes the computation:
cmpl $30, %ebx #, i
and even in the case of withoutConstant it inlines the calculation instead of performing a function call:
leal 0(%rbp,%rbx), %eax #, D.2605
If my_function is declared constexpr and the argument is really a constant, the value is calculated at compile time and thereby fulfilling the "as-if" and "sequential-consistency with no data-race" rule.
constexpr my_function(const int c);
If your function has side effects it would prevent the compiler from moving it out of the for-loop as it would not fulfil the "as-if" rule, unless the compiler can reason its way out of it.
The compiler might inline my_function, reduce on it as if it was part of the loop and with constant reduction find out that its really only a constant, de-facto removing the call and replacing it with a constant.
int my_function(const int c) {
return 17+c; // inline and constant reduced to the value.
}
So the answer to your question is ... maybe!
I have the following looking code in VC++:
for (int i = (a - 1) * b; i < a * b && i < someObject->someFunction(); i++)
{
// ...
}
As far as I know compilers optimize all these arithmetic operations and they won't be executed on each loop, but I'm not sure if they can tell that the function above also returns the same value each time and it doesn't need to be called each time.
Is it a better practice to save all calculations into variables, or just rely on compiler optimizations to have a more readable code?
int start = (a - 1) * b;
int expra = a * b;
int exprb = someObject->someFunction();
for (int i = startl i < expra && i < exprb; i++)
{
// ...
}
Short answer: it depends. If the compiler can deduce that running someObject->someFunction() every time and caching the result once both produce the same effects, it is allowed (but not guaranteed) to do so. Whether this static analysis is possible depends on your program: specifically, what the static type of someObject is and what its dynamic type is expected to be, as well as what someFunction() actually does, whether it's virtual, and so on.
In general, if it only needs to be done once, write your code in such a way that it can only be done once, bypassing the need to worry about what the compiler might be doing:
int start = (a - 1) * b;
int expra = a * b;
int exprb = someObject->someFunction();
for (int i = start; i < expra && i < exprb; i++)
// ...
Or, if you're into being concise:
for (int i = (a - 1) * b, expra = a * b, exprb = someObject->someFunction();
i < expra && i < exprb; i++)
// ...
From my experience VC++ compiler won't optimize the function call out unless it can see the function implementation at the point of compiling the calling code. So moving the call outside the loop is a good idea.
If a function resides within the same compilation unit as its caller, the compiler can often deduce some facts about it - e.g. that its output might not change for subsequent calls. In general, however, that is not the case.
In your example, assigning variables for these simple arithmetic expressions does not really change anything with regards to the produced object code and, in my opinion, makes the code less readable. Unless you have a bunch of long expressions that cannot reasonably be put within a line or two, you should avoid using temporary variables - if for no other reason, then just to reduce namespace pollution.
Using temporary variables implies a significant management overhead for the programmer, in order to keep them separate and avoid unintended side-effects. It also makes reusing code snippets harder.
On the other hand, assigning the result of the function to a variable can help the compiler optimise your code better by explicitly avoiding multiple function calls.
Personally, I would go with this:
int expr = someObject->someFunction();
for (int i = (a - 1) * b; i < a * b && i < expr; i++)
{
// ...
}
The compiler cannot make any assumption on whether your function will return the same value at each time. Let's imagine that your object is a socket, how could the compiler possibly know what will be its output?
Also, the optimization that a compiler can make in such loops strongly depends on the whether a and b are declared as const or not, and whether or not they are local. With advanced optimization schemes, it may be able to infer that a and b are neither modified in the loop nor in your function (again, you might imagine that your object holds some reference to them).
Well, in short: go for the second version of your code!
It is very likely that the compiler will call the function each time.
If you are concerned with the readability of code, what about using:
int maxindex = min (expra, exprb);
for (i=start; i<maxindex; i++)
IMHO, long lines does not improve readability.
Writing short lines and doing multiple step to get a result, does not impact the performance, this is exactly why we use compilers.
Effectively what you might be asking is whether the compiler will inline the function someFunction() and whether it will see that someObject is the same instance in each loop, and if it does both it will potentially "cache" the return value and not keep re-evaluating it.
Much of this may depend on what optimisation settings you use, with VC++ as well as any other compiler, although I am not sure VC++ gives you quite as many flags as gnu.
I often find it incredible that programmers rely on compilers to optimise things they can easily optimise themselves. Just move the expression to the first section of the for-loop if you know it will evaluate the same each time:
Just do this and don't rely on the compiler:
for (int i = (a - 1) * b, iMax = someObject->someFunction();
i < a * b && i < iMax; ++i)
{
// body
}