Can I hint the optimizer by giving the range of an integer? - c++

I am using an int type to store a value. By the semantics of the program, the value always varies in a very small range (0 - 36), and int (not a char) is used only because of the CPU efficiency.
It seems like many special arithmetical optimizations can be performed on such a small range of integers. Many function calls on those integers might be optimized into a small set of "magical" operations, and some functions may even be optimized into table look-ups.
So, is it possible to tell the compiler that this int is always in that small range, and is it possible for the compiler to do those optimizations?

Yes, it is possible. For example, for gcc you can use __builtin_unreachable to tell the compiler about impossible conditions, like so:
if (value < 0 || value > 36) __builtin_unreachable();
We can wrap the condition above in a macro:
#define assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
And use it like so:
assume(x >= 0 && x <= 10);
As you can see, gcc performs optimizations based on this information:
#define assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
int func(int x){
assume(x >=0 && x <= 10);
if (x > 11){
return 2;
}
else{
return 17;
}
}
Produces:
func(int):
mov eax, 17
ret
One downside, however, that if your code ever breaks such assumptions, you get undefined behavior.
It doesn't notify you when this happens, even in debug builds. To debug/test/catch bugs with assumptions more easily, you can use a hybrid assume/assert macro (credits to #David Z), like this one:
#if defined(NDEBUG)
#define assume(cond) do { if (!(cond)) __builtin_unreachable(); } while (0)
#else
#include <cassert>
#define assume(cond) assert(cond)
#endif
In debug builds (with NDEBUG not defined), it works like an ordinary assert, printing error message and abort'ing program, and in release builds it makes use of an assumption, producing optimized code.
Note, however, that it is not a substitute for regular assert - cond remains in release builds, so you should not do something like assume(VeryExpensiveComputation()).

There is standard support for this. What you should do is to include stdint.h (cstdint) and then use the type uint_fast8_t.
This tells the compiler that you are only using numbers between 0 - 255, but that it is free to use a larger type if that gives faster code. Similarly, the compiler can assume that the variable will never have a value above 255 and then do optimizations accordingly.

The current answer is good for the case when you know for sure what the range is, but if you still want correct behavior when the value is out of the expected range, then it won't work.
For that case, I found this technique can work:
if (x == c) // assume c is a constant
{
foo(x);
}
else
{
foo(x);
}
The idea is a code-data tradeoff: you're moving 1 bit of data (whether x == c) into control logic.
This hints to the optimizer that x is in fact a known constant c, encouraging it to inline and optimize the first invocation of foo separately from the rest, possibly quite heavily.
Make sure to actually factor the code into a single subroutine foo, though -- don't duplicate the code.
Example:
For this technique to work you need to be a little lucky -- there are cases where the compiler decides not to evaluate things statically, and they're kind of arbitrary. But when it works, it works well:
#include <math.h>
#include <stdio.h>
unsigned foo(unsigned x)
{
return x * (x + 1);
}
unsigned bar(unsigned x) { return foo(x + 1) + foo(2 * x); }
int main()
{
unsigned x;
scanf("%u", &x);
unsigned r;
if (x == 1)
{
r = bar(bar(x));
}
else if (x == 0)
{
r = bar(bar(x));
}
else
{
r = bar(x + 1);
}
printf("%#x\n", r);
}
Just use -O3 and notice the pre-evaluated constants 0x20 and 0x30e in the assembler output.

I am just pitching in to say that if you may want a solution that is more standard C++, you can use the [[noreturn]] attribute to write your own unreachable.
So I'll re-purpose deniss' excellent example to demonstrate:
namespace detail {
[[noreturn]] void unreachable(){}
}
#define assume(cond) do { if (!(cond)) detail::unreachable(); } while (0)
int func(int x){
assume(x >=0 && x <= 10);
if (x > 11){
return 2;
}
else{
return 17;
}
}
Which as you can see, results in nearly identical code:
detail::unreachable():
rep ret
func(int):
movl $17, %eax
ret
The downside is of course, that you get a warning that a [[noreturn]] function does, indeed, return.

Related

Is there any speed difference between the following two cases?

Will judge_function_2 be faster than judge_function_1? I think judge_function_1's AND statement needs to judge two, while judge_function_2 only needs to be judged once.
#include <iostream>
using namespace std;
bool judge_function_1()
{
if(100 == 100 || 100 == 200)
{
return true;
}
return false;
}
bool judge_function_2()
{
if(100 == 100)
{
return true;
}
if(100 == 200)
{
return true;
}
return false;
}
int main()
{
cout << judge_function_1() << endl;
cout << judge_function_2() << endl;
return 0;
}
Using godbolt and compiling with gcc with optimizations enabled results in the following assembly code https://godbolt.org/z/YEfYGv5vh :
judge_function_1():
mov eax, 1
ret
judge_function_2():
mov eax, 1
ret
The functions assembly code is identical and both return true, they will be exactly the same fast.
Compilers know how to read and understand code. They cannot guess your intentions, but thats only an issue when your code does not express your intentions.
A compiler "knows" that 100 == 100 is always true. It cannot be something else.
A compiler "knows" that true || whatever is always true. It cannot be something else.
A compiler "knows" that after a return no other statements are executed in a function.
The transformations are straightforward to proove that both functions are equivalent to
void same_thing_just_with_less_fluff() { return true; }
The code in both functions is just a very complicated way to say return true; / "this function returns true".
Compilers optimize code when you ask them to. If you turn on optimizations there is no reason to expect any difference between the two functions. Calling them has exactly the same effect. There is no observable difference between the two. And as shown above, the way to proove this is rather simple. It is somewhat safe to assume that any compiler can optimize the two functions to simply return true;.
To my experience, a misunderstanding that is common among beginners is that your code would be instructions for your CPU. That when you write 100 == 100 then at runtime there must be 100 in one register and 100 in another register and the cpu needs to carry out an operation to check if the values are the same. This picture is totally wrong.
Your code is an abstract description of the observable behavior of your code. Its a compilers job to translate this abstract description into something your cpu understands and can execute to exhibit the observable behavior you described in your code in accordance with the definitions provided by the C++ standard.
What your code describes is in plain english: Write 1 followed by a newline to stdout and then flush it twice.
From your guess:
I think judge_function_1's AND statement needs to judge two, while judge_function_2 only needs to be judged once.
I deduce that there ARE some compare performed, like if you pass two parameters to your function and compare them with some constant:
bool judge_function_1(int x, int y)
{
if(x == 100 || y == 200)
{
return true;
}
return false;
}
Even in that case, if the first condition is true, the function will return immediately, without comparing y to 200.

Why is fetestexcept in C++ compiled to a function call rather than inlined

I am evaluating the usage (clearing and querying) of Floating-Point Exceptions in performance-critical/"hot" code. Looking at the binary produced I noticed that neither GCC nor Clang expand the call to an inline sequence of instructions that I would expect; instead they seem to generate a call to the runtime library. This is prohibitively expensive for my application.
Consider the following minimal example:
#include <fenv.h>
#pragma STDC FENV_ACCESS on
inline int fetestexcept_inline(int e)
{
unsigned int mxcsr;
asm volatile ("vstmxcsr" " %0" : "=m" (*&mxcsr));
return mxcsr & e & FE_ALL_EXCEPT;
}
double f1(double a)
{
double r = a * a;
if(r == 0 || fetestexcept_inline(FE_OVERFLOW)) return -1;
else return r;
}
double f2(double a)
{
double r = a * a;
if(r == 0 || fetestexcept(FE_OVERFLOW)) return -1;
else return r;
}
And the output as produced by GCC: https://godbolt.org/z/jxjzYY
The compiler seems to know that he can use the CPU-family-dependent AVX-instructions for the target (it uses "vmulsd" for the multiplication). However, no matter which optimization flags I try, it will always produce the much more expensive function call to glibc rather than the assembly that (as far as I understand) should do what the corresponding glibc function does.
This is not intended as a complaint, I am OK with adding the inline assembly. I just wonder whether there might be a subtle difference that I am overlooking that could be a bug in the inline-assembly-version.
It's required to support long double arithmetic. fetestexcept needs to merge the SSE and FPU states because long double operations only update the FPU state, but not the MXSCR register. Therefore, the benefit from inlining is somewhat reduced.

Function with template bool argument: guaranteed to be optimized?

In the following example of templated function, is the central if inside the for loop guaranteed to be optimized out, leaving the used instructions only?
If this is not guaranteed to be optimized (in GCC 4, MSVC 2013 and llvm 8.0), what are the alternatives, using C++11 at most?
NOTE that this function does nothing usable, and I know that this specific function can be optimized in several ways and so on. But all I want to focus is on how the bool template argument works in generating code.
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
float ret = (IsMin ? std::numeric_limits<float>::max() : -std::numeric_limits<float>::max());
for (int x = 0; x < arraySize; x++) {
// Is this code optimized by the compiler to skip the unnecessary if?
if (isMin) {
if (ret > vals[x]) ret = vals[x];
} else {
if (ret < vals[x]) ret = vals[x];
}
}
return val;
}
In theory no. The C++ standard permits compilers to be not just dumb, but downright hostile. It could inject code doing useless stuff for no reason, so long as the abstract machine behaviour remains the same.1
In practice, yes. Dead code elimination and constant branch detection are easy, and every single compiler I have ever checked eliminates that if branch.
Note that both branches are compiled before one is eliminated, so they both must be fully valid code. The output assembly behaves "as if" both branches exist, but the branch instruction (and unreachable code) is not an observable feature of the abstract machine behaviour.
Naturally if you do not optimize, the branch and dead code may be left in, so you can move the instruction pointer into the "dead code" with your debugger.
1 As an example, nothing prevents a compiler from implementing a+b as a loop calling inc in assembly, or a*b as a loop adding a repeatedly. This is a hostile act by the compiler on almost all platforms, but not banned by the standard.
There is no guarantee that it will be optimized away. There is a pretty good chance that it will be though since it is a compile time constant.
That said C++17 gives us if constexpr which will only compile the code that pass the check. If you want a guarantee then I would suggest you use this feature instead.
Before C++17 if you only want one part of the code to be compiled you would need to specialize the function and write only the code that pertains to that specialization.
Since you ask for an alternative in C++11 here is one :
float IterateOverArrayImpl(float* vals, int arraySize, std::false_type)
{
float ret = -std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret < vals[x])
ret = vals[x];
}
return ret;
}
float IterateOverArrayImpl(float* vals, int arraySize, std::true_type)
{
float ret = std::numeric_limits<float>::max();
for (int x = 0; x < arraySize; x++) {
if (ret > vals[x])
ret = vals[x];
}
return ret;
}
template <bool IsMin>
float IterateOverArray(float* vals, int arraySize) {
return IterateOverArrayImpl(vals, arraySize, std::integral_constant<bool, IsMin>());
}
You can see it in live here.
The idea is to use function overloading to handle the test.

Is a standards conforming C++ compiler allowed to optimize away branching on <= 0 for unsigned integers?

Consider this code:
void foo(size_t value)
{
if (value > 0) { ... } // A
if (value <= 0) { ... } // B
}
Since an unsigned cannot be negative, could a standards conforming C++ compiler optimize away the B statement? Or would it just choose to compare to 0?
Well, it clearly cannot optimise away the B statement altogether—the condition body does execute when value is 0.
Since value cannot, by any means, be < 0, the compiler can of course transform B into if (value == 0) { ... }. Furthermore, if it can prove (remember that the standard mandates strict aliasing rules!) that value is not changed by statement A, it can legally transform the entire function like this:
void foo(size_t value)
{
if (value > 0) { ... } // A
else { ... } // B
}
Or, if it happens to know that the target architecture likes == better, into this:
void foo(size_t value)
{
if (value == 0) { ... } // B
else { ... } // A
}
If the code is correctly written, B cannot be optimized away, because value can be zero, though the particular comparison used can be replaced with an equivalent one as shown in Angew's answer. But if the statements in B invoke undefined behavior, all bets are off. For ease of reference, let's rewrite foo as
void foo(size_t value)
{
if (value > 0) bar(); // A
if (value <= 0) baz(); // B
}
If the compiler can determine that baz() invokes undefined behavior, then it can treat it as unreachable. From that, it can then deduce that value > 0, and optimize foo into
void foo(size_t value)
{
bar();
}
Since the compound statement must be executed if the unsinged value == 0, a conforming compiler cannot optimize away if (value <= 0) { /* ... */ }.
An optimizing compiler will probably consider several things here:
Both statements are mutually exclusive
There is no code in between both of them.
value cannot be smaller than zero
There are several possible "outcomes" of this scenario where every scenario consists of one comparison and one conditional jump instruction.
I suspect test R,R to be "more optimal" than cmp R, 0 but in general there is not much of a difference.
The resulting code can be (where Code A and Code B contain a ret):
Using cmp
cmp <value>, 0
A)
je equal
// Code A
equal:
// Code B
B)
jne nequal
// Code B
nequal:
// Code A
C)
jg great
// Code B
great:
// Code A
D)
jbe smoe
// Code A
smoe:
// Code B
Using test
test <value>, <value>
A)
je equal
// Code A
equal:
// Code B
B)
jne nequal
// Code B
nequal:
// Code A

native isnan check in C++

I stumbled upon this code to check for NaN:
/**
* isnan(val) returns true if val is nan.
* We cannot rely on std::isnan or x!=x, because GCC may wrongly optimize it
* away when compiling with -ffast-math (default in RASR).
* This function basically does 3 things:
* - ignore the sign (first bit is dropped with <<1)
* - interpret val as an unsigned integer (union)
* - compares val to the nan-bitmask (ones in the exponent, non-zero significand)
**/
template<typename T>
inline bool isnan(T val) {
if (sizeof(val) == 4) {
union { f32 f; u32 x; } u = { (f32)val };
return (u.x << 1) > 0xff000000u;
} else if (sizeof(val) == 8) {
union { f64 f; u64 x; } u = { (f64)val };
return (u.x << 1) > 0x7ff0000000000000u;
} else {
std::cerr << "isnan is not implemented for sizeof(datatype)=="
<< sizeof(val) << std::endl;
}
}
This looks arch dependent, right? However, I'm not sure about endianess, because no matter about little or big endian, the float and the int are probably stored in the same order.
Also, I wonder whether something like
volatile T x = val;
return std::isnan(x);
would have worked.
This was used with GCC 4.6 in the past.
Also, I wonder whether something like std::isnan((volatile)x) would have worked.
isnan takes its argument by value so the volatile qualifier would have been discarded. In other words, no, this doesn’t work.
The code you’ve posted relies on a specific floating point representation (IEEE). It also exhibits undefined behaviour since it relies on the union hack to retrieve the underlying float representation.
On a note about code review, the function is badly written even if we ignore the potential problems of the previous paragraph (which are justifiable): why does the function use runtime checks rather than compile-time checks and compile time error handling? It would have been better and easier just to offer two overloads.