Curious missed optimization of recursive constexpr function by Clang

Curious missed optimization of recursive constexpr function by Clang - c++

Today I wanted to test, how Clang would transform a recursive power of two function and noticed that even with known exponent, the recursion is not optimized away even when using constexpr.
#include <array>
constexpr unsigned int pow2_recursive(unsigned int exp) {
if(exp == 0) return 1;
return 2 * pow2_recursive(exp-1);
}
unsigned int pow2_5() {
return pow2_recursive(5);
}
pow2_5 is compiled as a call to pow2_recursive.
pow2_5(): # #pow2_5()
mov edi, 5
jmp pow2_recursive(unsigned int) # TAILCALL
However, when I use the result in a context that requires it to be known at compile time, it will correctly compute the result at compile time.
unsigned int pow2_5_arr() {
std::array<int, pow2_recursive(5)> a;
return a.size();
}
is compiled to
pow2_5_arr(): # #pow2_5_arr()
mov eax, 32
ret
Here is the link to the full example in Godbolt: https://godbolt.org/z/fcKef1
So, am I missing something here? Is there something that can change the result at runtime and a reason, that pow2_5 cannot be optimized in the same way as pow2_5_arr?

Related

Is it possible to template the asm opcode in c++?

I want something like this:
template <const char *op, int lane_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__("movups %1,%%xmm6\n"
"\tmovups %2,%%xmm7\n"
// order swapped for AT&T style which has destination second.
"\t " op " %%xmm7,%%xmm6\n"
"\tpextrb %3, %%xmm6, %0"
: "=r" (result) : "m" (x), "m" (y), "i" (lane_select*8) : "xmm6");
return result;
}
Clearly it has to be a template because it must be known at compile time. The lane_select works fine already and it is a template but it's an operand. I want the op that is in the asm to be different, like either pcmpgtd or pcmpgtq, etc. If it helps: I always want some form of x86's pcmpgt, with just the final letter changing.
Edit:
This is a test case for valgrind where it's very important that the exact instruction is run so that we can examine if the definedness of the output.

This is possible with some asm hackery, but normally you'd be better off using intrinsics, like this:
https://gcc.gnu.org/wiki/DontUseInlineAsm
#include <immintrin.h>
template<int size, int element>
int foo(__m128i x, __m128i y) {
if (size==32) // if constexpr if you want to be C++17 fancy
x = _mm_cmpgt_epi32(x, y);
else
x = _mm_cmpgt_epi64(x, y); // needs -msse4.2 or -march=native or whatever
return x[element]; // GNU C extension to index vectors with [].
// GCC defines __m128i as a vector of two long long
// cast to typedef int v4si __attribute__((vector_size(16)))
// or use _mm_extract_epi8 or whatever with element*size/8
// if you want to access one of 4 dword elements.
}
int test(__m128i x, __m128i y) {
return foo<32, 0>(x, y);
}
// compiles to pcmpgtd %xmm1, %xmm0 ; movq %xmm0, %rax ; ret
You can even go full-on GNU C native vector style and do x = x>y after casting them to v4si or not, depending on whether you want 32 or 64-bit element compares. GCC will implement the operator > however it can, with a multi-instruction emulation if SSE4.2 isn't available for pcmpgtq. There's no other fundamental difference between that and intrinsics though; the compiler isn't required to emit pcmpgtq just because the source contained _mm_cmpgt_epi64, e.g. it can do constant-propagation through it if x and y are both compile-time constants, or if y is known to be LONG_MAX so nothing can be greater than it.
Using inline asm
Only the C preprocessor could work the way you're hoping; the asm template has to be a string literal at compile time, and AFAIK C++ template constexpr stuff can't stringify and past a variable's value into an actual string literal. Template evaluation happens after parsing.
I came up with an amusing hack that gets GCC to print d or q as the asm symbol name of a global (or static) variable, using %p4 (See operand modifiers in the GCC manual.) An empty array like constexpr char d[] = {}; is probably a good choice here. You can't pass string literals to template parameters anyway.
(I also fixed the inefficiencies and bugs in your inline asm statement: e.g. let the compiler pick registers, and ask for the inputs in XMM regs, not memory. You were missing an "xmm7" clobber, but this version doesn't need any clobbers. This is still worse than intrinsics for cases where the inputs might be compile-time constants, or where one was in aligned memory so could use a memory operand, or various other possible optimizations. I could have used "xm" as a source but clang would always pick "m". **https://gcc.gnu.org/wiki/DontUseInlineAsm**.)
If you need it to not be optimized away for valgrind testing, maybe make it asm volatile to force it to run even if the output isn't needed. That's about the only reason you'd want to use inline asm instead of intrinsics or GNU C native vector syntax (x > y)
typedef long long V64x2 __attribute__((vector_size(16), may_alias));
// or #include <immintrin.h> and use __m128i which is defined the same way
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <const char *op, int element_select>
static int cmpGT64Sx2(V64x2 x, V64x2 y)
{
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // %p4 - print the bare name, not $d or $q
"pextrb %3, %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), "i" (element_select*8),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
int gt64(V64x2 x, V64x2 y) {
return cmpGT64Sx2<q, 1>(x,y);
}
int gt32(V64x2 x, V64x2 y) {
return cmpGT64Sx2<d, 1>(x,y);
}
So at the cost of having d and q as global-scope names in this file(!??), we can use <d, 2> or <q, 0> template params that look like the instruction we want.
Note that in x86 SIMD terminology, a "lane" is a 128-bit chunk of an AVX or AVX-512 vector. As in vpermilps (In-Lane Permute of 32-bit float elements).
This compiles to the following asm with GCC10 -O3 (https://godbolt.org/z/ovxWd8)
gt64(long long __vector(2), long long __vector(2)):
pcmpgtq %xmm1,%xmm0
pextrb $8, %xmm0, %eax
ret
gt32(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $8, %xmm0, %eax // This is actually element 2 of 4, not 1, because your scale doesn't account for the size.
ret
You can hide the global-scope vars from the template users and have them pass an integer size. I also fixed the element indexing to account for the variable element size.
static constexpr char q[0] asm("q") = {}; // override asm symbol name which gets mangled for const or constexpr
static constexpr char d[0] asm("d") = {};
template <int size, int element_select>
static int cmpGT64Sx2_new(V64x2 x, V64x2 y)
{
//static constexpr char dd[0] asm("d") = {}; // nope, asm symbol name overrides don't work on local-scope static vars
constexpr int bytepos = size/8 * element_select;
constexpr const char *op = (size==32) ? d : q;
// maybe static_assert( size == 32 || size == 64 )
int result;
__asm__(
// AT&T style has destination second.
"pcmpgt%p[op] %[src],%[dst]\n\t" // SSE2 or SSE4.2
"pextrb %[byte], %[dst], %0" // SSE4.1
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [byte]"i" (bytepos),
[op]"i"(op) // address as an immediate = symbol name
: /* no clobbers */);
return result;
}
// Note *not* referencing d or q static vars, but the template is
int gt64_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<64, 1>(x,y);
}
int gt32_new(V64x2 x, V64x2 y) {
return cmpGT64Sx2_new<32, 1>(x,y);
}
This also compiles like we want, e.g.
gt32_new(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pextrb $4, %xmm0, %eax # note the correct element 1 position
ret
BTW, you could use typedef int v4si __attribute__((vector_size(16))) and then v[element] to let GCC do it for you, if your asm statement just produces an "=x" output of that type in the same register as a "0"(x) input for example.
Without global-scope var names, using GAS .if / .else
We can easily get GCC to print a bare number into the asm template, e.g. for use as an operand to a .if %[size] == 32 directive. The GNU assembler has some conditional-assembly features, so we just get GCC to feed it the right text input to use that. Much less of a hack on the C++ side, but less compact source. Your template param could be a 'd' or 'q' size-code character if you wanted to compare on that instead of a size number.
template <int size, int element_select>
static int cmpGT64Sx2_mask(V64x2 x, V64x2 y)
{
constexpr int bytepos = size/8 * element_select;
unsigned int result;
__asm__(
// AT&T style has destination second.
".if %c[opsize] == 32\n\t" // note Godbolt hides directives; use binary mode to verify the assemble-time condition worked
"pcmpgtd %[src],%[dst]\n\t" // SSE2
".else \n\t"
"pcmpgtq %[src],%[dst]\n\t" // SSE4.2
".endif \n\t"
"pmovmskb %[dst], %0"
: "=r" (result), [dst]"+x"(x)
: [src]"x"(y), [opsize]"i"(size) // address as an immediate = symbol name
: /* no clobbers */);
return (result >> bytepos) & 1; // can just be TEST when branching on it
}
I also changed to using SSE2 pmovmskb to extract both / all element compare results, and using scalar stuff to select which bit to look at. This is orthogonal and can be used with any others. After inlining, it's generally going to be more efficient, allowing test $imm32, %eax. (pmovmskb is cheaper than pextrb, and it lets the whole thing require only SSE2 for the pcmpgtd version).
The asm output from the compiler looks like
.if 64 == 32
pcmpgtd %xmm1,%xmm0
.else
pcmpgtq %xmm1,%xmm0
.endif
pmovmskb %xmm0, %eax
To make sure that did what we want, we can assemble to binary and look at disassembly (https://godbolt.org/z/5zGdfv):
gt32_mask(long long __vector(2), long long __vector(2)):
pcmpgtd %xmm1,%xmm0
pmovmskb %xmm0,%eax
shr $0x4,%eax
and $0x1,%eax
(and gt64_mask uses pcmpgtq and shr by 8.)

What happens when there's an error in constexpr function?

I learnt that constexpr functions are evaluated at compile time. But look at this example:
constexpr int fac(int n)
{
return (n>1) ? n*fac(n-1) : 1;
}
int main()
{
const int a = 500000;
cout << fac(a);
return 0;
}
Apparently this code would throw an error, but since constexpr functions are evaluated at compiling time, why I see no error when compile and link?
Further on, I disassembled this code, and it turned out this function isn't evaluated but rather called as a normal function:
(gdb) x/10i $pc
=> 0x80007ca <main()>: sub $0x8,%rsp
0x80007ce <main()+4>: mov $0x7a11f,%edi
0x80007d3 <main()+9>: callq 0x8000823 <fac(int)>
0x80007d8 <main()+14>: imul $0x7a120,%eax,%esi
0x80007de <main()+20>: lea 0x20083b(%rip),%rdi # 0x8201020 <_ZSt4cout##GLIBCXX_3.4>
0x80007e5 <main()+27>: callq 0x80006a0 <_ZNSolsEi#plt>
0x80007ea <main()+32>: mov $0x0,%eax
0x80007ef <main()+37>: add $0x8,%rsp
0x80007f3 <main()+41>: retq
However, if I call like fac(5):
constexpr int fac(int n)
{
return (n>1) ? n*fac(n-1) : 1;
}
int main()
{
const int a = 5;
cout << fac(a);
return 0;
}
The assemble code turned into:
(gdb) x/10i $pc
=> 0x80007ca <main()>: sub $0x8,%rsp
0x80007ce <main()+4>: mov $0x78,%esi
0x80007d3 <main()+9>: lea 0x200846(%rip),%rdi # 0x8201020 <_ZSt4cout##GLIBCXX_3.4>
0x80007da <main()+16>: callq 0x80006a0 <_ZNSolsEi#plt>
0x80007df <main()+21>: mov $0x0,%eax
0x80007e4 <main()+26>: add $0x8,%rsp
0x80007e8 <main()+30>: retq
The fac function is evaluated at compile time.
Can Anyone explain this?
Compiling command:
g++ -Wall test.cpp -g -O1 -o test
And with g++ version 7.4.0, gdb version 8.1.0

I learnt that constexpr functions are evaluated at compile time
No, constexpr can be evaluated at compile time, but also at runtime.
further reading:
Difference between `constexpr` and `const`
https://en.cppreference.com/w/cpp/language/constexpr
Purpose of constexpr
Apparently this code would throw an error
No, no errors thrown. For large input, the result will overflow which is undefined behavior. This doesn't mean an error will be thrown or displayed. It means anything can happen. And when I say anything, I do mean anything. The program can crash, hang, appear to work with a strange results, display weird characters, or literally anything.
further reading:
https://en.cppreference.com/w/cpp/language/ub
https://en.wikipedia.org/wiki/Undefined_behavior
http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html
And, as pointed out by nathanoliver
when invoked in a constant expression, a constexpr function must check and error out on UB http://coliru.stacked-crooked.com/a/43ccf2039dc511d5
In other words there can't be any UB at compile time. What at runtime would be UB, at compile time it's a hard error.

constexpr it means that it can be evaluated at compile time, not that it will be evaluated at compile time. The compiler will be forced to do the evaluation compile-time if you use it where a compile time constant is expected (e.g. the size of an array).
On the other hand for small values g++ is for example smart enough to compute the result compile time (even without constexpr).
For example with:
int fact(int n) {
return n < 2 ? 1 : n*fact(n-1);
}
int bar() {
return fact(5);
}
the code generated by g++ -O3 for bar is:
bar():
mov eax, 120
ret
Note that overflowing the call stack (e.g. infinite or excessive recursion) or even overflowing signed integer arithmetic is in C++ undefined behavior and anything can happen. It doesn't mean you'll get a nice "error" or even a segfault... but that ANYTHING can happen (including, unfortunately, nothing evident). Basically it means that the authors of compilers can just ignore to handle those cases because you're not supposed to do this kind of mistakes.

why is the optimizer not allowed to fold in "constant context"?

__builtin_is_constant_evaluated is the builtin used to for implementing std::is_constant_evaluated in the standard library on clang and gcc.
code that is not valid in constant context is also often harder for the optimizer to constant fold.
for example:
int f(int i) {
if (__builtin_is_constant_evaluated())
return 1;
else {
int* ptr = new int(1);
int i = *ptr;
delete ptr;
return i;
}
}
is emitted by gcc -O3 as:
f(int):
sub rsp, 8
mov edi, 4
call operator new(unsigned long)
mov esi, 4
mov rdi, rax
call operator delete(void*, unsigned long)
mov eax, 1
add rsp, 8
ret
so the optimizer used __builtin_is_constant_evaluated() == 0
clang fold this to a constant but this is because clang's optimizer can remove unneeded dynamic allocation, not because it used __builtin_is_constant_evaluated() == 1.
i am aware that this would make the return value of __builtin_is_constant_evaluated() implementation defined because optimization vary from one compiler to an other. but is_constant_evaluated should already be used only when both path have the same observable behaviors.
why does the optimizer not use __builtin_is_constant_evaluated() == 1 and fallback to __builtin_is_constant_evaluated() == 0 if it wasn't able to fold ?

Per [meta.const.eval]:
constexpr bool is_constant_evaluated() noexcept;
Returns: true if and only if evaluation of the call occurs within the evaluation of an expression or conversion that is manifestly
constant-evaluated ([expr.const]).
f can never be invoked in a constant-evaluated expression or conversion, so std::is_constant_evaluated() returns false. This is decided by the compiler and has nothing to do with the optimizer.
Of course, if the optimizer can prove that the branches are equivalent, it can do constant fold. But that is optimization after all — beyond the scope of the C++ language itself.
But why is it this way? The proposal that introduced std::is_constant_evaluated is P0595. It explains the idea well:
constexpr double power(double b, int x) {
if (std::is_constant_evaluated() && x >= 0) {
// A constant-evaluation context: Use a
// constexpr-friendly algorithm.
double r = 1.0, p = b;
unsigned u = (unsigned)x;
while (u != 0) {
if (u & 1) r *= p;
u /= 2;
p *= p;
}
return r;
} else {
// Let the code generator figure it out.
return std::pow(b, (double)x);
}
}
// ...
double thousand() {
return power(10.0, 3); // (3)
}
[...]
Call (3) is a core constant expression, but an implementation is not
required to evaluate it at compile time. We therefore specify that it
causes std::is_constant_evaluated() to produce false. It's
tempting to leave it unspecified whether true or false is produced
in that case, but that raises significant semantic concerns: The
answer could then become inconsistent across various stages of the
compilation. For example:
int *p, *invalid;
constexpr bool is_valid() {
return std::is_constant_evaluated() ? true : p != invalid;
}
constexpr int get() { return is_valid() ? *p : abort(); }
This example tries to count on the fact that constexpr evaluation
detects undefined behavior to avoid the non-constexpr-friendly call to
abort() at compile time. However, if std::is_constant_evaluated()
can return true, we now end up with a situation where an important
run-time check is by-passed.

Do compilers reduce simple functions given constant arguments into unique instructions?

This is something I've always thought to be true but have never had any validation. Consider a very simple function:
int subtractFive(int num) {
return num -5;
}
If a call to this function uses a compile time constant such as
getElement(5);
A compiler with optimizations turned on will very likely inline this. What is unclear to me however, is if the num - 5 will be evaluated at runtime or compile time. Will expression simplification extend recursively through inlined functions in this manner? Or does it not transcend functions?

We can simply look at the generated assembly to find out. This code:
int subtractFive(int num) {
return num -5;
}
int main(int argc, char *argv[]) {
return subtractFive(argc);
}
compiled with g++ -O2 yields
leal -5(%rdi), %eax
ret
So the function call was indeed reduced to a single instruction. This optimization technique is known as inlining.
One can of course use the same technique to see how far a compiler will go with that, e.g. the slightly more complicated
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main(int argc, char *argv[]) {
return foo(argc);
}
still gets compiled to
leal -25(%rdi,%rdi,4), %eax
ret
so here both functions where just eliminated at compile time. If the input to foo is known at compile time, the function call will (in this case) simply be replaced by the resulting constant at compile time (Live).
The compiler can also combine this inlining with constant folding, to replace the function call with its fully evaluated result if all arguments are compile time constants. For example,
int subtractFive(int num) {
return num -5;
}
int foo(int i) {
return subtractFive(i) * 5;
}
int main() {
return foo(7);
}
compiles to
mov eax, 10
ret
which is equivalent to
int main () {
return 10;
}
A compiler will always do this where it thinks it is a good idea, and it is (usually) way better in optimizing code on this low level than you are.

It's easy to do a little test; consider the following
int foo(int);
int bar(int x) { return x-5; }
int baz() { return foo(bar(5)); }
Compiling with g++ -O3 the asm output for function baz is
xorl %edi, %edi
jmp _Z3fooi
This code loads a 0 in the first parameter and then jumps into the code of foo. So the code from bar is completely disappeared and the computation of the value to pass to foo has been done at compile time.
In addition returning the value of calling the function became just a jump to the function code (this is called "tail call optimization").

A smart compiler will evaluate this at compile time and will replace the getElement(5) because it will never have a different result. None of the variables are considered volatile.

Confused about the function return value

#include<iostream>
using namespace std;
int Fun(int x)
{
int sum=1;
if(x>1)
sum=x*Fun(x-1);
else
return sum;
}
int main()
{
cout<<Fun(1)<<endl;
cout<<Fun(2)<<endl;
cout<<Fun(3)<<endl;
cout<<Fun(4)<<endl;
cout<<Fun(5)<<endl;
}
This function is to compute the factorial of an integer number. In the branch of x>1,there is no return value for function Fun. So this function should not return correct answer.
But when fun(4) or some other examples are tested, the right answers are got unexpectedly. Why?
The assembly code of this function is(call Fun(4)):
0x004017E5 push %ebp
0x004017E6 mov %esp,%ebp
0x004017E8 sub $0x28,%esp
0x004017EB movl $0x1,-0xc(%ebp)
0x004017F2 cmpl $0x1,0x8(%ebp)
0x004017F6 jle 0x40180d <Fun(int)+40>
0x004017F8 mov 0x8(%ebp),%eax
0x004017FB dec %eax
0x004017FC mov %eax,(%esp)
0x004017FF call 0x4017e5 <Fun(int)>
0x00401804 imul 0x8(%ebp),%eax
0x00401808 mov %eax,-0xc(%ebp)
0x0040180B jmp 0x401810 <Fun(int)+43>
0x0040180D mov -0xc(%ebp),%eax
0x00401810 leave
0x00401811 ret
May be this is the reason: The value of sum is saved in register eax, and the return value is saved in eax too, so Funreturn the correct result.

Usually, EAX register is used to store return value, ad it is also used to do other stuff as well.
So whatever has been loaded to that register just before the function returns will be the return value, even if you don't intend to do so.
You can use the -S option to generate assembly code and see what happened to EAX right before the "ret" instruction.

When your program pass in the if condition, no return statement finish the function. The number you got is the result of an undefined behavior.
int Fun(int x)
{
int sum=1.0;
if(x>1)
sum=x*Fun(x-1);
else
return sum;
return x; // return something here
}

Just remove else from your code:
int Fun(int x)
{
int sum=1;
if(x>1)
sum=x*Fun(x-1);
return sum;
}

The code you have has a couple of errors:
you have an int being assigned the value 1.0 (which will be implicitly cast/converted), not an error as such but inelegant.
you have a return statement inside a conditionality, so you will only ever get a return when that if is true
If you fix issue with the return by removing the else, then all will be fine:)
As to why it works with 4 as an input, that is down to random chance/ some property of your environment as the code you have posted should be unable to function, as there will always be an instance, when calculating factorials for a positive int, where x = 1 and no return will be generated.
As an aside, here is a more concise/terse function: for so straightforward a function you might consider the ternary operator and use a function like:
int factorial(int x){ return (x>1) ? (x * factorial(x-1)) : 1;}
this is the function I use for my factorials and have had on library for the last 30 or so years (since my C days) :)

From the C standards:
Flowing off the end of a function is equivalent to a return with no
value; this results in undefined behavior in a value-returning
function.
Your situation is the same as this one:
int fun1(int x)
{
int sum = 1;
if(x > 1)
sum++;
else
return sum;
}
int main()
{
int b = fun1(3);
printf("%d\n", b);
return 0;
}
It prints 2 on my machine.
This is calling convention and architecture dependent. The return value is the result of last expression evaluation, stored in the eax register.

As stated in the comment this is undefined behaviour. With g++ I get the following warning.
warning: control reaches end of non-void function [-Wreturn-type]
On Visual C++, the warning is promoted to an error by default
error C4716: 'Fun' : must return a value
When I disabled the warning and ran the resulting executable, Fun(4) gave me 1861810763.
So why might it work under g++? During compilation conditional statements are turned into tests and jumps (or gotos). The function has to return something, and the simplest possible code for the compiler to produce is along the following lines.
int Fun(int x)
{
int sum=1.0;
if(!(x>1))
goto RETURN;
sum=x*Fun(x-1);
RETURN:
return sum;
}
This is consistent with your disassembly.
Of course you can't rely on undefined behaviour, as illustrated by the behaviour in Visual C++. Many shops have a policy to treat warnings as errors for this reason (also as suggested in a comment).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Curious missed optimization of recursive constexpr function by Clang - c++

Related

Is it possible to template the asm opcode in c++?

What happens when there's an error in constexpr function?

why is the optimizer not allowed to fold in "constant context"?

Do compilers reduce simple functions given constant arguments into unique instructions?

Confused about the function return value

Categories

Resources