How to help GCC to not insert xor before lzcnt? - c++

For this fragment of code (https://godbolt.org/z/s4PY44dha)
int foo(unsigned long long x)
{
return _lzcnt_u64(x);
}
GCC generates 3 asm instructions
xorl %eax, %eax
lzcntq %rdi, %rax
ret
while clang generates only 2
lzcntq %rdi, %rax
retq
Is it possible to change the implementation/signature of foo to help GCC understand that this xor instruction is useless? Why can't gcc perform such simple optimization itself?
The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to
pass newer architectures to the GCC (for example -march=rocketlake, -march=icelake-client) but it still inserts "useless" xor.
In contrast, even for old architectures like haswell clang doesn't insert xor. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor should be controlled manually.
For example, with this inline assembly, I managed to get the code without xor.
int xorless_lzcntq(unsigned long long x) {
unsigned long long res;
asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
return res;
}

Related

Wrong result on modular arithmetic on ARM (Apple M1) with clang -O3 optimization

I am pulling my hair out for the last couple of days with this "innocuous" piece of code (minimal reproducible example, part of a larger modular multiplication routine):
#include <iostream>
#include <limits>
using ubigint = unsigned long long int;
using bigint = long long int;
void modmul(bigint a, bigint b, ubigint p) {
ubigint ua = a < 0 ? -a : a;
ubigint ub = b < 0 ? -b : b;
ua %= p;
ub %= p;
std::cout << "ua: " << ua << '\n';
}
int main() {
bigint minbigint = std::numeric_limits<bigint>::min();
bigint maxbigint = std::numeric_limits<bigint>::max();
std::cout << "minbigint: " << minbigint << '\n';
std::cout << "maxbigint: " << maxbigint << '\n';
modmul(minbigint, maxbigint, 2314); // expect ua: 2036, got ua: 0
}
I am compiling on macOS 11.4 with clang 12.0 installed from Homebrew
clang version 12.0.0
Target: arm64-apple-darwin20.5.0
Thread model:posix
InstalledDir: /opt/homebrew/opt/llvm/bin
When compiling with clang -O1, the program spits out the expected result (in this case, 2036, I've checked with Wolfram Mathematica, Mod[9223372036854775808, 2314], and this is correct). However, when I compile with clang -O2 or clang -O3 (full optimization), somehow the variable ua is zeroed out (its value becomes 0). I am at a complete loss here, and have no idea why this happens. IMO, there's no UB, nor overflow, or anything dubious in this piece of code. I'd greatly appreciate any advice, or if you can reproduce the issue on your side.
PS: the code behaves as expected on any other platforms, including Windows/Linux/FreeBSD/Solaris), with any combination of compilers. I'm only getting this error on Apple M1 with clang 12 (didn't test with other compilers on M1).
UPDATE: As #harold pointed out in the comment section, negq and subq from 0 is exactly the same. So the my discussion related to negq and subq below is incorrect. Please disregard that part, sorry for not double checking before posting answer.
About the original question, I recompile a slightly simpler version of the code godbolt and find out that the problematic compiler's optimization is in main not modmul. In main, clang see that all of its operands for modmul is constant so it decided to do the computation of modmul at compile time. When calculating ubigint ua = a < 0 ? -a : a;, clang find out that is signed integer overflow UB so it decided to return 0 and print out. That may seem to be a radical thing to do but it's legal because of UB. Moreover, since there is no mathematically correct answer due to the limitation of two's compliment system, return 0 is arguably as good (or as bad) as any other result.
OLD ANSWER BELOWS
As some one pointed out in the comment section, the 2 lines below in your code is undefined behavior - signed integer overflow UB.
ubigint ua = a < 0 ? -a : a;
ubigint ub = b < 0 ? -b : b;
If you wonder what exactly clang does under the hood to produce 2 different results at 2 different optimization levels, consider a simple example as following.
using ubigint = unsigned long long int;
using bigint = long long int;
ubigint
negate(bigint a)
{
ubigint ua = -a;
return ua;
}
When compile with -O0
negate(long long): # #negate(long long)
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
xorl %eax, %eax
subq -8(%rbp), %rax # Negation is performed here
movq %rax, -16(%rbp)
movq -16(%rbp), %rax
popq %rbp
retq
Compile with -O3
negate(long long): # #negate(long long)
movq %rdi, %rax
negq %rax # Negation is performed here
retq
At -O0, clang use normal subq instruction which perform binary subtraction of 0 and %rax and produce results with integer-wrap-around behavior.
At -O3, clang can do better, it use negq instruction which only replace the operand with its two's complement (i.e flip all the bits and add 1). However, you can see that this optimization is only legal if signed integer overflow is undefined behavior (hence the compiler can just ignore overflow cases). If the standard required integer-wrap-around behavior, clang must fall back to the unoptimized version.

Correct way to implement inline assembler in c++ for xor operations on variables

I've recently seen an article on how the swap operation can be performed using xor'ing instead of using a temporary variable. When I compile code using int a ^= b; the result won't simply be(for at&t syntax)
xor b, a
etc.
instead it will load the raw values into registers, xor it and write it back.
To optimize this i want to write this in inline assembly so it only uses three ticks to do the entire thing and not 15 like it does normally.
I've tried multiple keywords like:
asm(...);
asm("...");
asm{...};
asm{"..."};
asm ...
__asm ...
None of that worked, either giving me a syntax error, gcc doesn't seem to accept all of that syntax or else saying
main.cpp: Assembler messages:
main.cpp:12: Error: too many memory references for `xor'
Basically, I want to use the variables defined in my c++ code used in the assembler block, using three lines to xor them and then have my swapped variables basically like this:
int main() {
volatile int a = 5;
volatile int b = 6;
asm {
xor a,b
xor b,a
xor a,b
};
//a should now be 6, b should be 5
}
To clarify:
I want to avoid using the compiler generated mov operations since they take more cpu cycles than just doing three xor operations which would take three cycles. How could I accomplish this?
To use inline assembly, you should use __asm__ volatile. However, this type of optimization may be premature. Just because there are more instructions does not mean the code is slower - some instructions can be really slow. For example, a floating point BCD store instruction (fbstp), while admittedly rare, takes over 200 cycles - compared to one cycle for a simple mov (Agner Fog's Optimization Guide is a good resource for these timings).
So, I implemented a bunch of "swap" functions, some in C++ and some in assembly, and did a bit of measuring, running each function 100 million times in a row.
Test cases
std::swap
std::swap is probably the preferred solution here. It does what you want (swap the values of two variables), works for most standard library types and not just for integers, clearly communicates what you are trying to achieve, and is portable across architectures.
void std_swap(int *a, int *b) {
std::swap(*a, *b);
}
Here is the generated assembly: It loads both values into registers, and then writes them back to the opposite memory locations.
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
XOR swap
This is what you were trying to do, in C++:
void xor_swap(int *a, int *b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
This doesn't directly translate to only xor instructions, because there is no instruction on x86 that allows you to directly xor two locations in memory - you always need to load at least one of the two into a register:
movl (%rdi), %eax
xorl (%rsi), %eax
movl %eax, (%rdi)
xorl (%rsi), %eax
movl %eax, (%rsi)
xorl %eax, (%rdi)
You also generate a bunch of extra instructions because the two pointers may alias, i.e. point to overlapping memory areas. Then, changing one variable would also change the other, so the compiler needs to constantly store and re-load the values. An implementation using the compiler-specific __restrict keyword will compile to the same code as std_swap (thanks to #Ped7g for pointing out this flaw in the comments).
Swap with temporary variables
This is the "standard" swap with a temporary variable (that the compiler promptly optimizes out to the same code as std::swap):
void tmp_swap(int *a, int *b) {
int tmp = *a;
*a = *b;
*b = tmp;
}
The xchg instruction
xchg can swap a memory value with a register value - it seems perfect at first for your use case. However, it is really slow when you use it to access memory, as you will see later.
void xchg_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xchgl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
We need to load one of the two values into a register, because there is no xchg for two memory locations.
XOR swap in Assembly
I made two versions of the XOR-based swap in Assembly. The first one only loads one of the values in a register, the second loads both before swapping them and writing them back.
void xor_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xorl (%1), %%eax\n\t"
"xorl %%eax, (%1)\n\t"
"xorl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
void xor_asm_register_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"movl (%1), %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"xorl %%eax, %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"movl %%eax, (%0)\n\t"
"movl %%ecx, (%1)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax", "%ecx"
);
}
The results
You can view the full compilation results along with the generated assembly code on Godbolt.
On my machine, the timings (in microseconds) vary a bit, but are generally comparable:
std_swap: 127371
xor_swap: 150152
tmp_swap: 125896
xchg_asm_swap: 699355
xor_asm_swap: 130586
xor_asm_register_swap: 124718
You can see that std_swap, tmp_swap, xor_asm_swap, and xor_asm_register_swap are generally very similar in speed - in fact, if I move xor_asm_register_swap to the front, it turns out slightly slower than std_swap. Also note that tmp_swap is exactly the same assembly code as std_swap (although it regularly measures in as a bit faster, probably because of the ordering).
xor_swap implemented in C++ is slightly slower because the compiler generates an additional memory load/store for each of the instructions because of aliasing - as mentioned above, if we modify xor_swap to take int * __restrict a, int * __restrict b instead (meaning that a and b never alias), the compiler generates the same code as for std_swap and tmp_swap.
xchg_swap, despite using the lowest number of instructions, is terribly slow (over four times slower than any of the other options), just because xchg is not a fast operation if it involves a memory access.
Ultimately, you have the choice between using some custom assembly-based version (that is hard to understand and maintain) or just using std::swap (which is pretty much the opposite, and also benefits from any optimizations that the standard library designers can come up with, e.g. using vectorization on larger types). Since this is over one hundred million iterations, it should be clear that the potential improvement by using assembly code here is very small - if you improve at all (which is not clear) you'd shave off a couple of microseconds at most.
TL;DR: You shouldn't do that, just use std::swap(a, b)
Appendix: __asm__ volatile
I figured that it may make sense at this point to explain the inline assembly code a bit. __asm__ (in GNU mode, asm is enough) introduces a block of assembly code. The volatile is there to make sure the compiler doesn't optimize it away - it likes to just remove the block otherwise.
There are two forms of __asm__ volatile. One of them also deals with goto labels; I will not address it here. The other form takes up to four arguments, separated with colons (:):
The simplest form (__asm__ volatile ("rdtsc")) just dumps the assembly code, but does not really interact with the C++ code around it. In particular, you need to guess how variables are assigned to registers, which is not exactly good.
Note that the assembly code instructions are separated with "\n", because this assembly code is passed verbatim to the GNU assembler (gas).
The second argument is a list of output operands. You can specify what "type" they have (in particular, =r means "any register operand", and +r means "any register operand, but it is also used as an input"). For example, : "+r" (a), "+r" (b) tells the compiler to replace %0 (references the first of the operands) with the register containing a, and %1 with the register containing b.
This notation means you need to replace %eax (as you would normally reference eax in AT&T assembly notation) with %%eax to escape the percentage sign.
You can also use ".intel_syntax\n" to switch to Intel's assembly syntax if you prefer.
The third argument is the same, but deals with input-only operands.
The fourth argument tells the compiler which registers and memory locations lose their values to enable optimizations around the assembly code. For example, "clobbering" "memory" will likely prompt the compiler to insert a full memory fence. You can see that I added all the registers I used for temporary storage to this list.

How to optimize function return values in C and C++ on x86-64?

The x86-64 ABI specifies two return registers: rax and rdx, both 64-bits (8 bytes) in size.
Assuming that x86-64 is the only targeted platform, which of these two functions:
uint64_t f(uint64_t * const secondReturnValue) {
/* Calculate a and b. */
*secondReturnValue = b;
return a;
}
std::pair<uint64_t, uint64_t> g() {
/* Calculate a and b, same as in f() above. */
return { a, b };
}
would yield better performance, given the current state of C/C++ compilers targeting x86-64? Are there any pitfalls performance-wise using one or the other version? Are compilers (GCC, Clang) always able to optimize the std::pair to be returned in rax and rdx?
UPDATE: Generally, returning a pair is faster if the compiler optimizes out the std::pair methods (examples of binary output with GCC 5.3.0 and Clang 3.8.0). If f() is not inlined, the compiler must generate code to write a value to memory, e.g:
movq b, (%rdi)
movq a, %rax
retq
But in case of g() it suffices for the compiler to do:
movq a, %rax
movq b, %rdx
retq
Because instructions for writing values to memory are generally slower than instructions for writing values to registers, the second version should be faster.
Since the ABI specifies that in some particular cases two registers have to be used for the 2-word result any conforming compiler has to obey that rule.
However, for such tiny functions I guess that most of the performance will come from inlining.
You may want to compile and link with g++ -flto -O2 using link-time optimizations.
I guess that the second function (returning a pair thru 2 registers) might be slightly faster, and that perhaps in some situations the GCC compiler could inline and optimize the first into the second.
But you really should benchmark if you care that much.
Note that the ABI specifies packing any small struct into registers for passing/returning (if it contains only integer types). This means that returning a std::pair<uint32_t, uint32_t> means the values have to be shift+ORed into rax.
This is probably still better than a round trip through memory, because setting up space for a pointer, and passing that pointer as an extra arg, has some overhead. (Other than that, though, a round-trip through L1 cache is pretty cheap, like ~5c latency. The store/load are almost certainly going to hit in L1 cache, because stack memory is used all the time. Even if it misses, store-forwarding can still happen, so execution doesn't stall until the ROB fills because the store can't retire. See Agner Fog's microarch guide and other stuff at the x86 tag wiki.)
Anyway, here's the kind of code you get from gcc 5.3 -O2, using functions that take args instead of returning compile-time constant values (which would lead to movabs rax, 0x...):
#include <cstdint>
#include <utility>
#define type_t uint32_t
type_t f(type_t * const secondReturnValue, type_t x) {
*secondReturnValue = x+4;
return x+2;
}
lea eax, [rsi+4] # LEA is an add-and-shift instruction that uses memory-operand syntax and encoding
mov DWORD PTR [rdi], eax
lea eax, [rsi+2]
ret
std::pair<type_t, type_t> g(type_t x) { return {x+2, x+4}; }
lea eax, [rdi+4]
lea edx, [rdi+2]
sal rax, 32
or rax, rdx
ret
type_t use_pair(std::pair<type_t, type_t> pair) {
return pair.second + pair.first;
}
mov rax, rdi
shr rax, 32
add eax, edi
ret
So it's really not bad at all. Two or three insns in the caller and callee to pack and unpack a pair of uint32_t values. Nowhere near as good as returning a pair of uint64_t values, though.
If you're specifically optimizing for x86-64, and care what happens for non-inlined functions with multiple return values, then prefer returning std::pair<uint64_t, uint64_t> (or int64_t, obviously), even if you assign those pairs to narrower integers in the caller. Note that in the x32 ABI (-mx32), pointers are only 32bits. Don't assume pointers are 64bit when optimizing for x86-64, if you care about that ABI.
If either member of the pair is 64bit, they use separate registers. It doesn't do anything stupid like splitting one value between the high half of one reg and the low half of another.

Getting a compiler to generate adc instruction

Is there any way to get either Clang, GCC or VS to generate adc (add with carry) instructions only using Standard-C++(98/11/14)? (Edit: I mean in x64 mode, sorry if that wasn't clear.)
If your code makes a comparison and adds the result of the comparison to something, then an adc is typically emitted by gcc 5 (incidentally, gcc 4.8 does not emit an adc here). For example,
unsigned foo(unsigned a, unsigned b, unsigned c, unsigned d)
{
return (a + b + (c < d));
}
assembles to
foo:
cmpl %ecx, %edx
movl %edi, %eax
adcl %esi, %eax
ret
However, it is a bit tricky to get gcc to really emit an adc.
There's an __int128_t type available on GCC for amd64 and other 64bit targets, which will use a pair of add/adc instructions for a simple addition. (See the Godbolt link below).
Also, this pure ISO C code may compile to an adc:
uint64_t adc(uint64_t a, uint64_t b)
{
a += b;
if (a < b) /* should simplify to nothing (setting carry is implicit in the add) */
a++; /* should simplify to adc r0, 0 */
return a;
}
For me (ARM) it generated something kind of silly, but it compiles for x86-64 (on the Godbolt compiler explorer) to this:
mov rax, rdi # a, a
add rax, rsi # a, b
adc rax, 0 # a,
ret
If you compile a 64-bit signed addition for X86 (int64_t in C++ 11), the compiled code will contain an adc instruction.
Edit: code sample:
int64_t add_numbers(int64_t x, int64_t y) {
return x + y;
}
On X86, the addition is implemented using an add instruction followed by an adc instruction. On X64, only a single add instruction is used.

Load 64-bit integer constant via GNU extended asm constraint?

I've written this code in Clang-compatible "GNU extended asm":
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq %[magic_pointer], %%rax\n\t"
"ret"
: : [magic_pointer] "p"(&foreign::magic_pointer));
}
I expected it to compile into the following assembly:
_get_address_of_x:
## InlineAsm Start
movq $__ZN7foreign13magic_pointerE, %rax
ret
## InlineAsm End
ret /* useless but I don't think there's any way to get rid of it */
But instead I get this "nonsense":
_get_address_of_x:
movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax
movq %rax, -8(%rbp)
## InlineAsm Start
movq -8(%rbp), %rax
ret
## InlineAsm End
ret
Apparently Clang is assigning the value of &foreign::magic_pointer into %rax (which is deadly to a naked function), and then further "spilling" it onto a stack frame that doesn't even exist, all so it can pull it off again in the inline asm block.
So, how can I make Clang generate exactly the code I want, without resorting to manual name-mangling? I mean I could just write
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax\n\t"
"ret");
}
but I really don't want to do that if there's any way to help it.
Before hitting on "p", I'd tried the "i" and "n" constraints; but they didn't seem to work properly with 64-bit pointer operands. Clang kept giving me error messages about not being able to allocate the operand to the %flags register, which seems like something crazy was going wrong.
For those interested in solving the "XY problem" here: I'm really trying to write a much longer assembly stub that calls off to another function foo(void *p, ...) where the argument p is set to this magic pointer value and the other arguments are set based on the original values of the CPU registers at the point this assembly stub was entered. (Hence, naked function.) Arbitrary company policy prevents just writing the damn thing in a .S file to begin with; and besides, I really would like to write foreign::magic_pointer instead of __ZN7foreign...etc.... Anyway, that should explain why spilling temporary results to stack or registers is strictly verboten in this context.
Perhaps there's some way to write
asm volatile(".long %[magic_pointer]" : : [magic_pointer] "???"(&foreign::magic_pointer));
to get Clang to insert exactly the relocation I want?
I think this is what you want:
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile ("ret" : : "a"(&foreign::magic_pointer));
}
In this context, "a" is a constraint that specifies that %rax must be used. Clang will then load the address of magic_pointer into %rax in preparation for executing your inline asm, which is all you need.
It's a little dodgy because it's defining constraints that are unreferenced in the asm text, and I'm not sure whether that's technically allowed/well-defined - but it does work on latest clang.
On clang 3.0-6ubuntu3 (because I'm being lazy and using gcc.godbolt.org), with -fPIC, this is the asm you get:
get_address_of_x: # #get_address_of_x
movq foreign::magic_pointer#GOTPCREL(%rip), %rax
ret
ret
And without -fPIC:
get_address_of_x: # #get_address_of_x
movl foreign::magic_pointer, %eax
ret
ret
OP here.
I ended up just writing a helper extern "C" function to return the magic value, and then calling that function from my assembly code. I still think Clang ought to support my original approach somehow, but the main problem with that approach in my real-life case was that it didn't scale to x86-32. On x86-64, loading an arbitrary address into %rdx can be done in a single instruction with a %rip-relative mov. But on x86-32, loading an arbitrary address with -fPIC turns into just a ton of code, .indirect_symbol directives, two memory accesses... I just didn't want to attempt writing all that by hand. So my final assembly code looks like
asm volatile(
"...save original register values...;"
"call _get_magic_pointer;"
"movq %rax, %rdx;"
"...set up other parameters to foo...;"
"call _foo;"
"...cleanup..."
);
Simpler and cleaner. :)