Getting a compiler to generate adc instruction

Getting a compiler to generate adc instruction - c++

Is there any way to get either Clang, GCC or VS to generate adc (add with carry) instructions only using Standard-C++(98/11/14)? (Edit: I mean in x64 mode, sorry if that wasn't clear.)

If your code makes a comparison and adds the result of the comparison to something, then an adc is typically emitted by gcc 5 (incidentally, gcc 4.8 does not emit an adc here). For example,
unsigned foo(unsigned a, unsigned b, unsigned c, unsigned d)
{
return (a + b + (c < d));
}
assembles to
foo:
cmpl %ecx, %edx
movl %edi, %eax
adcl %esi, %eax
ret
However, it is a bit tricky to get gcc to really emit an adc.

There's an __int128_t type available on GCC for amd64 and other 64bit targets, which will use a pair of add/adc instructions for a simple addition. (See the Godbolt link below).
Also, this pure ISO C code may compile to an adc:
uint64_t adc(uint64_t a, uint64_t b)
{
a += b;
if (a < b) /* should simplify to nothing (setting carry is implicit in the add) */
a++; /* should simplify to adc r0, 0 */
return a;
}
For me (ARM) it generated something kind of silly, but it compiles for x86-64 (on the Godbolt compiler explorer) to this:
mov rax, rdi # a, a
add rax, rsi # a, b
adc rax, 0 # a,
ret

If you compile a 64-bit signed addition for X86 (int64_t in C++ 11), the compiled code will contain an adc instruction.
Edit: code sample:
int64_t add_numbers(int64_t x, int64_t y) {
return x + y;
}
On X86, the addition is implemented using an add instruction followed by an adc instruction. On X64, only a single add instruction is used.

Related

GCC won't use its own optimization trick without -fwrapv

Consider this C++ code:
#include <cstdint>
// returns a if less than b or if b is INT32_MIN
int32_t special_min(int32_t a, int32_t b)
{
return a < b || b == INT32_MIN ? a : b;
}
GCC with -fwrapv correctly realizes that subtracting 1 from b can eliminate the special case, and it generates this code for x86-64:
lea edx, [rsi-1]
mov eax, edi
cmp edi, edx
cmovg eax, esi
ret
But without -fwrapv it generates worse code:
mov eax, esi
cmp edi, esi
jl .L4
cmp esi, -2147483648
je .L4
ret
.L4:
mov eax, edi
ret
I understand that -fwrapv is needed if I write C++ code which relies on signed overflow. But:
The above C++ code does not depend on signed overflow (it is valid standard C++).
We all know that signed overflow has a specific behavior on x86-64.
The compiler knows it is compiling for x86-64.
If I wrote "hand optimized" C++ code trying to implement that optimization, I understand -fwrapv would be required, otherwise the compiler could decide the signed overflow is UB and do whatever it wants in the case where b == INT32_MIN. But here the compiler is in control, and I don't see what stops it from using the optimization without -fwrapv. Is there some reason it isn't allowed to?

This kind of missed optimization has happened before in GCC, like not fully treating signed int add as associative even though it's compiling for a 2's complement target with wrapping addition. So it optimizes better for unsigned. IIRC, the reason was something like GCC losing track of some information it had about the operations, and thus being conservative? I forget if that ever got fixed.
I can't find where I've seen this before on SO with a reply from a GCC dev about the internals; maybe it was in a GCC bug report? I think it was with something like a+b+c+d+e (not) re-associating into a tree of dependencies to shorten the critical path. But unfortunately it's still present in current GCC:
int sum(int a, int b, int c, int d, int e, int f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
int sumv2(int a, int b, int c, int d, int e, int f) {
return (a+b)+(c+d)+(e+f);
// clang pessimizes this back to 1 chain, GCC doesn't
}
unsigned sumu(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return a+b+c+d+e+f;
// gcc and clang make one stupid dep chain
}
unsigned sumuv2(unsigned a, unsigned b, unsigned c, unsigned d, unsigned e, unsigned f) {
return (a+b)+(c+d)+(e+f);
// GCC and clang pessimize back to 1 chain for unsigned
}
Godbolt for x86-64 System V at -O3, clang and gcc -fwrapv make the same asm for all 4 functions, as you'd expect.
GCC (without -fwrapv) makes the same asm for sumu as for sumuv2 (summing into r8d, the reg that held e.) But GCC makes different asm for sum and sumv2, because they use signed int
# gcc -O3 *without* -fwrapv
# The same order of order of operations as the C source
sum(int, int, int, int, int, int):
add edi, esi # a += b
add edi, edx # ((a+b) + c) ...
add edi, ecx # sum everything into EDI
add edi, r8d
lea eax, [rdi+r9]
ret
# also as written, the source order of operations:
sumv2(int, int, int, int, int, int):
add edi, esi # a+=b
add edx, ecx # c+=d
add r8d, r9d # e+=f
add edi, edx # a += c
lea eax, [rdi+r8] # retval = a + e
ret
So ironically GCC makes better asm when it doesn't re-associate the source. That's assuming that all 6 inputs are ready at once. If out-of-order exec of earlier code only produced the input registers 1 per cycle, the final result here would be ready only 1 cycle after the final input was ready, assuming that final input was f.
But if the last input was a or b, the result wouldn't be ready until 5 cycles later with the single chain like GCC and clang use when they can. vs. 3 cycles worst case for the tree reduction, 2 cycle best case (if e or f were ready last).
(Update: -mtune=znver2 makes GCC re-associate into a tree, thanks #amonakov. So this is a tuning choice with a default that seems strange to me, at least for this specific problem-size. See GCC source, search for reassoc to see costs for other tuning settings; most of them are 1,1,1,1 which is insane, especially for floating-point. This might be why GCC fails to use multiple vector accumulators when unrolling FP loops, defeating the purpose.)
But anyway, this is a case of GCC only re-associating signed int with -fwrapv. So clearly it limits itself more than necessary without -fwrapv.
Related: Compiler optimizations may cause integer overflow. Is that okay? - it is of course legal, and failure to do it is a missed optimization.
GCC isn't totally hamstrung by signed int; it will auto-vectorize int sum += arr[i], and it does manage to optimize Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)? for signed int a.

How to help GCC to not insert xor before lzcnt?

For this fragment of code (https://godbolt.org/z/s4PY44dha)
int foo(unsigned long long x)
{
return _lzcnt_u64(x);
}
GCC generates 3 asm instructions
xorl %eax, %eax
lzcntq %rdi, %rax
ret
while clang generates only 2
lzcntq %rdi, %rax
retq
Is it possible to change the implementation/signature of foo to help GCC understand that this xor instruction is useless? Why can't gcc perform such simple optimization itself?
The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to
pass newer architectures to the GCC (for example -march=rocketlake, -march=icelake-client) but it still inserts "useless" xor.
In contrast, even for old architectures like haswell clang doesn't insert xor. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor should be controlled manually.
For example, with this inline assembly, I managed to get the code without xor.
int xorless_lzcntq(unsigned long long x) {
unsigned long long res;
asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
return res;
}

Why different assembling code generated? Which is better?

#include <cstdint>
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
if (a) {
return x | (a << n);
}
return x;
}
uint64_t hr2(const uint64_t x, const bool a, const int n)
{
return x | ((a ? 1ull : 0) << n);
}
https://godbolt.org/z/gy_65H
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
jne .L4
ret
.L4:
mov ecx, edx
mov esi, 1
sal esi, cl
movsx rsi, esi
or rax, rsi
ret
hr2(unsigned long, bool, int):
mov ecx, edx
movzx esi, sil
sal rsi, cl
mov rax, rsi
or rax, rdi
ret
Why clang and gcc cannot optimize first function as second?

The functions do not have identical behavior. In particular in the first one a will undergo integer promotion to int in a << n, so that the shift will have undefined behavior if n >= std::numeric_limits<int>::digits (typically 31).
This is not the case in the second function where a ? 1ull : 0 will result in the common type of unsigned long long, so that the shift will have well-defined behavior for all non-negative values n < std::numeric_limits<unsigned long long>::digits (typically 64) which is most likely more than std::numeric_limits<int>::digits (typically 31).
You should cast a and 1 to uint64_t in both shifts to make the code well behaved for all sensible inputs (i.e. 0 <= n < 64).
Even after fixing that the functions do not have equal behavior. The second function will have undefined behavior if n >= 64 or n < 0 no matter what the value of a is while the first function has well-defined behavior for a == false. The compiler must guarantee that this case returns x unmodified, no matter how large (or negative) the value of n is.
The second function therefore in principle gives the compiler more freedom to optimize since the range of valid input values is much smaller.
Of course, if the function gets inlined (likely), the compiler may use what it knows about the possible range of values in the call arguments for a and n and optimize further based on that.
This isn't the issue here though, GCC will compile to similar assembly for the first function if e.g.
uint64_t hr1(const uint64_t x, const bool a, const int n) noexcept
{
return a ? x | (uint64_t{1} << n) : x | (uint64_t{0} << n);
}
is used (which has the same valid inputs as hr2). I don't know which of the two assemblies will perform better. I suppose you will have to benchmark that or wait for some expert on that to show up.

Both ways look over-complicated (and the first one is buggy for n>=32). To promote a bool to a uint64_t 0 or 1, just use uint64_t(a) or a C-style cast. You don't need a ? 1ull : 0.
The simple branchless way is probably good, unless you expect a to be highly predictable (e.g. usually one way, or correlated with earlier branching. Modern TAGE predictors use recent branch history to index the BHT / BTB.)
uint64_t hr2(uint64_t x, bool a, int n) {
return x | (uint64_t(a) << n);
}
If you want to make this more complicated to avoid UB when n is out of range, write your C++ to wrap the shift count the same way x86 shift instructions do, so the compiler doesn't need any extra instructions.
#include <limits>
uint64_t hr3(uint64_t x, bool a, int n) {
using shiftwidth = decltype(x);
const int mask = std::numeric_limits<shiftwidth>::digits - 1;
// wrap the count to the shift width to avoid UB
// x86 does this for free for 32 and 64-bit shifts.
return x | (shiftwidth(a) << (n & mask));
}
Both versions compile identically for x86 (because the simple version has to work for all inputs without UB).
This compiles decently if you have BMI2 (for single-uop variable-count shifts on Intel), otherwise it's not great. (https://agner.org/optimize/ and https://uops.info/) But even then there are missed optimizations from GCC:
# GCC9.2 -O3 -march=skylake
hr3(unsigned long, bool, int):
movzx esi, sil # zero-extend the bool to 64-bit, 1 cycle latency because GCC failed to use a different register
shlx rsi, rsi, rdx # the shift
mov rax, rsi # stupid GCC didn't put the result in RAX
or rax, rdi # retval = shift | x
ret
This could have been
# hand optimized, and clang 9.0 -O3 -march=skylake
movzx eax, sil # mov-elimination works between different regs
shlx rax, rax, rdx # don't need to take advantage of copy-and-shift
or rax, rdi
ret
It turns out that clang9.0 actually does emit this efficient version with -O3 -march=skylake or znver1. (Godbolt).
This is cheap enough (3 uops) it's not worth branching for, except to break the data dependency on n in case x and a are likely to be ready earlier than n.
But without BMI2, the shift would take a mov ecx, edx, and a 3-uop (on Intel SnB-family) shl rax, cl. AMD has single-uop variable-count shifts even for the legacy versions that do write flags (except when CL=0 and they have to leave FLAGS unmodified; that's why it costs more on Intel). GCC is still dumb and zero-extends in place instead of into RAX. Clang gets it right (and takes advantage of the unofficial calling convention feature where narrow function args are sign or zero-extended to 32-bit so it can use mov instead of movzx) https://godbolt.org/z/9wrYEN
Clang compiles an if() to branchless using CMOV, so that's significantly worse than the simple version that uses uint64_t(a) << n. It's a missed optimization that it doesn't compile my hr1 the same as my hr3; they
GCC actually branches and then uses mov reg, 1 / shl / or for the if version. Again it could compile it the same as hr3 if it chose to. (It can assume that a=1 implies n<=63, otherwise the if version would have shift UB.)
The missed optimization in both is failure to use bts, which implements reg |= 1<<(n&63)
Especially for gcc after branching so it knows its shifting a constant 1, the tail of the function should be bts rax, rdx which is 1 uop with 1c latency on Intel, 2 uops on AMD Zen1 / Zen2. GCC and clang do know how to use bts for the simple case of a compile-time-constant a=1, though: https://godbolt.org/z/rkhbzH
There's no way that I know of to hand-hold GCC or clang into using bts otherwise, and I wouldn't recommend inline-assembly for this unless it's in the most critical inner loop of something and you're prepared to check that it doesn't hurt other optimizations, and to maintain it. i.e. just don't.
But ideally GCC / clang would do something like this when BMI2 isn't available:
# hand optimized, compilers should do this but don't.
mov rax, rdi # x
bts rdi, rdx # x | 1<<(n&63)
test sil, sil
cmovnz rax, rdi # return a ? x_with_bit_set : x;
ret
Doesn't require BMI2, but still only 4 uops on Broadwell and later. (And 5 uops on AMD Bulldozer / Zen). Critical path latencies:
x -> retval: 2 cycles (through (MOV and BTS) -> CMOV) on Broadwell and later. 3 cycles on earlier Intel (2 uop cmov) and on any AMD (2 uop BTS).
n -> retval: same as x (through BTS -> CMOV).
a -> retval: 2 cycles (through TEST -> CMOV) on Broadwell and later, and all AMD. 3 cycles on earlier Intel (2 uop cmov).
This is pretty obviously better than what clang emits for any version without -march=skylake or other BMI2, and even more better than what GCC emits (unless branchy turns out to be a good strategy).
One way that clang will use BTS:
If we mask the shift count for the branchy version, then clang will actually branch, and on the branch where the if body runs it implements it with bts as I described above. https://godbolt.org/z/BtT4w6
uint64_t hr1(uint64_t x, bool a, int n) noexcept
{
if (a) {
return x | (uint64_t(a) << (n&63));
}
return x;
}
clang 9.0 -O3 (without -march=)
hr1(unsigned long, bool, int):
mov rax, rdi
test sil, sil
je .LBB0_2 # if(a) {
bts rax, rdx # x |= 1<<(n&63)
.LBB0_2: # }
ret
So if branchy is good for your use-case, then this way of writing it compiles well with clang.
These stand-alone versions might end up different after inlining into a real caller.
For example, a caller might save a MOV instruction if it can have the shift count n already in CL. Or the decision on whether to do if-conversion from an if to a branchless sequence might be different.
Or if n is a compile-time constant, that means we don't need BMI2 to save uops on the shift anymore; immediate shifts are fully efficient on all modern CPUs (single uop).
And of course if a is a compile time constant then it's either nothing to do or optimizes to a bts.
Further reading: see the performance links in https://stackoverflow.com/tags/x86/info for more about how to decide if asm is efficient by looking at it.

Asm inserion in naked function

I have ubuntu 16.04, x86_64 arch, 4.15.0-39-generic kernel version.
GCC 8.1.0
I tried to rewrite this functions(from first post https://groups.google.com/forum/#!topic/comp.lang.c++.moderated/qHDCU73cEFc) from Intel dialect to AT&T. And I did not succeed.
namespace atomic {
__declspec(naked)
static void*
ldptr_acq(void* volatile*) {
_asm {
MOV EAX, [ESP + 4]
MOV EAX, [EAX]
RET
}
}
__declspec(naked)
static void*
stptr_rel(void* volatile*, void* const) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, [ESP + 8]
MOV [ECX], EAX
RET
}
}
}
Then I wrote a simple program, to get the same pointer, which I pass inside. I installed GCC version 8.1 with supported naked attributes(https://gcc.gnu.org/gcc-8/changes.html "The x86 port now supports the naked function attribute") for fuctions.
As far as I remember, this attribute tells the compiler not to create the prologue and epilog of the function, and I can take the parameters from the stack myself and return them.
Code:(don't work with segfault)
#include <cstdio>
#include <cstdlib>
__attribute__ ((naked))
int *get_num(int*) {
__asm__ (
"movl 4(%esp), %eax\n\t"
"movl (%eax), %eax\n\t"
"ret"
);
}
int main() {
int *i =(int*) malloc(sizeof(int));
*i = 5;
int *j = get_num(i);
printf("%d\n", *j);
free(i);
return 0;
}
then I tried using 64bit registers:(don't work with segfault)
__asm__ (
"movq 4(%rsp), %rax\n\t"
"movq (%rax), %rax\n\t"
"ret"
);
And only after I took the value out of rdi register - it all worked.
__asm__ (
"movq %rdi, %rax\n\t"
"ret"
);
Why did I fail to make the transfer through the stack register? I probably made a mistake. Please tell me where is my fail?

Because the x86-64 System V calling convention passes args in registers, not on the stack, unlike the old inefficient i386 System V calling convention.
You always have to write asm that matches the calling convention, if you're writing the whole function in asm, like with a naked function or a stand-along .S file.
GNU C extended asm allows you to use operands to specify the inputs to an asm statement, and the compiler will generate instructions to make that happen. (I wouldn't recommend using it until you understand asm and how compilers turn C into asm with optimization enabled, though.)
Also note that movq %rdi, %rax implements long *foo(long*p){return p;} not return *p. Perhaps you meant mov (%rdi), %rax to dereference the pointer arg?
And BTW, you definitely don't need and shouldn't use inline asm for this. https://gcc.gnu.org/wiki/DontUseInlineAsm, and see https://stackoverflow.com/tags/inline-assembly/info
In GNU C, you can cast a pointer to volatile uint64_t*. Or you can use __atomic_load_n (ptr, __ATOMIC_ACQUIRE) to get basically everything you were getting from that asm, without the overhead of a function call or any of the cost for the optimizer at the call-site of having all the call-clobbered registers be clobbered.
You can use them on any object: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Unlike C++11 where you can only do atomic ops on a std::atomic<T>.

How to implement this in inline assembly?

I'm woefully bad at understanding the GNU inline assembly syntax, so I'm hoping a practical example may help. Given the following assembly (x86-64, output by Clang) how would I construct a function using inline assembly that would be identical? GCC produces different code for the same function and I would like to get it to produce an identical version to what Clang (-O3) outputs.
bittest(unsigned char, int):
btl %esi, %edi
setb %al
ret
Here is what GCC (-O3) is producing:
bittest(unsigned char, int):
movzx eax, dil
mov ecx, esi
sar eax, cl
and eax, 1
ret
Here is the C code for the function:
bool bittest(unsigned char byte, int index)
{
return (byte >> index) & 1;
}

Well, last time I wrote a 32bit bittest, it looked something like this (the 64bit looks slightly different):
unsigned char _bittest(const long *Base, long Offset)
{
unsigned char old;
__asm__ ("btl %[Offset],%[Base] ; setc %[old]" :
[old] "=rm" (old) :
[Offset] "Ir" (Offset), [Base] "rm" (*Base) :
"cc");
return old;
}
Although if you want to put it in a public header, I have a different version. When I use -O2, it ends up inlining the whole thing to make really efficient code.
I'm surprised gcc doesn't generate the btl here itself (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36473), but you are right it doesn't.

I think it's unlikely that you can nail down a byte-by-byte equivalent version in your compiler, there are minor differences that aren't worth worrying about. Following this question, make sure you're compiling with the correct flags. Trying to get two compilers to produce identical output is probably an exercise in futility.

If you want to generate the exact same code then you can do the following
const char bittestfunction[] = { 0xf, 0xa3, 0xf7, 0xf, 0x92, 0xc0, 0x3 };
int (*bittest)( unsigned char, int ) = (int(*)(unsigned char, int))bittestfunction;
You can call this in the same way bittest( foo, bar ).
From objdump on the (gcc) compiled executable
00000000004006cc <bittestfunction>:
4006cc: 0f a3 f7 bt %esi,%edi
4006cf: 0f 92 c0 setb %al
4006d2: c3 retq

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Getting a compiler to generate adc instruction - c++

Is there any way to get either Clang, GCC or VS to generate adc (add with carry) instructions only using Standard-C++(98/11/14)? (Edit: I mean in x64 mode, sorry if that wasn't clear.)

Related

GCC won't use its own optimization trick without -fwrapv

How to help GCC to not insert xor before lzcnt?

Why different assembling code generated? Which is better?

Asm inserion in naked function

How to implement this in inline assembly?

Categories

Resources