How to implement this in inline assembly? - c++

I'm woefully bad at understanding the GNU inline assembly syntax, so I'm hoping a practical example may help. Given the following assembly (x86-64, output by Clang) how would I construct a function using inline assembly that would be identical? GCC produces different code for the same function and I would like to get it to produce an identical version to what Clang (-O3) outputs.
bittest(unsigned char, int):
btl %esi, %edi
setb %al
ret
Here is what GCC (-O3) is producing:
bittest(unsigned char, int):
movzx eax, dil
mov ecx, esi
sar eax, cl
and eax, 1
ret
Here is the C code for the function:
bool bittest(unsigned char byte, int index)
{
return (byte >> index) & 1;
}

Well, last time I wrote a 32bit bittest, it looked something like this (the 64bit looks slightly different):
unsigned char _bittest(const long *Base, long Offset)
{
unsigned char old;
__asm__ ("btl %[Offset],%[Base] ; setc %[old]" :
[old] "=rm" (old) :
[Offset] "Ir" (Offset), [Base] "rm" (*Base) :
"cc");
return old;
}
Although if you want to put it in a public header, I have a different version. When I use -O2, it ends up inlining the whole thing to make really efficient code.
I'm surprised gcc doesn't generate the btl here itself (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=36473), but you are right it doesn't.

I think it's unlikely that you can nail down a byte-by-byte equivalent version in your compiler, there are minor differences that aren't worth worrying about. Following this question, make sure you're compiling with the correct flags. Trying to get two compilers to produce identical output is probably an exercise in futility.

If you want to generate the exact same code then you can do the following
const char bittestfunction[] = { 0xf, 0xa3, 0xf7, 0xf, 0x92, 0xc0, 0x3 };
int (*bittest)( unsigned char, int ) = (int(*)(unsigned char, int))bittestfunction;
You can call this in the same way bittest( foo, bar ).
From objdump on the (gcc) compiled executable
00000000004006cc <bittestfunction>:
4006cc: 0f a3 f7 bt %esi,%edi
4006cf: 0f 92 c0 setb %al
4006d2: c3 retq

Related

Error : Invalid Character '(' in mnemonic

Hi I am trying to compile the below assembly code on Linux using gcc 7.5 version but somehow getting the error
Error : Invalid Character '(' in mnemonic
bool InterlockedCompareAndStore128(int *dest,int *newVal,int *oldVal)
{
asm(
"push %rbx\n"
"push %rdi\n"
"mov %rcx, %rdi\n" // ptr to dest -> RDI
"mov 8(%rdx), %rcx\n" // newVal -> RCX:RBX
"mov (%rdx), %rbx\n"
"mov 8(%r8), %rdx\n" // oldVal -> RDX:RAX
"mov (%r8), %rax\n"
"lock (%rdi), cmpxchg16b\n"
"mov $0, %rax\n"
"jnz exit\n"
"inc1 %rax\n"
"exit:;\n"
"pop %rdi\n"
"pop %rbx\n"
);
}
Can anyone suggest how to resolve this . Checked many online links and tutorials for Assembly code but could not relate the exact issue.
Thanks for the help in advance.
In Windows I could see the implementation of the above function as:
function InterlockedCompareExchange128;
asm
.PUSHNV RBX
MOV R10,RCX
MOV RBX,R8
MOV RCX,RDX
MOV RDX,[R9+8]
MOV RAX,[R9]
LOCK CMPXCHG16B [R10]
MOV [R9+8],RDX
MOV [R9],RAX
SETZ AL
MOVZX EAX, AL
end;
For PUSHNV , I could not found anything related to this on Linux. So , basically I am trying to implement same functionality in c++ on Linux.
The question here was about Invalid Character '(' in mnemonic which the other answer addresses.
However, OP's code has a number of issues beyond that problem. Here's (what I think are) two better approaches to this problem. Note that I've changed the order of the parameters and turned them const.
This one continues to use inline asm, but uses Extended asm instead of Basic. While I'm of the don't use inline asm school of thought, this might be useful or at least educational.
bool InterlockedCompareAndStore128B(__int64 *dest, const __int64 *oldVal, const __int64 *newVal)
{
bool result;
__int64 ovl = oldVal[0];
__int64 ovh = oldVal[1];
asm volatile ("lock cmpxchg16b %[ptr]"
: "=#ccz" (result), [ptr] "+m" (*dest),
"+d" (ovh), "+a" (ovl)
: "c" (newVal[1]), "b" (newVal[0])
: "cc", "memory");
// cmpxchg16b changes rdx:rax to the current value in dest. Useful if you need
// to loop until you succeed, but OP's code doesn't save the values, so I'm
// just following that spec.
//oldVal[0] = ovl;
//oldVal[1] = ovh;
return result;
}
In addition to solving the problems with the original code, it's also inlineable and shorter. The constraints likely make it harder to read, but the fact that there's only 1 line of asm might help offset that. If you want to understand what the constraints mean, check out this page (scroll down to x86 family) and the description of flag output constraints (again, scroll down for x86 family).
As an alternative, this code uses a gcc builtin and allows the compiler to generate the appropriate asm instructions. Note that this must be built with -mcx16 for best results.
bool InterlockedCompareAndStore128C(__int128 *dest, const __int128 *oldVal, const __int128 *newVal)
{
// While a sensible person would use __atomic_compare_exchange_n and let gcc generate
// cmpxchg16b, gcc decided they needed to turn this into a big hairy function call:
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
// In short, if someone wants to compare/exchange against readonly memory, you can't just
// use cmpxchg16b cuz it would crash. Why would anyone try to exchange memory that can't
// be written to? Apparently because it's expected to *not* crash if the compare fails
// and nothing gets written. So no one gets to use that 1 line instruction and everyone
// gets an entire routine (that uses MUTEX instead of lockfree) to support this absurd
// border case. Sounds dumb to me, but that's where things stand as of 2021-05-07.
// Use the legacy function instead.
bool b = __sync_bool_compare_and_swap(dest, *oldVal, *newVal);
return b;
}
For the kibizters in the crowd, here's the code generated by -m64 -O3 -mcx16 for that last one:
InterlockedCompareAndStore128C(__int128*, __int128 const*, __int128 const*):
mov rcx, rdx
push rbx
mov rax, QWORD PTR [rsi]
mov rbx, QWORD PTR [rcx]
mov rdx, QWORD PTR [rsi+8]
mov rcx, QWORD PTR [rcx+8]
lock cmpxchg16b XMMWORD PTR [rdi]
pop rbx
sete al
ret
If someone wants to fiddle, here's the godbolt link.
There are a number of problems with this code, and I'm not convinced I'm doing you any favors by telling you how to fix the specific problem.
But the short answer is that
"lock (%rdi), cmpxchg16b\n"
should be
"lock cmpxchg16b (%rdi)\n"
Tada, now it compiles. Well, it would if inc1 was a real instruction.
But I can't help but notice that the pointers here are int *, which is 4 bytes, not 16. And that this function is not declared as naked. And using Extended asm would save you from having to push all these registers around by hand, making this code a lot slower than it needs to be.
But most of all, you should really use the builtins, like __atomic_compare_exchange because inline asm is error prone, not portable, and really hard to maintain.

Can I replace my if-statements to save running time?

I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.

x86, C++, gcc and memory alignment

I have this simple C++ code:
int testFunction(int* input, long length) {
int sum = 0;
for (long i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
#include <stdlib.h>
#include <iostream>
using namespace std;
int main()
{
union{
int* input;
char* cinput;
};
size_t length = 1024;
input = new int[length];
//cinput++;
cout<<testFunction(input, length-1);
}
If I compile it with g++ 4.9.2 with -O3, it runs fine. I expected that if I uncomment the penultimate line it would run slower, however it outright crashes with SIGSEGV.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400754 in main ()
(gdb) disassemble
Dump of assembler code for function main:
0x00000000004006e0 <+0>: sub $0x8,%rsp
0x00000000004006e4 <+4>: movabs $0x100000000,%rdi
0x00000000004006ee <+14>: callq 0x400690 <_Znam#plt>
0x00000000004006f3 <+19>: lea 0x1(%rax),%rdx
0x00000000004006f7 <+23>: and $0xf,%edx
0x00000000004006fa <+26>: shr $0x2,%rdx
0x00000000004006fe <+30>: neg %rdx
0x0000000000400701 <+33>: and $0x3,%edx
0x0000000000400704 <+36>: je 0x4007cc <main+236>
0x000000000040070a <+42>: cmp $0x1,%rdx
0x000000000040070e <+46>: mov 0x1(%rax),%esi
0x0000000000400711 <+49>: je 0x4007f1 <main+273>
0x0000000000400717 <+55>: add 0x5(%rax),%esi
0x000000000040071a <+58>: cmp $0x3,%rdx
0x000000000040071e <+62>: jne 0x4007e1 <main+257>
0x0000000000400724 <+68>: add 0x9(%rax),%esi
0x0000000000400727 <+71>: mov $0x3ffffffc,%r9d
0x000000000040072d <+77>: mov $0x3,%edi
0x0000000000400732 <+82>: mov $0x3fffffff,%r8d
0x0000000000400738 <+88>: sub %rdx,%r8
0x000000000040073b <+91>: pxor %xmm0,%xmm0
0x000000000040073f <+95>: lea 0x1(%rax,%rdx,4),%rcx
0x0000000000400744 <+100>: xor %edx,%edx
0x0000000000400746 <+102>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400750 <+112>: add $0x1,%rdx
=> 0x0000000000400754 <+116>: paddd (%rcx),%xmm0
0x0000000000400758 <+120>: add $0x10,%rcx
0x000000000040075c <+124>: cmp $0xffffffe,%rdx
0x0000000000400763 <+131>: jbe 0x400750 <main+112>
0x0000000000400765 <+133>: movdqa %xmm0,%xmm1
0x0000000000400769 <+137>: lea -0x3ffffffc(%r9),%rcx
---Type <return> to continue, or q <return> to quit---
Why does it crash? Is it a compiler bug? Am I causing some undefined behavior? Does the compiler expect that ints are always 4-byte-aligned?
I also tested it on clang and there's no crash.
Here's g++'s assembly output: http://pastebin.com/CJdCDCs4
The code input = new int[length]; cinput++; causes undefined behaviour because the second statement is reading from a union member that is not active.
Even ignoring that, testFunction(input, length-1) would again have undefined behaviour for the same reason.
Even ignoring that, the sum loop accesses an object through a glvalue of the wrong type, which has undefined behaviour.
Even ignoring that, reading from an uninitialized object, as your sum loop does, would again have undefined behaviour.
gcc has vectorized the loop with SSE instructions. paddd (like most SSE instructions) requires 16 byte alignment. I haven't looked at the code previous to paddd in detail but I expect that it assumes 4 byte alignment initially, iterates with scalar code (where misalignment only incurs a performance penalty, not a crash) until it can assume 16 byte alignment, then enters the SIMD loop, processing 4 ints at a time. By adding an offset of 1 byte you are breaking the precondition of 4 byte alignment for the array of ints, and after that all bets are off. If you're going to be doing nasty stuff with misaligned data (and I highly recommend you don't) then you should disable automatic vectorization (gcc -fno-tree-vectorize).
The instruction that crashed is paddd (you highlighted it). The name is short for "packed add doubleword" (see e.g. here) - it is a part of the SSE instruction set. These instructions require aligned pointers; for example, the link above has a description of exceptions that paddd may cause:
GP(0)
...(128-bit operations only)
If a memory operand is not aligned on a 16-byte boundary, regardless of segment.
This is exactly your case. The compiler arranged the code in such a way that it could use these fast 128-bit operations like paddd, and you subverted it with your union trick.
I can guess that code generated by clang doesn't use SSE, so it's not sensitive to alighnment. If so, it's also probably much slower (but you won't notice it with just 1024 iterations).

Wrong sign extension c++/qt

I am running the following code on an Desktop x64 Intel architecture (gcc compiler, linux) and on an RaspberryPi Arm (gcc cross compiler).
quint32 id;
id=((quint32)ref[1]);
id|=((quint32)ref[2])<<8;
id|=((quint32)ref[3])<<16;
id|=((quint32)ref[4])<<24;
where ref is a QByteArray.
I noticed, that despite of casting quint32 (which is unsigned int) my Pc do the sign extension which causes error, when the given byte is a negative number. My code runs fine on Arm. Why does this happens? I thought casting should prevent this. Doesn't it?
id|=((quint32)ref[4])<<24;
disassembly:
mov -0x160(%rbp),%rax
mov $0x4,%esi
mov %rax,%rdi
callq 0x425af0 <QByteArray::operator[](int)>
mov %rax,%rcx
mov %edx,%eax
mov %rcx,-0x170(%rbp)
mov %eax,-0x168(%rbp)
mov -0x170(%rbp),%rax
mov %rax,-0x40(%rbp)
mov -0x168(%rbp),%rax
mov %rax,-0x38(%rbp)
lea -0x40(%rbp),%rax
mov %rax,%rdi
callq 0x425aa0 <QByteRef::operator char() const>
movsbl %al,%eax #sign extension.
shl $0x18,%eax
or %eax,-0x148(%rbp)
I have also noticed, that compiler uses QByteRef return value instead of char. But it should not cause any error i guess.
from QByteArray help page :
QByteRef operator[](int i)
char operator[](int i) const
QByteRef operator[](uint i)
char operator[](uint i) const
Thanks in advance
When converting from a signed char to unsigned int (or other larger unsigned type) the language specifies sign extension happens first. To avoid that, cast the char to unsigned char as your first step. You shouldn't need any other cast - the increase in size should happen automatically.

Getting a compiler to generate adc instruction

Is there any way to get either Clang, GCC or VS to generate adc (add with carry) instructions only using Standard-C++(98/11/14)? (Edit: I mean in x64 mode, sorry if that wasn't clear.)
If your code makes a comparison and adds the result of the comparison to something, then an adc is typically emitted by gcc 5 (incidentally, gcc 4.8 does not emit an adc here). For example,
unsigned foo(unsigned a, unsigned b, unsigned c, unsigned d)
{
return (a + b + (c < d));
}
assembles to
foo:
cmpl %ecx, %edx
movl %edi, %eax
adcl %esi, %eax
ret
However, it is a bit tricky to get gcc to really emit an adc.
There's an __int128_t type available on GCC for amd64 and other 64bit targets, which will use a pair of add/adc instructions for a simple addition. (See the Godbolt link below).
Also, this pure ISO C code may compile to an adc:
uint64_t adc(uint64_t a, uint64_t b)
{
a += b;
if (a < b) /* should simplify to nothing (setting carry is implicit in the add) */
a++; /* should simplify to adc r0, 0 */
return a;
}
For me (ARM) it generated something kind of silly, but it compiles for x86-64 (on the Godbolt compiler explorer) to this:
mov rax, rdi # a, a
add rax, rsi # a, b
adc rax, 0 # a,
ret
If you compile a 64-bit signed addition for X86 (int64_t in C++ 11), the compiled code will contain an adc instruction.
Edit: code sample:
int64_t add_numbers(int64_t x, int64_t y) {
return x + y;
}
On X86, the addition is implemented using an add instruction followed by an adc instruction. On X64, only a single add instruction is used.