Referencing memory operands in .intel_syntax GNU C inline assembly - c++

I'm catching a link error when compiling and linking a source file with inline assembly.
Here are the test files:
via:$ cat test.cxx
extern int libtest();
int main(int argc, char* argv[])
{
return libtest();
}
$ cat lib.cxx
#include <stdint.h>
int libtest()
{
uint32_t rnds_00_15;
__asm__ __volatile__
(
".intel_syntax noprefix ;\n\t"
"mov DWORD PTR [rnds_00_15], 1 ;\n\t"
"cmp DWORD PTR [rnds_00_15], 1 ;\n\t"
"je done ;\n\t"
"done: ;\n\t"
".att_syntax noprefix ;\n\t"
:
: [rnds_00_15] "m" (rnds_00_15)
: "memory", "cc"
);
return 0;
}
Compiling and linking the program results in:
via:$ g++ -fPIC test.cxx lib.cxx -c
via:$ g++ -fPIC lib.o test.o -o test.exe
lib.o: In function `libtest()':
lib.cxx:(.text+0x1d): undefined reference to `rnds_00_15'
lib.cxx:(.text+0x27): undefined reference to `rnds_00_15'
collect2: error: ld returned 1 exit status
The real program is more complex. The routine is out of registers so the flag rnds_00_15 must be a memory operand. Use of rnds_00_15 is local to the asm block. It is declared in the C code to ensure the memory is allocated on the stack and nothing more. We don't read from it or write to it as far as the C code is concerned. We list it as a memory input so GCC knows we use it and wire up the "C variable name" in the extended ASM.
Why am I receiving a link error, and how do I fix it?

Compile with gcc -masm=intel and don't try to switch modes inside the asm template string. AFAIK there's no equivalent before clang14 (Note: MacOS installs clang as gcc / g++ by default.)
Also, of course you need to use valid GNU C inline asm, using operands to tell the compiler which C objects you want to read and write.
Can I use Intel syntax of x86 assembly with GCC? clang14 supports -masm=intel like GCC
How to set gcc to use intel syntax permanently? clang13 and earlier didn't.
I don't believe Intel syntax uses the percent sign. Perhaps I am missing something?
You're getting mixed up between %operand substitutions into the Extended-Asm template (which use a single %), vs. the final asm that the assembler sees.
You need %% to use a literal % in the final asm. You wouldn't use "mov %%eax, 1" in Intel-syntax inline asm, but you do still use "mov %0, 1" or %[named_operand].
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html. In Basic asm (no operands), there is no substitution and % isn't special in the template, so you'd write mov $1, %eax in Basic asm vs. mov $1, %%eax in Extended, if for some reason you weren't using an operand like mov $1, %[tmp] or mov $1, %0.
uint32_t rnds_00_15; is a local with automatic storage. Of course it there's no asm symbol with that name.
Use %[rnds_00_15] and compile with -masm=intel (And remove the .att_syntax at the end; that would break the compiler-generate asm that comes after.)
You also need to remove the DWORD PTR, because the operand-expansion already includes that, e.g. DWORD PTR [rsp - 4], and clang errors on DWORD PTR DWORD PTR [rsp - 4]. (GAS accepts it just fine, but the 2nd one takes precendence so it's pointless and potentially misleading.)
And you'll want a "=m" output operand if you want the compiler to reserve you some scratch space on the stack. You must not modify input-only operands, even if it's unused in the C. Maybe the compiler decides it can overlap something else because it's not written and not initialized (i.e. UB). (I'm not sure if your "memory" clobber makes it safe, but there's no reason not to use an early-clobber output operand here.)
And you'll want to avoid label name conflicts by using %= to get a unique number.
Working example (GCC and ICC, but not clang unfortunately), on the Godbolt compiler explorer (which uses -masm=intel depending on options in the dropdown). You can use "binary mode" (the 11010 button) to prove that it actually assembles after compiling to asm without warnings.
int libtest_intel()
{
uint32_t rnds_00_15;
// Intel syntax operand-size can only be overridden with operand modifiers
// because the expansion includes an explicit DWORD PTR
__asm__ __volatile__
( // ".intel_syntax noprefix \n\t"
"mov %[rnds_00_15], 1 \n\t"
"cmp %[rnds_00_15], 1 \n\t"
"je .Ldone%= \n\t"
".Ldone%=: \n\t"
: [rnds_00_15] "=&m" (rnds_00_15)
:
: // no clobbers
);
return 0;
}
Compiles (with gcc -O3 -masm=intel) to this asm. Also works with gcc -m32 -masm=intel of course:
libtest_intel:
mov DWORD PTR [rsp-4], 1
cmp DWORD PTR [rsp-4], 1
je .Ldone8
.Ldone8:
xor eax, eax
ret
I couldn't get this to work with clang: It choked on .intel_syntax noprefix when I left that in explicitly.
Operand-size overrides:
You have to use %b[tmp] to get the compiler to substitute in BYTE PTR [rsp-4] to only access the low byte of a dword input operand. I'd recommend AT&T syntax if you want to do much of this.
Using %[rnds_00_15] results in Error: junk '(%ebp)' after expression.
That's because you switched to Intel syntax without telling the compiler. If you want it to use Intel addressing modes, compile with -masm=intel so the compiler can substitute into the template with the correct syntax.
This is why I avoid that crappy GCC inline assembly at nearly all costs. Man I despise this crappy tool.
You're just using it wrong. It's a bit cumbersome, but makes sense and mostly works well if you understand how it's designed.
Repeat after me: The compiler doesn't parse the asm string at all, except to do text substitutions of %operand. This is why it doesn't notice your .intel_syntax noprefex and keeps substituting AT&T syntax.
It does work better and more easily with AT&T syntax though, e.g. for overriding the operand-size of a memory operand, or adding an offset. (e.g. 4 + %[mem] works in AT&T syntax).
Dialect alternatives:
If you want to write inline asm that doesn't depend on -masm=intel or not, use Dialect alternatives (which makes your code super-ugly; not recommended for anything other than wrapping one or two instructions):
Also demonstrates operand-size overrides
#include <stdint.h>
int libtest_override_operand_size()
{
uint32_t rnds_00_15;
// Intel syntax operand-size can only be overriden with operand modifiers
// because the expansion includes an explicit DWORD PTR
__asm__ __volatile__
(
"{movl $1, %[rnds_00_15] | mov %[rnds_00_15], 1} \n\t"
"{cmpl $1, %[rnds_00_15] | cmp %k[rnds_00_15], 1} \n\t"
"{cmpw $1, %[rnds_00_15] | cmp %w[rnds_00_15], 1} \n\t"
"{cmpb $1, %[rnds_00_15] | cmp %b[rnds_00_15], 1} \n\t"
"je .Ldone%= \n\t"
".Ldone%=: \n\t"
: [rnds_00_15] "=&m" (rnds_00_15)
);
return 0;
}
With Intel syntax, gcc compiles it to:
mov DWORD PTR [rsp-4], 1
cmp DWORD PTR [rsp-4], 1
cmp WORD PTR [rsp-4], 1
cmp BYTE PTR [rsp-4], 1
je .Ldone38
.Ldone38:
xor eax, eax
ret
With AT&T syntax, compiles to:
movl $1, -4(%rsp)
cmpl $1, -4(%rsp)
cmpw $1, -4(%rsp)
cmpb $1, -4(%rsp)
je .Ldone38
.Ldone38:
xorl %eax, %eax
ret

Related

Work around lack of Yz machine constraint under Clang?

We use inline assembly to make SHA instructions available if __SHA__ is not defined. Under GCC we use:
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "Yz" (c));
return a;
}
Clang does not consume GCC's Yz constraint (see Clang 3.2 Issue 13199 and Clang 3.9 Issue 32727), which is required by the sha256rnds2 instruction:
Yz
First SSE register (%xmm0).
We added a mov for Clang:
asm ("mov %2, %%xmm0; sha256rnds2 %%xmm0, %1, %0" : "+x"(a) : "xm"(b), "x" (c) : "xmm0");
Performance is off by about 3 cycles per byte. On my 2.2 GHz Celeron J3455 test machine (Goldmont with SHA extensions), that's about 230 MiB/s. Its non-trivial.
Looking at the disassembly, Clang is not optimizing around SHA's k when two rounds are performed:
Breakpoint 2, SHA256_SSE_SHA_HashBlocks (state=0xaaa3a0,
data=0xaaa340, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cdd0 <+0>: sub $0x308,%rsp
0x000000000068cdd7 <+7>: movdqu (%rdi),%xmm0
0x000000000068cddb <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068ce49 <+121>: movq %xmm2,%xmm0
0x000000000068ce4d <+125>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm1
0x000000000068ce56 <+134>: pshufd $0xe,%xmm2,%xmm3
0x000000000068ce5b <+139>: movdqa %xmm13,%xmm2
0x000000000068ce60 <+144>: movaps %xmm1,0x2e0(%rsp)
0x000000000068ce68 <+152>: movq %xmm3,%xmm0
0x000000000068ce6c <+156>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm2
0x000000000068ce75 <+165>: movdqu 0x10(%rsi),%xmm3
0x000000000068ce7a <+170>: pshufb %xmm8,%xmm3
0x000000000068ce80 <+176>: movaps %xmm2,0x2d0(%rsp)
0x000000000068ce88 <+184>: movdqa %xmm3,%xmm4
0x000000000068ce8c <+188>: paddd 0x6729c(%rip),%xmm4 # 0x6f4130
0x000000000068ce94 <+196>: movq %xmm4,%xmm0
0x000000000068ce98 <+200>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
...
For example, 0068ce8c though 0068ce98 should have been:
paddd 0x6729c(%rip),%xmm0 # 0x6f4130
sha256rnds2 %xmm0,0x2d0(%rsp),%xmm1
I'm guessing our choice of inline asm instructions are a bit off.
How do we work around the lack of Yz machine constraint under Clang? What pattern avoids the intermediate move in optimized code?
Attempting to use Explicit Register Variable:
const __m128i k asm("xmm0") = c;
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
return a;
Results in:
In file included from sha.cpp:24:
./cpu.h:831:22: warning: ignored asm label 'xmm0' on automatic variable
const __m128i k asm("xmm0") = c;
^
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm1, 752(%rsp), %xmm0
^~~~~~~~~~
In file included from sha.cpp:24:
./cpu.h:833:7: error: invalid operand for instruction
asm ("sha256rnds2 %2, %1, %0" : "+x"(a) : "xm"(b), "x" (k));
^
<inline asm>:1:21: note: instantiated into assembly here
sha256rnds2 %xmm3, 736(%rsp), %xmm1
^~~~~~~~~~
...
I created this answer based on the tag inline assembly with no specific language mentioned. Extended assembly templates already assume use of extensions to the languages.
If the Yz constraint isn't available you can attempt to create a temporary variable to tell CLANG what register to use rather than a constraint. You can do this through what is called an Explicit Register Variable:
You can define a local register variable and associate it with a specified register like this:
register int *foo asm ("r12");
Here r12 is the name of the register that should be used. Note that this is the same syntax used for defining global register variables, but for a local variable the declaration appears within a function. The register keyword is required, and cannot be combined with static. The register name must be a valid register name for the target platform.
In your case you wish to force usage of xmm0 register. You could assign the c parameter to a temporary variable using an explicit register and use that temporary as a parameter to the Extended Inline Assembly. This is the primary purpose of explicit registers in GCC/CLANG.
GCC_INLINE __m128i GCC_INLINE_ATTRIB
MM_SHA256RNDS2_EPU32(__m128i a, const __m128i b, const __m128i c)
{
register const __m128i tmpc asm("xmm0") = c;
__asm__("sha256rnds2 %2, %1, %0" : "+x"(a) : "x"(b), "x" (tmpc));
return a;
}
The compiler should be able to provide some optimizations now since it has more knowledge as to how the xmm0 register is to be used.
When you placed mov %2, %%xmm0; into the template CLANG (and GCC) do not do any optimizations on the instructions. Basic Assembly and Extended Assembly templates are a black box that it only knows how to do basic substitution based on the constraints.
Here's a disassembly using the method above. It was compiled with clang++ and -std=c++03. The extra moves are no longer present:
Breakpoint 1, SHA256_SSE_SHA_HashBlocks (state=0x7fffffffae60,
data=0x7fffffffae00, length=0x40) at sha.cpp:1101
1101 STATE1 = _mm_loadu_si128((__m128i*) &state[4]);
(gdb) disass
Dump of assembler code for function SHA256_SSE_SHA_HashBlocks(unsigned int*, unsigned int const*, unsigned long):
0x000000000068cf60 <+0>: sub $0x308,%rsp
0x000000000068cf67 <+7>: movdqu (%rdi),%xmm0
0x000000000068cf6b <+11>: movdqu 0x10(%rdi),%xmm1
...
0x000000000068cfe6 <+134>: paddd 0x670e2(%rip),%xmm0 # 0x6f40d0
0x000000000068cfee <+142>: sha256rnds2 %xmm0,0x2f0(%rsp),%xmm2
0x000000000068cff7 <+151>: pshufd $0xe,%xmm0,%xmm1
0x000000000068cffc <+156>: movdqa %xmm1,%xmm0
0x000000000068d000 <+160>: movaps %xmm2,0x2e0(%rsp)
0x000000000068d008 <+168>: sha256rnds2 %xmm0,0x2e0(%rsp),%xmm3
0x000000000068d011 <+177>: movdqu 0x10(%rsi),%xmm5
0x000000000068d016 <+182>: pshufb %xmm9,%xmm5
0x000000000068d01c <+188>: movaps %xmm3,0x2d0(%rsp)
0x000000000068d024 <+196>: movdqa %xmm5,%xmm0
0x000000000068d028 <+200>: paddd 0x670b0(%rip),%xmm0 # 0x6f40e0
0x000000000068d030 <+208>: sha256rnds2 %xmm0,0x2d0(%rsp),%xmm2
...

LLVM-GCC ASM to LLVM in XCode

I got the 2 following definition that compile (and work) just fine using XCode LLVM-GCC compiler:
#define SAVE_STACK(v)__asm { mov v, ESP }
#define RESTORE_STACK __asm {sub ESP, s }
However when I change the compiler to Apple LLVM I got the following error:
Expected '(' after 'asm'
I replace the {} with () but that doesn't do the trick, I google on that error couldn't find anything useful... anyone?
The __asm {...} style of inline assembly is non-standard and not supported by clang. Instead C++ specifies inline assembly syntax as asm("..."), note the quotes. Also clang uses AT&T assembly syntax so the macros would need to be rewritten to be safe.
However, some work has been going on to improve support for Microsoft's non-standard assembly syntax, and Intel style assembly along with it. There's an option -fenable-experimental-ms-inline-asm that enables what's been done so far, although I'm not sure when it was introduced or how good the support is in the version of clang you're using. A simple attempt with the code you show seems to work with a recent version of clang from the SVN trunk.
#define SAVE_STACK(v)__asm { mov v, ESP }
#define RESTORE_STACK __asm {sub ESP, s }
int main() {
int i;
int s;
SAVE_STACK(i);
RESTORE_STACK;
}
clang++ tmp.cpp -fms-extensions -fenable-experimental-ms-inline-asm -S -o -
.def main;
.scl 2;
.type 32;
.endef
.text
.globl main
.align 16, 0x90
main: # #main
# BB#0: # %entry
pushq %rax
#APP
.intel_syntax
mov dword ptr [rsp + 4], ESP
.att_syntax
#NO_APP
#APP
.intel_syntax
sub ESP, dword ptr [rsp]
.att_syntax
#NO_APP
xorl %eax, %eax
popq %rdx
ret
And the command clang++ tmp.cpp -fms-extensions -fenable-experimental-ms-inline-asm produces an executable that runs.
It does still produce warnings like the following though.
warning: MS-style inline assembly is not supported [-Wmicrosoft]
I have a problem using the XCode development environment the following code compiled correctly. Switching to my makefile I received the following error message Expected '(' after 'asm'
#define DebugBreak() { __asm { int 3 }; }
int main(int argc, const char *argv[])
{
DebugBreak();
}
Note that the definition for DebugBreak() came from my code that compiled under Visual Studio.
The way that I fix this in my make file was to added the argument -fasm-blocks
CFLAGS += -std=c++11 -stdlib=libc++ -O2 -fasm-blocks

Why doesn't gcc support naked functions?

I use naked functions to patch parts of a program while it's running. I can easily do this in VC++ in Windows. I'm trying to do this in Linux and it seems gcc doesn't support naked functions. Compiling code with naked functions gives me this: warning: ‘naked’ attribute directive ignored. Compiled under CentOS 5.5 i386.
The naked attribute is only supported by GCC on certain platforms (ARM, AVR, MCORE, RX and SPU) according to the docs:
naked:
Use this attribute on the ARM, AVR, MCORE, RX and SPU ports to
indicate that the specified function does not need prologue/epilogue
sequences generated by the compiler. It is up to the programmer to
provide these sequences. The only statements that can be safely
included in naked functions are asm statements that do not have
operands. All other statements, including declarations of local
variables, if statements, and so forth, should be avoided. Naked
functions should be used to implement the body of an assembly
function, while allowing the compiler to construct the requisite
function declaration for the assembler.
I'm not sure why.
On x86 you can workaround by using asm at global scope instead:
int write(int fd, const void *buf, int count);
asm
(
".global write \n\t"
"write: \n\t"
" pusha \n\t"
" movl $4, %eax \n\t"
" movl 36(%esp), %ebx \n\t"
" movl 40(%esp), %ecx \n\t"
" movl 44(%esp), %edx \n\t"
" int $0x80 \n\t"
" popa \n\t"
" ret \n\t"
);
void _start()
{
#define w(x) write(1, x, sizeof(x));
w("hello\n");
w("bye\n");
}
Also naked is listed among x86 function attributes, so I suppose it works for newer gcc.
That's an ugly solution. Link against a .asm file for your target architecture.
GCC only supports naked functions on ARM and other embedded platforms. http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html
Also, what you're doing is inherently unsafe, as you cannot guarantee that the code you're patching isn't executing if the program is running.

Assembler code in C++ code

How can I put Intel asm code into my c++ application?
I'm using Dev-C++.
I want to do sth like that:
int temp = 0;
int usernb = 3;
pusha
mov eax, temp
inc eax
xor usernb, usernb
mov eax, usernb
popa
This is only example.
How can I do sth like that?
UPDATE:
How does it look in Visual Studio ?
You can find a complete howto here http://www.ibiblio.org/gferg/ldp/GCC-Inline-Assembly-HOWTO.html
#include <stdlib.h>
int main()
{
int temp = 0;
int usernb = 3;
__asm__ volatile (
"pusha \n"
"mov eax, %0 \n"
"inc eax \n"
"mov ecx, %1 \n"
"xor ecx, %1 \n"
"mov %1, ecx \n"
"mov eax, %1 \n"
"popa \n"
: // no output
: "m" (temp), "m" (usernb) ); // input
exit(0);
}
After that you need to compile with something like:
gcc -m32 -std=c99 -Wall -Wextra -masm=intel -o casm casmt.c && ./casm && echo $?
output:
0
You need to compile with the -masm=intel flag since you want intel assembly syntax :)
UPDATE: How does it look in Visual Studio ?
If you are building for 64 bit, you cannot use inline assembly in Visual Studio. If you are building for 32 bit, then you use __asm to do the embedding.
Generally, using inline ASM is a bad idea.
You're probably going to produce worse ASM than a compiler.
Using any ASM in a method generally defeats any optimizations which try to touch that method (i.e. inlining).
If you need to access specific features of the processor not obvious in C++ (e.g. SIMD instructions) then you can use much more consistent with the language intrinsics provided by most any compiler vendor. Intrinsics give you all the speed of that "special" instruction but in a way which is compatible with the language semantics and with optimizers.
Here's a simple example to show the syntax for GCC/Dev-C++:
int main(void)
{
int x = 10, y;
asm ("movl %1, %%eax;"
"movl %%eax, %0;"
:"=r"(y) /* y is output operand */
:"r"(x) /* x is input operand */
:"%eax"); /* %eax is clobbered register */
}
It depends on your compiler. But from your tags I guess you use gcc/g++ then you can use gcc inline assembler. But the syntax is quite weird and a bit different from intel syntax, although it achieves the same.
EDIT: With Visual Studio (or the Visual C++ compiler) it get's much easier, as it uses the usual Intel syntax.
If it's for some exercices I'd recommend some real assembler avoiding inlined code as it can get rather messy/confusing.
Some basics using GCC can be found here.
If you're open to trying MSVC (not sure if GCC is a requirement), I'd suggest you have a look at MSVC's interpretation which is (in my opinion) a lot easier to read/understand, especially for learning assembler. An example can be found here.

Inline assembly troubles

I tried to compile with GCC inline assembly code which compiled fine with MSVC, but got the following errors for basic operations:
// var is a template variable in a C++ function
__asm__
{
mov edx, var //error: Register name not specified for %edx
push ebx //error: Register name not specified for %ebx
sub esp, 8 //error: Register name not specified for %esp
}
After looking through documentation covering the topic, I found out that I should probably convert (even if I am only interested in x86) Intel style assembly code to AT&T style. However, after trying to use AT&T style I got even more weird errors:
mov var, %edx //error: Expected primary-expression before % token
mov $var, edx //error: label 'LASM$$s' used but not defined
I should also note that I tried to use LLVM-GCC, but it failed miserably with internal errors after encountering inline assembly.
What should I do?
For Apple's gcc you want -fasm-blocks which allows you to omit gcc's quoting requirement for inline asm and also lets you use Intel syntax.
// test_asm.c
int main(void)
{
int var;
__asm__
{
mov edx,var
push ebx
sub esp,8
}
return 0;
}
Compile this with:
$ gcc -Wall -m32 -fasm-blocks test_asm.c -o test_asm
Tested with gcc 4.2.1 on OS X 10.6.
g++ inline assembler is much more flexible than MSVC, and much more complicated. It treats an asm directive as a pseudo-instruction, which has to be described in the language of the code generator. Here is a working sample from my own code (for MinGW, not Mac):
// int BNASM_Add (DWORD* result, DWORD* a, int len)
//
// result += a
int BNASM_Add (DWORD* result, DWORD* a, int len)
{
int carry ;
asm volatile (
".intel_syntax\n"
" clc\n"
" cld\n"
"loop03:\n"
" lodsd\n"
" adc [edx],eax\n"
" lea edx,[edx+4]\n" // add edx,4 without disturbing the carry flag
" loop loop03\n"
" adc ecx,0\n" // Return the carry flag (ecx known to be zero)
".att_syntax\n"
: "=c"(carry) // Output: carry in ecx
: "d"(result), "S"(a), "c"(len) // Input: result in edx, a in esi, len in ecx
) ;
return carry ;
}
You can find documentation at http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm.