If you want to call a C/C++ function from inline assembly, you can do something like this:
void callee() {}
void caller()
{
asm("call *%0" : : "r"(callee));
}
GCC will then emit code which looks like this:
movl $callee, %eax
call *%eax
This can be problematic since the indirect call will destroy the pipeline on older CPUs.
Since the address of callee is eventually a constant, one can imagine that it would be possible to use the i constraint. Quoting from the GCC online docs:
`i'
An immediate integer operand (one with constant value) is allowed. This
includes symbolic constants whose
values will be known only at assembly
time or later.
If I try to use it like this:
asm("call %0" : : "i"(callee));
I get the following error from the assembler:
Error: suffix or operands invalid for `call'
This is because GCC emits the code
call $callee
Instead of
call callee
So my question is whether it is possible to make GCC output the correct call.
I got the answer from GCC's mailing list:
asm("call %P0" : : "i"(callee)); // FIXME: missing clobbers
Now I just need to find out what %P0 actually means because it seems to be an undocumented feature...
Edit: After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case.
For this to be safe, you need to tell the compiler about all registers that the function call might modify, e.g. : "eax", "ecx", "edx", "xmm0", "xmm1", ..., "st(0)", "st(1)", ....
See Calling printf in extended inline ASM for a full x86-64 example of correctly and safely making a function call from inline asm.
Maybe I am missing something here, but
extern "C" void callee(void)
{
}
void caller(void)
{
asm("call callee\n");
}
should work fine. You need extern "C" so that the name won't be decorated based on C++ naming mangling rules.
If you're generating 32-bit code (e.g. -m32 gcc option), the following asm inline emits a direct call:
asm ("call %0" :: "m" (callee));
The trick is string literal concatenation. Before GCC starts trying to get any real meaning from your code it will concatenate adjacent string literals, so even though assembly strings aren't the same as other strings you use in your program they should be concatenated if you do:
#define ASM_CALL(X) asm("\t call " X "\n")
int main(void) {
ASM_CALL( "my_function" );
return 0;
}
Since you are using GCC you could also do
#define ASM_CALL(X) asm("\t call " #X "\n")
int main(void) {
ASM_CALL(my_function);
return 0;
}
If you don't already know you should be aware that calling things from inline assembly is very tricky. When the compiler generates its own calls to other functions it includes code to set up and restore things before and after the call. It doesn't know that it should be doing any of this for your call, though. You will have to either include that yourself (very tricky to get right and may break with a compiler upgrade or compilation flags) or ensure that your function is written in such a way that it does not appear to have changed any registers or condition of the stack (or variable on it).
edit this will only work for C function names -- not C++ as they are mangled.
Related
TL;DR; I am looking for a standard way to basically tell the compiler to pass whatever happened to be in a given register to the next function.
Basically I have a function int bar(int a, int b, int c). In some cases c is unused and I would like to be able to call bar in the cases where c is unused without modifying rdx in any way.
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
I would like the assembly to just be:
For a tailcall
jmp bar
or for a normal call
call bar
Note: clang generally produces what I am looking for. But I am unsure if this will always be the case in more complex functions and I am hoping to not have to check the assembly each time I build.
GCC produces:
For a tailcall
xorl %edx, %edx
jmp bar
or for a normal call
xorl %edx, %edx
call bar
I can get the results I want using inline assembly i.e changing foo (for tail calls) to
int foo(int a, int b) {
asm volatile("jmp bar" : : :);
__builtin_unreachable();
}
which compiles to just
jmp bar
I understand that the performance implications of an xorl %edx, %edx is about as close to 0 as possible but
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that will require me verifying the assembly each time. I am looking for a method that you can basically tell the compiler "pass whatever happened to be in register".
See for examples: https://godbolt.org/z/eh1vK8
Edit: This is happening with -O3 set.
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that
will require me verifying the assembly each time. I am looking for a
method that you can basically tell the compiler "pass whatever
happened to be in register".
No, there is no standard way to achieve it in either C or C++. Neither of these languages speak to any lower-level function call semantics, nor even acknowledge the existence of CPU registers,* and both languages require every function call to provide arguments corresponding to all non-optional parameters (which is simply "all declared parameters" in C).
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
... then you reap undefined behavior as a result of using the value of no_init while it is indeterminate. Whatever any particular C or C++ implementation that accepts that at all does with it is non-standard by definition.
If you want to call bar(), but you don't care what value is passed as the third argument, then why not just choose a convenient value to pass? Zero, for example:
return bar(a, b, 0);
*Even the register keyword does not do this as far as either language standard is concerned.
Note that if the called function does read its 3rd arg, leaving it unwritten risks creating a false dependency on whatever last used EDX. For example it might be the result of a cache-miss load, or a long chain of calculations.
GCC is careful to xor-zero to break false dependencies in a lot of cases, e.g. before cvtsi2ss (bad ISA design) or popcnt (Sandybridge-family quirk).
Usually the xor edx,edx is basically a wasted 2-byte NOP, but it does prevent possible coupling of otherwise-independent dependency chains (critical paths).
If you're sure you want to defeat the compiler's attempt to protect you from that, then Nate's asm("" :"=r"(var)); is a good way to do an integer version of _mm_undefined_ps() that actually leaves a register uninitialized. (Note that _mm_undefined_ps doesn't guarantee leaving an XMM reg unwritten; some compilers will xor-zero one for you instead of fully implementing the false-dependency recklessness that intrinsic was designed to allow for Intel's compiler.)
One approach that should work for gcc/clang on most platforms is to do
int no_init;
asm("" : "=r" (no_init));
return bar(a, b, no_init);
This way you don't have to lie to the compiler about the prototype of bar (whichc could break some calling conventions), and you fool the compiler into thinking no_init is really initialized.
I would wonder about an architecture like Itanium with its "trap bit" that causes a fault when an uninitialized register is accessed. This code would probably not be safe there.
There is no portable way to get this behavior that I know of, but you could ifdef it:
#ifdef __GNUC__
#define UNUSED_INT ({ int x; asm("" : "=r" (x)); x; })
#else
#define UNUSED_INT 0
#endif
// ...
bar(a, b, UNUSED_INT);
Then you can fall back to the (infinitesimally) less efficient but correct code when necessary.
It results in a bare jmp on gcc/x86-64, see https://godbolt.org/z/d3ordK. On x86-32 it is not quite optimal as it pushes an uninitialized register, instead of just adjusting an existing subtraction from esp. Note that a bare jmp/call is not safe on x86-32 because that third stack slot may contain something important, and the callee is allowed to overwrite it (even if the variable is unused on the path you have in mind, the compiler could be using it as scratch space).
One portable alternative would be to rewrite bar to be variadic. However, then it would need to use va_arg to retrieve the third argument when it is present, and that tends to be less efficient.
Cast the function to have the smaller signature (i.e. fewer parameters):
extern int bar(int, int, int);
int foo(int a, int int b) {
return ((int (*)(int,int))bar)(a, b);
}
Maybe make a macro for 2 parameter bar, and even get rid of foo:
extern int bar3(int, int, int);
#define bar2(a,b) ((int (*)(int,int))bar3)(a,b)
int userOfBar(int a, int b) { return bar2 (a,b); }
https://godbolt.org/z/Gn4a69
Oddly, given the above gcc doesn't touch %edx, but clang does... oh, well.
(Still can't guarantee the compiler won't touch some registers, though, that's its domain. Otherwise, you can write these functions directly in assembly and avoid the middleperson.)
Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.
void clobber() {
asm volatile("" : : : "memory");
}
void escape(void* p) {
asm volatile("" : : "g"(p) : "memory");
}
These use inline assembly statements to change the assumptions of the optimizer.
The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.
The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.
(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)
All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.
These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?
† To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.
Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):
// always returns false, but in an undeducible way
bool isClobberingEnabled();
// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();
const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;
inline void clobber() {
if ( clobberingIsEnabled ) {
// This will never be executed, but the compiler
// cannot know about it.
clobberingPtr = nextLocationToClobber();
*clobberingPtr = *clobberingPtr;
}
}
UPDATE
Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?
Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:
bool undeducible_false() {
// It took mathematicians more than 3 centuries to prove Fermat's
// last theorem in its most general form. Hardly that knowledge
// has been put into compilers (or the compiler will try hard
// enough to check all one million possible combinations below).
// Caveat: avoid integer overflow (Fermat's theorem
// doesn't hold for modulo arithmetic)
std::uint32_t a = std::clock() % 100 + 1;
std::uint32_t b = std::rand() % 100 + 1;
std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;
return a*a*a + b*b*b == c*c*c;
}
I have used the following in place of escape.
#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
*reinterpret_cast<char volatile*>(p) =
*reinterpret_cast<char const volatile*>(p); // thanks, #milleniumbug
}
#pragma optimize("", on)
#endif
It's not perfect but it's close enough, I think.
Sadly, I don't have a way to emulate clobber.
Given this code:
#include <stdio.h>
int main(int argc, char **argv)
{
int x = 1;
printf("Hello x = %d\n", x);
}
I'd like to access and manipulate the variable x in inline assembly. Ideally, I want to change its value using inline assembly. GNU assembler, and using the AT&T syntax.
In GNU C inline asm, with x86 AT&T syntax:
(But https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it).
// this example doesn't really need volatile: the result is the same every time
asm volatile("movl $0, %[some]"
: [some] "=r" (x)
);
after this, x contains 0.
Note that you should generally avoid mov as the first or last instruction of an asm statement. Don't copy from %[some] to a hard-coded register like %%eax, just use %[some] as a register, letting the compiler do register allocation.
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and https://stackoverflow.com/tags/inline-assembly/info for more docs and guides.
Not all compilers support GNU syntax.
For example, for MSVC you do this:
__asm mov x, 0 and x will have the value of 0 after this statement.
Please specify the compiler you would want to use.
Also note, doing this will restrict your program to compile with only a specific compiler-assembler combination, and will be targeted only towards a particular architecture.
In most cases, you'll get as good or better results from using pure C and intrinsics, not inline asm.
asm("mov $0, %1":"=r" (x):"r" (x):"cc"); -- this may get you on the right track. Specify register use as much as possible for performance and efficiency. However, as Aniket points out, highly architecture dependent and requires gcc.
I'm writing an RPC library for AVR and need to pass a function address to some inline assembler code and call the function from within the assembler code. However the assembler complains when I try to call the function directly.
This minimal example test.cpp illustrates the issue (in the actual case I'm passing args and the function is an instantiation of a static member of templated class):
void bar () {
return;
}
void foo() {
asm volatile (
"call %0" "\n"
:
: "p" (bar)
);
}
Compiling with avr-gcc -S test.cpp -o test.S -mmcu=atmega328p works fine but when I try to assemble with avr-gcc -c test.S -o test.o -mmcu=atmega328p avr-as complains:
test.c: Assembler messages:
test.c:38: Error: garbage at end of line
I have no idea why it writes "test.c", the file it is referring to is test.S, which contains this on line 38:
call gs(_Z3barv)
I have tried all even remotely sensible constraints on the paramter to the inline assembler that I could find here but none of those I tried worked.
I imagine if the gs() part was removed, everything should work, but all constraints seem to add it. I have no idea what it does.
The odd thing is that doing an indirect call like this assembles just fine:
void bar () {
return;
}
void foo() {
asm volatile (
"ldi r30, lo8(%0)" "\n"
"ldi r31, hi8(%0)" "\n"
"icall" "\n"
:
: "p" (bar)
);
}
The assembler produced looks like this:
ldi r30, lo8(gs(_Z3barv))
ldi r31, hi8(gs(_Z3barv))
icall
And avr-as doesn't complain about any garbage.
There are several issues with the code:
Issue 1: Wrong Constraint
The correct constraint for a call target is "i", thus known at link-time.
Issue 2: Wrong % print-modifier
In order to print an address suitable for a call, use %x which will print a plain symbol without gs(). Generating a linker stub at this place by means of gs() is not valid syntax, hence "garbage at end of line". Apart from that, as you are calling bar directly, there is no need for linker stub (at least not for this kind of symbol usage).
Issue 3: call instruction might not be available
To factor out whether a device supports call or just rcall, there is %~ which prints a single r if just rcall is available, and nothing if call is available.
Issue 4: The Call might clobber Registers or have other Side-Effects
It's unlikely that the call has no effects on registers or on memory whatsoever. If you description of the inline asm does not match some side-effects of the code, it's likely that you will get wrong code sooner or later.
Taking it all together
Let's assume you have a function bar written in assembly that takes two 16-bit operands in R22 and R26, and computes a result in R22. This function does not obey the avr-gcc C/C++ calling convention, so inline assembly is one way to interface to such a function. For bar we cannot write a correct prototype anyways, so we just provide a prototype so that we can use symbol bar. Register X has constraint "x", but R22 has no own register constraint, and therefore we have to use a local asm register:
extern "C" void bar (...);
int call_bar (int x, int y)
{
register int r22 __asm ("r22") = x;
__asm ("%~call %x2"
: "+r" (r22)
: "x" (y), "i" (bar));
return r22;
}
Generated code for ATmega32 + optimization:
_Z8call_barii:
movw r26,r22
movw r22,r24
call bar
movw r24,r22
ret
So what's that "generate stub" gs() thing?
Suppose the C/C++ code is taking the address of a function. The only sensible thing to do with it is to call that function, which will be an indirect call in general. Now an indirect call can target 64KiW = 128KiB at most, so that on devices with > 128KiB of code memory, special means must be taken to indirectly call a function beyond the 128KiB boundary. The AVR hardware features an SFR named EIND for that purpose, but problems using it are obvious. You'd have to set it prior to a call and then reset it somehow somewhere; all evil things would be necessary.
avr-gcc takes a different approach: For each such address taken, the compiler generates gs(func). This will just resolve to func if the address is in the 128KiB range. If not, gs() resolves to an address in section .trampolines which is located close to the beginning of flash, i.e. in the lower 128KiB. .trampolines containts a list of direct JMPs to targets beyond the 128KiB range.
Take for example the following C code:
extern int far_func (void);
int main (void)
{
int (*pfunc)(void) = far_func;
__asm ("" : "+r" (pfunc)); /* Forget content of pfunc. */
return pfunc();
}
The __asm is used to keep the compiler from optimizing the indirect call to a direct one. Then run
> avr-gcc main.c -o main.elf -mmcu=atmega2560 -save-temps -Os -Wl,--defsym,far_func=0x24680
> avr-objdump -d main.elf > main.lst
For the matter of brevity, we just define symbol far_func per command line.
The assembly dump in main.s shows that far_func might require a linker stub:
main:
ldi r30,lo8(gs(far_func))
ldi r31,hi8(gs(far_func))
eijmp
The final executable listing in main.lst then shows that the stub is actually generated and used:
main.elf: file format elf32-avr
Disassembly of section .text:
...
000000e4 <__trampolines_start>:
e4: 0d 94 40 23 jmp 0x24680 ; 0x24680 <far_func>
...
00000104 <main>:
104: e2 e7 ldi r30, 0x72 ; 114
106: f0 e0 ldi r31, 0x00 ; 0
108: 19 94 eijmp
main loads Z=0x0072 which is a word address for byte address 0x00e4, i.e. the code is indirectly jumping to 0x00e4, and from there it jumps directly to 0x24680.
Note that call requires a constant, known-at-link-time value. The "p" constraint does not include that semantics; it would also allow a pointer from a variable (e.g. char* x), which call cannot handle. (I seem to remember that sometimes gcc is clever enough to optimize in such a way that "p" does work here - but that's basically undocumented behavior and non-deterministic, so better not count on it.)
If the function you're calling actually is compile-time constant you can use "i" (bar). If it's not, then you have no other choice than using icall as you already figured out.
Btw, the AVR section of https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints documents some more, AVR-specific constraints.
I've tries various ways of passing a C function name to inline ASM code without success. However I did find a workaround, which seems to provide the desired result.
Answer to the question:
As explained on https://www.nongnu.org/avr-libc/user-manual/inline_asm.html you can assign a ASM name to a C function in a prototype declaration:
void bar (void) asm ("ASM_BAR"); // any name possible here
void bar (void)
{
return;
}
Then you can call the function easily from your ASM code:
asm volatile("call ASM_BAR");
Use with library functions:
This approach does not work with library functions, because they have their own prototype declarations. To call a function like system_tick() of the time.h library more efficiently from an ISR, you can declare a helper function. Unfortunately GCC does not apply the inline setting to calls from ASM code.
inline void asm_system_tick(void) asm ("ASM_SYSTEM_TICK") __attribute__((always_inline));
void asm_system_tick(void)
{
system_tick();
}
In the following example GCC does only generate push/ pop instructions for the surrounding code, not for the function call! Note that system_tick() is specifically designed for ISR_NAKED and does all required stack operations on its own.
volatile uint8_t tick = 0;
ISR(TIMER2_OVF_vect)
{
tick++;
if (tick > 127)
{
tick = 0;
asm volatile ("call ASM_SYSTEM_TICK");
}
}
Because the inline attribute does not work, each function call takes 8 additional cpu cycles. Compared to 5632 CPU cycles required for push/ pull operations with a normal function call (44 CPU cycles for each run of the ISR) it is still a very impressive improvement.
I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}