I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}
Related
TL;DR; I am looking for a standard way to basically tell the compiler to pass whatever happened to be in a given register to the next function.
Basically I have a function int bar(int a, int b, int c). In some cases c is unused and I would like to be able to call bar in the cases where c is unused without modifying rdx in any way.
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
I would like the assembly to just be:
For a tailcall
jmp bar
or for a normal call
call bar
Note: clang generally produces what I am looking for. But I am unsure if this will always be the case in more complex functions and I am hoping to not have to check the assembly each time I build.
GCC produces:
For a tailcall
xorl %edx, %edx
jmp bar
or for a normal call
xorl %edx, %edx
call bar
I can get the results I want using inline assembly i.e changing foo (for tail calls) to
int foo(int a, int b) {
asm volatile("jmp bar" : : :);
__builtin_unreachable();
}
which compiles to just
jmp bar
I understand that the performance implications of an xorl %edx, %edx is about as close to 0 as possible but
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that will require me verifying the assembly each time. I am looking for a method that you can basically tell the compiler "pass whatever happened to be in register".
See for examples: https://godbolt.org/z/eh1vK8
Edit: This is happening with -O3 set.
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that
will require me verifying the assembly each time. I am looking for a
method that you can basically tell the compiler "pass whatever
happened to be in register".
No, there is no standard way to achieve it in either C or C++. Neither of these languages speak to any lower-level function call semantics, nor even acknowledge the existence of CPU registers,* and both languages require every function call to provide arguments corresponding to all non-optional parameters (which is simply "all declared parameters" in C).
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
... then you reap undefined behavior as a result of using the value of no_init while it is indeterminate. Whatever any particular C or C++ implementation that accepts that at all does with it is non-standard by definition.
If you want to call bar(), but you don't care what value is passed as the third argument, then why not just choose a convenient value to pass? Zero, for example:
return bar(a, b, 0);
*Even the register keyword does not do this as far as either language standard is concerned.
Note that if the called function does read its 3rd arg, leaving it unwritten risks creating a false dependency on whatever last used EDX. For example it might be the result of a cache-miss load, or a long chain of calculations.
GCC is careful to xor-zero to break false dependencies in a lot of cases, e.g. before cvtsi2ss (bad ISA design) or popcnt (Sandybridge-family quirk).
Usually the xor edx,edx is basically a wasted 2-byte NOP, but it does prevent possible coupling of otherwise-independent dependency chains (critical paths).
If you're sure you want to defeat the compiler's attempt to protect you from that, then Nate's asm("" :"=r"(var)); is a good way to do an integer version of _mm_undefined_ps() that actually leaves a register uninitialized. (Note that _mm_undefined_ps doesn't guarantee leaving an XMM reg unwritten; some compilers will xor-zero one for you instead of fully implementing the false-dependency recklessness that intrinsic was designed to allow for Intel's compiler.)
One approach that should work for gcc/clang on most platforms is to do
int no_init;
asm("" : "=r" (no_init));
return bar(a, b, no_init);
This way you don't have to lie to the compiler about the prototype of bar (whichc could break some calling conventions), and you fool the compiler into thinking no_init is really initialized.
I would wonder about an architecture like Itanium with its "trap bit" that causes a fault when an uninitialized register is accessed. This code would probably not be safe there.
There is no portable way to get this behavior that I know of, but you could ifdef it:
#ifdef __GNUC__
#define UNUSED_INT ({ int x; asm("" : "=r" (x)); x; })
#else
#define UNUSED_INT 0
#endif
// ...
bar(a, b, UNUSED_INT);
Then you can fall back to the (infinitesimally) less efficient but correct code when necessary.
It results in a bare jmp on gcc/x86-64, see https://godbolt.org/z/d3ordK. On x86-32 it is not quite optimal as it pushes an uninitialized register, instead of just adjusting an existing subtraction from esp. Note that a bare jmp/call is not safe on x86-32 because that third stack slot may contain something important, and the callee is allowed to overwrite it (even if the variable is unused on the path you have in mind, the compiler could be using it as scratch space).
One portable alternative would be to rewrite bar to be variadic. However, then it would need to use va_arg to retrieve the third argument when it is present, and that tends to be less efficient.
Cast the function to have the smaller signature (i.e. fewer parameters):
extern int bar(int, int, int);
int foo(int a, int int b) {
return ((int (*)(int,int))bar)(a, b);
}
Maybe make a macro for 2 parameter bar, and even get rid of foo:
extern int bar3(int, int, int);
#define bar2(a,b) ((int (*)(int,int))bar3)(a,b)
int userOfBar(int a, int b) { return bar2 (a,b); }
https://godbolt.org/z/Gn4a69
Oddly, given the above gcc doesn't touch %edx, but clang does... oh, well.
(Still can't guarantee the compiler won't touch some registers, though, that's its domain. Otherwise, you can write these functions directly in assembly and avoid the middleperson.)
I have a strange situation that seems to be working well for me, but I need to know how to either make this better or how to live this.
I am using C++ as a compiled scripting language for a game engine. The RISC-V system call ABI is the same as the C function calling convention, with the exception that instead of an 8th integer or pointer argument, A7 is used for the system call number. Yes, you know where this is going. Behold:
extern "C" long syscall_enter(...);
template <typename... Args>
inline long syscall(long syscall_n, Args&&... args)
{
asm volatile ("li a7, %0" : : "i"(syscall_n));
return syscall_enter(std::forward<Args>(args)...);
}
While syscall_enter is just a symbol in .text with the syscall instruction and a ret. The system call return value is also the same register as a normal function return.
000103f0 <syscall_enter>:
syscall_enter():
103f0: 00000073 ecall
103f4: 00008067 ret
Before this, I had to create 20+ functions to cover all the various ways to make system calls with integers and pointers with compiler barrier, and when I wanted to add a function that took floating-point values it would say the call was ambigous as integers and floats can be converted back and forth. So, I could either start to add unique names to the functions, or just solve this mess a better way. It was honestly irritating and putting a damper on an otherwise excellent experience. I really love being able to use C++ on "both sides".
The instructions generated by the compiler seems alright. It JAL and JALR syscall_enter, which is fine. The compiler seems a little bit confused, but I don't mind one extra instruction.
10204: 1f500793 li a5,501
10208: 00078893 mv a7,a5
1020c: 00000513 li a0,0
10210: 1e0000ef jal ra,103f0 <syscall_enter>
As well as center camera on position:
100d4: 19600793 li a5,406
100d8: 00078893 mv a7,a5
100dc: 000127b7 lui a5,0x12
100e0: 4207b587 fld fa1,1056(a5) # 12420 <_exit+0x2308>
100e4: 22b58553 fmv.d fa0,fa1
100e8: 010000ef jal ra,100f8 <syscall_enter>
Again one extra move instruction. Looks alright. The API is heavily in use already, and there is also a threading API which works with this.
Now, is there an even better way? I couldn't think of a better way to load a7 with a number and then force the compiler to set a function call up, without making an actual function call. I was thinking about using a template parameter for the system call number, but I'm not so sure about the rest. Maybe we can constrain the number of arguments to 7? It won't be correct when there are integer and floating-point arguments, but that's fine. Stack-stored structs are easy to pass.
After some testing, I have decided to use this:
extern "C" long syscall_enter(...);
template <typename... Args>
inline long syscall(long syscall_n, Args&&... args)
{
// This will prevent some cases of too many arguments,
// but not a mix of float and integral arguments.
static_assert(sizeof...(args) < 8, "There is a system call limit of 8 integer arguments");
// The memory clobbering prevents reordering of a7
asm volatile ("li a7, %0" : : "i"(syscall_n) : "a7", "memory");
return syscall_enter(std::forward<Args>(args)...);
asm volatile("" : : : "memory");
}
Should suffice. No need to for syscall function spam. The check to count arguments is not optimal, since it should only prevent the usage of the 8th integral register (which means counting integral, pointer and reference parameters). But it will prevent some cases.
There's two problems with this.
The first is that you aren't telling the compiler you are using a7, so it might try to put something else there, resulting in incorrect code. You need to add a7 to the clobbers list of the asm:
asm volatile ("mv a7, %0" : : "r"(syscall_n) : "a7");
The second is that the asm statement is not connected to the call, so the compiler may reorder things, and, in particular, move other code in between the asm mv instruction and the call. If that happens and the code in question modifies a7, you'll end up calling the wrong syscall.
This is the function I'm using now. Many thanks to #PeterCordes for all the help.
extern "C" long syscall_enter(...);
template <typename... Args>
inline long apicall(long syscall_n, Args&&... args)
{
// This will prevent some cases of too many arguments,
// but not a mix of float and integral arguments.
static_assert(sizeof...(args) < 8, "There is a system call limit of 8 integer arguments");
// The memory clobbering prevents reordering of a7
asm volatile ("li a7, %0" : : "i"(syscall_n) : "a7", "memory");
return syscall_enter(std::forward<Args>(args)...);
asm volatile("" : : : "memory");
}
It works well for me. Again, the primary reason to avoid the syscall-function-spam solution, is because if you have 2 functions where one takes an integral argument and another that takes a floating-point argument, then the function call will be ambigous, and now you need to start thinking about which function to call. I have tested this solution with a mix of float and integral arguments, and it's working as it should. One drawback is that it puts floating-point arguments into 64-bit registers, so it will be a tiny amount slower during the system call.
Again, there was a C++ solution!
Given this code:
#include <stdio.h>
int main(int argc, char **argv)
{
int x = 1;
printf("Hello x = %d\n", x);
}
I'd like to access and manipulate the variable x in inline assembly. Ideally, I want to change its value using inline assembly. GNU assembler, and using the AT&T syntax.
In GNU C inline asm, with x86 AT&T syntax:
(But https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it).
// this example doesn't really need volatile: the result is the same every time
asm volatile("movl $0, %[some]"
: [some] "=r" (x)
);
after this, x contains 0.
Note that you should generally avoid mov as the first or last instruction of an asm statement. Don't copy from %[some] to a hard-coded register like %%eax, just use %[some] as a register, letting the compiler do register allocation.
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and https://stackoverflow.com/tags/inline-assembly/info for more docs and guides.
Not all compilers support GNU syntax.
For example, for MSVC you do this:
__asm mov x, 0 and x will have the value of 0 after this statement.
Please specify the compiler you would want to use.
Also note, doing this will restrict your program to compile with only a specific compiler-assembler combination, and will be targeted only towards a particular architecture.
In most cases, you'll get as good or better results from using pure C and intrinsics, not inline asm.
asm("mov $0, %1":"=r" (x):"r" (x):"cc"); -- this may get you on the right track. Specify register use as much as possible for performance and efficiency. However, as Aniket points out, highly architecture dependent and requires gcc.
I'm writing an RPC library for AVR and need to pass a function address to some inline assembler code and call the function from within the assembler code. However the assembler complains when I try to call the function directly.
This minimal example test.cpp illustrates the issue (in the actual case I'm passing args and the function is an instantiation of a static member of templated class):
void bar () {
return;
}
void foo() {
asm volatile (
"call %0" "\n"
:
: "p" (bar)
);
}
Compiling with avr-gcc -S test.cpp -o test.S -mmcu=atmega328p works fine but when I try to assemble with avr-gcc -c test.S -o test.o -mmcu=atmega328p avr-as complains:
test.c: Assembler messages:
test.c:38: Error: garbage at end of line
I have no idea why it writes "test.c", the file it is referring to is test.S, which contains this on line 38:
call gs(_Z3barv)
I have tried all even remotely sensible constraints on the paramter to the inline assembler that I could find here but none of those I tried worked.
I imagine if the gs() part was removed, everything should work, but all constraints seem to add it. I have no idea what it does.
The odd thing is that doing an indirect call like this assembles just fine:
void bar () {
return;
}
void foo() {
asm volatile (
"ldi r30, lo8(%0)" "\n"
"ldi r31, hi8(%0)" "\n"
"icall" "\n"
:
: "p" (bar)
);
}
The assembler produced looks like this:
ldi r30, lo8(gs(_Z3barv))
ldi r31, hi8(gs(_Z3barv))
icall
And avr-as doesn't complain about any garbage.
There are several issues with the code:
Issue 1: Wrong Constraint
The correct constraint for a call target is "i", thus known at link-time.
Issue 2: Wrong % print-modifier
In order to print an address suitable for a call, use %x which will print a plain symbol without gs(). Generating a linker stub at this place by means of gs() is not valid syntax, hence "garbage at end of line". Apart from that, as you are calling bar directly, there is no need for linker stub (at least not for this kind of symbol usage).
Issue 3: call instruction might not be available
To factor out whether a device supports call or just rcall, there is %~ which prints a single r if just rcall is available, and nothing if call is available.
Issue 4: The Call might clobber Registers or have other Side-Effects
It's unlikely that the call has no effects on registers or on memory whatsoever. If you description of the inline asm does not match some side-effects of the code, it's likely that you will get wrong code sooner or later.
Taking it all together
Let's assume you have a function bar written in assembly that takes two 16-bit operands in R22 and R26, and computes a result in R22. This function does not obey the avr-gcc C/C++ calling convention, so inline assembly is one way to interface to such a function. For bar we cannot write a correct prototype anyways, so we just provide a prototype so that we can use symbol bar. Register X has constraint "x", but R22 has no own register constraint, and therefore we have to use a local asm register:
extern "C" void bar (...);
int call_bar (int x, int y)
{
register int r22 __asm ("r22") = x;
__asm ("%~call %x2"
: "+r" (r22)
: "x" (y), "i" (bar));
return r22;
}
Generated code for ATmega32 + optimization:
_Z8call_barii:
movw r26,r22
movw r22,r24
call bar
movw r24,r22
ret
So what's that "generate stub" gs() thing?
Suppose the C/C++ code is taking the address of a function. The only sensible thing to do with it is to call that function, which will be an indirect call in general. Now an indirect call can target 64KiW = 128KiB at most, so that on devices with > 128KiB of code memory, special means must be taken to indirectly call a function beyond the 128KiB boundary. The AVR hardware features an SFR named EIND for that purpose, but problems using it are obvious. You'd have to set it prior to a call and then reset it somehow somewhere; all evil things would be necessary.
avr-gcc takes a different approach: For each such address taken, the compiler generates gs(func). This will just resolve to func if the address is in the 128KiB range. If not, gs() resolves to an address in section .trampolines which is located close to the beginning of flash, i.e. in the lower 128KiB. .trampolines containts a list of direct JMPs to targets beyond the 128KiB range.
Take for example the following C code:
extern int far_func (void);
int main (void)
{
int (*pfunc)(void) = far_func;
__asm ("" : "+r" (pfunc)); /* Forget content of pfunc. */
return pfunc();
}
The __asm is used to keep the compiler from optimizing the indirect call to a direct one. Then run
> avr-gcc main.c -o main.elf -mmcu=atmega2560 -save-temps -Os -Wl,--defsym,far_func=0x24680
> avr-objdump -d main.elf > main.lst
For the matter of brevity, we just define symbol far_func per command line.
The assembly dump in main.s shows that far_func might require a linker stub:
main:
ldi r30,lo8(gs(far_func))
ldi r31,hi8(gs(far_func))
eijmp
The final executable listing in main.lst then shows that the stub is actually generated and used:
main.elf: file format elf32-avr
Disassembly of section .text:
...
000000e4 <__trampolines_start>:
e4: 0d 94 40 23 jmp 0x24680 ; 0x24680 <far_func>
...
00000104 <main>:
104: e2 e7 ldi r30, 0x72 ; 114
106: f0 e0 ldi r31, 0x00 ; 0
108: 19 94 eijmp
main loads Z=0x0072 which is a word address for byte address 0x00e4, i.e. the code is indirectly jumping to 0x00e4, and from there it jumps directly to 0x24680.
Note that call requires a constant, known-at-link-time value. The "p" constraint does not include that semantics; it would also allow a pointer from a variable (e.g. char* x), which call cannot handle. (I seem to remember that sometimes gcc is clever enough to optimize in such a way that "p" does work here - but that's basically undocumented behavior and non-deterministic, so better not count on it.)
If the function you're calling actually is compile-time constant you can use "i" (bar). If it's not, then you have no other choice than using icall as you already figured out.
Btw, the AVR section of https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints documents some more, AVR-specific constraints.
I've tries various ways of passing a C function name to inline ASM code without success. However I did find a workaround, which seems to provide the desired result.
Answer to the question:
As explained on https://www.nongnu.org/avr-libc/user-manual/inline_asm.html you can assign a ASM name to a C function in a prototype declaration:
void bar (void) asm ("ASM_BAR"); // any name possible here
void bar (void)
{
return;
}
Then you can call the function easily from your ASM code:
asm volatile("call ASM_BAR");
Use with library functions:
This approach does not work with library functions, because they have their own prototype declarations. To call a function like system_tick() of the time.h library more efficiently from an ISR, you can declare a helper function. Unfortunately GCC does not apply the inline setting to calls from ASM code.
inline void asm_system_tick(void) asm ("ASM_SYSTEM_TICK") __attribute__((always_inline));
void asm_system_tick(void)
{
system_tick();
}
In the following example GCC does only generate push/ pop instructions for the surrounding code, not for the function call! Note that system_tick() is specifically designed for ISR_NAKED and does all required stack operations on its own.
volatile uint8_t tick = 0;
ISR(TIMER2_OVF_vect)
{
tick++;
if (tick > 127)
{
tick = 0;
asm volatile ("call ASM_SYSTEM_TICK");
}
}
Because the inline attribute does not work, each function call takes 8 additional cpu cycles. Compared to 5632 CPU cycles required for push/ pull operations with a normal function call (44 CPU cycles for each run of the ISR) it is still a very impressive improvement.
If you want to call a C/C++ function from inline assembly, you can do something like this:
void callee() {}
void caller()
{
asm("call *%0" : : "r"(callee));
}
GCC will then emit code which looks like this:
movl $callee, %eax
call *%eax
This can be problematic since the indirect call will destroy the pipeline on older CPUs.
Since the address of callee is eventually a constant, one can imagine that it would be possible to use the i constraint. Quoting from the GCC online docs:
`i'
An immediate integer operand (one with constant value) is allowed. This
includes symbolic constants whose
values will be known only at assembly
time or later.
If I try to use it like this:
asm("call %0" : : "i"(callee));
I get the following error from the assembler:
Error: suffix or operands invalid for `call'
This is because GCC emits the code
call $callee
Instead of
call callee
So my question is whether it is possible to make GCC output the correct call.
I got the answer from GCC's mailing list:
asm("call %P0" : : "i"(callee)); // FIXME: missing clobbers
Now I just need to find out what %P0 actually means because it seems to be an undocumented feature...
Edit: After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case.
For this to be safe, you need to tell the compiler about all registers that the function call might modify, e.g. : "eax", "ecx", "edx", "xmm0", "xmm1", ..., "st(0)", "st(1)", ....
See Calling printf in extended inline ASM for a full x86-64 example of correctly and safely making a function call from inline asm.
Maybe I am missing something here, but
extern "C" void callee(void)
{
}
void caller(void)
{
asm("call callee\n");
}
should work fine. You need extern "C" so that the name won't be decorated based on C++ naming mangling rules.
If you're generating 32-bit code (e.g. -m32 gcc option), the following asm inline emits a direct call:
asm ("call %0" :: "m" (callee));
The trick is string literal concatenation. Before GCC starts trying to get any real meaning from your code it will concatenate adjacent string literals, so even though assembly strings aren't the same as other strings you use in your program they should be concatenated if you do:
#define ASM_CALL(X) asm("\t call " X "\n")
int main(void) {
ASM_CALL( "my_function" );
return 0;
}
Since you are using GCC you could also do
#define ASM_CALL(X) asm("\t call " #X "\n")
int main(void) {
ASM_CALL(my_function);
return 0;
}
If you don't already know you should be aware that calling things from inline assembly is very tricky. When the compiler generates its own calls to other functions it includes code to set up and restore things before and after the call. It doesn't know that it should be doing any of this for your call, though. You will have to either include that yourself (very tricky to get right and may break with a compiler upgrade or compilation flags) or ensure that your function is written in such a way that it does not appear to have changed any registers or condition of the stack (or variable on it).
edit this will only work for C function names -- not C++ as they are mangled.