I have a strange situation that seems to be working well for me, but I need to know how to either make this better or how to live this.
I am using C++ as a compiled scripting language for a game engine. The RISC-V system call ABI is the same as the C function calling convention, with the exception that instead of an 8th integer or pointer argument, A7 is used for the system call number. Yes, you know where this is going. Behold:
extern "C" long syscall_enter(...);
template <typename... Args>
inline long syscall(long syscall_n, Args&&... args)
{
asm volatile ("li a7, %0" : : "i"(syscall_n));
return syscall_enter(std::forward<Args>(args)...);
}
While syscall_enter is just a symbol in .text with the syscall instruction and a ret. The system call return value is also the same register as a normal function return.
000103f0 <syscall_enter>:
syscall_enter():
103f0: 00000073 ecall
103f4: 00008067 ret
Before this, I had to create 20+ functions to cover all the various ways to make system calls with integers and pointers with compiler barrier, and when I wanted to add a function that took floating-point values it would say the call was ambigous as integers and floats can be converted back and forth. So, I could either start to add unique names to the functions, or just solve this mess a better way. It was honestly irritating and putting a damper on an otherwise excellent experience. I really love being able to use C++ on "both sides".
The instructions generated by the compiler seems alright. It JAL and JALR syscall_enter, which is fine. The compiler seems a little bit confused, but I don't mind one extra instruction.
10204: 1f500793 li a5,501
10208: 00078893 mv a7,a5
1020c: 00000513 li a0,0
10210: 1e0000ef jal ra,103f0 <syscall_enter>
As well as center camera on position:
100d4: 19600793 li a5,406
100d8: 00078893 mv a7,a5
100dc: 000127b7 lui a5,0x12
100e0: 4207b587 fld fa1,1056(a5) # 12420 <_exit+0x2308>
100e4: 22b58553 fmv.d fa0,fa1
100e8: 010000ef jal ra,100f8 <syscall_enter>
Again one extra move instruction. Looks alright. The API is heavily in use already, and there is also a threading API which works with this.
Now, is there an even better way? I couldn't think of a better way to load a7 with a number and then force the compiler to set a function call up, without making an actual function call. I was thinking about using a template parameter for the system call number, but I'm not so sure about the rest. Maybe we can constrain the number of arguments to 7? It won't be correct when there are integer and floating-point arguments, but that's fine. Stack-stored structs are easy to pass.
After some testing, I have decided to use this:
extern "C" long syscall_enter(...);
template <typename... Args>
inline long syscall(long syscall_n, Args&&... args)
{
// This will prevent some cases of too many arguments,
// but not a mix of float and integral arguments.
static_assert(sizeof...(args) < 8, "There is a system call limit of 8 integer arguments");
// The memory clobbering prevents reordering of a7
asm volatile ("li a7, %0" : : "i"(syscall_n) : "a7", "memory");
return syscall_enter(std::forward<Args>(args)...);
asm volatile("" : : : "memory");
}
Should suffice. No need to for syscall function spam. The check to count arguments is not optimal, since it should only prevent the usage of the 8th integral register (which means counting integral, pointer and reference parameters). But it will prevent some cases.
There's two problems with this.
The first is that you aren't telling the compiler you are using a7, so it might try to put something else there, resulting in incorrect code. You need to add a7 to the clobbers list of the asm:
asm volatile ("mv a7, %0" : : "r"(syscall_n) : "a7");
The second is that the asm statement is not connected to the call, so the compiler may reorder things, and, in particular, move other code in between the asm mv instruction and the call. If that happens and the code in question modifies a7, you'll end up calling the wrong syscall.
This is the function I'm using now. Many thanks to #PeterCordes for all the help.
extern "C" long syscall_enter(...);
template <typename... Args>
inline long apicall(long syscall_n, Args&&... args)
{
// This will prevent some cases of too many arguments,
// but not a mix of float and integral arguments.
static_assert(sizeof...(args) < 8, "There is a system call limit of 8 integer arguments");
// The memory clobbering prevents reordering of a7
asm volatile ("li a7, %0" : : "i"(syscall_n) : "a7", "memory");
return syscall_enter(std::forward<Args>(args)...);
asm volatile("" : : : "memory");
}
It works well for me. Again, the primary reason to avoid the syscall-function-spam solution, is because if you have 2 functions where one takes an integral argument and another that takes a floating-point argument, then the function call will be ambigous, and now you need to start thinking about which function to call. I have tested this solution with a mix of float and integral arguments, and it's working as it should. One drawback is that it puts floating-point arguments into 64-bit registers, so it will be a tiny amount slower during the system call.
Again, there was a C++ solution!
Related
TL;DR; I am looking for a standard way to basically tell the compiler to pass whatever happened to be in a given register to the next function.
Basically I have a function int bar(int a, int b, int c). In some cases c is unused and I would like to be able to call bar in the cases where c is unused without modifying rdx in any way.
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
I would like the assembly to just be:
For a tailcall
jmp bar
or for a normal call
call bar
Note: clang generally produces what I am looking for. But I am unsure if this will always be the case in more complex functions and I am hoping to not have to check the assembly each time I build.
GCC produces:
For a tailcall
xorl %edx, %edx
jmp bar
or for a normal call
xorl %edx, %edx
call bar
I can get the results I want using inline assembly i.e changing foo (for tail calls) to
int foo(int a, int b) {
asm volatile("jmp bar" : : :);
__builtin_unreachable();
}
which compiles to just
jmp bar
I understand that the performance implications of an xorl %edx, %edx is about as close to 0 as possible but
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that will require me verifying the assembly each time. I am looking for a method that you can basically tell the compiler "pass whatever happened to be in register".
See for examples: https://godbolt.org/z/eh1vK8
Edit: This is happening with -O3 set.
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that
will require me verifying the assembly each time. I am looking for a
method that you can basically tell the compiler "pass whatever
happened to be in register".
No, there is no standard way to achieve it in either C or C++. Neither of these languages speak to any lower-level function call semantics, nor even acknowledge the existence of CPU registers,* and both languages require every function call to provide arguments corresponding to all non-optional parameters (which is simply "all declared parameters" in C).
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
... then you reap undefined behavior as a result of using the value of no_init while it is indeterminate. Whatever any particular C or C++ implementation that accepts that at all does with it is non-standard by definition.
If you want to call bar(), but you don't care what value is passed as the third argument, then why not just choose a convenient value to pass? Zero, for example:
return bar(a, b, 0);
*Even the register keyword does not do this as far as either language standard is concerned.
Note that if the called function does read its 3rd arg, leaving it unwritten risks creating a false dependency on whatever last used EDX. For example it might be the result of a cache-miss load, or a long chain of calculations.
GCC is careful to xor-zero to break false dependencies in a lot of cases, e.g. before cvtsi2ss (bad ISA design) or popcnt (Sandybridge-family quirk).
Usually the xor edx,edx is basically a wasted 2-byte NOP, but it does prevent possible coupling of otherwise-independent dependency chains (critical paths).
If you're sure you want to defeat the compiler's attempt to protect you from that, then Nate's asm("" :"=r"(var)); is a good way to do an integer version of _mm_undefined_ps() that actually leaves a register uninitialized. (Note that _mm_undefined_ps doesn't guarantee leaving an XMM reg unwritten; some compilers will xor-zero one for you instead of fully implementing the false-dependency recklessness that intrinsic was designed to allow for Intel's compiler.)
One approach that should work for gcc/clang on most platforms is to do
int no_init;
asm("" : "=r" (no_init));
return bar(a, b, no_init);
This way you don't have to lie to the compiler about the prototype of bar (whichc could break some calling conventions), and you fool the compiler into thinking no_init is really initialized.
I would wonder about an architecture like Itanium with its "trap bit" that causes a fault when an uninitialized register is accessed. This code would probably not be safe there.
There is no portable way to get this behavior that I know of, but you could ifdef it:
#ifdef __GNUC__
#define UNUSED_INT ({ int x; asm("" : "=r" (x)); x; })
#else
#define UNUSED_INT 0
#endif
// ...
bar(a, b, UNUSED_INT);
Then you can fall back to the (infinitesimally) less efficient but correct code when necessary.
It results in a bare jmp on gcc/x86-64, see https://godbolt.org/z/d3ordK. On x86-32 it is not quite optimal as it pushes an uninitialized register, instead of just adjusting an existing subtraction from esp. Note that a bare jmp/call is not safe on x86-32 because that third stack slot may contain something important, and the callee is allowed to overwrite it (even if the variable is unused on the path you have in mind, the compiler could be using it as scratch space).
One portable alternative would be to rewrite bar to be variadic. However, then it would need to use va_arg to retrieve the third argument when it is present, and that tends to be less efficient.
Cast the function to have the smaller signature (i.e. fewer parameters):
extern int bar(int, int, int);
int foo(int a, int int b) {
return ((int (*)(int,int))bar)(a, b);
}
Maybe make a macro for 2 parameter bar, and even get rid of foo:
extern int bar3(int, int, int);
#define bar2(a,b) ((int (*)(int,int))bar3)(a,b)
int userOfBar(int a, int b) { return bar2 (a,b); }
https://godbolt.org/z/Gn4a69
Oddly, given the above gcc doesn't touch %edx, but clang does... oh, well.
(Still can't guarantee the compiler won't touch some registers, though, that's its domain. Otherwise, you can write these functions directly in assembly and avoid the middleperson.)
Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.
void clobber() {
asm volatile("" : : : "memory");
}
void escape(void* p) {
asm volatile("" : : "g"(p) : "memory");
}
These use inline assembly statements to change the assumptions of the optimizer.
The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.
The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.
(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)
All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.
These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?
† To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.
Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):
// always returns false, but in an undeducible way
bool isClobberingEnabled();
// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();
const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;
inline void clobber() {
if ( clobberingIsEnabled ) {
// This will never be executed, but the compiler
// cannot know about it.
clobberingPtr = nextLocationToClobber();
*clobberingPtr = *clobberingPtr;
}
}
UPDATE
Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?
Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:
bool undeducible_false() {
// It took mathematicians more than 3 centuries to prove Fermat's
// last theorem in its most general form. Hardly that knowledge
// has been put into compilers (or the compiler will try hard
// enough to check all one million possible combinations below).
// Caveat: avoid integer overflow (Fermat's theorem
// doesn't hold for modulo arithmetic)
std::uint32_t a = std::clock() % 100 + 1;
std::uint32_t b = std::rand() % 100 + 1;
std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;
return a*a*a + b*b*b == c*c*c;
}
I have used the following in place of escape.
#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
*reinterpret_cast<char volatile*>(p) =
*reinterpret_cast<char const volatile*>(p); // thanks, #milleniumbug
}
#pragma optimize("", on)
#endif
It's not perfect but it's close enough, I think.
Sadly, I don't have a way to emulate clobber.
So I stumbled upon something which I'd like to understand, as it's causing me headaches. I have the following code:
#include <stdio.h>
#include <smmintrin.h>
typedef union {
struct { float x, y, z, w; } v;
__m128 m;
} vec;
vec __attribute__((noinline)) square(vec a)
{
vec x = { .m = _mm_mul_ps(a.m, a.m) };
return x;
}
int main(int argc, char *argv[])
{
float f = 4.9;
vec a = (vec){f, f, f, f};
vec res = square(a); // ?
printf("%f %f %f %f\n", res.v.x, res.v.y, res.v.z, res.v.w);
return 0;
}
Now, in my mind, the call to square in main should put the value of a in xmm0 so that the square function can do mulps xmm0, xmm0 and be done with it.
This is not what happens when I compile with clang or gcc. Instead, the first 8 bytes of a are put in xmm0 and the next 8 bytes in xmm1, making the square function a lot more complicated as it needs to patch things back up.
Any idea why?
NOTE: This is with -O3 optimization.
After further research, it seems like it has to do with the union type. If the function takes a straight __m128, the generated code will expect the value in a single register (xmm0). But given that they should both fit in xmm0, I don't see why it's being split in two half-used registers when the vec type is used..
The compiler is just trying to follow the calling convention as specified by the System V Application Binary Interface AMD64 Architecture Processor Supplement, section 3.2.3 Parameter Passing.
The relevant points are:
We first define a number of classes to classify arguments. The
classes are corresponding to AMD64 register classes and defined as:
SSE The class consists of types that fit into a vector register.
SSEUP The class consists of types that fit into a vector register and can
be passed and returned in the upper bytes of it.
The size of each argument gets rounded up to eightbytes.
The basic types are assigned their natural classes:
Arguments of types float, double, _Decimal32, _Decimal64 and __m64 are
in class SSE.
The classification of aggregate (structures and arrays) and union types
works as follows:
If the size of the aggregate exceeds a single eightbyte, each is
classified separately.
Applying the above rules means that the x, y and z, w pairs of the embedded struct get separately classified as SSE class, which in turn means they must be passed in two separate registers. The presence of the m member in this case doesn't have any effect, you can even delete it.
EDIT: on a second read through, I'm less certain why this is happening, but I'm more certain that this is where it is happening. I don't think this answer is right, but I'll leave it up as it may be helpful.
Speaking only for clang:
It seems like this is an issue that is just an unfortunate side effect of a compiler heuristic.
From a brief look at clang (file CGRecordLayoutBuilder.cpp, function CGRecordLowering::lowerUnion) it looks like llvm doesn't internally represent union types as such, and the types of a function don't get changed depending on the uses within the function.
clang looks at your function and sees that it needs 16 bytes worth of arguments for the type signature, then uses a heuristic to pick which type it thinks is best. It favors a { double, double } interpretation over a <4 x float> (which would give it the most efficiency in your case) because doubles are more lenient with respect to alignment.
I'm no expert on clang internals, so I could be very wrong, but it doesn't look like there's a particularly nice way around this one. If you want the optimized version you may have to use pointer casting instead of unions to get it.
The code I suspect is causing the problem:
void CGRecordLowering::lowerUnion() {
...
// Conditionally update our storage type if we've got a new "better" one.
if (!StorageType ||
getAlignment(FieldType) > getAlignment(StorageType) ||
(getAlignment(FieldType) == getAlignment(StorageType) &&
getSize(FieldType) > getSize(StorageType)))
StorageType = FieldType;
...
}
I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}
If you want to call a C/C++ function from inline assembly, you can do something like this:
void callee() {}
void caller()
{
asm("call *%0" : : "r"(callee));
}
GCC will then emit code which looks like this:
movl $callee, %eax
call *%eax
This can be problematic since the indirect call will destroy the pipeline on older CPUs.
Since the address of callee is eventually a constant, one can imagine that it would be possible to use the i constraint. Quoting from the GCC online docs:
`i'
An immediate integer operand (one with constant value) is allowed. This
includes symbolic constants whose
values will be known only at assembly
time or later.
If I try to use it like this:
asm("call %0" : : "i"(callee));
I get the following error from the assembler:
Error: suffix or operands invalid for `call'
This is because GCC emits the code
call $callee
Instead of
call callee
So my question is whether it is possible to make GCC output the correct call.
I got the answer from GCC's mailing list:
asm("call %P0" : : "i"(callee)); // FIXME: missing clobbers
Now I just need to find out what %P0 actually means because it seems to be an undocumented feature...
Edit: After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case.
For this to be safe, you need to tell the compiler about all registers that the function call might modify, e.g. : "eax", "ecx", "edx", "xmm0", "xmm1", ..., "st(0)", "st(1)", ....
See Calling printf in extended inline ASM for a full x86-64 example of correctly and safely making a function call from inline asm.
Maybe I am missing something here, but
extern "C" void callee(void)
{
}
void caller(void)
{
asm("call callee\n");
}
should work fine. You need extern "C" so that the name won't be decorated based on C++ naming mangling rules.
If you're generating 32-bit code (e.g. -m32 gcc option), the following asm inline emits a direct call:
asm ("call %0" :: "m" (callee));
The trick is string literal concatenation. Before GCC starts trying to get any real meaning from your code it will concatenate adjacent string literals, so even though assembly strings aren't the same as other strings you use in your program they should be concatenated if you do:
#define ASM_CALL(X) asm("\t call " X "\n")
int main(void) {
ASM_CALL( "my_function" );
return 0;
}
Since you are using GCC you could also do
#define ASM_CALL(X) asm("\t call " #X "\n")
int main(void) {
ASM_CALL(my_function);
return 0;
}
If you don't already know you should be aware that calling things from inline assembly is very tricky. When the compiler generates its own calls to other functions it includes code to set up and restore things before and after the call. It doesn't know that it should be doing any of this for your call, though. You will have to either include that yourself (very tricky to get right and may break with a compiler upgrade or compilation flags) or ensure that your function is written in such a way that it does not appear to have changed any registers or condition of the stack (or variable on it).
edit this will only work for C function names -- not C++ as they are mangled.