How can I access C/C++ variables in inline assembly? [duplicate] - c++

Given this code:
#include <stdio.h>
int main(int argc, char **argv)
{
int x = 1;
printf("Hello x = %d\n", x);
}
I'd like to access and manipulate the variable x in inline assembly. Ideally, I want to change its value using inline assembly. GNU assembler, and using the AT&T syntax.

In GNU C inline asm, with x86 AT&T syntax:
(But https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it).
// this example doesn't really need volatile: the result is the same every time
asm volatile("movl $0, %[some]"
: [some] "=r" (x)
);
after this, x contains 0.
Note that you should generally avoid mov as the first or last instruction of an asm statement. Don't copy from %[some] to a hard-coded register like %%eax, just use %[some] as a register, letting the compiler do register allocation.
See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html and https://stackoverflow.com/tags/inline-assembly/info for more docs and guides.
Not all compilers support GNU syntax.
For example, for MSVC you do this:
__asm mov x, 0 and x will have the value of 0 after this statement.
Please specify the compiler you would want to use.
Also note, doing this will restrict your program to compile with only a specific compiler-assembler combination, and will be targeted only towards a particular architecture.
In most cases, you'll get as good or better results from using pure C and intrinsics, not inline asm.

asm("mov $0, %1":"=r" (x):"r" (x):"cc"); -- this may get you on the right track. Specify register use as much as possible for performance and efficiency. However, as Aniket points out, highly architecture dependent and requires gcc.

Related

Why don't GCC & MSVC optimize based on uninitialized variables? Can I force them to?

Consider some dead-simple code (or a more complicated one, see below1) that uses an uninitialized stack variable, e.g.:
int main() { int x; return 17 / x; }
Here's what GCC emits (-O3):
mov eax, 17
xor ecx, ecx
cdq
idiv ecx
ret
Here's what MSVC emits (-O2):
mov eax, 17
cdq
idiv DWORD PTR [rsp]
ret 0
For reference, here's what Clang emits (-O3):
ret
The thing is, all three compilers detect that this is an uninitialized variable just fine (-Wall), but only one of them actually performs any optimizations based on it.
This is kind of stumping me... I thought all the decades of fighting over undefined behavior was to allow compiler optimizations, and yet I'm seeing only one compiler cares to optimize even the most basic cases of UB.
Why is this? What do I do if I want compilers other than Clang to optimize such cases of UB? Is there any way for me to actually get the benefits of UB instead of just the downsides with either compiler?
Footnotes
1 Apparently this was too much of an SSCCE for some folks to appreciate the actual issue. If you want a more complicated example of this problem that isn't undefined on every execution of the program, just massage it a bit. e.g.:
int main(int argc, char *[])
{
int x;
if (argc) { x = 100 + (argc * argc + x); }
return x;
}
On GCC you get:
main:
xor eax, eax
test edi, edi
je .L1
imul edi, edi
lea eax, [rdi+100]
.L1:
ret
On Clang you get:
main:
ret
Same issue, just more complicated.
Optimizing for actually reading unintiailized data is not the point.
Optimizing for assuming the data you read must have been initialized is.
So if you have some variable that can only be written to as 3 or 1, the compiler can assume it is odd.
Or, if you add positive signed constant to a signed value, we can assume the result is larger than the original signed value (this makes some loops faster).
What the optimizer does when it proves an uninitialized value is read isn't important; making UB or indeterminate value calculation faster is not the point. Well behaved programs don't do that on purpose, spending effort making it faster (or slower, or caring) is a waste of compiler writers time.
It may fall out of other efforts. Or it may not.
Consider this example:
int foo(bool x) {
int y;
if (x) y = 3;
return y;
}
Gcc realizes that the only way the function can return something well defined is when x is true. Hence, when optimizations are turned on there is no brach:
foo(bool):
mov eax, 3
ret
Calling foo(true) is not undefined behavior. Calling foo(false) is undefined behavior. There is nothing in the standard that specifies why foo(false) returns 3. There is also nothing in the standard that mandates that foo(false) does not return 3. Compilers do not optimize code that has undefined behavior, but compilers can optimize code without UB (eg remove the branch in foo) because it is not specified what happens when there is UB.
What do I do if I want compilers other than Clang to optimize such cases of UB?
Compilers do that by default. Gcc is not different than Clang with respect to that.
In your example of
int main() { int x; return 17 / x; }
there is no missed optimization, because it is not defined what the code will do in the first place.
Your second example can be considered as a missed opportunity for optimization. Though, again: UB grants opportunities to optimize code that does not have UB. The idea is not that you introduce UB in your code to gain optimizations. As your second example can (and should be) rewritten as
int main(int argc, char *[])
{
int x = 100 + (argc * argc + x);
return x;
}
It isnt a big issue in practice that gcc doesn't bother to remove the branch in your version. If you don't need the branch you don't have to write it just to expect the compiler to remove it.
The Standard uses the term "Undefined Behavior" to refer to actions which in some contexts might be non-portable but correct, but in other contexts would be erroneous, without making any effort to distinguish into when a particular action should be viewed one way or the other.
In C89 and C99, if it would be possible for a type's underlying storage to hold an invalid bit pattern, attempting to use an uninitialized automatic-duration or malloc-allocated object of that type would invoke Undefined Behavior, but if all possible bit patterns would be valid, accessing such an object would simply yield an Unspecified Value of that type. This meant, for example, that a program could do something like:
struct ushorts256 { uint16_t dat[256]; } x,y;
void test(void)
{
struct ushorts256 temp;
int i;
for (i=0; i<86; i++)
temp.dat[i*3]=i;
x=temp;
y=temp;
}
and if the callers only cared about what was in multiple-of-3 elements of the structures, there would be no need to have the code worry about the other 171 values of temp.
C11 changed the rules so that compiler writers wouldn't have to follow the C89 and C99 behavior if they felt there was something more useful they could do. For example, depending upon what calling code would do with the arrays, it might be more efficient to simply have the code write every third item of x and every third item of y, leaving the remaining items alone. A consequence of this would be that non-multiple-of-3 items of x might not match the corresponding items of y, but people seeking to sell compilers were expected to be able to judge their particular customers' needs better than the Committee ever could.
Some compilers treat uninitialized objects in a manner consistent with C89 and C99. Some may exploit the freedom to have the values behave non-deterministically (as in the example above) but without not disrupting program behavior. Some may opt to treat any programs that access uninitialized variables in gratuitously meaningless fashion. Portable programs may not rely upon any particular treatment, but the authors of the Standard have expressly stated they did not wish to "demean" useful programs that happened to be non-portable (see http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf page 13).

Have compiler ignoring setting an argument register before calling function

TL;DR; I am looking for a standard way to basically tell the compiler to pass whatever happened to be in a given register to the next function.
Basically I have a function int bar(int a, int b, int c). In some cases c is unused and I would like to be able to call bar in the cases where c is unused without modifying rdx in any way.
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
I would like the assembly to just be:
For a tailcall
jmp bar
or for a normal call
call bar
Note: clang generally produces what I am looking for. But I am unsure if this will always be the case in more complex functions and I am hoping to not have to check the assembly each time I build.
GCC produces:
For a tailcall
xorl %edx, %edx
jmp bar
or for a normal call
xorl %edx, %edx
call bar
I can get the results I want using inline assembly i.e changing foo (for tail calls) to
int foo(int a, int b) {
asm volatile("jmp bar" : : :);
__builtin_unreachable();
}
which compiles to just
jmp bar
I understand that the performance implications of an xorl %edx, %edx is about as close to 0 as possible but
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that will require me verifying the assembly each time. I am looking for a method that you can basically tell the compiler "pass whatever happened to be in register".
See for examples: https://godbolt.org/z/eh1vK8
Edit: This is happening with -O3 set.
I am wondering if there is a standard way to achieve this.
I.e I can probably find a hack for it for any given case. But that
will require me verifying the assembly each time. I am looking for a
method that you can basically tell the compiler "pass whatever
happened to be in register".
No, there is no standard way to achieve it in either C or C++. Neither of these languages speak to any lower-level function call semantics, nor even acknowledge the existence of CPU registers,* and both languages require every function call to provide arguments corresponding to all non-optional parameters (which is simply "all declared parameters" in C).
For example if I have
int foo(int a, int b) {
int no_init;
return bar(a, b, no_init);
}
... then you reap undefined behavior as a result of using the value of no_init while it is indeterminate. Whatever any particular C or C++ implementation that accepts that at all does with it is non-standard by definition.
If you want to call bar(), but you don't care what value is passed as the third argument, then why not just choose a convenient value to pass? Zero, for example:
return bar(a, b, 0);
*Even the register keyword does not do this as far as either language standard is concerned.
Note that if the called function does read its 3rd arg, leaving it unwritten risks creating a false dependency on whatever last used EDX. For example it might be the result of a cache-miss load, or a long chain of calculations.
GCC is careful to xor-zero to break false dependencies in a lot of cases, e.g. before cvtsi2ss (bad ISA design) or popcnt (Sandybridge-family quirk).
Usually the xor edx,edx is basically a wasted 2-byte NOP, but it does prevent possible coupling of otherwise-independent dependency chains (critical paths).
If you're sure you want to defeat the compiler's attempt to protect you from that, then Nate's asm("" :"=r"(var)); is a good way to do an integer version of _mm_undefined_ps() that actually leaves a register uninitialized. (Note that _mm_undefined_ps doesn't guarantee leaving an XMM reg unwritten; some compilers will xor-zero one for you instead of fully implementing the false-dependency recklessness that intrinsic was designed to allow for Intel's compiler.)
One approach that should work for gcc/clang on most platforms is to do
int no_init;
asm("" : "=r" (no_init));
return bar(a, b, no_init);
This way you don't have to lie to the compiler about the prototype of bar (whichc could break some calling conventions), and you fool the compiler into thinking no_init is really initialized.
I would wonder about an architecture like Itanium with its "trap bit" that causes a fault when an uninitialized register is accessed. This code would probably not be safe there.
There is no portable way to get this behavior that I know of, but you could ifdef it:
#ifdef __GNUC__
#define UNUSED_INT ({ int x; asm("" : "=r" (x)); x; })
#else
#define UNUSED_INT 0
#endif
// ...
bar(a, b, UNUSED_INT);
Then you can fall back to the (infinitesimally) less efficient but correct code when necessary.
It results in a bare jmp on gcc/x86-64, see https://godbolt.org/z/d3ordK. On x86-32 it is not quite optimal as it pushes an uninitialized register, instead of just adjusting an existing subtraction from esp. Note that a bare jmp/call is not safe on x86-32 because that third stack slot may contain something important, and the callee is allowed to overwrite it (even if the variable is unused on the path you have in mind, the compiler could be using it as scratch space).
One portable alternative would be to rewrite bar to be variadic. However, then it would need to use va_arg to retrieve the third argument when it is present, and that tends to be less efficient.
Cast the function to have the smaller signature (i.e. fewer parameters):
extern int bar(int, int, int);
int foo(int a, int int b) {
return ((int (*)(int,int))bar)(a, b);
}
Maybe make a macro for 2 parameter bar, and even get rid of foo:
extern int bar3(int, int, int);
#define bar2(a,b) ((int (*)(int,int))bar3)(a,b)
int userOfBar(int a, int b) { return bar2 (a,b); }
https://godbolt.org/z/Gn4a69
Oddly, given the above gcc doesn't touch %edx, but clang does... oh, well.
(Still can't guarantee the compiler won't touch some registers, though, that's its domain.  Otherwise, you can write these functions directly in assembly and avoid the middleperson.)

Porting to Mac OS X error

I have the cross-platform audio processing app. It is written using Qt and PortAudio libraries. I also use Chaotic-Daw sources for some audio processing functions (Vibarto effect and Soft-Knee Dynamic range compression). The problem is that I cannot port my app from Windows to Mac OSX because of I get the compiler errors for __asm parts (I use Mac OSX Yosemite and Qt Creator 3.4.1 IDE):
/Users/admin/My
projects/MySound/daw/basics/rosic_NumberManipulations.h:69:
error:
expected '(' after 'asm'
{
^
for such lines:
INLINE int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
#ifndef LINUX
__asm
{ // <========= error indicates that row
fld x;
fadd st, st (0);
fadd round_towards_m_i;
fistp i;
sar i, 1;
}
#else
i = (int) floor(x);
#endif
return (i);
}
How can I resolve this problem?
The code was clearly written for Microsoft's Visual C++ compiler, as that is the syntax it uses for inline assembly. It uses the Intel syntax and is rather simplistic, which makes it easy to write but hinders its optimization potential.
Clang and GCC both use a different format for inline assembly. In particular, they use the GNU AT&T syntax. It is more complicated to write, but much more expressive. The compiler error is basically Clang's way of telling you, "I can tell you're trying to write inline assembly, but you've formatted it all wrong!"
Therefore, to make this code compile, you will need to convert the MSVC-style inline assembly into GAS-format inline assembly. It might look like this:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
__asm__("fadd %[x], %[x] \n\t"
"fadds %[adj] \n\t"
"fistpl %[i] \n\t"
"sarl $1, %[i]"
: [i] "=m" (i) // store result in memory (as required by FISTP)
: [x] "t" (x), // load input onto top of x87 stack (equivalent to FLD)
[adj] "m" (round_towards_m_i)
: "st");
return (i);
}
But, because of the additional expressivity of the GAS style, we can offload more of the work to the built-in optimizer, which may yield even more optimal object code:
int floorInt(double x)
{
const float round_towards_m_i = -0.5f;
int i;
x += x; // equivalent to the first FADD
x += round_towards_m_i; // equivalent to the second FADD
__asm__("fistpl %[i]"
: [i] "=m" (i)
: [x] "t" (x)
: "st");
return (i >> 1); // equivalent to the final SAR
}
Live demonstration
(Note that, technically, a signed right-shift like that done by the last line is implementation-defined in C and would normally be inadvisable. However, if you're using inline assembly, you have already made the decision to target a specific platform and can therefore rely on implementation-specific behavior. In this case, I know and it can easily be demonstrated that all C compilers will generate SAR instructions to do an arithmetic right-shift on signed integer values.)
That said, it appears that the authors of the code intended for the inline assembly to be used only when you are compiling for a platform other than LINUX (presumably, that would be Windows, on which they expected you to be using Microsoft's compiler). So you could get the code to compile simply by ensuring that you are defining LINUX, either on the command line or in your makefile.
I'm not sure why that decision was made; Clang and GCC are both going to generate the same inefficient code that MSVC does (assuming that you are targeting the older generation of x86 processors and unable to use SSE2 instructions). It is up to you: the code will run either way, but it will be slower without the use of inline assembly to force the use of this clever optimization.

Unconventional Calls with Inline ASM

I'm working with a proprietary MCU that has a built-in library in metal (mask ROM). The compiler I'm using is clang, which uses GCC-like inline ASM. The issue I'm running into, is calling the library since the library does not have a consistent calling convention. While I found a solution, I've found that in some cases the compiler will make optimizations that clobber registers immediately before the call, I think there is just something wrong with how I'm doing things. Here is the code I'm using:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))(); //MASKROM_EchoByte is a 16-bit integer with the memory location of the function
}
Now this has the obvious problem that while the variable "asmHex" is asserted to register R1, the actual call does not use it and therefore the compiler "doesn't know" that R1 is reserved at the time of the call. I used the following code to eliminate this case:
int EchoByte()
{
register int asmHex __asm__ ("R1") = Hex;
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
((volatile void (*)(void))(MASKROM_EchoByte))();
asm volatile("//Assert Input to R1 for MASKROM_EchoByte"
:
:"r"(asmHex)
:"%R1");
}
This seems really ugly to me, and like there should be a better way. Also I'm worried that the compiler may do some nonsense in between, since the call itself has no indication that it needs the asmHex variable. Unfortunately, ((volatile void (*)(int))(MASKROM_EchoByte))(asmHex) does not work as it will follow the C-convention, which puts arguments into R2+ (R1 is reserved for scratching)
Note that changing the Mask ROM library is unfortunately impossible, and there are too many frequently used routines to recreate them all in C/C++.
Cheers, and thanks.
EDIT: I should note that while I could call the function in the ASM block, the compiler has an optimization for functions that are call-less, and by calling in assembly it looks like there's no call. I could go this route if there is some way of indicating that the inline ASM contains a function call, but otherwise the return address will likely get clobbered. I haven't been able to find a way to do this in any case.
Per the comments above:
The most conventional answer is that you should implement a stub function in assembly (in a .s file) that simply performs the wacky call for you. In ARM, this would look something like
// void EchoByte(int hex);
_EchoByte:
push {lr}
mov r1, r0 // move our first parameter into r1
bl _MASKROM_EchoByte
pop pc
Implement one of these stubs per mask-ROM routine, and you're done.
What's that? You have 500 mask-ROM routines and don't want to cut-and-paste so much code? Then add a level of indirection:
// typedef void MASKROM_Routine(int r1, ...);
// void GeneralPurposeStub(MASKROM_Routine *f, int arg, ...);
_GeneralPurposeStub:
bx r0
Call this stub by using the syntax GeneralPurposeStub(&MASKROM_EchoByte, hex). It'll work for any mask-ROM entry point that expects a parameter in r1. Any really wacky entry points will still need their own hand-coded assembly stubs.
But if you really, really, really must do this via inline assembly in a C function, then (as #JasonD pointed out) all you need to do is add the link register lr to the clobber list.
void EchoByte(int hex)
{
register int r1 asm("r1") = hex;
asm volatile(
"bl _MASKROM_EchoByte"
:
: "r"(r1)
: "r1", "lr" // Compare the codegen with and without this "lr"!
);
}

Direct C function call using GCC's inline assembly

If you want to call a C/C++ function from inline assembly, you can do something like this:
void callee() {}
void caller()
{
asm("call *%0" : : "r"(callee));
}
GCC will then emit code which looks like this:
movl $callee, %eax
call *%eax
This can be problematic since the indirect call will destroy the pipeline on older CPUs.
Since the address of callee is eventually a constant, one can imagine that it would be possible to use the i constraint. Quoting from the GCC online docs:
`i'
An immediate integer operand (one with constant value) is allowed. This
includes symbolic constants whose
values will be known only at assembly
time or later.
If I try to use it like this:
asm("call %0" : : "i"(callee));
I get the following error from the assembler:
Error: suffix or operands invalid for `call'
This is because GCC emits the code
call $callee
Instead of
call callee
So my question is whether it is possible to make GCC output the correct call.
I got the answer from GCC's mailing list:
asm("call %P0" : : "i"(callee)); // FIXME: missing clobbers
Now I just need to find out what %P0 actually means because it seems to be an undocumented feature...
Edit: After looking at the GCC source code, it's not exactly clear what the code P in front of a constraint means. But, among other things, it prevents GCC from putting a $ in front of constant values. Which is exactly what I need in this case.
For this to be safe, you need to tell the compiler about all registers that the function call might modify, e.g. : "eax", "ecx", "edx", "xmm0", "xmm1", ..., "st(0)", "st(1)", ....
See Calling printf in extended inline ASM for a full x86-64 example of correctly and safely making a function call from inline asm.
Maybe I am missing something here, but
extern "C" void callee(void)
{
}
void caller(void)
{
asm("call callee\n");
}
should work fine. You need extern "C" so that the name won't be decorated based on C++ naming mangling rules.
If you're generating 32-bit code (e.g. -m32 gcc option), the following asm inline emits a direct call:
asm ("call %0" :: "m" (callee));
The trick is string literal concatenation. Before GCC starts trying to get any real meaning from your code it will concatenate adjacent string literals, so even though assembly strings aren't the same as other strings you use in your program they should be concatenated if you do:
#define ASM_CALL(X) asm("\t call " X "\n")
int main(void) {
ASM_CALL( "my_function" );
return 0;
}
Since you are using GCC you could also do
#define ASM_CALL(X) asm("\t call " #X "\n")
int main(void) {
ASM_CALL(my_function);
return 0;
}
If you don't already know you should be aware that calling things from inline assembly is very tricky. When the compiler generates its own calls to other functions it includes code to set up and restore things before and after the call. It doesn't know that it should be doing any of this for your call, though. You will have to either include that yourself (very tricky to get right and may break with a compiler upgrade or compilation flags) or ensure that your function is written in such a way that it does not appear to have changed any registers or condition of the stack (or variable on it).
edit this will only work for C function names -- not C++ as they are mangled.