How to force GCC to use jmp instruction instead of ret? - c++

I was now using a stackful co-routines for network programming. But I was punished by the invalidation of return stack buffer (see
http://www.agner.org/optimize/microarchitecture.pdf p.36), during the context switch (because we manually change the SP register)
I found out that the jmp instruction is better than ret after assembly language test. However, I have some more functions that indirectly call the context switch function that was written in C++ language (compiled by GCC). How can we force these function return using jmp instead of ret in the GCC assembly result?
Some common but not perfect methods:
using inline assembly and manually set SP register to __builtin_frame_address+2*sizeof(void*) and jmp to the return address, before ret?
This is an unsafe solution. In C++, local variables or right values are destructed before ret instruction. We will omit these instruction if we jmp. What's worse, even if we are in C, callee-saved registers need to be restored before ret instruction and we will also omit these instruction, too.
So what can we do to force GCC use jmp instead of ret to avoid the problems listing above?

Use an assembler macro:
.macro ret
pop %ecx
jmp *%ecx
.endm
Put that in inline assembler at the top of the file or elsewhere.

Related

Why does `monotonic_buffer_resource` appear in the assembly when it doesn't seem to be used?

This is a follow-up from another question.
I think the following code should not use monotonic_buffer_resource, but in the generated assembly there are references to it.
void default_pmr_alloc(std::pmr::polymorphic_allocator<int>& alloc) {
(void)alloc.allocate(1);
}
godbolt
I looked into the source code of the header files and libstdc++, but could not find how monotonic_buffer_resource was selected to be used by the default pmr allocator.
The assembly tells the story. In particular, this:
cmp rax, OFFSET FLAT:_ZNSt3pmr25monotonic_buffer_resource11do_allocateEmm
jne .L11
This appears to be a test to see if the memory resource is a monotonic_buffer_resource. This seems to be done by checking the do_allocate member of the vtable. If it is not such a resource (ie: if do_allocate in the memory resource is not the monotonic one), then it jumps down to this:
.L11:
mov rdi, rbx
mov edx, 4
mov esi, 4
pop rbx
jmp rax
This appears to be a vtable call.
The rest of the assembly appears to be an inlined version of monotonic_buffer_resource::do_allocate. Which is why it conditionally calls std::pmr::monotonic_buffer_resource::_M_new_buffer.
So overall, this implementation of polymorphic_resource::allocate seems to have some built-in inlining of monotonic_buffer_resource::do_allocate if the resource is appropriate for that. That is, it won't do a vtable call if it can determine that it should call monotonic_buffer_resource::do_allocate.

The implementation difference between the interperter and the JIT compiler for AtomicInteger.lazySet() in the Hotspot JVM

Refer to the existing discussion at AtomicInteger lazySet vs. set for the background of AtomicInteger.lazySet().
So according to the semantics of AtomicInteger.lazySet(), on x86 CPU, AtomicInteger.lazySet() is equivalent to a normal write operation to the value of the AotmicInteger, because the x86 memory model guarantees the order among write operations.
However, the runtime behavior for AtomicInteger.lazySet() is different between the interperter and the JIT compiler (the C2 compiler specifically) in the JDK 8 Hotspot JVM, which confuses me.
First, create a simple Java application for demo.
import java.util.concurrent.atomic.AtomicInteger;
public class App {
public static void main (String[] args) throws Exception {
AtomicInteger i = new AtomicInteger(0);
i.lazySet(1);
System.out.println(i.get());
}
}
Then, dump the instructions for AtomicInteger.lazySet() which are from the instrinc method provided by the C2 compiler:
$ java -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:CompileCommand=print,*AtomicInteger.lazySet App
...
0x00007f1bd927214c: mov %edx,0xc(%rsi) ;*invokevirtual putOrderedInt
; - java.util.concurrent.atomic.AtomicInteger::lazySet#8 (line 110)
As you can see, the operation is as expected a normal write.
Then, use the GDB to trace the runtime behavior of the interpreter for AtomicInteger.lazySet().
$ gdb --args java App
(gdb) b Unsafe_SetOrderedInt
0x7ffff69ae836 callq 0x7ffff69b6642 <OrderAccess::release_store_fence(int volatile*, int)>
0x7ffff69b6642:
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov %esi,-0xc(%rbp)
mov -0xc(%rbp),%eax
mov -0x8(%rbp),%rdx
xchg %eax,(%rdx)           // the write operation
mov %eax,-0xc(%rbp)
nop
pop %rbp
retq
s you can see, the operation is actually a XCHG instruction ,which has a implict lock semantics, which brings performance overhead that AtomicInteger.lazySet() is intended to eliminate.
Does anyone know why there is such a difference? thanks.
There is no much sense in optimizing a rarely used operation in the interpreter. This would increase development and maintenance costs for no visible benefit.
It's a common practice in HotSpot to implement optimizations only in a JIT compiler (either C2 or both C1+C2). The interpreter implementation just works, it does not need to be fast, because if the code is performance sensitive, it will be JIT-compiled anyway.
So, in the interpreter (i.e. on the slow path), Unsafe_SetOrderedInt is exactly the same as a volatile write.

Is the Link Register (LR) affected by inline or naked functions?

I'm using an ARM Cortex-M4 processor. As far as I understand, the LR (link register) stores the return address of the currently executing function. However, do inline and/or naked functions affect it?
I'm working on implementing simple multitasking. I'd like to write some code that saves the execution context (pusing R0-R12 and LR to the stack) so that it can be restored later. After the context save, I have an SVC so the kernel can schedule another task. When it decide to schedule the current task again, it'd restore the stack and execute BX LR. I'm asking this question because I'd like BX LR to jump to the correct place.
Let's say I use arm-none-eabi-g++ and I'm not concerned with portability.
For example, if I have the following code with the always_inline attribute, since the compiler will inline it, then there is not gonna be a function call in the resulting machine code, so the LR is unaffected, right?
__attribute__((always_inline))
inline void Task::saveContext() {
asm volatile("PUSH {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12, LR}");
}
Then, there is also the naked attribute whose documentation says that it will not have prologue/epilogue sequences generated by the compiler. What exactly does that mean. Does a naked function still result in a function call and does it affect the LR?
__attribute__((naked))
void saveContext() {
asm volatile("PUSH {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12, LR}");
}
Also, out of curiosity, what happens if a function is marked with both always_inline and naked? Does that make a difference?
Which is the correct way to ensure that a function call does not affect the LR?
As far as I understand, the LR (link register) stores the return address of the currently executing function.
Nope, lr simply receives the address of the following instruction upon execution of a bl or blx instruction. In the M-class architecture, it also receives a special magic value upon exception entry, which will trigger an exception return when used like a return address, making exception handlers look exactly the same as regular functions.
Once the function has been entered, the compiler is free to save that value elsewhere and use r14 as just another general-purpose register. Indeed, it needs to save the value somewhere if it wants to make any nested calls. With most compilers any non-leaf function will push lr to the stack as part of the prologue (and often take advantage of being able to pop it straight back into pc in the epilogue to return).
Which is the correct way to ensure that a function call does not affect the LR?
A function call by definition affects lr - otherwise it would be a goto, not a call (tail-calls notwithstanding, of course).
re: update. Leaving my old answer below, since it answers the original question before the edit.
__attribute__((naked)) basically exists so you can write the whole function in asm, inside asm statements instead of in a separate .S file. The compiler doesn't even emit a return instruction, you have to do that yourself. It doesn't make sense to use this for inline functions (like I already answered below).
Calling a naked function will generate the usual call sequence, with a bl my_naked_function, which of course sets LR to point to the instruction after the bl. A naked function is essentially a never-inline function that you write in asm. "prologue" and "epilogue" are the instructions that save and restore callee-saved registers, and the return instruction itself (bx lr).
Try it and see. It's easy to look at gcc's asm output. I changed your function names to help explain what's going on, and fixed the syntax (The GNU C __attribute__ extension requires doubled parens).
extern void extfunc(void);
__attribute__((always_inline))
inline void break_the_stack() { asm volatile("PUSH LR"); }
__attribute__((naked))
void myFunc() {
asm volatile("PUSH {r3, LR}\n\t" // keep the stack aligned for our callee by pushing a dummy register along with LR
"bl extfunc\n\t"
"pop {r3, PC}"
);
}
int foo_simple(void) {
extfunc();
return 0;
}
int foo_using_inline(void) {
break_the_stack();
extfunc();
return 0;
}
asm output with gcc 4.8.2 -O2 for ARM (default is a thumb target, I think).
myFunc(): # I followed the compiler's foo_simple example for this
PUSH {r3, LR}
bl extfunc
pop {r3, PC}
foo_simple():
push {r3, lr}
bl extfunc()
movs r0, #0
pop {r3, pc}
foo_using_inline():
push {r3, lr}
PUSH LR
bl extfunc()
movs r0, #0
pop {r3, pc}
The extra push LR means we're popping the wrong data into PC. Maybe another copy of LR, in this case, but we're returning with a modified stack pointer, so the caller will break. Don't mess with LR or the stack in an inline function, unless you're trying to do some kind of binary instrumentation thing.
re: comments: if you just want to set a C variable = LR:
As #Notlikethat points out, LR might not hold the return address. So you might want __builtin_return_address(0) to get the return address of the current function. However, if you're just trying to save register state, then you should save/restore whatever the function has in LR if you hope to correctly resume execution at this point:
#define get_lr(lr_val) asm ("mov %0, lr" : "=r" (lr_val))
This might need to be volatile to stop it from being hoisted up the call tree during whole-program optimization.
This leads to an extra mov instruction when perhaps the ideal sequence would be to store lr, rather than copy to another reg first. Since ARM uses different instructions for reg-reg move vs. store to memory, you can't just use a rm constraint for the output operand to give the compiler that option.
You could wrap this inside an inline function. A GNU C statement-expression in a macro would also work, but an inline function should be fine:
__attribute__((always_inline)) void* current_lr(void) { // This should work correctly when inlined, or just use the macro
void* lr;
get_lr(lr);
return lr;
}
For reference: What are SP (stack) and LR in ARM?
A naked always_inline function is not useful.
The docs say a naked function can only contain asm statements, and only "Basic" asm (without operands, so you have to get args from the right place for the ABI yourself). Inlining that makes zero sense, because you won't know where the compiler put your args.
If you want to inline some asm, don't use a naked function. Instead, use an inline function that uses correct contraints for input/output parameters.
The x86 wiki has some good inline asm links, and they're not all specific to x86. For example, see the collection of GNU inline asm links at the end of this answer for examples of how to make good use of the syntax to let the compiler make as efficient code as possible around your asm fragment.

Why would inline assembly crash in release only? [duplicate]

I have a scenario in GCC causing me problems. The behaviour I get is not the behaviour I expect. To summarise the situation, I am proposing several new instructions for x86-64 which are implemented in a hardware simulator. In order to test these instructions I am taking existing C source code and handcoding the new instructions using hexidecimal. Because these instructions interact with the existing x86-64 registers, I use the input/output/clobber lists to declare dependencies for GCC.
What's happening is that if I call a function e.g. printf, the dependent registers aren't saved and restored.
For example
register unsigned long r9 asm ("r9") = 101;
printf("foo %s\n", "bar");
asm volatile (".byte 0x00, 0x00, 0x00, 0x00" : /* no output */ : "q" (r9) );
101 was assigned to r9 and the inline assembly (fake in this example) is dependent on r9. This runs correctly in the absence of the printf, but when it is there GCC does not save and restore r9 and another value is there by the time my custom instruction is called.
I thought perhaps that GCC might have secretly changed the assignment to the variable r9, but when I do this
asm volatile (".byte %0" : /* no output */ : "q" (r9) );
and look at the assembly output, it is indeed using %r9.
I am using gcc 4.4.5. What do you think might be happening? I thought GCC will always save and restore registers on function calls. Is there some way I can enforce it?
Thanks!
EDIT: By the way, I'm compiling the program like this
gcc -static -m64 -mmmx -msse -msse2 -O0 test.c -o test
The ABI, section 3.2.1 says:
Registers %rbp, %rbx and %r12 through %r15 “belong” to the calling function and the
called function is
required to preserve their values. In other words, a called function must preserve
these registers’ values for its caller. Remaining registers “belong” to the called
function. If a calling function wants to preserve such a register value across a
function call, it must save the value in its local stack frame.
so you shouldn't expect registers other than %rbp, %rbx and %r12 through %r15 to be preserved by a function call.
gcc will not make explicit-register variables like this callee-saved. Basically this register notation you're using makes the variable a direct alias for the register, with the assumption you want to be able to read back the value a callee leaves in the register. If you used a callee-saved register instead of a call-clobbered (caller-saved) register, the problem would go away.

Can a C/C++ compiler inline builtin functions like malloc()?

While inspecting the disassembly of below function,
void * malloc_float_align(size_t n, unsigned int a, float *& dizi)
{
void * adres=NULL;
void * adres2=NULL;
adres=malloc(n*sizeof(float)+a);
size_t adr=(size_t)adres;
size_t adr2=adr+a-(adr&(a-1u));
adres2=(void * ) adr2;
dizi=(float *)adres2;
return adres;
}
Builtin functions are not inlined even with the inline optimization flag set.
; Line 26
$LN4:
push rbx
sub rsp, 32 ; 00000020H
; Line 29
mov ecx, 160 ; 000000a0H
mov rbx, r8
call QWORD PTR __imp_malloc <------this is not inlined
; Line 31
mov rcx, rax
; Line 33
mov rdx, rax
and ecx, 31
sub rdx, rcx
add rdx, 32 ; 00000020H
mov QWORD PTR [rbx], rdx
; Line 35
add rsp, 32 ; 00000020H
pop rbx
ret 0
Question: is this a must-have property of functions like malloc? Can we inline it some way to inspect it(or any other function like strcmp/new/free/delete)? Is this forbidden?
Typically the compiler will inline functions when it has the source code available during compilation (in other words, the function is defined, rather than just a prototype declaration) in a header file).
However, in this case, the function (malloc) is in a DLL, so clearly the source code is not available to the compiler during the compilation of your code. It has nothing to do with what malloc does (etc). However, it's also likely that malloc won't be inlined anyway, since it is a fairly large function [at least it often is], whcih prevents it from being inlined even if the source code is available.
If you are using Visual Studio, you can almost certainly find the source code for your runtime library, as it is supplied with the Visual Studio package.
(The C runtime functions are in a DLL because many different programs in the system use the same functions, so putting them in a DLL that is loaded once for all "users" of the functionality will give a good saving on the size of all the code in the system. Although malloc is perhaps only a few hundred bytes, a function like printf can easily add some 5-25KB to the size of an executable. Multiply that by the number of "users" of printf, and there is likely several hundred kilobytes just from that one function "saved" - and of course, all other functions such as fopen, fclose, malloc, calloc, free, and so on all add a little bit each to the overall size)
A C compiler is allowed to inline malloc (or, as you see in your example, part of it), but it is not required to inline anything. The heuristics it uses need not be documented, and they're usually quite complex, but normally only short functions will be inlined, since otherwise code-bloat is likely.
malloc and friends are implemented in the runtime library, so they're not available for inlining. They would need to have their implementation in their header files for that to happen.
If you want to see their disassembly, you could step into them with a debugger. Or, depending on the compiler and runtime you're using, the source code might be available. It is available for both gcc and msvc, for example.
The main thing stopping the inlining of malloc() et al is their complexity — and the obvious fact that no inline definition of the function is provided. Besides, you may need different versions of the function at different times; it would be harder (messier) for tools like valgrind to work, and you could not arrange to use a debugging version of the functions if their code is expanded inline.