Does Tsan instrument inline assembly? - c++

I have some code in my project in inline assembly, does TSAN instrument it?
let's look at this example:
T0: x++
T1: (inline assembly code) MOV x, 2;
will we get data race here(assuming no sync at all)? if so, does it instrument all assembly memory access such as XADD, etc?

Related

The implementation difference between the interperter and the JIT compiler for AtomicInteger.lazySet() in the Hotspot JVM

Refer to the existing discussion at AtomicInteger lazySet vs. set for the background of AtomicInteger.lazySet().
So according to the semantics of AtomicInteger.lazySet(), on x86 CPU, AtomicInteger.lazySet() is equivalent to a normal write operation to the value of the AotmicInteger, because the x86 memory model guarantees the order among write operations.
However, the runtime behavior for AtomicInteger.lazySet() is different between the interperter and the JIT compiler (the C2 compiler specifically) in the JDK 8 Hotspot JVM, which confuses me.
First, create a simple Java application for demo.
import java.util.concurrent.atomic.AtomicInteger;
public class App {
public static void main (String[] args) throws Exception {
AtomicInteger i = new AtomicInteger(0);
i.lazySet(1);
System.out.println(i.get());
}
}
Then, dump the instructions for AtomicInteger.lazySet() which are from the instrinc method provided by the C2 compiler:
$ java -Xcomp -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:CompileCommand=print,*AtomicInteger.lazySet App
...
0x00007f1bd927214c: mov %edx,0xc(%rsi) ;*invokevirtual putOrderedInt
; - java.util.concurrent.atomic.AtomicInteger::lazySet#8 (line 110)
As you can see, the operation is as expected a normal write.
Then, use the GDB to trace the runtime behavior of the interpreter for AtomicInteger.lazySet().
$ gdb --args java App
(gdb) b Unsafe_SetOrderedInt
0x7ffff69ae836 callq 0x7ffff69b6642 <OrderAccess::release_store_fence(int volatile*, int)>
0x7ffff69b6642:
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov %esi,-0xc(%rbp)
mov -0xc(%rbp),%eax
mov -0x8(%rbp),%rdx
xchg %eax,(%rdx)           // the write operation
mov %eax,-0xc(%rbp)
nop
pop %rbp
retq
s you can see, the operation is actually a XCHG instruction ,which has a implict lock semantics, which brings performance overhead that AtomicInteger.lazySet() is intended to eliminate.
Does anyone know why there is such a difference? thanks.
There is no much sense in optimizing a rarely used operation in the interpreter. This would increase development and maintenance costs for no visible benefit.
It's a common practice in HotSpot to implement optimizations only in a JIT compiler (either C2 or both C1+C2). The interpreter implementation just works, it does not need to be fast, because if the code is performance sensitive, it will be JIT-compiled anyway.
So, in the interpreter (i.e. on the slow path), Unsafe_SetOrderedInt is exactly the same as a volatile write.

How to force GCC to use jmp instruction instead of ret?

I was now using a stackful co-routines for network programming. But I was punished by the invalidation of return stack buffer (see
http://www.agner.org/optimize/microarchitecture.pdf p.36), during the context switch (because we manually change the SP register)
I found out that the jmp instruction is better than ret after assembly language test. However, I have some more functions that indirectly call the context switch function that was written in C++ language (compiled by GCC). How can we force these function return using jmp instead of ret in the GCC assembly result?
Some common but not perfect methods:
using inline assembly and manually set SP register to __builtin_frame_address+2*sizeof(void*) and jmp to the return address, before ret?
This is an unsafe solution. In C++, local variables or right values are destructed before ret instruction. We will omit these instruction if we jmp. What's worse, even if we are in C, callee-saved registers need to be restored before ret instruction and we will also omit these instruction, too.
So what can we do to force GCC use jmp instead of ret to avoid the problems listing above?
Use an assembler macro:
.macro ret
pop %ecx
jmp *%ecx
.endm
Put that in inline assembler at the top of the file or elsewhere.

Is the Link Register (LR) affected by inline or naked functions?

I'm using an ARM Cortex-M4 processor. As far as I understand, the LR (link register) stores the return address of the currently executing function. However, do inline and/or naked functions affect it?
I'm working on implementing simple multitasking. I'd like to write some code that saves the execution context (pusing R0-R12 and LR to the stack) so that it can be restored later. After the context save, I have an SVC so the kernel can schedule another task. When it decide to schedule the current task again, it'd restore the stack and execute BX LR. I'm asking this question because I'd like BX LR to jump to the correct place.
Let's say I use arm-none-eabi-g++ and I'm not concerned with portability.
For example, if I have the following code with the always_inline attribute, since the compiler will inline it, then there is not gonna be a function call in the resulting machine code, so the LR is unaffected, right?
__attribute__((always_inline))
inline void Task::saveContext() {
asm volatile("PUSH {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12, LR}");
}
Then, there is also the naked attribute whose documentation says that it will not have prologue/epilogue sequences generated by the compiler. What exactly does that mean. Does a naked function still result in a function call and does it affect the LR?
__attribute__((naked))
void saveContext() {
asm volatile("PUSH {R0, R1, R2, R3, R4, R5, R6, R7, R8, R9, R10, R11, R12, LR}");
}
Also, out of curiosity, what happens if a function is marked with both always_inline and naked? Does that make a difference?
Which is the correct way to ensure that a function call does not affect the LR?
As far as I understand, the LR (link register) stores the return address of the currently executing function.
Nope, lr simply receives the address of the following instruction upon execution of a bl or blx instruction. In the M-class architecture, it also receives a special magic value upon exception entry, which will trigger an exception return when used like a return address, making exception handlers look exactly the same as regular functions.
Once the function has been entered, the compiler is free to save that value elsewhere and use r14 as just another general-purpose register. Indeed, it needs to save the value somewhere if it wants to make any nested calls. With most compilers any non-leaf function will push lr to the stack as part of the prologue (and often take advantage of being able to pop it straight back into pc in the epilogue to return).
Which is the correct way to ensure that a function call does not affect the LR?
A function call by definition affects lr - otherwise it would be a goto, not a call (tail-calls notwithstanding, of course).
re: update. Leaving my old answer below, since it answers the original question before the edit.
__attribute__((naked)) basically exists so you can write the whole function in asm, inside asm statements instead of in a separate .S file. The compiler doesn't even emit a return instruction, you have to do that yourself. It doesn't make sense to use this for inline functions (like I already answered below).
Calling a naked function will generate the usual call sequence, with a bl my_naked_function, which of course sets LR to point to the instruction after the bl. A naked function is essentially a never-inline function that you write in asm. "prologue" and "epilogue" are the instructions that save and restore callee-saved registers, and the return instruction itself (bx lr).
Try it and see. It's easy to look at gcc's asm output. I changed your function names to help explain what's going on, and fixed the syntax (The GNU C __attribute__ extension requires doubled parens).
extern void extfunc(void);
__attribute__((always_inline))
inline void break_the_stack() { asm volatile("PUSH LR"); }
__attribute__((naked))
void myFunc() {
asm volatile("PUSH {r3, LR}\n\t" // keep the stack aligned for our callee by pushing a dummy register along with LR
"bl extfunc\n\t"
"pop {r3, PC}"
);
}
int foo_simple(void) {
extfunc();
return 0;
}
int foo_using_inline(void) {
break_the_stack();
extfunc();
return 0;
}
asm output with gcc 4.8.2 -O2 for ARM (default is a thumb target, I think).
myFunc(): # I followed the compiler's foo_simple example for this
PUSH {r3, LR}
bl extfunc
pop {r3, PC}
foo_simple():
push {r3, lr}
bl extfunc()
movs r0, #0
pop {r3, pc}
foo_using_inline():
push {r3, lr}
PUSH LR
bl extfunc()
movs r0, #0
pop {r3, pc}
The extra push LR means we're popping the wrong data into PC. Maybe another copy of LR, in this case, but we're returning with a modified stack pointer, so the caller will break. Don't mess with LR or the stack in an inline function, unless you're trying to do some kind of binary instrumentation thing.
re: comments: if you just want to set a C variable = LR:
As #Notlikethat points out, LR might not hold the return address. So you might want __builtin_return_address(0) to get the return address of the current function. However, if you're just trying to save register state, then you should save/restore whatever the function has in LR if you hope to correctly resume execution at this point:
#define get_lr(lr_val) asm ("mov %0, lr" : "=r" (lr_val))
This might need to be volatile to stop it from being hoisted up the call tree during whole-program optimization.
This leads to an extra mov instruction when perhaps the ideal sequence would be to store lr, rather than copy to another reg first. Since ARM uses different instructions for reg-reg move vs. store to memory, you can't just use a rm constraint for the output operand to give the compiler that option.
You could wrap this inside an inline function. A GNU C statement-expression in a macro would also work, but an inline function should be fine:
__attribute__((always_inline)) void* current_lr(void) { // This should work correctly when inlined, or just use the macro
void* lr;
get_lr(lr);
return lr;
}
For reference: What are SP (stack) and LR in ARM?
A naked always_inline function is not useful.
The docs say a naked function can only contain asm statements, and only "Basic" asm (without operands, so you have to get args from the right place for the ABI yourself). Inlining that makes zero sense, because you won't know where the compiler put your args.
If you want to inline some asm, don't use a naked function. Instead, use an inline function that uses correct contraints for input/output parameters.
The x86 wiki has some good inline asm links, and they're not all specific to x86. For example, see the collection of GNU inline asm links at the end of this answer for examples of how to make good use of the syntax to let the compiler make as efficient code as possible around your asm fragment.

How does one pass on parameters in assembly?

im working on a hook in C++ and ASM and currently i have just made an easy inline hook that places a jump in the first instruction of the target function which in this case is OutputDebugString just for testing purposes.
the thing is that my hook fianlly works after about 3 days of research and figuring out the bits and peaces of how things work, but there is one problem i have no idea how to change the parameters that come in to my "dummy" function before jumping on to the rest of the original function.
as u can see in my code i have tried to change the parameter simply in C++ but of course this does not work as im poping all the registers afterwards :/
anyways here is my dummy function which is what the hooked function jumps to:
static void __declspec(naked) MyDebugString(LPCTSTR lpOutputString) {
__asm {
PUSHAD
}
//Where i suppose i could run my code, but not be able to interfere with parameters :/
lpOutputString = L"new message!";
__asm {
POPAD
MOV EDI, EDI
PUSH EBP
MOV EBP, ESP
JMP Addr
}
original_DebugString(lpOutputString);
}
i understand why the code is not working as i said, i just can't see a proper solution to this, any help is greatly appreciated.
Every compiler has a protocol for calling functions using assembly language. The protocol may be stated deep in their manuals.
A faster method to find the function protocols is to have the compiler generate an assembly language listing for your function.
The best method for writing inline assembly is to:
First write the function in C++ source code
Next print out the assembly listing of the function.
Review and understand how the compiler generated assembly works.
Lastly, modify the internal assembly to suite your needs.
My preference is to write the C++ code as efficient as I can (or to help the compiler use optimal assembly language). I then review the assembly listing. I only change the inline assembly to invoke processor special features (such as block move instructions).

frame pointer register 'ebx' modified by inline assembly code in xatomic.h

Okay I stopped to most obscure bug what I have ever encountered. I would have commented on the almost exactly same question but I do not have enough reputation.. :(
What the bug does is that my program tries to execute on memory area that is not executable when the program tries to return from function. "Access violation on executing address 0x00000000".
I tracked the bug taking place in Visual Studio 2012's xatomic.h header (#include 'atomic' C++11 standard header) where it overwrites ebx register in x86 inline assembly. Once that happens the thread's stack is destroyed permanently.
I know quite precisely when this happens. The bug is triggered by boost::lockfree::queue::empty() function and only in release build with optimizations on. The empty() function must be inlined by compiler into it's caller function. The program works perfectly fine on debug mode as the empty() function is not inlined.
I get many compiler warnings about modifying the ebx register:
"include\boost-1_55\boost\atomic\detail\windows.hpp(1598): warning C4731: 'BuzyStack<JobPool>::push' : frame pointer register 'ebx' modified by inline assembly code"
"include\boost-1_55\boost\atomic\detail\windows.hpp(1598): warning C4731: 'BuzyStack<JobPool>::push' : frame pointer register 'ebx' modified by inline assembly code"
"y:\work\visualstudio\vc\include\xatomic.h(2133): warning C4731: 'ThreadSubSystem::join_pool' : frame pointer register 'ebx' modified by inline assembly code"
"y:\work\visualstudio\vc\include\xatomic.h(2137): warning C4731: 'ThreadSubSystem::join_pool' : frame pointer register 'ebx' modified by inline assembly code"
BuzyStack is my concurrent 'thread pool stack' that manages threading pools. Items can be concurrently pushed to/poped from the BuzyStack.
I realy do need the boost::lockfree::queue::empty() function, so how do I fix this?
What I have already done is quite radical action. I modified the Visual Studio 2012 (Update 4) xatomic.h header __asm {} parts, where the ebx register is overwritten. I force preserving the ebx register by saving it at begining of __asm block into temporal var and restoring the ebx at end of __asm block. This works. The bug is gone, but I can still see point in my program where the call stack is temporaly invalid. Also the number of compiler warnings doubled when I did this change.
(Update)
Sorry for being unclear with the question: Shortly: How do I fix this bug? I seems to be the MSVC compiler's fault.
I do not have any inline asm code in my code what so ever. All warnings are generated by code in boost-1.55 atomic and lockfree libraries plus MSVC 2012 xatomic.h header.
The standard header mod was only a temporal workaround and I do not use the modded header any more nor the empty() function. The bug still exists and destroys my stack today if I try call the empty() function.