C++ multithreaded inline asm - c++

I'm looking to run inline asm in a thread but I get a seg fault on certain instructions, such as the below:
#include <thread>
void Foo(int &x)
{
int temp;
asm volatile ("movl $5, %%edx;"
"movl $3, %%eax;"
"addl %%edx, %%eax;"
"movl %%eax, -24(%%rbp);" // seg faults here
"movl -24(%%rbp), %0;"
: "=r" (temp) : : );
x=temp;
}
int main()
{
int x;
std::thread t1(Foo, std::ref(x));
t1.join();
return 0;
}
(I'm using std::ref to be able to pass a reference to an std::thread, but have to use the temp variable because the extended asm syntax doesn't work with references.)
I've tried clobbering all involved registers and it doesn't help. If I don't pass any arguments to the thread, it seems to work: and it strangely also works if I clobber the %ebx register (which isn't involved). I'm using gcc 4.8.4 on 64-bit Ubuntu 14.04.
What is the best/safest way I can execute inline asm in a thread?

I'm still not completely convinced I understand what your objectives are. I find the comment "but have to use the temp variable because the extended asm syntax doesn't work with references." a bit unusual . I'd expect you can pass a reference into an assembler template since it really is just a pointer under the hood. Since you are using a contrived example, I'll maintain that, but exclude the move of data to an arbitrary location on the stack, and I'll add the registers that get clobbered to the list. You could probably go with something like:
#include <thread>
void Foo(int &x)
{
asm volatile ("movl $5, %%edx;" // We clobber EDX
"movl $3, %%eax;" // We clobber EAX
"addl %%edx, %%eax;" // Result in EAX=3+5=8
"movl %%eax, %0;" // Move to variable x
: "=r" (x) : : "eax", "edx" );
}
int main()
{
int x;
std::thread t1(Foo, std::ref(x));
t1.join();
return 0;
}
I've added "eax" and "edx" to the clobber list since we destroy them in our assembler template (and they don't appear as input or output constraints). You should also notice I don't use a temporary variable. The assembler code can be reduced to a single instruction since the contrived example is the equivalent of movl $8, %0;.
You could also use the memory address (reference) to x like this:
void Foo(int &x)
{
asm volatile ("movl $5, %%edx;" // We clobber EDX
"movl $3, %%eax;" // We clobber EAX
"addl %%edx, %%eax;" // Result in EAX=3+5=8
"movl %%eax, %0;" // Move to variable x
: "=mr" (x) : : "eax", "edx" );
}
In this case I use =mr (memory operand or register) as an output constraint. This allows us to move a value right to the memory operand without an intermediate register.

Related

How to use inline assembly to write data with MOVNTI instruction to variable memory address? [duplicate]

I am trying to understand some things about inline assembler in Linux. I am using following function:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(var) );
return;
}
It generates following assembler code:
.globl test_func
.type test_func, #function
test_func:
pushl %ebp
movl %esp, %ebp
#APP
# 336 "opers.c" 1
addl 8(%ebp), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
.size test_func, .-test_func
It sums var mem address to eax register value instead var value.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
Regards
It sums var mem address to eax register value instead var value.
Yes, the syntax of gcc inline assembly is pretty arcane. Paraphrasing from the relevant section in the GCC Inline Assembly HOWTO "m" roughly gives you the memory location of the C-variable.
It's what you'd use when you just want an address you can write to or read from. Notice I said the location of the C variable, so %0 is set to the address of Word32 *var - you have a pointer to a pointer. A C translation of the inline assembly block could look like EAX += *(&var) because you can say that the "m" constraint implicitly takes the address of the C variable and gives you an address expression, that you then add to %eax.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
That depends on what you mean. You need to get var from the stack, so someone has to dereference memory (see #Bo Perssons answer), but you don't have to do it in inline assembly
The constraint needs to be "m"(*var) (as #fazo suggested). That will give you the memory location of the value that var is pointing to, rather than a memory location pointing to it.
The generated code is now:
test_func:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
#APP
# 2 "test.c" 1
addl (%eax), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
Which is a little suspect, but that's understandable as you forgot to tell GCC that you clobbered (modified without having in the input/output list) %eax. Fixing that asm("addl %0, %%eax" : : "m"(*var) : "%eax" ) generates:
movl 8(%ebp), %edx
addl (%edx), %eax
Which isn't any better or more correct in this case, but it is always a good practice to remember. See the section on the clobber list and pay special attention to the "memory" clobber for advanced usage of inline assembly.
Even though you don't want to (explicitly) load the memory address into a register I'll briefly cover it.
Changing the constraint from "m" to "r" almost seems to work, the relevant sections gets changed to (if we include %eax in the clobber list):
movl 8(%ebp), %edx
addl %edx, %eax
Which is almost correct, we have loaded the pointer value var into a register, but now we have to specify ourselves that we're loading from memory. Changing the code to match the constraint (usually undesirable, I'm only showing it for completeness):
asm("addl (%0), %%eax" : : "r"(var) : "%eax" );
Gives:
movl 8(%ebp), %edx
addl (%edx), %eax
The same as with "m".
yes, because you give him var which is address. give him *var.
like:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(*var) );
return;
}
i don't remember exactly, but you should replace "m" with "r" ?
memory operand doesn;t mean that it will take value from that address. it's just a pointer
No, there is no addressing mode for x86 processors that goes two levels indirect.
You have to first load the pointer from a memory address and then load indirectly from the pointer.
An "m" constraint doesn't implicitly dereference anything. It's just like an "r" constraint, except it expands to an addressing mode for a memory location holding the value of the expression, instead of a register. (In C, every object has an address, although often that can be optimized away.)
The C object that's an input (or output for "=m") for the asm is the lvalue or rvalue you specify, e.g. "m"(var) takes the value of var, not *var. So you'd adding the pointer. (And telling the compiler that you want that input pointer value to be in memory, not a register.)
Perhaps it's confusing you that you have a pointer but you called it var, not ptr or something? A C pointer is an object whose value is an address, and can itself be stored in memory. If you were using C++, Word32 &var would make the dereference implicit whenever you write var.
In C terms, you're doing eax += ptr, but you want eax += *ptr, so you should write
void test_func(Word32 *ptr){
asm( "add %[input], %%eax"
: // no inputs. Probably you should use "+a"(add_to_this) if you want the add result, and remove the EAX clobber.
: [input] "m"(*ptr) // the pointed-to Word32 in memory
: "eax" // the instruction modifies EAX; tell the compiler about it
);
}
Compiling (Godbolt compiler explorer) results in:
# gcc -O3 -m32
test_func:
movl 4(%esp), %edx # compiler-generated load of the function arg
add (%edx), %eax # from asm template, (%edx) filled in as %[input] for *ptr
ret
Or if you'd compiled with -mregparm=3, or a 64-bit build, the arg would already be in a register. e.g. 64-bit GCC emits add (%rdi), %eax ; ret.
If you'd written return *ptr in C for a function returning Word32, with no inline asm, the asm would be similar, loading the pointer arg from the stack and then mov (%edx), %eax to load the return value. See the Godbolt link for that.
If inline asm isn't doing what you expect, look at the compiler generated asm to see how it filled in your template. That sometimes helps you figure out what the compiler thought you meant. (But only if you understand the basic design principles.)
If you write "m"(ptr), it compiles as follows:
void add_pointer(Word32 *ptr)
{
asm( "add %[input], %%eax" : : [input] "m"(ptr) : "eax" );
}
add_pointer:
add 4(%esp), %eax # ptr
ret
Very similar to if you wrote Word32 *bar(Word32 *ptr){ return ptr; }
Note that if you wanted to increment the memory location, you'd use a "+m"(*ptr) constraint to tell the compiler that the pointed-to memory is both an input and output. Or if you write-only to the memory, "=m"(*ptr) so it can potentially optimize away earlier dead stores to this memory location.
See also How can I indicate that the memory *pointed* to by an inline ASM argument may be used? to handle cases where you use an "r"(ptr) input and dereference the pointer manually inside the asm, accessing memory that you didn't tell the compiler about as being an input or output operand.
Generally avoid doing "r"(ptr) and then manually doing add (%0), %%eax. It needs extra constraints to make it safe, and it forces the compiler to materialize the exact address in a register, instead of using an addressing mode to reach it relative to some other register. e.g. 4(%ecx) if after inlining it sees that you're actually passing a pointer into an array or to a struct member.
Of course, generally avoid inline asm entirely unless you can't get the compiler to emit good enough asm without it. https://gcc.gnu.org/wiki/DontUseInlineAsm. If you do decide to use it, see https://stackoverflow.com/tags/inline-assembly/info for guides to avoid common mistakes.
Try
void test_func(Word32 *var){
asm( " mov %0, %%edx; \
addl (%%edx), %%eax" : : "m"(var) );
return;
}

Correct way to implement inline assembler in c++ for xor operations on variables

I've recently seen an article on how the swap operation can be performed using xor'ing instead of using a temporary variable. When I compile code using int a ^= b; the result won't simply be(for at&t syntax)
xor b, a
etc.
instead it will load the raw values into registers, xor it and write it back.
To optimize this i want to write this in inline assembly so it only uses three ticks to do the entire thing and not 15 like it does normally.
I've tried multiple keywords like:
asm(...);
asm("...");
asm{...};
asm{"..."};
asm ...
__asm ...
None of that worked, either giving me a syntax error, gcc doesn't seem to accept all of that syntax or else saying
main.cpp: Assembler messages:
main.cpp:12: Error: too many memory references for `xor'
Basically, I want to use the variables defined in my c++ code used in the assembler block, using three lines to xor them and then have my swapped variables basically like this:
int main() {
volatile int a = 5;
volatile int b = 6;
asm {
xor a,b
xor b,a
xor a,b
};
//a should now be 6, b should be 5
}
To clarify:
I want to avoid using the compiler generated mov operations since they take more cpu cycles than just doing three xor operations which would take three cycles. How could I accomplish this?
To use inline assembly, you should use __asm__ volatile. However, this type of optimization may be premature. Just because there are more instructions does not mean the code is slower - some instructions can be really slow. For example, a floating point BCD store instruction (fbstp), while admittedly rare, takes over 200 cycles - compared to one cycle for a simple mov (Agner Fog's Optimization Guide is a good resource for these timings).
So, I implemented a bunch of "swap" functions, some in C++ and some in assembly, and did a bit of measuring, running each function 100 million times in a row.
Test cases
std::swap
std::swap is probably the preferred solution here. It does what you want (swap the values of two variables), works for most standard library types and not just for integers, clearly communicates what you are trying to achieve, and is portable across architectures.
void std_swap(int *a, int *b) {
std::swap(*a, *b);
}
Here is the generated assembly: It loads both values into registers, and then writes them back to the opposite memory locations.
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
XOR swap
This is what you were trying to do, in C++:
void xor_swap(int *a, int *b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
This doesn't directly translate to only xor instructions, because there is no instruction on x86 that allows you to directly xor two locations in memory - you always need to load at least one of the two into a register:
movl (%rdi), %eax
xorl (%rsi), %eax
movl %eax, (%rdi)
xorl (%rsi), %eax
movl %eax, (%rsi)
xorl %eax, (%rdi)
You also generate a bunch of extra instructions because the two pointers may alias, i.e. point to overlapping memory areas. Then, changing one variable would also change the other, so the compiler needs to constantly store and re-load the values. An implementation using the compiler-specific __restrict keyword will compile to the same code as std_swap (thanks to #Ped7g for pointing out this flaw in the comments).
Swap with temporary variables
This is the "standard" swap with a temporary variable (that the compiler promptly optimizes out to the same code as std::swap):
void tmp_swap(int *a, int *b) {
int tmp = *a;
*a = *b;
*b = tmp;
}
The xchg instruction
xchg can swap a memory value with a register value - it seems perfect at first for your use case. However, it is really slow when you use it to access memory, as you will see later.
void xchg_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xchgl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
We need to load one of the two values into a register, because there is no xchg for two memory locations.
XOR swap in Assembly
I made two versions of the XOR-based swap in Assembly. The first one only loads one of the values in a register, the second loads both before swapping them and writing them back.
void xor_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xorl (%1), %%eax\n\t"
"xorl %%eax, (%1)\n\t"
"xorl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
void xor_asm_register_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"movl (%1), %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"xorl %%eax, %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"movl %%eax, (%0)\n\t"
"movl %%ecx, (%1)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax", "%ecx"
);
}
The results
You can view the full compilation results along with the generated assembly code on Godbolt.
On my machine, the timings (in microseconds) vary a bit, but are generally comparable:
std_swap: 127371
xor_swap: 150152
tmp_swap: 125896
xchg_asm_swap: 699355
xor_asm_swap: 130586
xor_asm_register_swap: 124718
You can see that std_swap, tmp_swap, xor_asm_swap, and xor_asm_register_swap are generally very similar in speed - in fact, if I move xor_asm_register_swap to the front, it turns out slightly slower than std_swap. Also note that tmp_swap is exactly the same assembly code as std_swap (although it regularly measures in as a bit faster, probably because of the ordering).
xor_swap implemented in C++ is slightly slower because the compiler generates an additional memory load/store for each of the instructions because of aliasing - as mentioned above, if we modify xor_swap to take int * __restrict a, int * __restrict b instead (meaning that a and b never alias), the compiler generates the same code as for std_swap and tmp_swap.
xchg_swap, despite using the lowest number of instructions, is terribly slow (over four times slower than any of the other options), just because xchg is not a fast operation if it involves a memory access.
Ultimately, you have the choice between using some custom assembly-based version (that is hard to understand and maintain) or just using std::swap (which is pretty much the opposite, and also benefits from any optimizations that the standard library designers can come up with, e.g. using vectorization on larger types). Since this is over one hundred million iterations, it should be clear that the potential improvement by using assembly code here is very small - if you improve at all (which is not clear) you'd shave off a couple of microseconds at most.
TL;DR: You shouldn't do that, just use std::swap(a, b)
Appendix: __asm__ volatile
I figured that it may make sense at this point to explain the inline assembly code a bit. __asm__ (in GNU mode, asm is enough) introduces a block of assembly code. The volatile is there to make sure the compiler doesn't optimize it away - it likes to just remove the block otherwise.
There are two forms of __asm__ volatile. One of them also deals with goto labels; I will not address it here. The other form takes up to four arguments, separated with colons (:):
The simplest form (__asm__ volatile ("rdtsc")) just dumps the assembly code, but does not really interact with the C++ code around it. In particular, you need to guess how variables are assigned to registers, which is not exactly good.
Note that the assembly code instructions are separated with "\n", because this assembly code is passed verbatim to the GNU assembler (gas).
The second argument is a list of output operands. You can specify what "type" they have (in particular, =r means "any register operand", and +r means "any register operand, but it is also used as an input"). For example, : "+r" (a), "+r" (b) tells the compiler to replace %0 (references the first of the operands) with the register containing a, and %1 with the register containing b.
This notation means you need to replace %eax (as you would normally reference eax in AT&T assembly notation) with %%eax to escape the percentage sign.
You can also use ".intel_syntax\n" to switch to Intel's assembly syntax if you prefer.
The third argument is the same, but deals with input-only operands.
The fourth argument tells the compiler which registers and memory locations lose their values to enable optimizations around the assembly code. For example, "clobbering" "memory" will likely prompt the compiler to insert a full memory fence. You can see that I added all the registers I used for temporary storage to this list.

Asm inserion in naked function

I have ubuntu 16.04, x86_64 arch, 4.15.0-39-generic kernel version.
GCC 8.1.0
I tried to rewrite this functions(from first post https://groups.google.com/forum/#!topic/comp.lang.c++.moderated/qHDCU73cEFc) from Intel dialect to AT&T. And I did not succeed.
namespace atomic {
__declspec(naked)
static void*
ldptr_acq(void* volatile*) {
_asm {
MOV EAX, [ESP + 4]
MOV EAX, [EAX]
RET
}
}
__declspec(naked)
static void*
stptr_rel(void* volatile*, void* const) {
_asm {
MOV ECX, [ESP + 4]
MOV EAX, [ESP + 8]
MOV [ECX], EAX
RET
}
}
}
Then I wrote a simple program, to get the same pointer, which I pass inside. I installed GCC version 8.1 with supported naked attributes(https://gcc.gnu.org/gcc-8/changes.html "The x86 port now supports the naked function attribute") for fuctions.
As far as I remember, this attribute tells the compiler not to create the prologue and epilog of the function, and I can take the parameters from the stack myself and return them.
Code:(don't work with segfault)
#include <cstdio>
#include <cstdlib>
__attribute__ ((naked))
int *get_num(int*) {
__asm__ (
"movl 4(%esp), %eax\n\t"
"movl (%eax), %eax\n\t"
"ret"
);
}
int main() {
int *i =(int*) malloc(sizeof(int));
*i = 5;
int *j = get_num(i);
printf("%d\n", *j);
free(i);
return 0;
}
then I tried using 64bit registers:(don't work with segfault)
__asm__ (
"movq 4(%rsp), %rax\n\t"
"movq (%rax), %rax\n\t"
"ret"
);
And only after I took the value out of rdi register - it all worked.
__asm__ (
"movq %rdi, %rax\n\t"
"ret"
);
Why did I fail to make the transfer through the stack register? I probably made a mistake. Please tell me where is my fail?
Because the x86-64 System V calling convention passes args in registers, not on the stack, unlike the old inefficient i386 System V calling convention.
You always have to write asm that matches the calling convention, if you're writing the whole function in asm, like with a naked function or a stand-along .S file.
GNU C extended asm allows you to use operands to specify the inputs to an asm statement, and the compiler will generate instructions to make that happen. (I wouldn't recommend using it until you understand asm and how compilers turn C into asm with optimization enabled, though.)
Also note that movq %rdi, %rax implements long *foo(long*p){return p;} not return *p. Perhaps you meant mov (%rdi), %rax to dereference the pointer arg?
And BTW, you definitely don't need and shouldn't use inline asm for this. https://gcc.gnu.org/wiki/DontUseInlineAsm, and see https://stackoverflow.com/tags/inline-assembly/info
In GNU C, you can cast a pointer to volatile uint64_t*. Or you can use __atomic_load_n (ptr, __ATOMIC_ACQUIRE) to get basically everything you were getting from that asm, without the overhead of a function call or any of the cost for the optimizer at the call-site of having all the call-clobbered registers be clobbered.
You can use them on any object: https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html Unlike C++11 where you can only do atomic ops on a std::atomic<T>.

C++ jump to other method execution

In my C++ JNI-Agent project i am implementing a function which would be given a variable number of parameters and would pass the execution to the other function:
// address of theOriginalFunction
public static void* originalfunc;
void* interceptor(JNIEnv *env, jclass clazz, ...){
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
The function above needs to just jump to the:
JNIEXPORT void JNICALL Java_main_Main_theOriginalFunction(JNIEnv *env, jclass clazz, jboolean p1, jbyte p2, jshort p3, jint p4, jlong p5, jfloat p6, jdouble p7, jintArray p8, jbyteArray p9){
// Do something
}
The code above works perfectly, the original function can read all the parameters correctly (tested with 9 parameters of different types including arrays).
However, before jumping into original function from the interceptor i need to do some computations. However, here i observe interesting behavior.
void* interceptor(JNIEnv *env, jclass clazz, ...){
int x = 10;
int y = 20;
int summ = x + y;
// NEED TO RESTORE ESP TO EBP SO THAT ORIGINAL FUNCTION READS PARAMETERS CORRECTLY
asm (
"movl %ebp, %esp;"
"mov %rbp, %rsp"
);
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
This still works fine, i am able to do some basic computations , then reset the stack pointer and jump to my original function, the original function also reads the parameters from the var_args correctly. However: if i replace the basic int operations with malloc or printf("any string"); , then, somehow, if jump into my original function, then my parameters get messed up and the original function ends reading wrong values...
I have tried to debug this behavior and i inspected the memory regions to see what is goin wrong... Right before the jump, everything looks fine there, ebp is being followed by function parameters.
If i jump without complicated computations, everything works fine, memory region behind ebp doesnt get changed. original function reads correct values...
Now if i jump after doing printf (for example), the parameters read by the original method get corrupted...
What is causing this strange behavior? printf doesnt even store any lokal variables in my method... Ok it does store some literals in registers but why my stack gets corrupted only after the jump and not already before it?
For this project I use g++ version 4.9.1 compiler running on a windows machine.
And yes I am concerned of std::forward and templates options but they just do not work in my case... Aaand yes I know that jumping into other methods is a bit hacky but thats my only idea of how to bring JNI-interceptor to work...
******************** EDIT ********************
As discussed i am adding the generated assembler code with the source functions.
Function without printf (which works fine):
void* interceptor(JNIEnv *env, jclass clazz, ...){
//just an example
int x=8;
// restoring stack pointers
asm (
"movl %ebp, %esp;"
"mov %rbp, %rsp"
);
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
void* interceptor(JNIEnv *env, jclass clazz, ...){
// first when interceptor is called, probably some parameter restoring...
push %rbp
mov %rsp %rbp
sub $0x30, %rsp
mov %rcx, 0x10(%rbp)
mov %r8, 0x20(%rbp)
mov %r9, 0x28(%rbp)
mov %rdx, 0x18(%rbp)
// int x = 8;
movl $0x8, -0x4(%rbp)
// my inline asm restoring stack pointers
mov %ebp, %esp
mov %rbp, %rsp
// asm volatile("jmp *%0;"::"r" (originalfunc+4))
mov 0xa698b(%rip),%rax // store originalfunc in rax
add %0x4, %rax
jmpq *%rax
// return NULL;
mov $0x0, %eax
}
Now asm output for printf variant...
void* interceptor(JNIEnv *env, jclass clazz, ...){
//just an example
int x=8;
printf("hey");
// restoring stack pointers
asm (
"movl %ebp, %esp;"
"mov %rbp, %rsp"
);
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
void* interceptor(JNIEnv *env, jclass clazz, ...){
// first when interceptor is called, probably some parameter restoring...
push %rbp
mov %rsp %rbp
sub $0x30, %rsp
mov %rcx, 0x10(%rbp)
mov %r8, 0x20(%rbp)
mov %r9, 0x28(%rbp)
mov %rdx, 0x18(%rbp)
// int x = 8;
movl $0x8, -0x4(%rbp)
// printf("hey");
lea 0x86970(%rip), %rcx // stores "hey" in rcx???
callq 0x6b701450 // calls the print function, i guess
// my inline asm restoring stack pointers
mov %ebp, %esp
mov %rbp, %rsp
// asm volatile("jmp *%0;"::"r" (originalfunc+4))
mov 0xa698b(%rip),%rax // store originalfunc in rax
add %0x4, %rax
jmpq *%rax
// return NULL;
mov $0x0, %eax
}
And here is the asm code for the printf function:
printf(char const*, ...)
push %rbp
push %rbx
sub $0x38, %rsp
lea 0x80(%rsp), %rbp
mov %rdx, -0x28(%rbp)
mov $r8, -0x20(%rbp)
mov $r9, -0x18(%rbp)
mov $rcx, -0x30(%rbp)
lea -0x28(%rbp), %rax
mov %rax, -0x58(%rbp)
mov -0x58(%rbp), %rax
mov %rax, %rdx
mov -0x30(%rbp), %rcx
callq 0x6b70ff60 // (__mingw_vprintf)
mov %eax, %ebx
mov %ebx, %eax
add $0x38, %rsp
pop %rbx
pop %rbp
retq
It looks like printf does many operations on rbp , but i cannot see anything wrong with it...
And here is the asm code of the intercepted function.
push %rbp // 1 byte
push %rsp, %rbp // 3 bytes , need to skip them
sub $0x50, %rsp
mov %rcx, 0x10(%rbp)
mov %rdx, 0x18(%rbp)
mov %r8d, %ecx
mov %r9d, %edx
mov 0x30(%rbp), %eax
mov %cl, 0x20(%rbp)
mov %dl, 0x28(%rbp)
mov %ax, -0x24(%rbp)
************* EDIT 2 **************
I thought it would be useful to see how memory changes at the run-time:
The first picture shows the memory layout right after entering the interceptor function:
The second images shows the same memory region after problematic code (like printf and so)
The third picture shows the memory layout right after jumping to original function.
As you can see, right after calling printf , stack looks fine, however when i jump into the original function, it messes up...
Looking at the screenshots, I am pretty sure that all the parameters lie on the stack in the memory, and parameter are not passed by registers.
Arguments are passed manually in assembly using a set calling convention. In this case, the arguments are passed in registers beginning with %rcx. Any modification to the registers used as calling conventions will change the arguments perceived by any proceeding jmp.
Calling printf before your jmp changes the value of %rcx from *env to a pointer to constant "hello". After you change the value of %rcx you need to restore it to the value it was previously. The following code should work:
void* interceptor(JNIEnv *env, jclass clazz, ...){
//just an example
int x=8;
printf("hey");
// restoring stack pointers
asm (
"movl %ebp, %esp;"
"mov %rbp, %rsp"
);
// restore %rcx to equal *env
asm volatile("mov %rcx, 0x10(%rbp)");
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
What architecture is this? From the register names, it appears to be x64.
You say the parameters are wrong. I agree. You jump from there to believing the stack is wrong. Probably not. x64 passes some parameters in registers, but not varargs. So the function signature for your forwarder is simply incompatible with the function you are trying to call.
Post the assembly for a direct call to Java_main_Main_theOriginalFunction and then for a call to your forwarder using the exact same parameters; you'll see a terrible difference in how the arguments are passed.
Most likely any function you call before your forwarding destroys the structure that is needed to handle the variable argument list (in your assembly there is still the mingw_printf call of which you didn't show the disassembly).
To understand better what's going on you might want to have a look at this question.
To solve your problem you could consider to add another indirection, I think that the following might work (but I haven't tested it).
void *forward_interceptor(env, clazz, ... ) {
// add 4 to the function address to skip "push ebp / mov ebp esp"
asm volatile("jmp *%0;"::"r" (originalfunc+4));
// will not get here anyway
return NULL;
}
void* interceptor(JNIEnv *env, jclass clazz, ...){
//do your preparations
...
va_list args;
va_start(args, clazz);
forward_interceptor(env, clazz, args);
va_end(args);
}
IMHO the important thing is that you need the va_list/va_start/va_end setup to make sure that the parameters are properly passed on to the next function.
However, since you seem to know the signature of the function you are forwarding to and it doesn't seem to accept a variable number of arguments, why not extract the arguments, and call the function properly like:
void* interceptor(JNIEnv *env, jclass clazz, ...){
//do your preparations
...
va_list args;
va_start(args, clazz);
jboolean p1 = va_arg(args, jboolean);
jbyte p2 = va_arg(args, jbyte);
jshort p3 = va_arg(args, jshort);
...
Java_main_Main_theOriginalFunction(env, clazz, p1, p2, ...
va_end(args);
return NULL;
}
Note, however, that va_arg can not check whether the parameter is of the correct type or available at all.

x86 logical address syntax error

Compiler: gcc 4.7.1, 32bit, ubuntu
Here's an example:
int main(void)
{
unsigned int mem = 0;
__asm volatile
(
"mov ebx, esp\n\t"
"mov %0, [ds : ebx]\n\t"
: "=m"(mem)
);
printf("mem = 0x%08x\n", mem);
return 0;
}
gcc -masm=intel -o app main.c
Assembler messages: invalid use of register!
As I know, ds and ss point to the same segment. I don't know why I can't use [ds : ebx] logical address for addressing.
Your code has two problems:
One: the indirect memory reference should be:
mov %0, ds : [ebx]
That is, with the ds out of the brackets.
Two: A single instruction cannot have both origin and destination in memory, you have to use a register. The easiest way would be to indicate =g that basically means whatever, but in your case it is not possible because esp cannot be moved directly to memory. You have to use =r.
Three: (?) You are clobbering the ebx register, so you should declare it as such, or else do not use it that way. That will not prevent compilation, but will make your code to behave erratically.
In short:
unsigned int mem = 0;
__asm volatile
(
"mov ebx, esp\n\t"
"mov %0, ds : [ebx]\n\t"
: "=r"(mem) :: "ebx"
);
Or better not to force to use ebx, let instead the compiler decide:
unsigned int mem = 0, temp;
__asm volatile
(
"mov %1, esp\n\t"
"mov %0, ds : [%1]\n\t"
: "=r"(mem) : "r"(temp)
);
BTW, you don't need the volatile keyword in this code. That is used to avoid the assembler to be optimized away even if the output is not needed. If you write the code for the side-effect, add volatile, but if you write the code to get an output, do not add volatile. That way, if the optimizing compiler determines that the output is not needed, it will remove the whole block.