Why using 'divl' when doing int / unsigned int division - c++

I tested this code in X86.
void func()
{
int a, b;
unsigned int c, d;
int ret;
ret = a / b; // This line use idivl, expected
ret = c / d; // this line use idivl, expected
ret = a / c; // this line use divl..., surprised
ret = c / a; // this line use divl..., supriised
ret = a * c; // this line use imull, expected
}
I paste the assembly code here:
func:
pushl %ebp
movl %esp, %ebp
subl $36, %esp
movl -4(%ebp), %eax
movl %eax, %edx
sarl $31, %edx
idivl -8(%ebp)
movl %eax, -20(%ebp)
movl -12(%ebp), %eax
movl $0, %edx
divl -16(%ebp)
movl %eax, -20(%ebp)
movl -4(%ebp), %eax
movl $0, %edx
divl -12(%ebp)
movl %eax, -20(%ebp)
movl -4(%ebp), %eax
movl %eax, -36(%ebp)
movl -12(%ebp), %eax
movl $0, %edx
divl -36(%ebp)
movl %eax, -20(%ebp)
movl -4(%ebp), %eax
imull -12(%ebp), %eax
movl %eax, -20(%ebp)
leave
ret
Could you please tell me, why the division between int and unsigned int using divl , instead of idivl ?

Since the types of a and c have the same conversion rank, but a is signed and c is unsigned, a is converted to unsigned int before the division, in both a / c and c / a.
The compiler thus emits the unsigned division instruction div for these cases (as well as c / d, where both operands are unsigned).
The multiplication a * c is also an unsigned multiplication. In this case the compiler can get away with using the signed multiplication instruction imull, because the truncated result is identical regardless of whether mull or imull is used - only the flags are different, and the generated code doesn't test those.

Related

C++ GCC Optimization Speed slows down when local variable is copied to global variable

I have a question regarding GCC's optimization flags and how they work.
I have a very long piece of code that utilizes all local arrays and variables. At the end of the code, I copy the contents of the local array to a global array. Here is an extremely stripped down example of my code:
uint8_t globalArray[16]={0};
void func()
{
unsigned char localArray[16]={0};
for (int r=0; r<1000000; r++)
{
**manipulate localArray with a lot of calculations**
}
memcpy(&globalArray,localArray,16);
}
Here's the approximate speed of the code in three different scenarios:
Without "-O3" optimization: 3.203s
With "-O3" optimization: 1.457s
With "-O3" optimization and without the final memcpy(&globalArray,localArray,16); statement: 0.015s
Without copying the local array into the global array, the code runs almost 100 times faster. I know that the global array is stored in the memory and the local array is stored in registers. My question is:
Why does just copying 16 elements of a local array to a global array cause 100 times slower execution? I have searched this forum and online and I cannot find a definite answer to this particular scenario of mine.
Is there any way that I can extract the contents of the local variable without the speed loss?
Thank you in advance to anyone that can help me with this problem.
Without the memcpy, your compiler will likely see that localArray is never read from, so it doesn't need to do any of the calculations in the loop body.
Take this code as an example:
uint8_t globalArray[16]={0};
void func()
{
unsigned char localArray[16]={0};
for (int r=0; r<1000000; r++)
{
localArray[r%16] = r;
}
memcpy(&globalArray,localArray,16);
}
Clang 3.7.1 with -O3 outputs this assembly:
func(): # #func()
# BB#0:
xorps %xmm0, %xmm0
movaps %xmm0, -24(%rsp)
#DEBUG_VALUE: r <- 0
xorl %eax, %eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
#DEBUG_VALUE: r <- 0
movl %eax, %ecx
sarl $31, %ecx
shrl $28, %ecx
leal (%rcx,%rax), %ecx
andl $-16, %ecx
movl %eax, %edx
subl %ecx, %edx
movslq %edx, %rcx
movb %al, -24(%rsp,%rcx)
leal 1(%rax), %ecx
#DEBUG_VALUE: r <- ECX
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 1(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 1(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 2(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 2(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 2(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 3(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 3(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 3(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
leal 4(%rax), %ecx
movl %ecx, %edx
sarl $31, %edx
shrl $28, %edx
leal 4(%rax,%rdx), %edx
andl $-16, %edx
negl %edx
leal 4(%rax,%rdx), %edx
movslq %edx, %rdx
movb %cl, -24(%rsp,%rdx)
addl $5, %eax
cmpl $1000000, %eax # imm = 0xF4240
jne .LBB0_1
# BB#2:
movaps -24(%rsp), %xmm0
movaps %xmm0, globalArray(%rip)
retq
For the same code without the memcpy, it outputs this:
func(): # #func()
# BB#0:
#DEBUG_VALUE: r <- 0
retq
Even if you know nothing about assembly, it's clear to see that the latter just does nothing.

Why is this recursion so much faster than equivalent iteration?

I've been told many times that recursion is slow due to function calls, but in this code, it seems much faster than the iterative solution. At best, I typically expect a compiler to optimize recursion into iteration (which looking at the assembly, did seem to happen).
#include <iostream>
bool isDivisable(int x, int y)
{
for (int i = y; i != 1; --i)
if (x % i != 0)
return false;
return true;
}
bool isDivisableRec(int x, int y)
{
if (y == 1)
return true;
return x % y == 0 && isDivisableRec(x, y-1);
}
int findSmallest()
{
int x = 20;
for (; !isDivisable(x,20); ++x);
return x;
}
int main()
{
std::cout << findSmallest() << std::endl;
}
Assembly here: https://gist.github.com/PatrickAupperle/2b56e16e9e5a6a9b251e
I'd love to know what is going on here. I'm sure it is some tricky compiler optimization that I can be amazed to learn about.
Edit: I just realized I forgot to mention that if I use the recursive version, it runs in about .25 seconds, the iterative, about .6.
Edit 2: I am compiling with -O3 using
$ g++ --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Though, I'm not really sure what that matters.
Edit 3:
Better benchmarking:
Source: http://gist.github.com/PatrickAupperle/ee8241ac51417437d012
Output: http://gist.github.com/PatrickAupperle/5870136a5552b83fd0f1
Running with 100 iterations shows very similar results
Edit 4:
At Roman's suggestion, I added -fno-inline-functions -fno-inline-small-functions to the compilation flags. The effect is extremely bizarre to me. The code runs about 15x faster, but the ratio between the recursive version and the iterative version remains similar.
https://gist.github.com/PatrickAupperle/3a87eb53a9f11c1f0bec
Using this code I also see large timing difference (in favor of the recursive version) with GCC 4.9.3 in Cygwin. I get
13.411 seconds for iterative
4.29101 seconds for recursive
Looking at the assembly code it generated with -O3, I see two things
The compiler replaced tail recursion in isDivisableRec with a cycle and then unrolled the cycle: each iteration of the cycle in the machine code covers two levels of the original recursion.
_Z14isDivisableRecii:
.LFB1467:
.seh_endprologue
movl %edx, %r8d
.L15:
cmpl $1, %r8d
je .L18
movl %ecx, %eax ; First unrolled divisibility check
cltd
idivl %r8d
testl %edx, %edx
je .L20
.L19:
xorl %eax, %eax
ret
.p2align 4,,10
.L20:
leal -1(%r8), %r9d
cmpl $1, %r9d
jne .L21
.p2align 4,,10
.L18:
movl $1, %eax
ret
.p2align 4,,10
.L21:
movl %ecx, %eax ; Second unrolled divisibility check
cltd
idivl %r9d
testl %edx, %edx
jne .L19
subl $2, %r8d
jmp .L15
.seh_endproc
The compiler inlined several iterations of isDivisableRec by lifting them into findSmallestRec. Since the value of y parameter of isDivisableRec is hardcoded as 20 the compiler managed to replace the iterations for 20, 19...15 with some "magical" code inlined directly into findSmallestRec. The actual call to isDivisableRec happens only for y parameter value of 14 (if it happens at all).
Here's the inlined code in findSmallestRec
movl $20, %ebx
movl $1717986919, %esi ; Magic constants
movl $1808407283, %edi ; for divisibility tests
movl $954437177, %ebp ;
movl $2021161081, %r12d ;
movl $-2004318071, %r13d ;
jmp .L28
.p2align 4,,10
.L29: ; The main cycle
addl $1, %ebx
.L28:
movl %ebx, %eax ; Divisibility by 20 test
movl %ebx, %ecx
imull %esi
sarl $31, %ecx
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,4), %eax
sall $2, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 19 test
imull %edi
sarl $3, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
leal (%rdx,%rax,2), %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 18 test
imull %ebp
sarl $2, %edx
subl %ecx, %edx
leal (%rdx,%rdx,8), %eax
addl %eax, %eax
cmpl %eax, %ebx
jne .L29
movl %ebx, %eax ; Divisibility by 17 test
imull %r12d
sarl $3, %edx
subl %ecx, %edx
movl %edx, %eax
sall $4, %eax
addl %eax, %edx
cmpl %edx, %ebx
jne .L29
testb $15, %bl ; Divisibility by 16 test
jne .L29
movl %ebx, %eax ; Divisibility by 15 test
imull %r13d
leal (%rdx,%rbx), %eax
sarl $3, %eax
subl %ecx, %eax
movl %eax, %edx
sall $4, %edx
subl %eax, %edx
cmpl %edx, %ebx
jne .L29
movl $14, %edx
movl %ebx, %ecx
call _Z14isDivisableRecii ; call isDivisableRecii(x, 14)
...
The above blocks of machine instructions before each jne .L29 jump are divisibility tests for 20, 19...15 lifted directly into findSmallestRec. Apparently, they are more efficient than the tests used inside isDivisableRec for a run-time value of y. As you can see, the divisibility by 16 test is implemented simply as testb $15, %bl. Because of this, non-divisibility of x by high values of y is caught early by the above highly optimized code.
None of this happens for isDivisable and findSmallest - they are basically translated literally. Even the cycle is not unrolled.
I believe it is the second optimization that makes for the most of the difference. The compiler used highly optimized methods of checking divisibility for higher y values, which happen to be known at compile time.
If you replace the second argument of isDivisableRec with an "unpredictable" run-time value of 20 (instead of hard-coded compile-time constant 20), it should disable this optimization and bring the timings in line. I just tried this and ended up with
12.9 seconds for iterative
13.26 seconds for recursive

C++ Tail recursion using 64-bit variables

I have written a simple Fibonacci function as an exercise in C++ (using Visual Studio) to test Tail Recursion and to see how it works.
this is the code:
int fib_tail(int n, int res, int next) {
if (n == 0) {
return res;
}
return fib_tail(n - 1, next, res + next);
}
int main()
{
fib_tail(10,0,1); //Tail Recursion works
}
when I compiled using Release mode I saw the optimized assembly using the JMP instruction in spite of a call. So my conclusion was: tail recursion works. See image below:
I wanted to do some performance tests by increasing the input variable n in my Fibonacci function. I then opted to change the variable type, used in the function, from int to unsigned long long. Then I passed a big number like: 10e+08
This is now the new function:
typedef unsigned long long ULONG64;
ULONG64 fib_tail(ULONG64 n, ULONG64 res, ULONG64 next) {
if (n == 0) {
return res;
}
return fib_tail(n - 1, next, res + next);
}
int main()
{
fib_tail(10e+9,0,1); //Tail recursion does not work
}
When I ran the code above I got a stack overflow exception, which made me think that tail recursion was not working. I looked at the assembly and in fact I found this:
As you see now there is a call instruction whereas I was expecting only a simple JMP. I don't understand the reason why using a 8 bytes variable disables tail recursion. Why the compiler doesn't perform an optimization in such case?
This is one of those questions that you'd have to ask the guys that do compiler optimisation for MS - there is really no technical reason why ANY return type should prevent tail-recursion from being a jump as such - there may be OTHER reasons such as "the code is too complex to understand" or some such.
clang 3.7 as of a couple of weeks back clearly figures it out:
_Z8fib_tailyyy: # #_Z8fib_tailyyy
pushl %ebp
pushl %ebx
pushl %edi
pushl %esi
pushl %eax
movl 36(%esp), %ecx
movl 32(%esp), %esi
movl 28(%esp), %edi
movl 24(%esp), %ebx
movl %ebx, %eax
orl %edi, %eax
je .LBB0_1
movl 44(%esp), %ebp
movl 40(%esp), %eax
movl %eax, (%esp) # 4-byte Spill
.LBB0_3: # %if.end
movl %ebp, %edx
movl (%esp), %eax # 4-byte Reload
addl $-1, %ebx
adcl $-1, %edi
addl %eax, %esi
adcl %edx, %ecx
movl %ebx, %ebp
orl %edi, %ebp
movl %esi, (%esp) # 4-byte Spill
movl %ecx, %ebp
movl %eax, %esi
movl %edx, %ecx
jne .LBB0_3
jmp .LBB0_4
.LBB0_1:
movl %esi, %eax
movl %ecx, %edx
.LBB0_4: # %return
addl $4, %esp
popl %esi
popl %edi
popl %ebx
popl %ebp
retl
main: # #main
subl $28, %esp
movl $0, 20(%esp)
movl $1, 16(%esp)
movl $0, 12(%esp)
movl $0, 8(%esp)
movl $2, 4(%esp)
movl $1410065408, (%esp) # imm = 0x540BE400
calll _Z8fib_tailyyy
movl %edx, f+4
movl %eax, f
xorl %eax, %eax
addl $28, %esp
retl
Same applies to gcc 4.9.2 if you give it -O2 (but not in -O1 which was all clang needed)
(And of course also in 64-bit mode)

Binary Operators Return Xvalue Instead of PRvalue?

According to this blog -- which I realize is old, if it is no longer considered relevant please let me know -- the best method for implementing binary operators is the following...
// The "usual implementation"
Matrix operator+(Matrix const& x, Matrix const& y)
{ Matrix temp = x; temp += y; return temp; }
// --- Handle rvalues ---
Matrix operator+(Matrix&& temp, const Matrix& y)
{ temp += y; return std::move(temp); }
Matrix operator+(const Matrix& x, Matrix&& temp)
{ temp += x; return std::move(temp); }
Matrix operator+(Matrix&& temp, Matrix&& y)
{ temp += y; return std::move(temp); }
I tested this implementation, and in expressions like the following...
a + b + c + d
Where they are all matrices, I ended up with many move constructor and destructor calls that I don't believe are necessary. If the return type on all the operator+ taking an rvalue matrix were changed to Matrix&&, you eliminate all the move the constructors, and need only a single destructor call.
I made a simple program to show both implementations with code here.
Could anyone explain if doing this is wrong / bad, and why? I can't think of a reason why not to do it this way. It saves many constructor and destructor calls, and doesn't seem to break anything.
You are pessemizing your code with move constructors here. The matrix addition can be done safely without move constructors at all and the compilers are clever enough to optimize it away.
Here is some test code to prove what I'm saying:
#include <stdint.h>
class Matrix3
{
public:
float Mtx[3][3];
inline Matrix3() {};
inline Matrix3 operator+( const Matrix3& Matrix ) const
{
Matrix3 Result;
for ( size_t i = 0; i != 3; ++i )
{
for ( size_t j = 0; j != 3; ++j )
{
Result.Mtx[i][j] = Mtx[i][j] + Matrix.Mtx[i][j];
}
}
return Result;
}
virtual int GetResult() const
{
int Result = 0;
for ( size_t i = 0; i != 3; ++i )
{
for ( size_t j = 0; j != 3; ++j )
{
Result += (int)Mtx[i][j];
}
}
return Result;
}
};
int main()
{
Matrix3 M;
Matrix3 M1;
Matrix3 M2;
Matrix3 M3;
Matrix3 M4;
M = M1 + M2 + M3 + M4;
return M.GetResult();
}
I use GCC: (GNU) 4.9.0 20131110 (experimental) as follows: g++ -O3 main.cpp -S
The output assembly looks as below:
_main:
pushl %ebp
movl %esp, %ebp
andl $-16, %esp
subl $176, %esp
call ___main
fnstcw 14(%esp)
fldz
fadd %st(0), %st
fadds LC0
fadds LC0
fsts 140(%esp)
movl 140(%esp), %eax
fsts 144(%esp)
movl %eax, 20(%esp)
movl 144(%esp), %eax
fsts 148(%esp)
movl %eax, 24(%esp)
fsts 152(%esp)
movl 148(%esp), %eax
fsts 156(%esp)
movl %eax, 28(%esp)
fsts 160(%esp)
movl 152(%esp), %eax
fsts 164(%esp)
movl %eax, 32(%esp)
fsts 168(%esp)
movl 156(%esp), %eax
fstps 172(%esp)
movl %eax, 36(%esp)
movl 160(%esp), %eax
flds 24(%esp)
movl %eax, 40(%esp)
movl 164(%esp), %eax
movl %eax, 44(%esp)
movl 168(%esp), %eax
movl %eax, 48(%esp)
movl 172(%esp), %eax
movl %eax, 52(%esp)
movzwl 14(%esp), %eax
movb $12, %ah
movw %ax, 12(%esp)
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %edx
flds 20(%esp)
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 28(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 32(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 36(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 40(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 44(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 48(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
flds 52(%esp)
addl %eax, %edx
fldcw 12(%esp)
fistpl 8(%esp)
fldcw 14(%esp)
movl 8(%esp), %eax
leave
addl %edx, %eax
ret
There is not a single trace of any copy/move constructor or any function call at all. Everything is unrolled into a fast math-grinding stream of instructions.
Seriously, there is no need to write additional handlers for r-values. The compiler makes the perfect code without them.

Calling method using inline assembler in gcc

so as I said, I'm trying to call a method using inline asm using gcc. So, I searched how x86 works, and what are the calling convention, then I tried some easy call witch worked perfectly. Then I tried to embed v8, which was my original goal, but it didn't work so well...
Here's my code :
v8::Handle<v8::Value> V8Method::staticInternalMethodCaller(const v8::Arguments& args, int argsize, void* object, void* method)
{
int i = 0;
char* native_args;
// Move the ESP to the end of the array (argsize is the array size in byte)
asm("subl %1, %%esp;"
"movl %%esp, %0;"
: "=r"(native_args)
: "r"(argsize));
// This for loop only converts V8 type to native type,
// and puts them in the array:
for (; i < args.Length(); ++i)
{
if (args[i]->IsInt32())
{
*(int*)(native_args) = args[i]->Int32Value();
native_args += sizeof(int);
}
else if (args[i]->IsNumber())
{
*(float*)(native_args) = (float)(args[i]->NumberValue());
native_args += sizeof(float);
}
}
// Then call the method:
asm("call *%1;" : : "c"(object), "r"(method));
return v8::Null();
}
And here is the generated assembly :
__ZN3srl8V8Method26staticInternalMethodCallerERKN2v89ArgumentsEiPvS5_:
LFB1178:
.cfi_startproc
.cfi_personality 0,___gxx_personality_v0
.cfi_lsda 0,LLSDA1178
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
pushl %ebx
subl $68, %esp
.cfi_offset 3, -12
movl $0, -12(%ebp)
movl 12(%ebp), %eax
/APP
# 64 "method.cpp" 1
subl %eax, %esp; movl %esp, %ebx; addl $4, %esp
# 0 "" 2
/NO_APP
movl %ebx, -16(%ebp)
jmp L74
L77:
movl -12(%ebp), %eax
movl %eax, (%esp)
movl 8(%ebp), %ecx
LEHB25:
call __ZNK2v89ArgumentsixEi
LEHE25:
subl $4, %esp
movl %eax, -36(%ebp)
leal -36(%ebp), %eax
movl %eax, %ecx
call __ZNK2v86HandleINS_5ValueEEptEv
movl %eax, %ecx
LEHB26:
call __ZNK2v85Value7IsInt32Ev
LEHE26:
testb %al, %al
je L75
movl -12(%ebp), %eax
movl %eax, (%esp)
movl 8(%ebp), %ecx
LEHB27:
call __ZNK2v89ArgumentsixEi
LEHE27:
subl $4, %esp
movl %eax, -32(%ebp)
leal -32(%ebp), %eax
movl %eax, %ecx
call __ZNK2v86HandleINS_5ValueEEptEv
movl %eax, %ecx
LEHB28:
call __ZNK2v85Value10Int32ValueEv
LEHE28:
movl %eax, %edx
movl -16(%ebp), %eax
movl %edx, (%eax)
movl -16(%ebp), %eax
movl (%eax), %ebx
movl $LC4, 4(%esp)
movl $__ZSt4cout, (%esp)
LEHB29:
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl -16(%ebp), %edx
movl %edx, (%esp)
movl %eax, %ecx
call __ZNSolsEPKv
subl $4, %esp
movl $LC5, 4(%esp)
movl %eax, (%esp)
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, (%esp)
movl %eax, %ecx
call __ZNSolsEi
subl $4, %esp
movl $__ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, (%esp)
movl %eax, %ecx
call __ZNSolsEPFRSoS_E
subl $4, %esp
addl $4, -16(%ebp)
jmp L76
L75:
movl -12(%ebp), %eax
movl %eax, (%esp)
movl 8(%ebp), %ecx
call __ZNK2v89ArgumentsixEi
LEHE29:
subl $4, %esp
movl %eax, -28(%ebp)
leal -28(%ebp), %eax
movl %eax, %ecx
call __ZNK2v86HandleINS_5ValueEEptEv
movl %eax, %ecx
LEHB30:
call __ZNK2v85Value8IsNumberEv
LEHE30:
testb %al, %al
je L76
movl -12(%ebp), %eax
movl %eax, (%esp)
movl 8(%ebp), %ecx
LEHB31:
call __ZNK2v89ArgumentsixEi
LEHE31:
subl $4, %esp
movl %eax, -24(%ebp)
leal -24(%ebp), %eax
movl %eax, %ecx
call __ZNK2v86HandleINS_5ValueEEptEv
movl %eax, %ecx
LEHB32:
call __ZNK2v85Value11NumberValueEv
LEHE32:
fstps -44(%ebp)
flds -44(%ebp)
movl -16(%ebp), %eax
fstps (%eax)
movl -16(%ebp), %eax
movl (%eax), %ebx
movl $LC4, 4(%esp)
movl $__ZSt4cout, (%esp)
LEHB33:
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl -16(%ebp), %edx
movl %edx, (%esp)
movl %eax, %ecx
call __ZNSolsEPKv
subl $4, %esp
movl $LC5, 4(%esp)
movl %eax, (%esp)
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl %ebx, (%esp)
movl %eax, %ecx
call __ZNSolsEf
subl $4, %esp
movl $__ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, (%esp)
movl %eax, %ecx
call __ZNSolsEPFRSoS_E
subl $4, %esp
addl $4, -16(%ebp)
L76:
incl -12(%ebp)
L74:
movl 8(%ebp), %ecx
call __ZNK2v89Arguments6LengthEv
cmpl -12(%ebp), %eax
setg %al
testb %al, %al
jne L77
movl 16(%ebp), %eax
movl 20(%ebp), %edx
movl %eax, %ecx
/APP
# 69 "method.cpp" 1
call *%edx;
# 0 "" 2
/NO_APP
call __ZN2v84NullEv
leal -20(%ebp), %edx
movl %eax, (%esp)
movl %edx, %ecx
call __ZN2v86HandleINS_5ValueEEC1INS_9PrimitiveEEENS0_IT_EE
subl $4, %esp
movl -20(%ebp), %eax
jmp L87
L83:
movl %eax, (%esp)
call __Unwind_Resume
L84:
movl %eax, (%esp)
call __Unwind_Resume
L85:
movl %eax, (%esp)
call __Unwind_Resume
L86:
movl %eax, (%esp)
call __Unwind_Resume
LEHE33:
L87:
movl -4(%ebp), %ebx
leave
.cfi_restore 5
.cfi_restore 3
.cfi_def_cfa 4, 4
ret
.cfi_endproc
So, this static method is a callback (I do some signature checking before) witch is supposed to call the specific method providing valid C++ native args. In order to speed up a little bit and avoid copies of args, I'm trying to load all param in an local array, and then modify the ESP to make this array an argument.
The method call works well, but I don't get correct arguments... I've done lots of research about function call, calling convention, and lots of test (which were all successful), but I don't understand what is going on... Is there something I missed ?
Basically, the callee is supposed to get its arguments at the top of the esp, in my case, the array... (I precise that the array is valid)
I use GCC.
There are many problems with what you are attempting.
You cannot modify %esp using inline assembly, because the compiler
is probably using %esp to reference its local variables and arguments. This may work if the compiler uses %ebp instead, but there is no guarantee.
You never undo the %esp modification before returning.
In your inline assembly, you need to declare that %esp is side-effected.
You probably need to pass object as a silent first argument. method is an instance method, not a static method?
all of this depends on what calling convention you're using: cdecl, stdcall, etc.
I'd recommend not trying to do this yourself, there are a lot of annoying little details that have to be gotten exactly right. I'd suggest instead using the FFCALL library, specifically the avcall set of methods, to do this.
I imagine that something like this would do what you want:
v8::Handle<v8::Value> V8Method::staticInternalMethodCaller(const v8::Arguments& args, int argsize, void* object, void* method)
{
// Set up the argument list with the function pointer, return type, and
// pointer to value storing the return value (assuming int, change if
// necessary)
int return_value;
av_alist alist;
av_start_int(alist, method, &return_value);
for(int i = args.Length() - 1; i >= 0; i--)
{
// Push the arguments onto the argument list
if (args[i]->IsInt32())
{
av_int(alist, args[i]->Int32Value());
}
else if (args[i]->IsNumber())
{
av_double(alist, (float)(args[i]->NumberValue());
}
}
av_call(alist); // Call the function
return v8::Null();
}