Function call cost - c++

The application I am dealing with right now uses some brute-force numerical algorithm that calls many tiny functions billions of times. I was wandering how much the performance can be improved by eliminating function calls using inclining and static polymorphism.
What is the cost of calling a function relative to calling non-inline and non-intrinsic function in the following situations:
1) function call via function pointer
2) virtual function call
I know that it is hard to measure, but a very rough estimate would do it.
Thank you!

To make a member function call compiler needs to:
Fetch address of function -> Call function
To call a virtual function compiler needs to:
Fetch address of vptr -> Fetch address of the function -> Call function
Note: That virtual mechanism is compiler implementation detail, So the implementation might differ for different compilers, there may not even be a vptr or vtable for that matter. Having said So Usually, compilers implement it with vptr and vtable and then above holds true.
So there is some overhead for sure(One additional Fetch), To know precisely how much it impacts, you will have to profile your source code there is no simpler way.

It depends on your target architecture and your compiler, but one thing you can do is write a small test and check the assembly generated.
I did one to do the test:
// test.h
#ifndef FOO_H
#define FOO_H
void bar();
class A {
public:
virtual ~A();
virtual void foo();
};
#endif
// main.cpp
#include "test.h"
void doFunctionPointerCall(void (*func)()) {
func();
}
void doVirtualCall(A *a) {
a->foo();
}
int main() {
doFunctionPointerCall(bar);
A a;
doVirtualCall(&a);
return 0;
}
Note that you don't even need to write test.cpp, since you just need to check the assembly for main.cpp.
To see the compiler assembly output, with gcc use the flag -S:
gcc main.cpp -S -O3
It will create a file main.s, with the assembly output.
Now we can see what gcc generated to the calls.
doFunctionPointerCall:
.globl _Z21doFunctionPointerCallPFvvE
.type _Z21doFunctionPointerCallPFvvE, #function
_Z21doFunctionPointerCallPFvvE:
.LFB0:
.cfi_startproc
jmp *%rdi
.cfi_endproc
.LFE0:
.size _Z21doFunctionPointerCallPFvvE, .-_Z21doFunctionPointerCallPFvvE
doVirtualCall:
.globl _Z13doVirtualCallP1A
.type _Z13doVirtualCallP1A, #function
_Z13doVirtualCallP1A:
.LFB1:
.cfi_startproc
movq (%rdi), %rax
movq 16(%rax), %rax
jmp *%rax
.cfi_endproc
.LFE1:
.size _Z13doVirtualCallP1A, .-_Z13doVirtualCallP1A
Note here I'm using a x86_64, that the assembly will change for other achitectures.
Looking to the assembly, it looks like it is using two extra movq for the virtual call, it probably is some offset in the vtable. Note that in a real code, it would need to save some registers (be it function pointer or virtual call), but the virtual call would still need two extra movq over function pointer.

Just use a profiler like AMD's codeanalyst (using IBS and TBS), else you can go the more 'hardcore' route and give Agner Fog's optimization manuals a read (they will help both for precision instruction timings and optimizing your code): http://www.agner.org/optimize/

Function calls are a significant overhead if the functions are small. The CALL and RETURN while optimized on modern CPUs will still be noticeable when many many calls are made. Also the small functions could be spread across memory so the CALL/RETURN may also induce cache misses and excessive paging.
//code
int Add(int a, int b) { return a + b; }
int main() {
Add(1, Add(2, 3));
...
}
// NON-inline x86 ASM
Add:
MOV eax, [esp+4] // 1st argument a
ADD eax, [esp+8] // 2nd argument b
RET 8 // return and fix stack 2 args * 4 bytes each
// eax is the returned value
Main:
PUSH 3
PUSH 2
CALL [Add]
PUSH eax
PUSH 1
CALL [Add]
...
// INLINE x86 ASM
Main:
MOV eax, 3
ADD eax, 2
ADD eax, 1
...
If optimization is your goal and you're calling many small functions, it's always best to inline them. Sorry, I don't care for the ugly ASM syntax used by c/c++ compilers.

Related

Do I make a copy when calling a non-void function without using its value?

This question doesn't completely help me to clarify this issue. Say I have the function:
int foo(){
/* some code */
return 4;
}
int main(){
foo();
}
Since I'm returning by value, will a copy of the integer returned by foo() be made? This answer mentions that the compiler will turn the function into a void function. Does that mean the actual function being called will be this one?
void foo(){
/* some code */
}
This question is related to using a class method as a helper function and an interface function at the same time. For instance, assume the classes Matrix and Vector are defined.
class Force
{
public:
Matrix calculate_matrix(Vector & input);
Vector calculate_vector(Vector & input);
private:
Matrix _matrix;
Vector _vector;
};
Implementation
Matrix Force::calculate_matrix(Vector & input){
/* Perform calculations using input and writing to _matrix */
return _matrix;
}
Vector Force::calculate_vector(Vector & input){
calculate_matrix(input)
/* Perform calculations using input and _matrix and writing to _vector */
return _vector;
}
Whenever I call Force::calculate_matrix() in Force::calculate_vector(), am I making a copy of _matrix? I need Force::calculate_matrix() as an interface whenever I just need to know the value of _matrix. Would a helper function such as:
void Force::_calculate_matrix(Vector & input){
/* Perform calculations using input and writing to _matrix */
}
and then use it in Force::calculate_matrix() and Force::calculate_vector() be a better design solution or I can get away with the compiler optimizations mentioned above to avoid the copies of the Force::calculate_matrix() returning value?
To answer this question, it is worth revisiting the whole 'returning' from the function thing. It is really interesting once you start digging into it.
First of all, the int example is really irrelevant to the latter question. Stay tuned to know why :)
First thing to remember is that there is no 'returning values' on the assembly level. Instead, the rules which guide how to return a value from a function are defined in what is called ABI protocol. Not so long ago, there were many different ABI protocols in action, but now, thanks someone, most of what we see is so-called AMD64 ABI. In particular, according to it, returning an integer means pushing a value into RAX registry.
So when you ignore an integer return value, your code will simply not read RAX.
It is different when you return an object. As a matter of fact, the object is returned in the place prepared by the caller function (location of this place is passed to the callee). Callee performs all initialization and populates the object with values. Calling code than handles the 'space' as approriate.
Now, callee function has no idea if the result will be used or not (unless it is inlined). So it always has to 'return' a value - put int into RAX or initialize object in provided space. And even if the value is not used, at a caller site code still needs to allocate the space - as it knows called function will be putting data into it. Since calling code knows the space is not going to be used, it will be discarded and no copies will be made at the calling site. There still will be a copy in the callee!
Now, even more interesting. Enter compiler optimizations. Depending on the size of your caluclate_matrix function, compiler may decide to inline it. If this is to happen, all 'argument passing' will simply go away - there will be nothing to be passed, as the code will be simply executed in the call site as if no function was called at all. When this happens, there would be no copy ever, and the whole return statement would likely be optimized away - there would be nowhere to return it.
I hope, this does answer the question.
Regarding your simple example, I compiled it with g++ -o void void.cc on x86_64 Linux, and disassembled it with gdb. Here is what I got:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004f8 <+0>: push %rbp
0x00000000004004f9 <+1>: mov %rsp,%rbp
0x00000000004004fc <+4>: callq 0x4004ed <_Z3foov>
0x0000000000400501 <+9>: mov $0x0,%eax
0x0000000000400506 <+14>: pop %rbp
0x0000000000400507 <+15>: retq
End of assembler dump.
(gdb) disass foo
Dump of assembler code for function _Z3foov:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov $0x4,%eax
0x00000000004004f6 <+9>: pop %rbp
0x00000000004004f7 <+10>: retq
End of assembler dump.
As you see, the value 4 is being returned from foo() via the eax register, but it is being ignored in main().
Regarding your Matrix code. Returning a Matrix by value is expensive, as a copy would have to be made. As BeyelerStudios suggested in the comment, you can avoid a copy by returning const Matrix &. Another option is to make _calculate_matrix() take Matrix& as argument expecting an object in some initial usable state and fill it as the matrix is being computed.

Writing a custom pure virtual handler: What is the state of the stack and registers when it is called

So it is possible to make the system call a custom function for pure virtual functions[1]. This raises the question what such a function can do. For GCC
Vtable for Foo
Foo::_ZTV3Foo: 5u entries
0 (int (*)(...))0
8 (int (*)(...))(& _ZTI3Foo)
16 0u
24 0u
32 (int (*)(...))__cxa_pure_virtual
And, it is placed directly in the slot for the pure virtual function. Since the function prototype void foo() does not match the true signature, is the stack still sane? In particular, can a I throw an exception and catch it somewhere and continue execution.
[1] Is there an equivilant of _set_purecall_handler() in Linux?
Read the x86-64 ABI supplement to understand what is really happening; notably about calling conventions.
In your case, the stack is safe (because calling a void foo(void) is safe instead of calling any other signature), and you probably can throw some exception.
Details are compiler and processor specific. Your hack might perhaps work -but probably not- but is really unportable (since technically an undefined behavior, IIUC).
I'm not sure it will work. Perhaps the compiler would emit an indirect jump, and you'll jump to the nil address, and that is a SIGSEGV
Notice that a virtual call is just an indirect jump; with
class Foo {
public:
virtual void bar(void) =0;
virtual ~Foo();
};
extern "C" void doit(Foo*f) {
f->bar();
}
The assembly code (produced with g++-4.9 -Wall -O -fverbose-asm -S foo.cc) is:
.type doit, #function
doit:
.LFB0:
.file 1 "foo.cc"
.loc 1 7 0
.cfi_startproc
.LVL0:
subq $8, %rsp #,
.cfi_def_cfa_offset 16
.loc 1 8 0
movq (%rdi), %rax # f_2(D)->_vptr.Foo, f_2(D)->_vptr.Foo
call *(%rax) # *_3
.LVL1:
.loc 1 9 0
addq $8, %rsp #,
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE0:
.size doit, .-doit
and I don't seen any checks against unbound virtual methods above.
It is much better, and more portable, to define the virtual methods in the base class to throw the exception.
You might customize your GCC compiler using MELT to fit your bizarre needs!

Performance hit of vtable lookup in C++

I'm evaluating to rewrite a piece of real-time software from C/assembly language to C++/assembly language (for reasons not relevant to the question parts of the code are absolutely necessary to do in assembly).
An interrupt comes with a 3 kHz frequency, and for each interrupt around 200 different things are to be done in a sequence. The processor runs with 300 MHz, giving us 100,000 cycles to do the job. This has been solved in C with an array of function pointers:
// Each function does a different thing, all take one parameter being a pointer
// to a struct, each struct also being different.
void (*todolist[200])(void *parameters);
// Array of pointers to structs containing each function's parameters.
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
Speed is important. The above 200 iterations are done 3,000 times per second, so practically we do 600,000 iterations per second. The above for loop compiles to five cycles per iteration, yielding a total cost of 3,000,000 cycles per second, i.e. 1% CPU load. Assembler optimization might bring that down to four instructions, however I fear we might get some extra delay due to memory accesses close to each other, etc. In short, I believe those five cycles are pretty optimal.
Now to the C++ rewrite. Those 200 things we do are sort of related to each other. There is a subset of parameters that they all need and use, and have in their respective structs. In a C++ implementation they could thus neatly be regarded as inheriting from a common base class:
class Base
{
virtual void Execute();
int something_all_things_need;
}
class Derived1 : Base
{
void Execute() { /* Do something */ }
int own_parameter;
// Other own parameters
}
class Derived2 : Base { /* Etc. */ }
Base *todolist[200];
void realtime(void)
{
for (int i = 0; i < 200; i++)
todolist[i]->Execute(); // vtable look-up! 20+ cycles.
}
My problem is the vtable lookup. I cannot do 600,000 lookups per second; this would account for more than 4% of wasted CPU load. Moreover the todolist never changes during run-time, it is only set up once at start-up, so the effort of looking up what function to call is truly wasted. Upon asking myself the question "what is the most optimal end result possible", I look at the assembler code given by the C solution, and refind an array of function pointers...
What is the clean and proper way to do this in C++? Making a nice base class, derived classes and so on feels pretty pointless when in the end one again picks out function pointers for performance reasons.
Update (including correction of where the loop starts):
The processor is an ADSP-214xx, and the compiler is VisualDSP++ 5.0. When enabling #pragma optimize_for_speed, the C loop is 9 cycles. Assembly-optimizing it in my mind yields 4 cycles, however I didn't test it so it's not guaranteed. The C++ loop is 14 cycles. I'm aware of the compiler could do a better job, however I did not want to dismiss this as a compiler issue - getting by without polymorphism is still preferable in an embedded context, and the design choice still interests me. For reference, here the resulting assembly:
C:
i3=0xb27ba;
i5=0xb28e6;
r15=0xc8;
Here's the actual loop:
r4=dm(i5,m6);
i12=dm(i3,m6);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279de;
r15=r15-1;
if ne jump (pc, 0xfffffff2);
C++ :
i5=0xb279a;
r15=0xc8;
Here's the actual loop:
i5=modify(i5,m6);
i4=dm(m7,i5);
r2=i4;
i4=dm(m6,i4);
r1=dm(0x3,i4);
r4=r2+r1;
i12=dm(0x5,i4);
r2=i6;
i6=i7;
jump (m13,i12) (db);
dm(i7,m7)=r2;
dm(i7,m7)=0x1279e2;
r15=r15-1;
if ne jump (pc, 0xffffffe7);
In the meanwhile, I think I have found sort of an answer. The lowest amount of cycles is achieved by doing the very least possible. I have to fetch a data pointer, fetch a function pointer, and call the function with the data pointer as parameter. When fetching a pointer the index register is automatically modified by a constant, and one can just as well let this constant equal 1. So once again one finds oneself with an array of function pointers, and an array of data pointers.
Naturally, the limit is what can be done in assembly, and that has now been explored. Having this in mind, I now understand that even though it comes natural to one to introduce a base class, it was not really what fit the bill. So I guess the answer is that if one wants an array of function pointers, one should make oneself an array of function pointers...
What makes you think vtable lookup overhead is 20 cycles? If that's really true, you need a better C++ compiler.
I tried this on an Intel box, not knowing anything about the processor you're using, and as expected the difference between the C despatch code and the C++ vtable despatch is one instruction, having to do with the fact that the vtable involves an extra indirect.
C code (based on OP):
void (*todolist[200])(void *parameters);
void *paramlist[200];
void realtime(void)
{
int i;
for (i = 0; i < 200; i++)
(*todolist[i])(paramlist[i]);
}
C++ code:
class Base {
public:
Base(void* unsafe_pointer) : unsafe_pointer_(unsafe_pointer) {}
virtual void operator()() = 0;
protected:
void* unsafe_pointer_;
};
Base* todolist[200];
void realtime() {
for (int i = 0; i < 200; ++i)
(*todolist[i])();
}
Both compiled with gcc 4.8, -O3:
realtime: |_Z8realtimev:
.LFB0: |.LFB3:
.cfi_startproc | .cfi_startproc
pushq %rbx | pushq %rbx
.cfi_def_cfa_offset 16 | .cfi_def_cfa_offset 16
.cfi_offset 3, -16 | .cfi_offset 3, -16
xorl %ebx, %ebx | movl $todolist, %ebx
.p2align 4,,10 | .p2align 4,,10
.p2align 3 | .p2align 3
.L3: |.L3:
movq paramlist(%rbx), %rdi | movq (%rbx), %rdi
call *todolist(%rbx) | addq $8, %rbx
addq $8, %rbx | movq (%rdi), %rax
| call *(%rax)
cmpq $1600, %rbx | cmpq $todolist+1600, %rbx
jne .L3 | jne .L3
popq %rbx | popq %rbx
.cfi_def_cfa_offset 8 | .cfi_def_cfa_offset 8
ret | ret
In the C++ code, the first movq gets the address of the vtable, and the call then indexes through that. So that's one instruction overhead.
According to OP, the DSP's C++ compiler produces the following code. I've inserted comments based on my understanding of what's going on (which might be wrong). Note that (IMO) the loop starts one location earlier than OP indicates; otherwise, it makes no sense (to me).
# Initialization.
# i3=todolist; i5=paramlist | # i5=todolist holds paramlist
i3=0xb27ba; | # No paramlist in C++
i5=0xb28e6; | i5=0xb279a;
# r15=count
r15=0xc8; | r15=0xc8;
# Loop. We need to set up r4 (first parameter) and figure out the branch address.
# In C++ by convention, the first parameter is 'this'
# Note 1:
r4=dm(i5,m6); # r4 = *paramlist++; | i5=modify(i5,m6); # i4 = *todolist++
| i4=dm(m7,i5); # ..
# Note 2:
| r2=i4; # r2 = obj
| i4=dm(m6,i4); # vtable = *(obj + 1)
| r1=dm(0x3,i4); # r1 = vtable[3]
| r4=r2+r1; # param = obj + r1
i12=dm(i3,m6); # i12 = *todolist++; | i12=dm(0x5,i4); # i12 = vtable[5]
# Boilerplate call. Set frame pointer, push return address and old frame pointer.
# The two (push) instructions after jump are actually executed before the jump.
r2=i6; | r2=i6;
i6=i7; | i6=i7;
jump (m13,i12) (db); | jump (m13,i12) (db);
dm(i7,m7)=r2; | dm(i7,m7)=r2;
dm(i7,m7)=0x1279de; | dm(i7,m7)=0x1279e2;
# if (count--) loop
r15=r15-1; | r15=r15-1;
if ne jump (pc, 0xfffffff2); | if ne jump (pc, 0xffffffe7);
Notes:
In the C++ version, it seems that the compiler has decided to do the post-increment in two steps, presumably because it wants the result in an i register rather than in r4. This is undoubtedly related to the issue below.
The compiler has decided to compute the base address of the object's real class, using the object's vtable. This occupies three instructions, and presumably also requires the use of i4 as a temporary in step 1. The vtable lookup itself occupies one instruction.
So: the issue is not vtable lookup, which could have been done in a single extra instruction (but actually requires two). The problem is that the compiler feels the need to "find" the object. But why doesn't gcc/i86 need to do that?
The answer is: it used to, but it doesn't any more. In many cases (where there is no multiple inheritance, for example), the cast of a pointer to a derived class to a pointer of a base class does not require modifying the pointer. Consequently, when we call a method of the derived class, we can just give it the base class pointer as its this parameter. But in other cases, that doesn't work, and we have to adjust the pointer when we do the cast, and consequently adjust it back when we do the call.
There are (at least) two ways to perform the second adjustment. One is the way shown by the generated DSP code, where the adjustment is stored in the vtable -- even if it is 0 -- and then applied during the call. The other way, (called vtable-thunks) is to create a thunk -- a little bit of executable code -- which adjusts the this pointer and then jumps to the method's entry point, and put a pointer to this thunk into the vtable. (This can all be done at compile time.) The advantage of the thunk solution is that in the common case where no adjustment needs to be done, we can optimize away the thunk and there is no adjustment code left. (The disadvantage is that if we do need an adjustment, we've generated an extra branch.)
As I understand it, VisualDSP++ is based on gcc, and it might have the -fvtable-thunks and -fno-vtable-thunks options. So you might be able to compile with -fvtable-thunks. But if you do that, you would need to compile all the C++ libraries you use with that option, because you cannot mix the two calling styles. Also, there were (15 years ago) various bugs in gcc's vtable-thunks implementation, so if the version of gcc used by VisualDSP++ is old enough, you might run into those problems too (IIRC, they all involved multiple inheritance, so they might not apply to your use case.)
(Original test, before update):
I tried the following simple case (no multiple inheritance, which can slow things down):
class Base {
public:
Base(int val) : val_(val) {}
virtual int binary(int a, int b) = 0;
virtual int unary(int a) = 0;
virtual int nullary() = 0;
protected:
int val_;
};
int binary(Base* begin, Base* end, int a, int b) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->binary(a, b); }
return accum;
}
int unary(Base* begin, Base* end, int a) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->unary(a); }
return accum;
}
int nullary(Base* begin, Base* end) {
int accum = 0;
for (; begin != end; ++begin) { accum += begin->nullary(); }
return accum;
}
And compiled it with gcc (4.8) using -O3. As I expected, it produced exactly the same assembly code as your C despatch would have done. Here's the for loop in the case of the unary function, for example:
.L9:
movq (%rbx), %rax
movq %rbx, %rdi
addq $16, %rbx
movl %r13d, %esi
call *8(%rax)
addl %eax, %ebp
cmpq %rbx, %r12
jne .L9
As has already been mentioned, you can use templates to do away with the dynamic dispatch. Here is an example that does this:
template <typename FirstCb, typename ... RestCb>
struct InterruptHandler {
void execute() {
// I construct temporary objects here since I could not figure out how you
// construct your objects. You can change these signatures to allow for
// passing arbitrary params to these handlers.
FirstCb().execute();
InterruptHandler<RestCb...>().execute();
}
}
InterruptHandler</* Base, Derived1, and so on */> handler;
void realtime(void) {
handler.execute();
}
This should completely eliminate the vtable lookups while providing more opportunities for compiler optimization since the code inside execute can be inlined.
Note however that you will need to change some parts depending on how you initialize your handlers. The basic framework should remain the same.
Also, this requires that you have a C++11 compliant compiler.
I suggest using static methods in your derived classes and placing these functions into your array. This would eliminate the overhead of the v-table search. This is closest to your C language implementation.
You may end up sacrificing the polymorphism for speed.
Is the inheritance necessary?
Just because you switch to C++ doesn't mean you have to switch to Object Oriented.
Also, have you tried unrolling your loop in the ISR?
For example, perform 2 or more execution calls before returning back to the top of the loop.
Also, can you move any of the functionality out of the ISR?
Can any part of the functionality be performed by the "background loop" instead of the ISR? This would reduce the time in your ISR.
You can hide the void* type erasure and type recovery inside templates. The result would (hopefully) be the same array to function pointers. This yould help with casting and compatible to your code:
#include <iostream>
template<class ParamType,class F>
void fun(void* param) {
F f;
f(*static_cast<ParamType*>(param));
}
struct my_function {
void operator()(int& i) {
std::cout << "got it " << i << std::endl;
}
};
int main() {
void (*func)(void*) = fun<int, my_function>;
int j=4;
func(&j);
return 0;
}
In this case you can create new functions as a function object with more type safty. The "normal" OOP aproach with virtual functions doesn't help here.
In case of A C++11 environment you could create the array with help of variadic templates at compile time (but with an complicated syntax).
This is unrelated to your question, but if you are that keen on performance you could use templates to do a loop unroll for the todolist:
void (*todo[3])(void *);
void *param[3];
void f1(void*) {std::cout<<"1" << std::endl;}
void f2(void*) {std::cout<<"2" << std::endl;}
void f3(void*) {std::cout<<"3" << std::endl;}
template<int N>
struct Obj {
static void apply()
{
todo[N-1](param[N-1]);
Obj<N-1>::apply();
}
};
template<> struct Obj<0> { static void apply() {} };
todo[0] = f1;
todo[1] = f2;
todo[2] = f3;
Obj<sizeof todo / sizeof *todo>::apply();
Find out where your compiler puts the vtable and access it directly to get the function pointers and store them for usage. That way you will have pretty much the same approach like in C with an array of function pointers.

Load 64-bit integer constant via GNU extended asm constraint?

I've written this code in Clang-compatible "GNU extended asm":
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq %[magic_pointer], %%rax\n\t"
"ret"
: : [magic_pointer] "p"(&foreign::magic_pointer));
}
I expected it to compile into the following assembly:
_get_address_of_x:
## InlineAsm Start
movq $__ZN7foreign13magic_pointerE, %rax
ret
## InlineAsm End
ret /* useless but I don't think there's any way to get rid of it */
But instead I get this "nonsense":
_get_address_of_x:
movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax
movq %rax, -8(%rbp)
## InlineAsm Start
movq -8(%rbp), %rax
ret
## InlineAsm End
ret
Apparently Clang is assigning the value of &foreign::magic_pointer into %rax (which is deadly to a naked function), and then further "spilling" it onto a stack frame that doesn't even exist, all so it can pull it off again in the inline asm block.
So, how can I make Clang generate exactly the code I want, without resorting to manual name-mangling? I mean I could just write
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax\n\t"
"ret");
}
but I really don't want to do that if there's any way to help it.
Before hitting on "p", I'd tried the "i" and "n" constraints; but they didn't seem to work properly with 64-bit pointer operands. Clang kept giving me error messages about not being able to allocate the operand to the %flags register, which seems like something crazy was going wrong.
For those interested in solving the "XY problem" here: I'm really trying to write a much longer assembly stub that calls off to another function foo(void *p, ...) where the argument p is set to this magic pointer value and the other arguments are set based on the original values of the CPU registers at the point this assembly stub was entered. (Hence, naked function.) Arbitrary company policy prevents just writing the damn thing in a .S file to begin with; and besides, I really would like to write foreign::magic_pointer instead of __ZN7foreign...etc.... Anyway, that should explain why spilling temporary results to stack or registers is strictly verboten in this context.
Perhaps there's some way to write
asm volatile(".long %[magic_pointer]" : : [magic_pointer] "???"(&foreign::magic_pointer));
to get Clang to insert exactly the relocation I want?
I think this is what you want:
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile ("ret" : : "a"(&foreign::magic_pointer));
}
In this context, "a" is a constraint that specifies that %rax must be used. Clang will then load the address of magic_pointer into %rax in preparation for executing your inline asm, which is all you need.
It's a little dodgy because it's defining constraints that are unreferenced in the asm text, and I'm not sure whether that's technically allowed/well-defined - but it does work on latest clang.
On clang 3.0-6ubuntu3 (because I'm being lazy and using gcc.godbolt.org), with -fPIC, this is the asm you get:
get_address_of_x: # #get_address_of_x
movq foreign::magic_pointer#GOTPCREL(%rip), %rax
ret
ret
And without -fPIC:
get_address_of_x: # #get_address_of_x
movl foreign::magic_pointer, %eax
ret
ret
OP here.
I ended up just writing a helper extern "C" function to return the magic value, and then calling that function from my assembly code. I still think Clang ought to support my original approach somehow, but the main problem with that approach in my real-life case was that it didn't scale to x86-32. On x86-64, loading an arbitrary address into %rdx can be done in a single instruction with a %rip-relative mov. But on x86-32, loading an arbitrary address with -fPIC turns into just a ton of code, .indirect_symbol directives, two memory accesses... I just didn't want to attempt writing all that by hand. So my final assembly code looks like
asm volatile(
"...save original register values...;"
"call _get_magic_pointer;"
"movq %rax, %rdx;"
"...set up other parameters to foo...;"
"call _foo;"
"...cleanup..."
);
Simpler and cleaner. :)

asm subroutine handling int and char from c++ file

how are an int and char handled in an asm subroutine after being linked with a c++ program? e.g. extern "C" void LCD_ byte (char byte, int cmd_ data); how does LCD_ byte handle the "byte" and "cmd_ data"? how do I access "byte" and "cmd_ data" in the assembly code?
This very much depends on the microprocessor you use. If it is x86, the char will be widened to an int, and then both parameters are passed on the stack. You can find out yourself by compiling C code that performs a call into assembly code, and inspect the assembly code.
For example, given
void LCD_byte (char byte, int cmd_data);
void foo()
{
LCD_byte('a',100);
}
gcc generates on x86 Linux the code
foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $100, 4(%esp)
movl $97, (%esp)
call LCD_byte
leave
ret
As you can see, both values are pushed on the stack (so that 'a' is on the top), then a call instruction to the target routine is made. Therefore, the target routine can find the first incoming parameter at esp+4.
Well a lot depends on the calling convention which in turn, AFAIK, depends on the compiler.
But 99.9%% of the time it is one of 2 things. Either they are passed in registers or they are pushed on to the stack and popped back off inside the function.
Look up the documentation for your platform. It tells you which calling convention is used for C.
The calling convention specifies how parameters are passed, which registers are caller-saves and which are callee-saves, how the return address is stored and everything else you need to correctly implement a function that can be called from C. (as well as everything you need to correctly call a C function)