I've written this code in Clang-compatible "GNU extended asm":
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq %[magic_pointer], %%rax\n\t"
"ret"
: : [magic_pointer] "p"(&foreign::magic_pointer));
}
I expected it to compile into the following assembly:
_get_address_of_x:
## InlineAsm Start
movq $__ZN7foreign13magic_pointerE, %rax
ret
## InlineAsm End
ret /* useless but I don't think there's any way to get rid of it */
But instead I get this "nonsense":
_get_address_of_x:
movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax
movq %rax, -8(%rbp)
## InlineAsm Start
movq -8(%rbp), %rax
ret
## InlineAsm End
ret
Apparently Clang is assigning the value of &foreign::magic_pointer into %rax (which is deadly to a naked function), and then further "spilling" it onto a stack frame that doesn't even exist, all so it can pull it off again in the inline asm block.
So, how can I make Clang generate exactly the code I want, without resorting to manual name-mangling? I mean I could just write
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile("movq __ZN7foreign13magic_pointerE#GOTPCREL(%rip), %rax\n\t"
"ret");
}
but I really don't want to do that if there's any way to help it.
Before hitting on "p", I'd tried the "i" and "n" constraints; but they didn't seem to work properly with 64-bit pointer operands. Clang kept giving me error messages about not being able to allocate the operand to the %flags register, which seems like something crazy was going wrong.
For those interested in solving the "XY problem" here: I'm really trying to write a much longer assembly stub that calls off to another function foo(void *p, ...) where the argument p is set to this magic pointer value and the other arguments are set based on the original values of the CPU registers at the point this assembly stub was entered. (Hence, naked function.) Arbitrary company policy prevents just writing the damn thing in a .S file to begin with; and besides, I really would like to write foreign::magic_pointer instead of __ZN7foreign...etc.... Anyway, that should explain why spilling temporary results to stack or registers is strictly verboten in this context.
Perhaps there's some way to write
asm volatile(".long %[magic_pointer]" : : [magic_pointer] "???"(&foreign::magic_pointer));
to get Clang to insert exactly the relocation I want?
I think this is what you want:
namespace foreign {
extern char magic_pointer[];
}
extern "C" __attribute__((naked)) void get_address_of_x(void)
{
asm volatile ("ret" : : "a"(&foreign::magic_pointer));
}
In this context, "a" is a constraint that specifies that %rax must be used. Clang will then load the address of magic_pointer into %rax in preparation for executing your inline asm, which is all you need.
It's a little dodgy because it's defining constraints that are unreferenced in the asm text, and I'm not sure whether that's technically allowed/well-defined - but it does work on latest clang.
On clang 3.0-6ubuntu3 (because I'm being lazy and using gcc.godbolt.org), with -fPIC, this is the asm you get:
get_address_of_x: # #get_address_of_x
movq foreign::magic_pointer#GOTPCREL(%rip), %rax
ret
ret
And without -fPIC:
get_address_of_x: # #get_address_of_x
movl foreign::magic_pointer, %eax
ret
ret
OP here.
I ended up just writing a helper extern "C" function to return the magic value, and then calling that function from my assembly code. I still think Clang ought to support my original approach somehow, but the main problem with that approach in my real-life case was that it didn't scale to x86-32. On x86-64, loading an arbitrary address into %rdx can be done in a single instruction with a %rip-relative mov. But on x86-32, loading an arbitrary address with -fPIC turns into just a ton of code, .indirect_symbol directives, two memory accesses... I just didn't want to attempt writing all that by hand. So my final assembly code looks like
asm volatile(
"...save original register values...;"
"call _get_magic_pointer;"
"movq %rax, %rdx;"
"...set up other parameters to foo...;"
"call _foo;"
"...cleanup..."
);
Simpler and cleaner. :)
Related
I am trying to understand some things about inline assembler in Linux. I am using following function:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(var) );
return;
}
It generates following assembler code:
.globl test_func
.type test_func, #function
test_func:
pushl %ebp
movl %esp, %ebp
#APP
# 336 "opers.c" 1
addl 8(%ebp), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
.size test_func, .-test_func
It sums var mem address to eax register value instead var value.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
Regards
It sums var mem address to eax register value instead var value.
Yes, the syntax of gcc inline assembly is pretty arcane. Paraphrasing from the relevant section in the GCC Inline Assembly HOWTO "m" roughly gives you the memory location of the C-variable.
It's what you'd use when you just want an address you can write to or read from. Notice I said the location of the C variable, so %0 is set to the address of Word32 *var - you have a pointer to a pointer. A C translation of the inline assembly block could look like EAX += *(&var) because you can say that the "m" constraint implicitly takes the address of the C variable and gives you an address expression, that you then add to %eax.
Is there any way to tell addl instruction to use var value instead var mem address without copying var mem address to a register?
That depends on what you mean. You need to get var from the stack, so someone has to dereference memory (see #Bo Perssons answer), but you don't have to do it in inline assembly
The constraint needs to be "m"(*var) (as #fazo suggested). That will give you the memory location of the value that var is pointing to, rather than a memory location pointing to it.
The generated code is now:
test_func:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
#APP
# 2 "test.c" 1
addl (%eax), %eax
# 0 "" 2
#NO_APP
popl %ebp
ret
Which is a little suspect, but that's understandable as you forgot to tell GCC that you clobbered (modified without having in the input/output list) %eax. Fixing that asm("addl %0, %%eax" : : "m"(*var) : "%eax" ) generates:
movl 8(%ebp), %edx
addl (%edx), %eax
Which isn't any better or more correct in this case, but it is always a good practice to remember. See the section on the clobber list and pay special attention to the "memory" clobber for advanced usage of inline assembly.
Even though you don't want to (explicitly) load the memory address into a register I'll briefly cover it.
Changing the constraint from "m" to "r" almost seems to work, the relevant sections gets changed to (if we include %eax in the clobber list):
movl 8(%ebp), %edx
addl %edx, %eax
Which is almost correct, we have loaded the pointer value var into a register, but now we have to specify ourselves that we're loading from memory. Changing the code to match the constraint (usually undesirable, I'm only showing it for completeness):
asm("addl (%0), %%eax" : : "r"(var) : "%eax" );
Gives:
movl 8(%ebp), %edx
addl (%edx), %eax
The same as with "m".
yes, because you give him var which is address. give him *var.
like:
void test_func(Word32 *var){
asm( " addl %0, %%eax" : : "m"(*var) );
return;
}
i don't remember exactly, but you should replace "m" with "r" ?
memory operand doesn;t mean that it will take value from that address. it's just a pointer
No, there is no addressing mode for x86 processors that goes two levels indirect.
You have to first load the pointer from a memory address and then load indirectly from the pointer.
An "m" constraint doesn't implicitly dereference anything. It's just like an "r" constraint, except it expands to an addressing mode for a memory location holding the value of the expression, instead of a register. (In C, every object has an address, although often that can be optimized away.)
The C object that's an input (or output for "=m") for the asm is the lvalue or rvalue you specify, e.g. "m"(var) takes the value of var, not *var. So you'd adding the pointer. (And telling the compiler that you want that input pointer value to be in memory, not a register.)
Perhaps it's confusing you that you have a pointer but you called it var, not ptr or something? A C pointer is an object whose value is an address, and can itself be stored in memory. If you were using C++, Word32 &var would make the dereference implicit whenever you write var.
In C terms, you're doing eax += ptr, but you want eax += *ptr, so you should write
void test_func(Word32 *ptr){
asm( "add %[input], %%eax"
: // no inputs. Probably you should use "+a"(add_to_this) if you want the add result, and remove the EAX clobber.
: [input] "m"(*ptr) // the pointed-to Word32 in memory
: "eax" // the instruction modifies EAX; tell the compiler about it
);
}
Compiling (Godbolt compiler explorer) results in:
# gcc -O3 -m32
test_func:
movl 4(%esp), %edx # compiler-generated load of the function arg
add (%edx), %eax # from asm template, (%edx) filled in as %[input] for *ptr
ret
Or if you'd compiled with -mregparm=3, or a 64-bit build, the arg would already be in a register. e.g. 64-bit GCC emits add (%rdi), %eax ; ret.
If you'd written return *ptr in C for a function returning Word32, with no inline asm, the asm would be similar, loading the pointer arg from the stack and then mov (%edx), %eax to load the return value. See the Godbolt link for that.
If inline asm isn't doing what you expect, look at the compiler generated asm to see how it filled in your template. That sometimes helps you figure out what the compiler thought you meant. (But only if you understand the basic design principles.)
If you write "m"(ptr), it compiles as follows:
void add_pointer(Word32 *ptr)
{
asm( "add %[input], %%eax" : : [input] "m"(ptr) : "eax" );
}
add_pointer:
add 4(%esp), %eax # ptr
ret
Very similar to if you wrote Word32 *bar(Word32 *ptr){ return ptr; }
Note that if you wanted to increment the memory location, you'd use a "+m"(*ptr) constraint to tell the compiler that the pointed-to memory is both an input and output. Or if you write-only to the memory, "=m"(*ptr) so it can potentially optimize away earlier dead stores to this memory location.
See also How can I indicate that the memory *pointed* to by an inline ASM argument may be used? to handle cases where you use an "r"(ptr) input and dereference the pointer manually inside the asm, accessing memory that you didn't tell the compiler about as being an input or output operand.
Generally avoid doing "r"(ptr) and then manually doing add (%0), %%eax. It needs extra constraints to make it safe, and it forces the compiler to materialize the exact address in a register, instead of using an addressing mode to reach it relative to some other register. e.g. 4(%ecx) if after inlining it sees that you're actually passing a pointer into an array or to a struct member.
Of course, generally avoid inline asm entirely unless you can't get the compiler to emit good enough asm without it. https://gcc.gnu.org/wiki/DontUseInlineAsm. If you do decide to use it, see https://stackoverflow.com/tags/inline-assembly/info for guides to avoid common mistakes.
Try
void test_func(Word32 *var){
asm( " mov %0, %%edx; \
addl (%%edx), %%eax" : : "m"(var) );
return;
}
For this fragment of code (https://godbolt.org/z/s4PY44dha)
int foo(unsigned long long x)
{
return _lzcnt_u64(x);
}
GCC generates 3 asm instructions
xorl %eax, %eax
lzcntq %rdi, %rax
ret
while clang generates only 2
lzcntq %rdi, %rax
retq
Is it possible to change the implementation/signature of foo to help GCC understand that this xor instruction is useless? Why can't gcc perform such simple optimization itself?
The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to
pass newer architectures to the GCC (for example -march=rocketlake, -march=icelake-client) but it still inserts "useless" xor.
In contrast, even for old architectures like haswell clang doesn't insert xor. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor should be controlled manually.
For example, with this inline assembly, I managed to get the code without xor.
int xorless_lzcntq(unsigned long long x) {
unsigned long long res;
asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
return res;
}
I've recently seen an article on how the swap operation can be performed using xor'ing instead of using a temporary variable. When I compile code using int a ^= b; the result won't simply be(for at&t syntax)
xor b, a
etc.
instead it will load the raw values into registers, xor it and write it back.
To optimize this i want to write this in inline assembly so it only uses three ticks to do the entire thing and not 15 like it does normally.
I've tried multiple keywords like:
asm(...);
asm("...");
asm{...};
asm{"..."};
asm ...
__asm ...
None of that worked, either giving me a syntax error, gcc doesn't seem to accept all of that syntax or else saying
main.cpp: Assembler messages:
main.cpp:12: Error: too many memory references for `xor'
Basically, I want to use the variables defined in my c++ code used in the assembler block, using three lines to xor them and then have my swapped variables basically like this:
int main() {
volatile int a = 5;
volatile int b = 6;
asm {
xor a,b
xor b,a
xor a,b
};
//a should now be 6, b should be 5
}
To clarify:
I want to avoid using the compiler generated mov operations since they take more cpu cycles than just doing three xor operations which would take three cycles. How could I accomplish this?
To use inline assembly, you should use __asm__ volatile. However, this type of optimization may be premature. Just because there are more instructions does not mean the code is slower - some instructions can be really slow. For example, a floating point BCD store instruction (fbstp), while admittedly rare, takes over 200 cycles - compared to one cycle for a simple mov (Agner Fog's Optimization Guide is a good resource for these timings).
So, I implemented a bunch of "swap" functions, some in C++ and some in assembly, and did a bit of measuring, running each function 100 million times in a row.
Test cases
std::swap
std::swap is probably the preferred solution here. It does what you want (swap the values of two variables), works for most standard library types and not just for integers, clearly communicates what you are trying to achieve, and is portable across architectures.
void std_swap(int *a, int *b) {
std::swap(*a, *b);
}
Here is the generated assembly: It loads both values into registers, and then writes them back to the opposite memory locations.
movl (%rdi), %eax
movl (%rsi), %edx
movl %edx, (%rdi)
movl %eax, (%rsi)
XOR swap
This is what you were trying to do, in C++:
void xor_swap(int *a, int *b) {
*a ^= *b;
*b ^= *a;
*a ^= *b;
}
This doesn't directly translate to only xor instructions, because there is no instruction on x86 that allows you to directly xor two locations in memory - you always need to load at least one of the two into a register:
movl (%rdi), %eax
xorl (%rsi), %eax
movl %eax, (%rdi)
xorl (%rsi), %eax
movl %eax, (%rsi)
xorl %eax, (%rdi)
You also generate a bunch of extra instructions because the two pointers may alias, i.e. point to overlapping memory areas. Then, changing one variable would also change the other, so the compiler needs to constantly store and re-load the values. An implementation using the compiler-specific __restrict keyword will compile to the same code as std_swap (thanks to #Ped7g for pointing out this flaw in the comments).
Swap with temporary variables
This is the "standard" swap with a temporary variable (that the compiler promptly optimizes out to the same code as std::swap):
void tmp_swap(int *a, int *b) {
int tmp = *a;
*a = *b;
*b = tmp;
}
The xchg instruction
xchg can swap a memory value with a register value - it seems perfect at first for your use case. However, it is really slow when you use it to access memory, as you will see later.
void xchg_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xchgl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
We need to load one of the two values into a register, because there is no xchg for two memory locations.
XOR swap in Assembly
I made two versions of the XOR-based swap in Assembly. The first one only loads one of the values in a register, the second loads both before swapping them and writing them back.
void xor_asm_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"xorl (%1), %%eax\n\t"
"xorl %%eax, (%1)\n\t"
"xorl (%1), %%eax\n\t"
"movl %%eax, (%0)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax"
);
}
void xor_asm_register_swap(int *a, int *b) {
__asm__ volatile (
"movl (%0), %%eax\n\t"
"movl (%1), %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"xorl %%eax, %%ecx\n\t"
"xorl %%ecx, %%eax\n\t"
"movl %%eax, (%0)\n\t"
"movl %%ecx, (%1)"
: "+r" (a), "+r" (b)
: /* No separate inputs */
: "%eax", "%ecx"
);
}
The results
You can view the full compilation results along with the generated assembly code on Godbolt.
On my machine, the timings (in microseconds) vary a bit, but are generally comparable:
std_swap: 127371
xor_swap: 150152
tmp_swap: 125896
xchg_asm_swap: 699355
xor_asm_swap: 130586
xor_asm_register_swap: 124718
You can see that std_swap, tmp_swap, xor_asm_swap, and xor_asm_register_swap are generally very similar in speed - in fact, if I move xor_asm_register_swap to the front, it turns out slightly slower than std_swap. Also note that tmp_swap is exactly the same assembly code as std_swap (although it regularly measures in as a bit faster, probably because of the ordering).
xor_swap implemented in C++ is slightly slower because the compiler generates an additional memory load/store for each of the instructions because of aliasing - as mentioned above, if we modify xor_swap to take int * __restrict a, int * __restrict b instead (meaning that a and b never alias), the compiler generates the same code as for std_swap and tmp_swap.
xchg_swap, despite using the lowest number of instructions, is terribly slow (over four times slower than any of the other options), just because xchg is not a fast operation if it involves a memory access.
Ultimately, you have the choice between using some custom assembly-based version (that is hard to understand and maintain) or just using std::swap (which is pretty much the opposite, and also benefits from any optimizations that the standard library designers can come up with, e.g. using vectorization on larger types). Since this is over one hundred million iterations, it should be clear that the potential improvement by using assembly code here is very small - if you improve at all (which is not clear) you'd shave off a couple of microseconds at most.
TL;DR: You shouldn't do that, just use std::swap(a, b)
Appendix: __asm__ volatile
I figured that it may make sense at this point to explain the inline assembly code a bit. __asm__ (in GNU mode, asm is enough) introduces a block of assembly code. The volatile is there to make sure the compiler doesn't optimize it away - it likes to just remove the block otherwise.
There are two forms of __asm__ volatile. One of them also deals with goto labels; I will not address it here. The other form takes up to four arguments, separated with colons (:):
The simplest form (__asm__ volatile ("rdtsc")) just dumps the assembly code, but does not really interact with the C++ code around it. In particular, you need to guess how variables are assigned to registers, which is not exactly good.
Note that the assembly code instructions are separated with "\n", because this assembly code is passed verbatim to the GNU assembler (gas).
The second argument is a list of output operands. You can specify what "type" they have (in particular, =r means "any register operand", and +r means "any register operand, but it is also used as an input"). For example, : "+r" (a), "+r" (b) tells the compiler to replace %0 (references the first of the operands) with the register containing a, and %1 with the register containing b.
This notation means you need to replace %eax (as you would normally reference eax in AT&T assembly notation) with %%eax to escape the percentage sign.
You can also use ".intel_syntax\n" to switch to Intel's assembly syntax if you prefer.
The third argument is the same, but deals with input-only operands.
The fourth argument tells the compiler which registers and memory locations lose their values to enable optimizations around the assembly code. For example, "clobbering" "memory" will likely prompt the compiler to insert a full memory fence. You can see that I added all the registers I used for temporary storage to this list.
The following code does some copying from one array of zeroes interpreted as floats to another one, and prints timing of this operation. As I've seen many cases where no-op loops are just optimized away by compilers, including gcc, I was waiting that at some point of changing my copy-arrays program it will stop doing the copying.
#include <iostream>
#include <cstring>
#include <sys/time.h>
static inline long double currentTime()
{
timespec ts;
clock_gettime(CLOCK_MONOTONIC,&ts);
return ts.tv_sec+(long double)(ts.tv_nsec)*1e-9;
}
int main()
{
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
memset(data1,0,W*H*sizeof(float));
memset(data2,0,W*H*sizeof(float));
long double time1=currentTime();
for(int q=0;q<16;++q) // take more time
for(int k=0;k<W*H;++k)
data2[k]=data1[k];
long double time2=currentTime();
std::cout << (time2-time1)*1e+3 << " ms\n";
delete[] data1;
delete[] data2;
}
I compiled this with g++ 4.8.1 command g++ main.cpp -o test -std=c++0x -O3 -lrt. This program prints 6952.17 ms for me. (I had to set ulimit -s 2000000 for it to not crash.)
I also tried changing creation of arrays with new to automatic VLAs, removing memsets, but this doesn't change g++ behavior (apart from changing timings by several times).
It seems the compiler could prove that this code won't do anything sensible, so why didn't it optimize the loop away?
Anyway it isn't impossible (clang++ version 3.3):
clang++ main.cpp -o test -std=c++0x -O3 -lrt
The program prints 0.000367 ms for me... and looking at the assembly language:
...
callq clock_gettime
movq 56(%rsp), %r14
movq 64(%rsp), %rbx
leaq 56(%rsp), %rsi
movl $1, %edi
callq clock_gettime
...
while for g++:
...
call clock_gettime
fildq 32(%rsp)
movl $16, %eax
fildq 40(%rsp)
fmull .LC0(%rip)
faddp %st, %st(1)
.p2align 4,,10
.p2align 3
.L2:
movl $1, %ecx
xorl %edx, %edx
jmp .L5
.p2align 4,,10
.p2align 3
.L3:
movq %rcx, %rdx
movq %rsi, %rcx
.L5:
leaq 1(%rcx), %rsi
movss 0(%rbp,%rdx,4), %xmm0
movss %xmm0, (%rbx,%rdx,4)
cmpq $200000001, %rsi
jne .L3
subl $1, %eax
jne .L2
fstpt 16(%rsp)
leaq 32(%rsp), %rsi
movl $1, %edi
call clock_gettime
...
EDIT (g++ v4.8.2 / clang++ v3.3)
SOURCE CODE - ORIGINAL VERSION (1)
...
size_t W=20000,H=10000;
float* data1=new float[W*H];
float* data2=new float[W*H];
...
SOURCE CODE - MODIFIED VERSION (2)
...
const size_t W=20000;
const size_t H=10000;
float data1[W*H];
float data2[W*H];
...
Now the case that isn't optimized is (1) + g++
The code in this question has changed quite a bit, invalidating correct answers. This answer applies to the 5th version: as the code currently attempts to read uninitialized memory, an optimizer may reasonably assume that unexpected things are happening.
Many optimization steps have a similar pattern: there's a pattern of instructions that's matched to the current state of compilation. If the pattern matches at some point, the matched pattern is (parametrically) replaced by a more efficient version. A very simple example of such a pattern is the definition of a variable that's not subsequently used; the replacement in this case is simply a deletion.
These patterns are designed for correct code. On incorrect code, the patterns may simply fail to match, or they may match in entirely unintended ways. The first case leads to no optimization, the second case may lead to totally unpredictable results (certainly if the modified code if further optimized)
Why do you expect the compiler to optimise this? It’s generally really hard to prove that writes to arbitrary memory addresses are a “no-op”. In your case it would be possible, but it would require the compiler to trace the heap memory addresses through new (which is once again hard since these addresses are generated at runtime) and there really is no incentive for doing this.
After all, you tell the compiler explicitly that you want to allocate memory and write to it. How is the poor compiler to know that you’ve been lying to it?
In particular, the problem is that the heap memory could be aliased to lots of other stuff. It happens to be private to your process but like I said above, proving this is a lot of work for the compiler, unlike for function local memory.
The only way in which the compiler could know that this is a no-op is if it knew what memset does. In order for that to happen, the function must either be defined in a header (and it typically isn't), or it must be treated as a special intrinsic by the compiler. But barring those tricks, the compiler just sees a call to an unknown function which could have side effects and do different things for each of the two calls.
how are an int and char handled in an asm subroutine after being linked with a c++ program? e.g. extern "C" void LCD_ byte (char byte, int cmd_ data); how does LCD_ byte handle the "byte" and "cmd_ data"? how do I access "byte" and "cmd_ data" in the assembly code?
This very much depends on the microprocessor you use. If it is x86, the char will be widened to an int, and then both parameters are passed on the stack. You can find out yourself by compiling C code that performs a call into assembly code, and inspect the assembly code.
For example, given
void LCD_byte (char byte, int cmd_data);
void foo()
{
LCD_byte('a',100);
}
gcc generates on x86 Linux the code
foo:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl $100, 4(%esp)
movl $97, (%esp)
call LCD_byte
leave
ret
As you can see, both values are pushed on the stack (so that 'a' is on the top), then a call instruction to the target routine is made. Therefore, the target routine can find the first incoming parameter at esp+4.
Well a lot depends on the calling convention which in turn, AFAIK, depends on the compiler.
But 99.9%% of the time it is one of 2 things. Either they are passed in registers or they are pushed on to the stack and popped back off inside the function.
Look up the documentation for your platform. It tells you which calling convention is used for C.
The calling convention specifies how parameters are passed, which registers are caller-saves and which are callee-saves, how the return address is stored and everything else you need to correctly implement a function that can be called from C. (as well as everything you need to correctly call a C function)