Understanding the assembly language for if-else in following code [duplicate] - c++

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax

The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.

Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

Related

segmentation fault after linking c++ file with asm file [duplicate]

I am currently learning x86 assembly. Something is not clear to me still however when using the stack for function calls. I understand that the call instruction will involve pushing the return address on the stack and then load the program counter with the address of the function to call. The ret instruction will load this address back to the program counter.
My confusion is, does it matter when the ret instruction is called within the procedure/function? Will it always find the correct return address stored on the stack, or must the stack pointer be currently pointing to where the return address was stored? If that's the case, can't we just use push and pop instead of call and ret?
For example, the code below could be the first on entering the function , if we push different registers on the stack, must the ret instruction only be called after the registers are popped in the reverse order so that after the pop %ebp instruction , the stack pointer will point to the correct place on the stack where the return address is, or will it still find it regardless where it is called? Thanks in advance
push %ebp
mov %ebp, %esp
//push other registers
...
//pop other registers
mov %esp, %ebp
(could ret instruction go here for example and still pop the correct return address?)
pop %ebp
ret
You must leave the stack and non-volatile registers as you found them. The calling function has no clue what you might have done with them otherwise - the calling function will simply continue to its next instruction after ret. Only ret after you're done cleaning up.
ret will always look to the top of the stack for its return address and will pop it into EIP. If the ret is a "far" return then it will also pop the code segment into the CS register (which would also have been pushed by call for a "far" call). Since these are the first things pushed by call, they must be the last things popped by ret. Otherwise you'll end up reting somewhere undefined.
The CPU has no idea what is function/etc... The ret instruction will fetch value from memory pointed to by esp a jump there. For example you can do things like (to illustrate the CPU is not interested into how you structurally organize your source code):
; slow alternative to "jmp continue_there_address"
push continue_there_address
ret
continue_there_address:
...
Also you don't need to restore the registers from stack, (not even restore them to the original registers), as long as esp points to the return address when ret is executed, it will be used:
call SomeFunction
...
SomeFunction:
push eax
push ebx
push ecx
add esp,8 ; forget about last 2 push
pop ecx ; ecx = original eax
ret ; returns back after call
If your function should be interoperable from other parts of code, you may still want to store/restore the registers as required by the calling convention of the platform you are programming for, so from the caller point of view you will not modify some register value which should be preserved, etc... but none of that bothers CPU and executing instruction ret, the CPU just loads value from stack ([esp]), and jumps there.
Also when the return address is stored to stack, it does not differ from other values pushed to stack in any way, all of them are just values written in memory, so the ret has no chance to somehow find "return address" in stack and skip "values", for CPU the values in memory look the same, each 32 bit value is that, 32 bit value. Whether it was stored by call, push, mov, or something else, doesn't matter, that information (origin of value) is not stored, only value.
If that's the case, can't we just use push and pop instead of call and ret?
You can certainly push preferred return address into stack (my first example). But you can't do pop eip, there's no such instruction. Actually that's what ret does, so pop eip is effectively the same thing, but no x86 assembly programmer use such mnemonics, and the opcode differs from other pop instructions. You can of course pop the return address into different register, like eax, and then do jmp eax, to have slow ret alternative (modifying also eax).
That said, the complex modern x86 CPUs do keep some track of call/ret pairings (to predict where the next ret will return, so it can prefetch the code ahead quickly), so if you will use one of those alternative non-standard ways, at some point the CPU will realize it's prediction system for return address is off the real state, and it will have to drop all those caches/preloads and re-fetch everything from real eip value, so you may pay performance penalty for confusing it.
In the example code, if the return was done before pop %ebp, it would attempt to return to the "address" that was in ebp at the start of the function, which would be the wrong address to return to.

to how many subroutines can x86 processors call? [duplicate]

This question already has answers here:
C/C++ maximum stack size of program on mainstream OSes
(7 answers)
Closed 1 year ago.
i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)
function a() {b();}
function b() {c();}
function c() {d();}
...
i've used 5 such levels (you know what i mean, right?)
Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.
To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.
call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.
All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.
(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)
Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.
Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)
In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).
If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.
In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.
To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like
void recurse(int n) {
if (n == 1)
return;
recurse(n - 1);
}
and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.
If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).
For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)
__attribute__((noipa)) // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}
__attribute__((noipa))
void bar(){
foo();
}
void baz(){
bar();
}
compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:
## GCC11.2 -O1 -m32 -mpreferred-stack-boundary=2
foo():
ret
bar():
call foo()
ret
baz():
call bar()
ret
g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.
You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.
If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.
Recursive:
I got GCC to make asm that recurses with no gaps between return address:
void recurse(int n){
if (!--n)
return;
recurse(n);
}
## G++11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1 # --n
jne .L6 # skip over the ret if the result is non-zero
.L4:
ret
.L6:
call recurse(int) # push a 4-byte ret addr and jump to top
jmp .L4 # silly compiler, should put another ret here. But we told it not to optimize too much
Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)
With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)
Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.
Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.
void recurse(int n){
if (__builtin_expect(!!--n, 1))
recurse(n);
}
Godbolt
## GCC 11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1
je .L1
call recurse(int)
.L1:
ret

Assembler push rdi, pop rdi around function call

What's purpose of push rdi and pop rdi when calling function in C++?
VS2010, x64, debug, no optimizations
C++
int calc()
{
return 8 + 7;
}
Disassembly:
int calc()
{
000000013F0B1020 push rdi
return 8 + 7;
000000013F0B1022 mov eax,0Fh
}
000000013F0B1027 pop rdi
000000013F0B1028 ret
There is no purpose to it. This is a common artifact of unoptimized code. The code generator emits the push edi instruction in anticipation of having to perform an addition. The EDI register must be preserved across function calls. But then, later, figures out that the addition can be performed at compile time.
Getting rid of extraneous code like this requires "peephole optimization". But that optimization isn't enabled in the Debug build. To know what the real code look like, you have to turn on the optimizer, best done by building the Release build. It in fact will completely eliminate the function, you can prevent it from doing so with:
__declspec(noline) int calc()
{
return 8 + 7;
}
Which produces in the Release build:
return 8 + 7;
000007F7038E1000 mov eax,0Fh
000007F7038E1005 ret
Have you heard of "caller-save" and "callee-save" registers?
Since your CPU only has a small, finite number of registers, it's usually impossible for caller/called functions to always use different registers. If a caller function and called function both want to use the same register, it means the value in the caller will have to be saved/restored before/after the call.
Saving/restoring register values can be done either by the caller or by the callee -- which one does so is a matter of convention. The benefit of "caller-save" registers is that if the caller knows it won't need the value in register XYZ after the call, it can omit the save/restore operations. The benefit of "callee-save" registers is that if the callee knows it won't modify the value in register XYZ, it can omit the save/restore operations.
I'm guessing that your compiler treats RDI as a callee-save register, but doesn't omit the unnecessary save/restore operations unless you have compiler optimizations turned on. (If someone knows this is incorrect, please post another answer!)
UPDATE: I found an article on x86 calling conventions: http://en.wikipedia.org/wiki/X86_calling_conventions
It seems to confirm that with most calling conventions, RDI would be callee-save. This doesn't explain why it isn't pushing and popping all the other callee-save registers. Maybe there is something else going on here.

OCaml calling convention: is this an accurate summary?

I've been trying to find the OCaml calling convention so that I can manually interpret the stack traces that gdb can't parse. Unfortunately, it seems like nothing has ever been written down in English except for general observations. E.g., people will comment on blogs that OCaml passes many arguments in registers. (If there is English documentation somewhere, a link would be much appreciated.)
So I've been trying to puzzle it out from the ocamlopt source. Could anyone confirm the accuracy of these guesses?
And, if I'm right about the first ten arguments being passed in registers, is it just not generally possible to recover the arguments to a function call? In C, the arguments would still be pushed onto the stack somewhere, if only I walk back up to the correct frame. In OCaml, it would seem that callees are free to destroy their callers' arguments.
Register allocation (from /asmcomp/amd64/proc.ml)
For calling into OCaml functions,
The first 10 integer and pointer arguments are passed in the registers rax, rbx, rdi, rsi, rdx, rcx, r8, r9, r10 and r11
The first 10 floating-point arguments are passed in the registers xmm0 - xmm9
Additional arguments are pushed onto the stack (leftmost-first-in?), floats and ints and pointers intermixed
The trap pointer (see Exceptions below) is passed in r14
The allocation pointer (presumably for the minor heap as described in this blog post) is passed in r15
The return value is passed back in rax if it is an integer or pointer, and in xmm0 if it is a float
All registers are caller-save?
For calling into C functions, the standard amd64 C convention is used:
The first six integer and pointer arguments are passed in rdi, rsi, rdx, rcs, r8, and r9
The first eight float arguments are passed in xmm0 - xmm7
Additional arguments are pushed onto the stack
The return value is passed back in rax or xmm0
The registers rbx, rbp, and r12 - r15 are callee-save
Return address (from /asmcomp/amd64/emit.mlp)
The return address is the first pointer pushed into the call frame, in accordance with amd64 C convention. (I'm guessing the ret instruction assumes this layout.)
Exceptions (from /asmcomp/linearize.ml)
The code try (...body...) with (...handler...); (...rest...) gets linearized like this:
Lsetuptrap .body
(...handler...)
Lbranch .join
Llabel .body
Lpushtrap
(...body...)
Lpoptrap
Llabel .join
(...rest...)
and then emitted as assembly like this (destinations on the right):
call .body
(...handler...)
jmp .join
.body:
pushq %r14
movq %rsp, %r14
(...body...)
popq %r14
addq %rsp, 8
.join:
(...rest...)
Somewhere in the body, there's a linearized opcode Lraise which gets emitted as this exact assembly:
movq %r14, %rsp
popq %r14
ret
Which is really neat! Instead of this setjmp/longjmp business, we create a dummy frame whose return address is the exception handler and whose only local is the previous such dummy frame. The /asmcomp/amd64/proc.ml has a comment calling $r14 the "trap pointer" so I'll call this dummy frame the trap frame. When we want to raise an exception, we set the stack pointer to the most recent trap frame, set the trap pointer to the trap frame before that, and then "return" into the exception handler. And I bet if the exception handler can't handle this exception, it just reraises it.
The exception is in %eax.
This is more an answer than a question! The bit I know on this topic, I have learned by looking at the source, just like you, so don't expect further precisions to be much more authoritative than your post.
Yes, I think OCaml uses specialized calling conventions with caller-save registers only. A benefit of this choice is that it simplifies tail-calls: when you jump through a tail-call¹, you don't have to spill or reload any register.
¹: for non-self tail calls, this only works when there are not too much arguments, and therefore we don't need to spill. If stack allocation is needed, the call is turned into a non-tail call.
Note that calling conventions still depends strongly on the target architecture. On x86 for example, a small numbers of globals are used when the registers are exhausted and before spilling on the stack, to preserve tail-calls.
I also agree on "leftmost-first-in": arguments are traversed in order by calling_conventions in proc.ml, stored in offset order by slot_offset in emit.mlp; they where computed right-to-left, but returned in order, in selectgen.ml.
Yes, you cannot recover the arguments from a call, as OCaml tries to reuse registers as much as possible, and will thus destroy their content if it is not useful anymore in the remaining of a function. Debuggers have no way to print the arguments, they can only, at a given point in the function, print the variables that are still live, but for that, you would need to modify ocamlopt to dump DWARF code to recover the values.

what will be the addressing mode in assembly code generated by the compiler here?

Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?
The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.
Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).