What does the PIC register (%ebx) do? - c++

I have written a "dangerous" program in C++ that jumps back and forth from one stack frame to another. The goal is to be jump from the lowest level of a call stack to a caller, do something, and then jump back down again, each time skipping all the calls inbetween.
I do this by manually changing the stack base address (setting %ebp) and jumping to a label address. It totally works, with gcc and icc both, without any stack corruption at all. The day this worked was a cool day.
Now I'm taking the same program and re-writing it in C, and it doesn't work. Specifically, it doesn't work with gcc v4.0.1 (Mac OS). Once I jump to the new stack frame (with the stack base pointer set correctly), the following instructions execute, being just before a call to fprintf. The last instruction listed here crashes, dereferencing NULL:
lea 0x18b8(%ebx), %eax
mov (%eax), %eax
mov (%eax), %eax
I've done some debugging, and I've figured out that by setting the %ebx register manually when I switch stack frames (using a value I observed before leaving the function in the first place), I fix the bug. I've read that this register deals with "position independent code" in gcc.
What is position independent code? How does position independent code work? To what is this register pointing?

EBX points to the Global Offset Table. See this reference about PIC on i386. The link explains what PIC is an how EBX is used.

PIC is code that is relocated dynamically when it is loaded. Code that is non-PIC has jump and call addresses set at link time. PIC has a table that references all the places where such values exist, much like a .dll.
When the image is loaded, the loader will dynamically update those values. Other schemes reference a data value that defines a "base" and the target address is decided by performing calculations on the base. The base is usually set by the loader again.
Finally, other schemes use various trampolines that call to known relative offsets. The relative offsets contain code and/or data that are updated by a loader.
There are different reasons why different schemes are chosen. Some are fast when run, but slower to load. Some are fast to load, but have less runtime performance.

Related

Get the address of an intrinsic function-generated instruction

I have a function that uses the compiler intrinsic __movsq to copy some data from a global buffer into another global buffer upon every call of the function. I'm trying to nop out those instructions once a flag has been set globally and the same function is called again. Example code:
// compiler: MSVC++ VS 2022 in C++ mode; x64
void DispatchOptimizationLoop()
{
__movsq(g_table, g_localimport, 23);
// hopefully create a nop after movsq?
static unsigned char* ptr = (unsigned char*)(&__nop);
if (!InterlockedExchange8(g_Reduce, 1))
{
// point to movsq in memory
ptr -= 3;
// nop it out
...
}
// rest of function here
...
}
Basically the function places a nop after the movsq, and then tries to get the address of the placed nop then backtrack by the size of the movsq so that a pointer is pointing to the start of movsq, so then I can simply cover it with 3 0x90s. I am aware that the line (unsigned char*)(&__nop) is not actaully creating a nop because I'm not calling the intrinsic, I'm just trying to show what I want to do.
Is this possible, or is there a better way to store the address of the instructions that need to be nop'ed out in the future?
It's not useful to have the address of a 0x90 NOP somewhere else, all you need is the address of machine code inside your function. Nothing you've written comes remotely close to helping you find that. As you say, &__nop doesn't lead to there being a NOP in your function's machine code which you could offset relative to.
If you want to hard-code offsets that could break with different optimization settings, you could take the address of the start of the function and offset it.
Or you could write the whole function in asm so you can put a label on the address you want to modify. That would actually let you do this safely.
You might get something that happens to work with GNU C labels as values, where you can take the address of C goto labels like &&label. Like put a mylabel: before the intrinsic, and maybe after for good measure so you can check that the difference is the expected 3 bytes. If you're lucky, the compiler won't put any other instructions between your labels.
So you can memset((void*)&&mylabel, 0x90, 3) (after an assert on &&mylabel_end - &&mylabel == 3). But I don't think MSVC supports that GNU extension or anything equivalent.
But you can't actually use memset if another thread could be running this at the same time.
And for efficiency, you want a single 3-byte NOP anyway.
And of course you'd have to VirtualProtect the page of machine code containing that instruction to make it writeable. (Assuming the function is 16-byte aligned, it's hopefully impossible for that one instruction near the start to be split across two pages.)
And if other threads could be running this function at the same time, you'd better use an atomic RMW (on the containing dword or qword) to replace the 3-byte instruction with a single 3-byte NOP, otherwise you could have another thread fetch and decode the first NOP, but then fetch a byte of of the movsq machine code not replaced yet.
Actually a plain mov store would be atomic if it's 4 bytes not crossing an 8-byte boundary. Since there are no other writers of different data, it's fine to load / AND/OR / store to later store the same surrounding bytes you loaded earlier. Normally a non-atomic load+store is not thread-safe, but no other threads could have written a different value in the meantime.
Atomicity of cross-modifying code
I think cross-modifying code has atomicity rules similar to data. But if the instruction spans a 16-byte boundary, code-fetch in another core might have pulled in the first 1 or 2 bytes of it before you atomically replace all 3. So the 2nd and 3rd byte get treated as either the start of an instruction, or the 2nd + 3rd bytes of a long-NOP. Since long-NOPs generally start with 0F 1F with an escape byte, if that's not how __movsq starts then it could desync.
So if cross-modifying code doesn't trigger a pipeline nuke on the other core, it's not safe to do it while another thread might be running the code. Code fetch is usually done in 16-byte chunks but that's not guaranteed. And it's not guaranteed that they're aligned 16-byte chunks.
So you should probably make sure no other threads are running this function while you change the machine code. Unless you're very sure of the safety of what you're doing and check each build to make sure the instruction starts at a safe offset, where safe is defined according to any possibility or anything that could go wrong.

to how many subroutines can x86 processors call? [duplicate]

This question already has answers here:
C/C++ maximum stack size of program on mainstream OSes
(7 answers)
Closed 1 year ago.
i'm writing a small program to print a polygon with printf("\219") to just see if whatever i'm doing is right for my kernel. but it needs to call many functions and i don't know whether x86 processors can accept that many subroutines and i can't find results in google. so my question is will it accept so many function calls and what is the maximum. (i mean something like this:-)
function a() {b();}
function b() {c();}
function c() {d();}
...
i've used 5 such levels (you know what i mean, right?)
Your function depth is not limited by the processor, but by the size of your stack. Behind the scenes, calls to C++ functions usually translate to call instructions in x86, which push four (or eight, for x64 programs) bytes onto your program's stack for the return pointer. Sometimes calls are optimized and don't touch the stack at all. Functions might also push additional bytes (e.g. local function state) onto the stack.
To get the exact number of functions you can call, you need to disassemble your code to calculate the number of bytes each function pushes to the stack (minimum four/eight because of the return address, but likely many more), then find the maximum stack size and divide it by the function frame size.
call and ret instructions aren't special; the CPU doesn't know how deeply nested it is. (And doesn't know the difference between calling another function or being recursive.) As described in my answer here, functions are a high-level concept that asm gives you the tools to implement.
All the CPU knows is whether pushing a return address causes a page fault or not, if you run out of stack space. (A "stack overflow"). Usually the result of recursion going too deep, or a huge array in automatic storage (a C++ local var) or alloca.
(As mentioned in comments, not every C function call results in a call instruction; inlining can fully optimize away the fact that it's a separate function. You want small functions to inline, and the design of the template classes in the C++ standard library depends on this for efficiency. Also, a tailcall can be just a jmp, having the next function take over this stack space instead of getting new space.)
Linux typically uses 8 MiB stacks for user-space processes. (ulimit -s). Windows is typically 1 MiB. Inside a kernel, you often use smaller stacks. e.g. Linux kernel thread-stacks are currently 16 kiB for x86-64. Previously as small as 4 kiB (one page) for 32-bit x86, but some code has more local vars that take up stack space.
Related: How does the stack work in assembly language?. When ret executes, it just pops into the program counter. It's up to the programmer (or compiler) to create asm that runs ret when the stack pointer is pointing at somewhere you want to jump. (Typically a return address.)
In modern Linux systems, the smallest stack frame is 16 bytes, because the ABI specifies maintaining 16-byte stack alignment before a call. So best case you can have a call depth of 512k before you overflow the stack. (Unless you're in a thread that was started with an extra large thread-stack).
If you're in 32-bit mode with an old version of the i386 System V ABI that only required 4-byte stack alignment (like gcc -mpreferred-stack-boundary=2 instead of the default 4), with functions that just called without using any other stack space, 8 MiB of stack would give you 8 MiB / 4B = 2 Mi call depth.
In real life, some of that 8MiB space on the main thread's stack is used up by env vars and argv[] already on the stack at the process entry point (copied there by the kernel), and then a bit more as _start calls a function that calls main.
To make this able to actually return at the end, instead of just eventually faulting, you'd need either a huge chain, or some recursion with a termination condition, like
void recurse(int n) {
if (n == 1)
return;
recurse(n - 1);
}
and compile with some optimization but not enough to get the compiler to turn it into a do{}while(--n); loop or optimize away.
If you wanted all different functions, that's ok, the code-size limit is at least 2GiB, and call rel32 / ret takes a total of 6 bytes. (Un-optimized code would default to push ebp or push rbp as well, so you'd have to avoid that for 32-bit code if you wanted to meet that 4-byte stack-frame goal of having all your stack space full with just return addresses).
For example with GCC (see How to remove "noise" from GCC/clang assembly output? for the __attribute__((noipa)) option)
__attribute__((noipa)) // don't let other functions even notice that this is empty, let alone inline it
void foo(void){}
__attribute__((noipa))
void bar(){
foo();
}
void baz(){
bar();
}
compiles with GCC (Godbolt compiler explorer) to this 32-bit asm which just calls without using any stack space for anything else:
## GCC11.2 -O1 -m32 -mpreferred-stack-boundary=2
foo():
ret
bar():
call foo()
ret
baz():
call bar()
ret
g++ -O2 would optimize these call/ret functions into a tailcall like jmp foo, which doesn't push a return address. That's why I only used -O1 or -Og. Of course in real life you do want things to inline and optimize away; defeating that is just for this silly computer trick to achieve the longest finite call depth that actually would crash if you made it longer.
You can repeat this pattern indefinitely; GCC allows long symbol names, so it's not a problem to have many different unique function names. You can split across multiple files, with just a prototype for one of the functions in the other file.
If you reduce the -falign-functions tuning setting, you can probably get it down to either 6 bytes per function (no padding) or 8 (align by 8 thus 2 bytes of padding), down from the default of aligning each function label by 16, wasting 10 bytes per function.
Recursive:
I got GCC to make asm that recurses with no gaps between return address:
void recurse(int n){
if (!--n)
return;
recurse(n);
}
## G++11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1 # --n
jne .L6 # skip over the ret if the result is non-zero
.L4:
ret
.L6:
call recurse(int) # push a 4-byte ret addr and jump to top
jmp .L4 # silly compiler, should put another ret here. But we told it not to optimize too much
Note the -mregparm=3 option, so the first up-to-3 args are passed in registers, instead of on the stack in the inefficient i386 System V calling convention. (It was designed a long time ago; the x86-64 SysV calling convention is much better.)
With -O2, this function optimizes away to just a ret. (It turns the call into a tailcall which becomes a loop, and then it can see it's a non-infinite loop with no side effects so it just removes it.)
Of course in real life you want optimizations like this. Recursion in asm sucks compared to loops. If you're worried about robust code and call depth, don't write recursive functions in the first place, unless you know recursion depth will be shallow. If a debug build doesn't convert your recursion to iteration, you don't want your kernel to crash.
Just for fun, I got GCC to tighten up the asm even at -O1 by convincing it to do the conditional branch over the call, to only have one ret instruction in the function so tail-duplication wouldn't be relevant anyway. And it means the fast path (recursion) involves a not-taken macro-fused conditional branch plus a call.
void recurse(int n){
if (__builtin_expect(!!--n, 1))
recurse(n);
}
Godbolt
## GCC 11.2 -O1 -mregparm=3 -m32 -mpreferred-stack-boundary=2
recurse(int):
sub eax, 1
je .L1
call recurse(int)
.L1:
ret

When debugging with gdb, what is the meaning of the debugging information in the front of the assembly code?

When debugging c code with gdb, the displayed assembly code is
0x000000000040116c main+0 push %rbp
0x000000000040116d main+1 mov %rsp,%rbp
!0x0000000000401170 main+4 movl $0x0,-0x4(%rbp)
0x0000000000401177 main+11 jmp 0x40118d <main+33>
0x0000000000401179 main+13 mov -0x4(%rbp),%eax
0x000000000040117c main+16 mov %eax,%edx
0x000000000040117e main+18 mov -0x4(%rbp),%eax
Is the 0x000000000040116d in the front of the first assembly instruction the virtual address of this function? Is main+1 the offset of this assembly from the main function? The next assembly is main+4. Does it mean that the first mov %rsp,%rbp is three bytes? If so, why is movl $0x0,-0x4(%rbp) 7 bytes?
I am using a server. The version is:Linux version 4.15.0-122-generic (buildd#lcy01-amd64-010) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)) #124~16.04.1-Ubuntu SMP.
Pretty much yes. It's quite apparent that, for example, adding 4 to the first address gives you the address shown for main+4, and adding another 7 on top of that gives you the corresponding address for main+11.
As far as the two move instructions go: they are very different, and do completely different things. They are two very different kind of moves, and that's how many bytes each one requires in x86 machine language, so its not surprising that one takes many more bytes than the other. As far as the precise reason why, well, in general, that opens a very long, broad, and windy discussion about the underlying reasons, and the original design goals of the x86 machine language instruction set. Much of it actually no longer applies (and you would probably find quite boring, actually), since the modern x86 CPU is something quite radically different than its original generation. But it has to remain binary compatible. Hence, little oddities like that.
Just to give you a basic understanding: the first move is between two CPU registers. Doesn't take a long novel to specify from where, and to where. The second move has to specify a 32 bit value (0 to be precise), a CPU register, and a memory offset. That has to be specified somewhere. You have to find the bytes somewhere to specify all the little details for that.

Understanding the assembly language for if-else in following code [duplicate]

What happens if i say 'call ' instead of jump? Since there is no return statement written, does control just pass over to the next line below, or is it still returned to the line after the call?
start:
mov $0, %eax
jmp two
one:
mov $1, %eax
two:
cmp %eax, $1
call one
mov $10, %eax
The CPU always executes the next instruction in memory, unless a branch instruction sends execution somewhere else.
Labels don't have a width, or any effect on execution. They just allow you to make reference to this address from other places. Execution simply falls through labels, even off the end of your code if you don't avoid that.
If you're familiar with C or other languages that have goto (example), the labels you use to mark places you can goto to work exactly the same as asm labels, and jmp / jcc work exactly like goto or if(EFLAGS_condition) goto. But asm doesn't have special syntax for functions; you have to implement that high-level concept yourself.
If you leave out the ret at the end of a block of code, execution keeps doing and decodes whatever comes next as instructions. (Maybe What would happen if a system executes a part of the file that is zero-padded? if that was the last function in an asm source file, or maybe execution falls into some CRT startup function that eventually returns.)
(In which case you could say that the block you're talking about isn't a function, just part of one, unless it's a bug and a ret or jmp was intended.)
You can (and maybe should) try this yourself in a debugger. Single-step through that code and watch RSP and RIP change. The nice thing about asm is that the total state of the CPU (excluding memory contents) is not very big, so it's possible to watch the entire architectural state in a debugger window. (Well, at least the interesting part that's relevant for user-space integer code, so excluding model-specific registers that the only the OS can tweak, and excluding the FPU and vector registers.)
call and ret aren't "special" (i.e. the CPU doesn't "remember" that it's inside a "function").
They just do exactly what the manual says they do, and it's up to you to use them correctly to implement function calls and returns. (e.g. make sure the stack pointer is pointing at a return address when ret runs.) It's also up to you to get the calling convention correct, and all that stuff. (See the x86 tag wiki.)
There's also nothing special about a label that you jmp to vs. a label that you call. An assembler just assembles bytes into the output file, and remembers where you put label markers. It doesn't truly "know" about functions the way a C compiler does. You can put labels wherever you want, and it doesn't affect the machine code bytes.
Using the .globl one directive would tell the assembler to put an entry in the symbol table so the linker could see it. That would let you define a label that's usable from other files, or even callable from C. But that's just meta-data in the object file and still doesn't put anything between instructions.
Labels are just part of the machinery that you can use in asm to implement the high-level concept of a "function", aka procedure or subroutine: A label for callers to call to, and code that will eventually jump back to a return address the caller passed, one way or another. But not every label is the start of a function. Some are just the tops of loops, or other targets of conditional branches within a function.
Your code would run exactly the same way if you emulated call with an equivalent push of the return address and then a jmp.
one:
mov $1, %eax
# missing ret so we fall through
two:
cmp %eax, $1
# call one # emulate it instead with push+jmp
pushl $.Lreturn_address
jmp one
.Lreturn_address:
mov $10, %eax
# fall off into whatever comes next, if it ever reaches here.
Note that this sequence only works in non-PIC code, because the absolute return address is encoded into the push imm32 instruction. In 64-bit code with a spare register available, you can use a RIP-relative lea to get the return address into a register and push that before jumping.
Also note that while architecturally the CPU doesn't "remember" past CALL instructions, real implementations run faster by assuming that call/ret pairs will be matched, and use a return-address predictor to avoid mispredicts on the ret.
Why is RET hard to predict? Because it's an indirect jump to an address stored in memory! It's equivalent to pop %internal_tmp / jmp *%internal_tmp, so you can emulate it that way if you have a spare register to clobber (e.g. rcx is not call-preserved in most calling conventions, and not used for return values). Or if you have a red-zone so values below the stack-pointer are still safe from being asynchronously clobbered (by signal handlers or whatever), you could add $8, %rsp / jmp *-8(%rsp).
Obviously for real use you should just use ret, because it's the most efficient way to do that. I just wanted to point out what it does using multiple simpler instructions. Nothing more, nothing less.
Note that functions can end with a tail-call instead of a ret:
(see this on Godbolt)
int ext_func(int a); // something that the optimizer can't inline
int foo(int a) {
return ext_func(a+a);
}
# asm output from clang:
foo:
add edi, edi
jmp ext_func # TAILCALL
The ret at the end of ext_func will return to foo's caller. foo can use this optimization because it doesn't need to make any modifications to the return value or do any other cleanup.
In the SystemV x86-64 calling convention, the first integer arg is in edi. So this function replaces that with a+a, then jumps to the start of ext_func. On entry to ext_func, everything is in the correct state just like it would be if something had run call ext_func. The stack pointer is pointing to the return address, and the args are where they're supposed to be.
Tail-call optimizations can be done more often in a register-args calling convention than in a 32-bit calling convention that passes args on the stack. You often run into situations where you have a problem because the function you want to tail-call takes more args than the current function, so there isn't room to rewrite our own args into args for the function. (And compilers don't tend to create code that modifies its own args, even though the ABI is very clear that functions own the stack space holding their args and can clobber it if they want.)
In a calling convention where the callee cleans the stack (with ret 8 or something to pop another 8 bytes after the return address), you can only tail-call a function that takes exactly the same number of arg bytes.
Your intuition is correct: the control just passes to the next line below after the function returns.
In your case, after call one, your function will jump to mov $1, %eax and then continue down to cmp %eax, $1 and end up in an infinite loop as you will call one again.
Beyond just an infinite loop, your function will eventually go beyond its memory constraints since a call command writes the current rip (instruction pointer) to the stack. Eventually, you'll overflow the stack.

Corrupted offset in call instruction

The last few days have been spent debugging a very strange problem. An application built for i386 running on Windows crashed, with the top of the callstack completely corrupted and the instruction pointer in a nonsense location.
After some effort, I rebuilt the callstack and was able to determine how the IP ended up in the nonsense location. An instruction in boost shared pointer code attempts to call a function defined in my DLLs import address table using an incorrect offset. The instruction looks like:
call dword ptr [nonsense offset into import address table]
As a result, execution ended up in a bad location that was, unfortunately, executable. Execution then proceeded, gobbling up the top of the stack until eventually crashing.
By launching the identical application on my PC, and stepping into the problematic code, I can find the same call instruction and see it's supposed to by calling msvc100's 'new' operator.
Further comparing the minidump from the client's PC to my PC, I found that my PC was calls a function with an offset of 0x0254 into the address table. On the clients PC, the code is trying to invoke a function with an offset of 0x8254.
What's even more confusing is that this offset is not coming from a register or another memory location. The offset is a constant in the disassembly. So, the disassembly looks like:
call dword ptr [ 0x50018254 ]
not like:
call dword ptr [ edx ]
Does anyone know how this might happen?
That's a single bit flip:
0x0254 = 0b0000001001010100
0x8254 = 0b1000001001010100
Perhaps corrupt memory, corrupt disk, gamma ray from the sun...?
If this specific case is reproducible and their on-disk binary matches yours, I'd investigate further. If it's not specifically reproducible, I'd encourage the client to run some machine diagnostics.
Thats seems to me a hardware error for sure, mainly memory error. As #Hostile_Fork pointed out, is just a bit flip.
Does your memory have ECC feature? it it does it, make sure is enabled. I would pass a burn-in memory test with memtest86 to see what happens, I bet you have a faulty memory chip, doesn't look like a bug.