How gdb reconstructs stacktrace for C++?

How gdb reconstructs stacktrace for C++? - gdb

I have divided the whole question into smaller ones:
What kind of different algorithms GDB is capable to use to reconstruct stacktraces?
How each of the stacktrace reconstruction algorithm works at high level? Advantages and disadvantages?
What kind of meta-information compiler needs to provide in program for each stacktrace reconstruction algorithm to work?
And also corresponding g++ compiler switches that enable/disable particular algorithm?

Speaking Pseudocode, you could call the stack "an array of packed stack frames", where every stack frame is a data structure of variable size you could express like:
template struct stackframe<N> {
uintptr_t contents[N];
#ifndef OMIT_FRAME_POINTER
struct stackframe<> *nextfp;
#endif
void *retaddr;
};
Problem is that every function has a different <N> - frame sizes vary.
The compiler knows frame sizes, and if creating debugging information will usually emit these as part of that. All the debugger then needs to do is to locate the last program counter, look up the function in the symbol table, then use that name to look up the framesize in the debugging information. Add that to the stackpointer and you get to the beginning of the next frame.
If using this method you don't require frame linkage, and backtracing will work just fine even if you use -fomit-frame-pointer. On the other hand, if you have frame linkage, then iterating the stack is just following a linked list - because every framepointer in a new stackframe is initialized by the function prologue code to point to the previous one.
If you have neither frame size information nor framepointers, but still a symbol table, then you can also perform backtracing by a bit of reverse engineering to calculate the framesizes from the actual binary. Start with the program counter, look up the function it belongs to in the symbol table, and then disassemble the function from the start. Isolate all operations between the beginning of the function and the program counter that actually modify the stackpointer (write anything to the stack and/or allocate stackspace). That calculates the frame size for the current function, so subtract that from the stackpointer, and you should (on most architectures) find the last word written to the stack before the function was entered - which is usually the return address into the caller. Re-iterate as necessary.
Finally, you can perform a heuristic analysis of the contents of the stack - isolate all words in the stack that are within executably-mapped segments of the process address space (and thereby could be function offsets aka return addresses), and play a what-if game looking up the memory, disassembling the instruction there and see if it actually is a call instruction of sort, if so whether that really called the 'next' and if you can construct an uninterrupted call sequence from that. This works to a degree even if the binary is completely stripped (although all you could get in that case is a list of return addresses). I don't think GDB employs this technique, but some embedded lowlevel debuggers do. On x86, due to the varying instruction lengths, this is terribly difficult to do because you can't easily "step back" through an instruction stream, but on RISC, where instruction lengths are fixed, e.g. on ARM, this is much simpler.
There are some holes that make simple or even complex/exhaustive implementations of these algorithms fall over sometimes, like tail-recursive functions, inlined code, and so on. The gdb sourcecode might give you some more ideas:
https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/frame.c
GDB employs a variety of such techniques.

Related

Can't understand why this simple recursion begins to work and then crashes

I was doing some trick questions about C++ and I run on similar code to this then I modified it to see what will happen.
I don't understand why at first place this recursion is working (it's printing values from 2 to 4764) and then suddenly it throws exception.
I don't understand also why I can say return in void function and actually return something other than "return;"
Can anyone explain this two problems?
#include<iostream>
using namespace std;
void function(int& a){
a++;
cout << a << endl;
return function(a);
}
void main() {
int b = 2;
function(b);
system("pause>0");
}

The comments have correctly identified that your infinite recursion is causing a stack overflow - each new call to the same function is taking up more RAM, until you use up the amount allocated for the program (the default C++ stack size varies greatly by environment, and anywhere from 10s kB on old systems to 10+ MB on the upper end).The comments have correctly identified that your infinite recursion is causing a stack overflow - the space amount allocated for this purpose (the default C++ stack size varies greatly by environment, and anywhere from 10s kB on old systems to 10+ MB on the upper end). While the function itself is doing very little in terms of memory, the stack frames (which keep track of which function called which other ongoing function with what parameters) can take up quite a lot.
While useful for certain data structures, recursive programs should not need to go several thousand layers deep and usually add a stop condition (in this case, even checking whether a > some_limit) to identify the point where they have gone to deep and need to stop adding more things to the stack (plain return;).
In this case, the exact same output can be achieved with a simple for loop, so I guess these trick questions are purely experimental.

On x86-64 platforms, like your laptop or desktop, functions get called one of two ways:
with a call assembly instruction
with a jmp assembly instruction
What's the difference? A call assembly instruction has additional instructions after it: when the function is called, the code will return to the place it was called from. In order to keep track of where it is, the function uses memory on the stack. If a recursive function calls itself using call, then as it recurses it'll use up more and more of the stack, eventually resulting in a stack overflow.
On the other hand, a jmp instruction just tells the CPU to jump to the section of code where the other function is stored. If a function is calling itself, then the CPU will just jmp back up to the top of the function and start it over with the updated parameters. This is called a tail-call optimization, and it prevents stack overflow entirely in a lot of common cases because the stack doesn't grow.
If you compile your code at a higher optimization level (say, -O2 on GCC), then the compiler will use tail-call optimization and your code won't have a stack overflow.

Are there memset function implementations that fill the buffer in reverse order?

I know that there are implementations of memcpy, which copied memory in reverse order to optimize for some processors. At one time, a bug "Strange sound on mp3 flash website" was connected with that. Well, it was an interesting story, but my question is about another function.
I am wondering, there is a memset function in the world, which fills the buffer, starting from the end. It is clear that in theory nothing prevents doing such an implementation of a function. But I am interested exactly in the fact that this function was done in practice by someone somewhere. I would be especially grateful on the link on the library with such a function.
P.S. I understand that in terms of applications programming it has completely no difference whether the buffer is filled in the ascending or descending order. However, it is important for me to find out whether there was any "reverse" function implementation. I need it for writing an article.

The Linux kernel's memset for the SuperH architecture has this property:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/sh/lib/memset.S?id=v4.14
Presumably it's done this way because the mov instruction exists in predecrement form (mov.l Rm,#-Rn) but not postincrement form. See:
http://shared-ptr.com/sh_insns.html
If you want something that's not technically kernel internals on a freestanding implementation, but an actual hosted C implementation that application code could get linked to, musl libc also has an example:
https://git.musl-libc.org/cgit/musl/tree/src/string/memset.c?id=v1.1.18
Here, the C version of memset (used on many but not all target archs) does not actually fill the whole buffer backwards, but rather starts from both the beginning and end in a manner that reduces the number of conditional branches and makes them all predictable for very small memsets. See the commit message where it was added for details:
https://git.musl-libc.org/cgit/musl/commit/src/string/memset.c?id=a543369e3b06a51eacd392c738fc10c5267a195f
Some of the arch-specific asm versions of memset also have this property:
https://git.musl-libc.org/cgit/musl/tree/src/string/x86_64/memset.s?id=v1.1.18

How to offload memory offset calculation from runtime in C/C++?

I am implementing a simple VM, and currently I am using runtime arithmetic to calculate individual program object addresses as offsets from base pointers.
I asked a couple of questions on the subject today, but I seem to be going slowly nowhere.
I learned a couple of things thou, from question one -
Object and struct member access and address offset calculation -
I learned that modern processors have virtual addressing capabilities, allowing to calculate memory offsets without any additional cycles devoted to arithmetic.
And from question two - Are address offsets resolved during compile time in C/C++? - I learned that there is no guarantee for this happening when doing the offsets manually.
By now it should be clear that what I want to achieve is to take advantage of the virtual memory addressing features of the hardware and offload those from the runtime.
I am using GCC, as for platform - I am developing on x86 in windows, but since it is a VM I'd like to have it efficiently running on all platforms supported by GCC.
So ANY information on the subject is welcome and will be very appreciated.
Thanks in advance!
EDIT: Some overview on my program code generation - during the design stage the program is build as a tree hierarchy, which is then recursively serialized into one continuous memory block, along with indexing objects and calculating their offset from the beginning of the program memory block.
EDIT 2: Here is some pseudo code of the VM:
switch *instruction
case 1: call_fn1(*(instruction+1)); instruction += (1+sizeof(parameter1)); break;
case 2: call_fn2(*(instruction+1), *(instruction+1+sizeof(parameter1));
instruction += (1+sizeof(parameter1)+sizeof(parameter2); break;
case 3: instruction += *(instruction+1); break;
Case 1 is a function that takes one parameter, which is found immediately after the instruction, so it is passed as an offset of 1 byte from the instruction. The instruction pointer is incremented by 1 + the size of the first parameter to find the next instruction.
Case 2 is a function that takes two parameters, same as before, first parameter passed as 1 byte offset, second parameter passed as offset of 1 byte plus the size of the first parameter. The instruction pointer is then incremented by the size of the instruction plus sizes of both parameters.
Case 3 is a goto statement, the instruction pointer is incremented by an offset which immediately follows the goto instruction.
EDIT 3: To my understanding, the OS will provide each process with its own dedicated virtual memory addressing space. If so, does this mean the first address is always ... well zero, so the offset from the first byte of the memory block is actually the very address of this element? If memory address is dedicated to every process, and I know the offset of my program memory block AND the offset of every program object from the first byte of the memory block, then are the object addresses resolved during compile time?
Problem is those offsets are not available during the compilation of the C code, they become known during the "compilation" phase and translation to bytecode. Does this mean there is no way to do object memory address calculation for "free"?
How is this done in Java for example, where only the virtual machine is compiled to machine code, does this mean the calculation of object addresses takes a performance penalty because of runtime arithmetics?

Here's an attempt to shed some light on how the linked questions and answers apply to this situation.
The answer to the first question mixes two different things, the first is the addressing modes in X86 instruction and the second is virtual-to-physical address mapping. The first is something that is done by compilers and the second is something that is (typically) set up by the operating system. In your case you should only be worrying about the first.
Instructions in X86 assembly have great flexibility in how they access a memory address. Instructions that read or write memory have the address calculated according to the following formula:
segment + base + index * size + offset
The segment portion of the address is almost always the default DS segment and can usually be ignored. The base portion is given by one of the general purpose registers or the stack pointer. The index part is given by one of the general purpose registers and the size is either 1, 2, 4, or 8. Finally the offset is a constant value embedded in the instruction. Each of these components is optional, but obviously at least one must be given.
This addressing capability is what is generally meant when talking about computing addresses without explicit arithmetic instructions. There is a special instruction that one of the commenters mentioned: LEA which does the address calculation but instead of reading or writing memory, stores the computed address in a register.
For the code you included in the question, it is quite plausible that the compiler would use these addressing modes to avoid explicit arithmetic instructions.
As an example, the current value of the instruction variable could be held in the ESI register. Additionally, each of sizeof(parameter1) and sizeof(parameter2) are compile time constants. In the standard X86 calling conventions function arguments are pushed in reverse order (so the first argument is at the top of the stack) so the assembly codes might look something like
case1:
PUSH [ESI+1]
CALL fn1
ADD ESP,4 ; drop arguments from stack
ADD ESI,5
JMP end_switch
case2:
PUSH [ESI+5]
PUSH [ESI+1]
CALL fn2
ADD ESP,8 ; drop arguments from stack
ADD ESI,9
JMP end_swtich
case3:
MOV ESI,[ESI+1]
JMP end_switch
end_switch:
this is assuming that the size of both parameters is 4 bytes. Of course the actual code is up to the compiler and it is reasonable to expect that the compiler will output fairly efficient code as long as you ask for some level optimization.

You have a data item X in the VM, at relative address A, and an instruction that says (for instance) push X, is that right? And you want to be able to execute this instruction without having to add A to the base address of the VM's data area.
I have written a VM that solves this problem by mapping the VM's data area to a fixed Virtual Address. The compiler knows this Virtual Address, and so can adjust A at compile time. Would this solution work for you? Can you change the compiler yourself?
My VM runs on a smart card, and I have complete control over the OS, so it's a very different environment from yours. But Windows does have some facilities for allocating memory at a fixed address -- the VirtualAlloc function, for instance. You might like to try this out. If you do try it out, you might find that Windows is allocating regions that clash with your fixed-address data area, so you will probably have to load by hand any DLLs that you use, after you have allocated the VM's data area.
But there will probably be unforeseen problems to overcome, and it might not be worth the trouble.

Playing with virtual address translation, page tables or TLBs is something that can only be done at the OS kernel level, and is unportable between platforms and processor families. Furthermore hardware address translation on most CPU ISAs is usually supported only at the level of certain page sizes.

To answer my own question, based on the many responses I got.
Turns out what I want to achieve is not really possible in my situation, getting memory address calculations for free is attainable only when specific requirements are met and requires compilation to machine specific instructions.
I am developing a visual element, lego style drag and drop programming environment for educational purposes, which relies on a simple VM to execute the program code. I was hoping to maximize performance, but it is just not possible in my scenario. It is not that big of a deal thou, because program elements can also generate their C code equivalent, which can then be compiled conventionally to maximize performance.
Thanks to everyone who responded and clarified a matter that wasn't really clear to me!

What is the order of magnitude of the maximum number of recursive calls in C++?

I have a recursive function which calls itself a very large number of times given certain inputs - which is exactly what it should do. I know my function isn't infinitely looping - it just gets to a certain number of calls and overflows. I want to know if this is a problem with putting too much memory on the stack, or just a normal restriction in number of calls. Obviously it's very hard to say a specific number of calls which is the maximum, but can anybody give me a rough estimate of the order of magnitude? Is it in the thousands? Hundreds? Millions?

So, as you've guessed, the problem is the (eponymous) stack overflow. Each call requires setting up a new stack frame, pushing new information onto the stack; stack size is fixed, and eventually runs out.
What sets the stack size? That's a property of the compiler -- that is, it's fixed for a binary executable. In Microsoft's compiler (used in VS2010) it defaults to 1 megabyte, and you can override it with "/F " in compiler options (see here for an '03 example, but the syntax is the same).
It's very difficult to figure out how many calls that equates to in practice. A function's stack size is determined by it's local variables, the size of the return address, and how parameters are passed (some may go on the stack), and much of that depends on architecture, also. Generally you may assume that the latter two are less than a hundred bytes (that's a gross estimate). The former depends on what you're doing in the function. If you assume the function takes, say, 256 bytes on the stack, then with a 1M stack you'd get 4096 function calls before overflowing -- but that doesn't take into account the overhead of the main function, etc.
You could try to reduce local variables and parameter overhead, but the real solution is Tail Call Optimization, in which the compiler releases the calling function as it invokes the recursing function. You can read more about doing that in MSVC here. If you can't do tail calls, and you can't reduce your stack size acceptably, then you can look at increasing stack size with the "/F" parameter, or (the preferred solution) look at a redesign.

It completely depends on how much information you use on the stack. However, the default stack on Windows is 1MB and the default stack on Unix is 8MB. Simply making a call can involve pushing a few 32bit registers and a return address, say, so you could be looking at maybe 20bytes a call, which would put the maximum at about 50k on Windows and 400k on Unix- for an empty function.
Of course, as far as I'm aware, you can change the stack size.

One option for you may be to change/increase the default stacksize. Here's one way http://msdn.microsoft.com/en-us/library/tdkhxaks(v=vs.80).aspx

There are tools to measure the stack usage. They fill the stack in advance with a certain byte pattern and look afterwards up to what address it got changed. With those you can find out how close to the limit you get.
Maybe one of the valgrind tools can do that.

The amount of stack space used by a recursive function depends on the depth of the recursion and the amount of memory space used by each call.
The depth of the recursion refers to the number of levels of call active at any given moment. For example, a binary tree might have, say, a million nodes, but if it's well balanced you can traverse it with no more that 20 simultaneous active calls. If it's not well balanced, the maximum depth might be much greater.
The amount of memory used by each call depends on the total size of the variables declared in your recursive function.
There's no fixed limit on the maximum depth of recursion; you'll get a stack overflow if your total usage exceeds the stack limit imposed by the system.
You might be able to reduce memory usage by somehow reducing the depth of your recursion, perhaps by restructuring whatever it is you're traversing (you didn't tell us much about that), or by reducing the total size of any local objects declared inside your recursive function (note that heap-allocated objects don't contribute to stack size), or some combination of the two.
And as others have said, you might be able to increase your allowed stack size, but that will probably be of only limited use -- and it's an extra thing you'll have to do before running your program. It could also consume resources and interfere with other processes on the system (limits are imposed for a reason).
Changing the algorithm to avoid recursion might be a possibility, but again, we don't have enough information to say much about that.

Interprocess Memory Editing - Finding changed addresses

I'm currently making one of those game trainers as a small project. I've already ran into a problem; when you "go into a different level", the addresses for things such as fuel, cash, bullets, their addresses change. This would also happen say, if you were to restart the application.
How can I re-locate these addresses?
I feel like it's a fairly basic question, but it's one of those "it is or is not possible" questions to me. Should I just stop looking and forget the concept entirely? "Too hard?"

It's a bit hard to describe exactly how to do this since it heavily dependents on the program you're studying and whether the author went out if his way to make your life difficult. Note that I've only done this once but it worked reasonably well even if I only knew a little assembly.
What is probably happening is that the values are allocated on the heap using a call to malloc/new and everytime you change level they are cleaned up and re-allocated somewhere else. So the idea is to look at the assembly code of the program to find where the pointer returned by malloc is stored and figure out a way to reliably read the content of the pointer and find the value you're looking for.
First thing you'll want is a debugger like OllyDbg and a basic knowledge of assembly. After that, start by setting a read and write breakpoint on the variable you want to examine. Since you said that you can't tell exactly where the variable is, you'll have to pause the process while it's running and search the program's memory for the value. Hopefully you'll end up with only a few results to sift through but be suspicious of anything that is on the stack since it might just be a copy for a function call or for local use.
Once the breakpoint is set just run the program until a break occurs. Now all you have to do is look at the code and examine how the variable is being accessed. If it's being passed as a parameter, go examine the call site of the function. If it's being accessed through a pointer, make a note of it and start examining the pointer. If it's being accessed as an offset of a pointer, that means it's part of a data structure so make a note of it and start examining the other variable. And so on.
Stay focused on your variable and just keep examining the code until you eventually find the root which can be one of two things:
A global variable that has a static address. This is the easiest scenario since you have a static address hardcoded straight into the code that you can use to reliably walk through the data structures.
A stack allocated variable. This is trickier and I'm not entirely sure how to deal with this scenario reliably. It's possible that its address will have the same offset from the beginning of the stack most of the time but it might not. You could also walk the stack to find the corresponding function and its parameters but this a bit tricky to get right.
Once you have an address all that's left to do is use ReadProcessMemory to locate your variable using the information you found. For example, if the address you have represents a pointer to a data structure where at offset 0x40 your fuel value is stored, then you'll have to read the value at the address, add 0x40 to it and do another read on the result.
Note that the address is only valid as long as the executable doesn't change in any way. If it's recompiled or patched then you have to start over. I believe you'll also have to be careful about Windows' ASLR which might change the address around every time you start the program.
Comment box was too small to fit this so I'll put it here.
If it's esp plus a constant then I believe that this is a parameter and not a local variable (do confirm by checking the layout of the calling convention). If that's the case, then you should step the program until it returns to its caller, figure out how the parameter is being set (look for push instructions before the call instruction) and continue exploring from there. When I did this I had to unwind the stack once or twice before I found the global pointer to the data structure.
Also the esi register is not related to the stack (I had to look it up) so I'd check how it's being set. It could be that it contains the address of the data structure and the constant is the offset to the variable. If you figure out how the register is set you'll be that much closer to the pointer.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js