How does gcc implement stack unrolling for C++ exceptions on linux? - c++

How does gcc implement stack unrolling for C++ exceptions on linux? In particular, how does it know which destructors to call when unrolling a frame (i.e., what kind of information is stored and where is it stored)?

See section 6.2 of the x86_64 ABI. This details the interface but not a lot of the underlying data. This is also independent of C++ and could conceivably be used for other purposes as well.
There are primarily two sections of the ELF binary as emitted by gcc which are of interest for exception handling. They are .eh_frame and .gcc_except_table.
.eh_frame follows the DWARF format (the debugging format that primarily comes into play when you're using gdb). It has exactly the same format as the .debug_frame section emitted when compiling with -g. Essentially, it contains the information necessary to pop back to the state of the machine registers and the stack at any point higher up the call stack. See the Dwarf Standard at dwarfstd.org for more information on this.
.gcc_except_table contains information about the exception handling "landing pads" the locations of handlers. This is necessary so as to know when to stop unwinding. Unfortunately this section is not well documented. The only snippets of information I have been able to glean come from the gcc mailing list. See particularly this post
The remaining piece of information is then what actual code interprets the information found in these data sections. The relevant code lives in libstdc++ and libgcc. I cannot remember at the moment which pieces live in which. The interpreter for the DWARF call frame information can be found in the gcc source code in the file gcc/unwind-dw.c

There isn't much documentation currently available, however the basic system is that GCC translates try/catch blocks to function calls and then links in a library with the needed runtime support (documentation about the tree building code includes the statement "throwing an exception is not directly represented in GIMPLE, since it is implemented by calling a function").
Unfortunately I'm not familiar with these functions and can't tell you what to look at (other than the source for libgcc -- which includes the exception handling runtime).
There is an "Exception Handling for Newbies" document available.

Although this looks to be for Itanium, presumably the implementation is similar for x86: exception handling ABI

Related

What affects generated machine code at each step of the compilation process?

I am almost certain this question has been asked before, but I can not seem to find the right keywords to search for to get an answer. My apologies if this is a duplicate.
I am better trying to understand the compilation process of say a C++ file as it goes from the C++ syntax to the binary machine code. In addition I am trying to understand what influences the resulting machine code.
First, I am nearly certain that the following are the only factors (for most systems) that dictate the final machine code (please correct me if I am wrong here)
The tools used to compile, assemble, and link.
Things like gnu c compiler, clang, visual studio, nasm, ect.
The kernel of the system being used.
Whether its a specific version of the linux kernel, windows microkernel, or some other kernel like a mac os x one.
The operating system being used.
This one I am less clear about. I am unsure if machines running the same linux kernel, but different os, in this case let's say debian vs centos, will they produce different binaries.
Lastly the hardware architecture.
Different cpu architectures like arm 64, x86, power pc, ect. take different op codes so obviously the machine code should be different.
So with that being said here is my understanding of the compilation process and where each of these dependencies show up.
I write a C++ file and use code that my system can understand. A good example might be using <winsock.h> on windows and <sys/socket.h> on linux.
The preprocessor runs and executes any preprocessor macros.
Here I know that different preprocessors will define different macros but for now I will assume this is not too machine dependent. (This might be wrong to assume).
The compiler tools run to produce assembly file outputs.
Here the assembly produced depends on the compiler and what optimizations or choices it makes.
It also depends on the kernel because different kernels have different system calls and store files in different locations. This means the assembly might make changes such as different branching when calling functions specific to that kernel.
The operating system? Still unsure how the operating system fits in to this. If two machines have the same kernel, what does the operating system do to the binaries?
Finally the assembly code depends on the cpu architecture. I think that is a pretty obvious statement.
Once the compiler produces an assembly. We can then invoke the assembler to turn our assembly code into almost complete machine code. (I think machine code is identical to binary opcodes a cpu manual lists but this might be wrong).
The corresponding machine code files (often called object files I think) contain nearly all the instructions needed to run or reference other machine code files which will be linked in the next step.
This machine code usually has some format (I think ELF is a popular format for linux) and this format is dependent on the linker for sure.
I don't think the kernel, operating system, or hardware affect the layout/format of the object file but this is probably wrong. If they do please correct this.
The hardware will affect the actual machine code produced because again I think it is a 1 to 1 mapping of machine code instructions to opcodes for a cpu.
I am unsure if the kernel or operating system affect the linking process because I thought their changes were already incorporated in the compiling step.
Finally the linking step occurs.
I think this is as simple as the linker looking for all the referenced machine code and injecting it into one complete machine code file which can be executed.
I have no clue what affects this besides the linker tool itself.
So with all that, I need help identifying inaccuracies with the procedure I described above, and any dependencies I might have missed whether it be cpu, os, kernel, or tool ones.
Thank you and sorry for the long winded question. This probably should have been broken up into multiple questions but I am too far in. If this does not go well I may ask each part in individual questions.
EDIT:
Questions with more focus.
What components of a machine affect the machine code produced given a C++ file input?
Actually that is a lot of questions and usually you're question would be much too broad for SO (as you managed to recognize by yourself). But on the other hand you showed a deep interest (just by writing such a long and profound question) and also a lot of correct understanding of the process of compiling a program. The things you are missing or not understanding correctly (and you are probably the most interested in) are those things, that I myself found hard to learn. Thus I will provide you with some important points, that I think you are missing in the big picture.
Note that I am very much used to Linux, so I will mostly describe how things work on Linux. But I believe that most things also happen in a similar way on other operating systems.
Let's begin with the hardware. A modern computer has a CPU of some architecture. There are lots of different of CPU architectures. You mentioned some of them like arm, x86, etc. which are families of similar CPUs and can be divided into smaller groups by bit width and/or supported extensions. Ultimately your processor has a specified instruction set that defines which opcodes it supports and what those opcodes do. If a native (compiled) program runs, there are raw opcodes in the memory and the CPU directly executes them following its architecture specification.
Aside from the CPU there is a lot more hardware connected to your computer. Usually communicating with this hardware is complicated and not standardized. If a user program for example gets input keystrokes from the keyboard, in does not have to directly communicate with the keyboard, but rather does this via the operating system kernel. This works by a mechanism called syscall interrupt. The kernel installs an handler routine, that is called if a user program triggers such an interrupt with a special CPU instruction. You can think of it like a language agnostic function call from the program into the kernel. For example for Linux you can find a list of all syscalls at the syscall(2) man page. The syscalls form the kernel's Application Binary Interface (kernel ABI). Reading and writing from a terminal or using a filesystem are examples for syscall functionality.
As you can see, there are already very high level functions, that are implemented in the kernel. However the functionality is still quite limited for most typical applications. To encapsulate the syscalls and provide functions for memory management, utility functions, mathematical functions and many other things you probably use in your daily programs, there is usually another layer between the program and the kernel. This thing is called the C standard library, and it is a shared library (we will cover what exactly this is in a moment). On GNU/Linux it is the glibc which is the single most important library on a GNU/Linux system (and notably not part of the kernel 1). While it implements all the features that are required by the C standard (for example functions like malloc() or strcpy()), it also ships a lot of additional functions which are a superset of the ISO C standard library, the POSIX standard and some extensions. This interface is usually called the Application Programming Interface (API) of the operating system. While it is in principle possible to bypass the API and directly use the syscalls, almost all programs (even when written in other languages than C or C++) use the C library.
Now get yourself a coffee and a few minutes of rest. We now have enough background information to look at how a C++ program is transformed into a binary, and how exactly this binary is executed.
A C++ program consists of different compilation units (usually each different source file is a compilation unit). Each compilation unit undergoes the following steps
The preprocessor is run on the file. It includes header, expands macros and does some other stuff. As you wrote in your question this is rather platform independent. The preprocessor actions are standardized in the C++ standard.
The resulting code is compiled. That means C++ code is translated into assembly code. Because assembly code directly reflects the CPU instructions, this step is dependent on the target CPU architecture, that the compiler was configured for (usually the host CPU). The compiler is allowed to optimize and translate the program in any way it wants, as long as it follows the as-if rule. Thus this step is also higly dependent on the compiler you are using.
Note: Symbols (especially functions) that are not defined, are left undefined. If you say call the malloc() function, this will not be compiled, but left unevaluated until later. Thus this step is also not much dependent on the operating system.
Assembling takes place. This is very straightforward. The assembly code usually can be converted directly into binary CPU instructions. Local symbols (such as goto labels etc.) are resolved and replaced by their corresponding addresses. Unknown external symbols such as the mentioned malloc() call still are left unevaluated and are stored in the object file's symbol table. Because most of the syscalls are wrapped in library functions, the assembly code will usually not directly contain syscall code. Thus this step is depended on the CPU architecture. It is however dependent on the ABI2, which in term is dependent on the compiler and the OS.
Linking takes place. The different compilation units are combined into a single executable binary in an OS-dependent format (e.g. GNU/Linux uses ELF). Here yet more symbols are resolved. For example if one compilation calls a function in another compilation unit, this call is resolved and the symbol is replaced by the function address. If you link to a library statically, this is just treated like another compilation unit and included into the executable with its symbols resolved.
Shared libraries are checked for the needed symbols, but not linked yet. For example in case of the malloc() call, the linker checks, that there is a malloc symbol in the glibc, but the symbol in the executable still remains unresolved.
At this point you have a executable binary. As you might noticed, there might still be unresolved symbols in that binary. Thus you cannot just load that binary into RAM and let the CPU execute it. A final step called dynamic linking is needed. On Linux the program that performs this step is called the dynamic linker/loader. Its task is to load the executable ELF file into memory, look up all the needed dynamic libraries, load them into memory as well (a list is stored in the ELF file) and resolve the remaining symbols. This last step happens each time the program is executed. Now finally the malloc() symbol is resolved with the address in the glibc shared library.
You have pure CPU instructions in memory, the CPU's program counter register (the one that tracks the next instruction) is set to the entry point, and the program can begin to run. Every now and then it is interrupted either because it makes a syscall, or because it is interrupted by the kernel scheduler to let another program run on that CPU core.
I hope I could answer some of your questions and satisfy your curiosity. I think the most important part you were missing, was how dynamic linking happens. This is a very interesting topic which is related to concepts like position independent code. I wish you could luck learning.
1 this is also one reason why some people insist on calling Linux based systems GNU/Linux. The glibc library (together with many other GNU programs) defines much of the operating system structure, interacts with supplementary programs and configuration files etc. There are however Linux based systems without glibc. One of them is Android, using Googles bionic libc.
2 The ABI is related to the calling convention. This is a mixture of operating system, programming language and compiler specification. It is one of the reasons (besides name mangling, see the comment of PeterCordes below) you need those extern "C" {...} scopes in C++ header files, that declare C functions in shared libraries. It basically is a convention on how to pass parameters and return values between functions.
Neither operating system nor kernel are directly involved in any of this.
Their limited involvement is in that if you want to build Linux 64 bit binaries for x86 using gnu tools then you need to in some way (download and install or build yourself) build the gnu tools themselves for that target processor and that operating system. As system calls are specific to the operating system and target, and also the binaries supported by that operating system. Not strictly just the elf file format, that is just a container, but the linking and possibly bootstrap is also specific to the operating systems loader. (or if building something for the kernel that would have other rules). For example, does the application loader initialize .bss and .data for you from specific information in the .elf file, or like on an mcu does the bootstrap code itself have to do this?
The builder for gnu tools for a target like linux and ideally a pre-built binary for your os and target, would have paths setup in some way. The c library would have a default linker script and its intimate partner the bootstrap.
After that point, it is just a dumb toolchain. Include files be they at the system level, compiler level, or programmer level are just includes in the C language. The default paths and gcc knows where it was executed from so it knows where in a normal build the gcc and other libraries live.
gcc itself is not a compiler actually it calls other programs like the preprocessor, the compiler itself, the assembler and linker.
The preprocessor is going to do the search and replace for includes and defines and end up with one great big cpp file, then pass that to the compiler.
The compiler front end (C++ language for gcc for example) turns that into an internal language, allocate an int with this name, and another add the two and blah. A pseudo code if you will. This gets a lot of the optimization work done on it then eventually the back end (which for gnu could be x86, mips, arm, etc independent to some extent of the front and middle). The LLVM tools, are at least capable of exposing that middle, internal, language to external files (external to the memory used by the compiler to do the compilation) and you can combine and optimize those bytecode files and then convert them to assembly or direct to object in the llvm world. I think this is an exception not a rule, others just use internal tables.
While I think it is wise and sane to use an assembly language step. Not all compilers do and do not assume that all compilers do. Some output objects.
Yes that assembly is naturally partial, external functions (labels) and variables (labels) cannot be resolved at the object level. The linker has to do that.
So the target (x86, arm, etc) does affect the construction of the elf file as
there are certain items, magic numbers specific to the target. As mentioned the operating system and or kernel do affect the elf in that there are rules for construction of the binary for that kernel or operating system. Remember that elf is just a container like tar or zip or mkv etc. Do not assume that the operating system can handle every possible choice you want to make with the contents that the linker will allow (the tools are dumb, do what they are told).
So your source.
All the relevant sources that go with it including system includes, compiler includes and your includes.
gcc/g++ is a wrapper program that manages the steps.
calls the pre-processor expands includes and defines into one file (no magic here)
call the compiler to parse that one file into internal tables, think pseudo code and data
many, many possible optimizers that operate on these structures
backend, including peephole optimizer, turns the tables into assembly language (for gnu at least)
assembler is called to turn the asm into an object
If all the objects are specified and gcc is told to link, then...
Linker combines all the objects for the binary, including the bootstrap, including already built libraries, stubs, etc, and command line or more likely a linker script (linker script and bootstrap have an intimate relationship they are not assumed to be separable and not part of the compiler they are part of a C library, etc).
Kernel module loader or operating system application loader fed the file and per the rules of that loader loads and runs the program.

How to use -fsplit-stack on a large project

I recently posted a question about stack segmentation and boost coroutines but it seems like the -fsplit-stack approach only works with source files that are compiled with that flag, the runtime breaks down when you branch to another function that has not been compiled with -fsplit-stack. For example
This implies that the runtime uses a function local technique to detect when the current stack has been surpassed. And not a "guard page signal" trick, where the end of the stack always has a guard page which will raise a signal on write or read, telling the runtime to allocate a new stack frame and branch to that.
Then what is the use of this flag? If I link to any other library that has not been built with this, code will break (even libstdc++ and libc), then how is this something people use practically with big projects?
From reading the gcc wiki about split stacks it seems like calling a non split stack function from a split stack function results in an allocation of a 64KB stack frame. Good.
But it seems like calling a non split stack function from a function pointer has not yet been implemented to follow the above scheme.
What use is this flag then? If I proceed to call any virtual function will my program break?
Further from the answer below it seems like clang has not implemented split stacks?
You have to compile boost (at least boost.context and boost.coroutine) with segmeented-stacks support AND your application.
compile boost (boost.context and boost.coroutine) with b2 property segmented-stacks=on (enables special code inside boost.coroutine and boost.context).
your app has to be compiled with -DBOOST_USE_SEGMENTED_STACKS and -fsplit-stack (required by boost.coroutines headers).
see boost.coroutine documentation
boost.coroutine contains an example that demonstrates segmented stacks (in directory coroutine/example/asymmetric/ call b2 toolset=gcc segmented-stacks=on).
regarding your last question GCC Wiki states:
For calls from split-stack code to non-split-stack code, the linker
will change the initial instructions in the split-stack (caller)
function. This means that the linker will have to have special
knowledge of the instructions that the compiler emits. The effect of
the changes will be to increase the required framesize by a number
large enough to reasonably work for a non-split-stack. This will be a
target dependent number; the default will be something like 64K. Note
that this large stack will be released when the split-stack function
returns. Note that I'm disregarding the case of split-stack code in a
shared library calling non-split-stack code in the main executable;
that seems like an unlikely problem.
please note: while llvm supports segmented stacks, clang seams not to provide the __splitstack_<xyz> functions.
First I'd say split stack support is somewhat experimental in nature to begin with. It is not a widely supported thing nor has a single implementation become accepted as the way to go. As such, part of the purpose of it existing in the compiler is to enable research in real use.
That said, one generally wants to use such a feature to enable lots of threads with small stacks, but which can get bigger if they need to. In some applications, the code that runs in these threads can be tightly controlled. E.g. fairly specialized request handlers that do not call general purpose libraries such as Boost. High performance systems work often involves tightening down the constraints on what code is used in a given path and this would be an example thereof. It certainly limits the applicability of the feature, but I wouldn't be surprised if someone is using it in production this way.
Note that similar issues exist with flags such as -fno-exceptions and -fno-rtti . Generally C++ requires compiling everything that goes into an executable with a compatible set of flags. Sometimes one can mix and match, but it is often fragile. This is part of the motivation of building everything from source and hermetic build tools like bazel. Other languages have different approaches to non-source components, especially virtual machine based languages such as Java and the .NET family. In those worlds things like split stacks are decided at a lower-level of compilation, but typically one would not have any control over or awareness of them at the source code level.

Getting address of caller in c++

At the moment I'm working on a anticheat. I added a way to detect any hooking to the directx functions, since those are what most cheats do.
The problem comes in when a lot of programs, such as OBS, Fraps and many other programs that hook directx get their hook detected too.
So to be able to hook directx, you will most probabbly have to call VirtualProtect. If I could determine what address this is being called from, then I could loop through all dll's in memory, and then find what module it has been called from, and then sending the information to the server, maybe perhaps even taking a md5 hash and sending it to the server for validation.
I could also hook the DirectX functions that the cheats hook and check where those get called from (since most of them use ms detours).
I looked it up, and apparently you can check the call stack, but every example I found did not seem to help me.
This -getting the caller's address- is not possible in standard C++. And many C++ compilers might optimize some calls (e.g. by inlining them, even when you don't specify inline, or because there is no more any framepointer, e.g. compiler option -fomit-frame-pointerfor x86 32 bits with GCC, or by optimizing a tail-call ....) to the point that the question might not make any sense.
With some implementations and some C or C++ standard libraries and some (but not all) compiler options (in particular, don't ask the compiler to optimize too much*) you might get it, e.g. (on Linux) use backtrace from GNU glibc or I.Taylor's libbacktrace (from inside GCC implementation) or GCC return address builtins.
I don't know how difficult would it be to port these to Windows (Perhaps Cygwin did it). The GCC builtins might somehow work, if you don't optimize too much.
Read also about continuations. See also this answer to a related question.
Note *: on Linux, better compile all the code (including external libraries!) with at most g++ -Wall -g -O1 : you don't want too much optimization, and you want the debug information (in particular for libbacktrace)
Ray Chen's blog 'The old new thing' covers using return address' to make security decisions and why its a pretty pointless thing
https://devblogs.microsoft.com/oldnewthing/20060203-00/?p=32403
https://devblogs.microsoft.com/oldnewthing/20040101-00/?p=41223
Basically its pretty easy to fake (by injecting code or using a manually constructed fake stack to trick you). Its Windows centric but the basic concepts are generally applicable.

LLVM exception handling implementation

Just to note, I've read the questions and read the blog posts and I've also referenced the ABI.
What I completely don't understand is how that interacts with LLVM's EH intrinsics. The LLVM EH page gives a very vague overview- not exactly a checklist of "Implement X, Y, Z".
The LLVM EH page references the Itanium ABI directly. This would imply to me that LLVM only supports Itanium ABI exceptions. But I already know that Clang supports ARM and is developing support for Microsoft ABIs. So exactly how specific is LLVM's implementation of EH to the Itanium ABI?
When referencing the _Unwind stuff defined by Itanium ABI, is that obliged to be provided by a backend, or would I have to implement it for myself?
I also noticed that the LLVM IR generated by Clang does not reveal any language-specific tables, any exception frames, exception tables, or anything like that. In that case, how does LLVM know how to generate the language-specific data?
In short, how exactly do you go from LSDAs, EH contexts, and _Unwind_RaiseException to landingpad and resume?
Edit: Just for reference, I'm going to be JITting the resulting code on Windows.
Nowadays Itanium C++ ABI is de facto standard C++ ABI used on many other
platforms. Itanium C++ ABI supports zero cost exception handling technique,
which is the most widespread technique for today.
To support exception handling one must change the semantics of a function
call. Now the calls fork the execution flow. One branch is taken when
everything is fine and the second branch is taken in case of an exception.
In LLVM IR there is the invoke instruction to call functions that may throw.
When the second branch is taken several kinds of actions may be performed:
call destructors (cleanup)
continue stack unwinding (resume)
enforce throw specifications (filter)
restore normal control flow (catch).
It's clear that some additional code must be generated to perform these actions.
That's why we've got the landingpad instruction as well. It is the first
instruction to be executed after invoke ends up with an exception.
But the main magic is performed at run-time. After an exception is thrown, the
language agnostic runtime unwinds the stack, for every frame it finds language
specific data area (LSDA) and calls language specific personality routine. The
personality routine inspects program counter, LSDA and the current exception. It
determines if any cleanup is necessary, if any throw specification is violated
or if the exception can be caught by this frame.
As you probably know, all these data (personality routine, catched types, throw specification, cleanup actions) are already specified in landingpad
instruction, so no additional data should be passed to the backend to generate
exception-related sections in object file.
The short answer is, LLVM effectively hardcodes support for every EH ABI it wants to support- ARM, Itanium, SEH, etc. So whilst the landingpad stuff might in theory be somewhat abstract, it's really not abstract at all and very tightly coupled, because the other half of stuff you need to do must be accomplished by the Itanium ABI EH support library that you need to do explicitly.
LLVM generates virtually all the EH implementation details, but you must also link to an Itanium EH support library at runtime. Other than that, the IR is really just what it shows- no additional effort required on behalf of the programmer.
I imagine things get much more sticky if you want to use non-Itanium.

Getting pointer to bottom of the call stack and resolving symbol by address (like dladdr) under Windows?

I want to implement an analog of backtrace utility under windows in order to add this information to exception for example.
I need to capture return addresses and then translate it into symbols names.
I'm aware of StackWalk64 and of StackWalker project but unfortunately it has several important drawbacks:
It is known to be very slow (the StackWalk64) and I don't want to waste much time for collecting the trace the basically can be done as fast as walking on linked list.
The function StackWalk64 is known to be not thread safe.
I want to support only x86 and possible x86_64 architectures
Basic idea I have is following:
Run on stack using esp/ebp registers similarly to what GCC's __builtin_return_address(x)/__builtin_frame_address(x) doe till I reach the bottom of the stack (this is what glibc does).
Translate addresses to symbols
Demangle them.
Problems/Questions:
How do I know that I reach the to of the stack? For example glibc implementation has __libc_stack_end so it is easy to find where to stop. Is there any analog of such thing under Windows? How can I get stack bottom address?
What are the analogs of dladdr functionality. Now I know that unlike ELF platform that keeps most of symbol names, PE format does not. So it should read somehow the debug information. Any ideas?
Capturing Stack Trace: RtlCaptureStackBackTrace
Getting Symbols: Using DBG Help library (MSVC only). Key functions:
// Initialization
hProcess = GetCurrentProcess()
SymSetOptions(SYMOPT_DEFERRED_LOADS)
SymInitialize(hProcess, NULL, TRUE)
// Fetching symbol
SymFromAddr(...)
Implementation can be found there
You use StackWalk but resolve symbols later.