Capture the call stack quickly

Capture the call stack quickly - c++

I'm trying to capture the call stack as quickly as possible. Right now this is what I've got:
void* addrs[10] = {};
DWORD hash;
RtlCaptureStackBackTrace(0, 10, addrs, &hash);
for( size_t i = 0; i <10; ++i )
{
std::cout << addrs[i] << '\n';
}
This is for use in a memory tracking system, so it's perfectly fine that I end up with an array of addresses if they can later (upon some user driven event) be turned into something human readable.
How can I turn addrs into something human readable? (see edit below)
Is there anything faster than RtlCaptureStackBackTrace?
Is there a cross platform way to capture the call stack?
Edit:
I turned the addresses into human readable information by using SymFromAddr and SymGetLineFromAddr64. However, my version of new that uses CaptureStackBackTrace takes ~30x longer than the original and almost all that time is because of the stack trace. I'm still looking for a faster solution!

Is there a cross platform way to capture the call stack?
The quick answer is that unfortunately it's not possible. Different compilers and platforms will organize the stack differently, and without "help" from the compiler, you're not going to be able to accurately trace back through the stack ... the most you could do on platforms that are using a stack-frame base-pointer would be to move up the stack using the values in either EBP or RBP, and even that is dependent on if you're running a 32-bit x86 executable, or a x86_64 executable. Additionally the rules governing the UNIX application-binary-interface (ABI) and the Windows ABI are completely different. So in general, you're going to end up having to support at least four different possibilities:
Unix 32-bit ABI
Unix 64-bit ABI
Windows 32-bit ABI
Windows 64-bit ABI
Keep in mind that the UNIX 64-bit ABI allows for compilers to choose whether they will or will not support a stack-frame base-pointer ... so the use of a base-pointer is not guaranteed; a compiler may simply choose to reference all stack-variables against the stack-pointer itself without using a separate base-pointer. That would still be considered compliant with the UNIX 64-bit ABI.
So as you can see, with all these possible variations, you're going to have a very tough time making a reliable cross-platform stack-tracer. Your best-bet it to use the tools available to you for the platform you're targeting. If you need to support multiple platforms, then unfortunately there's going to be some level of code duplication as you utlize platform-specific tools to help you reliably perform the stack-trace.

There isn't a cross-platform way, but there are cross-platform libraries that do the trick. It's a common requirement for garbage collectors.
Unfortunately, I don't know enough about these libraries to recommend one. MMgc has something embedded (explained in the link, above). I'm sure Hans Boehm's garbage collector has another.
These libraries, in general, will rely on platform-specific calls, such as RtlCaptureStackBacktrace on Windows. I would be amazed if any such library went faster than the platform-specific calls; but I would also be amazed if any such library went significantly slower. The only question -- to me -- would be whether you prefer to call the Windows API call directly or to route that call through another library.

Related

Getting address of caller in c++

At the moment I'm working on a anticheat. I added a way to detect any hooking to the directx functions, since those are what most cheats do.
The problem comes in when a lot of programs, such as OBS, Fraps and many other programs that hook directx get their hook detected too.
So to be able to hook directx, you will most probabbly have to call VirtualProtect. If I could determine what address this is being called from, then I could loop through all dll's in memory, and then find what module it has been called from, and then sending the information to the server, maybe perhaps even taking a md5 hash and sending it to the server for validation.
I could also hook the DirectX functions that the cheats hook and check where those get called from (since most of them use ms detours).
I looked it up, and apparently you can check the call stack, but every example I found did not seem to help me.

This -getting the caller's address- is not possible in standard C++. And many C++ compilers might optimize some calls (e.g. by inlining them, even when you don't specify inline, or because there is no more any framepointer, e.g. compiler option -fomit-frame-pointerfor x86 32 bits with GCC, or by optimizing a tail-call ....) to the point that the question might not make any sense.
With some implementations and some C or C++ standard libraries and some (but not all) compiler options (in particular, don't ask the compiler to optimize too much*) you might get it, e.g. (on Linux) use backtrace from GNU glibc or I.Taylor's libbacktrace (from inside GCC implementation) or GCC return address builtins.
I don't know how difficult would it be to port these to Windows (Perhaps Cygwin did it). The GCC builtins might somehow work, if you don't optimize too much.
Read also about continuations. See also this answer to a related question.
Note *: on Linux, better compile all the code (including external libraries!) with at most g++ -Wall -g -O1 : you don't want too much optimization, and you want the debug information (in particular for libbacktrace)

Ray Chen's blog 'The old new thing' covers using return address' to make security decisions and why its a pretty pointless thing
https://devblogs.microsoft.com/oldnewthing/20060203-00/?p=32403
https://devblogs.microsoft.com/oldnewthing/20040101-00/?p=41223
Basically its pretty easy to fake (by injecting code or using a manually constructed fake stack to trick you). Its Windows centric but the basic concepts are generally applicable.

Will statically linked c++ binary work on every system with same architecture?

I'm making a very simple program with c++ for linux usage, and I'd like to know if it is possible to make just one big binary containing all the dependencies that would work on any linux system.
If my understanding is correct, any compiler turns source code into machine instructions, but since there are often common parts of code that can be reused with different programs, most programs depend on another libraries.
However if I have the source code for all my dependencies, I should be able to compile a binary in a way that would not require anything from the system? Will I be able to run stuff compiled on 64bit system on a 32bit system?

In short: Maybe.
The longer answer is:
It depends. You can't, for example, run a 64-bit binary on a 32-bit system, that's just not even nearly possible. Yes, it's the same processor family, but there are twice as many registers in the 64-bit system, which also has twice as long registers. What's the 32-bit processor going to "give back" for the value of those bits and registers that doesn't exist in the hardware in the processor? It just plain won't work. Some of the instructions also completely change meaning, so the system really needs to be "right" for the compiled code, or it won't work - fortunately, Linux will check this and plain refuse if it's not right.
You can BUILD a 32-bit binary on a 64-bit system (assuming you have all the right libraries, etc, installed for both 64- and 32-bit, etc).
Similarly, if you try to run ARM code on an x86 processor, or MIPS code on an ARM processor, it simply has no chance of working, because the actual instructions are completely different (or they would be in breach of some patent/copyright or similar, because processor instruction sets contain portions that are "protected intellectual property" in nearly all cases - so designers have to make sure they do NOT do "the same as someone else's design"). Like for 32-bit and 64-bit, you simply won't get a chance to run the wrong binary here, it just won't work.
Sometimes, there are subtle differences, for example ARM code can be compiled with "hard" or "soft" floating point. If the code is compiled for hard float, and there isn't the right support in the OS, then it won't run the binary. Worse yet, if you compile on x86 for SSE instructions, and try to run on a non-SSE processor, the code will simply crash [unless you specifically build code to "check for SSE, and display error if not present"].
So, if you have a binary that passes the above criteria, the Linux system tends to change a tiny bit between releases, and different distributions have subtle "fixes" that change things. Most of the time, these are completely benign (they fix some obscure corner-case that someone found during testing, but the general, non-corner case behaviour is "normal"). However, if you go from Linux version 2.2 to Linux version 3.15, there will be some substantial differences between the two versions, and the binary from the old one may very well be incompatible with the newer (and almost certainly the other way around) - it's hard to know exactly which versions are and aren't compatible. Within releases that are close, then it should work OK as long as you are not specifically relying on some feature that is present in only one (after all, new things ARE added to the Linux kernel from time to time). Here the answer is "maybe".
Note that in the above is also your implementation of the C and C++ runtime, so if you have a "new" C or C++ runtime library that uses Linux kernel feature X, and try to run it on an older kernel, before feature X was implemented (or working correctly for the case the C or C++ runtime is trying to use it).
Static linking is indeed a good way to REDUCE the dependency of different releases. And a good way to make your binary huge, which may be preventing people from downloading it.
Making the code open source is a much better way to solve this problem, then you just distribute your source code and a list of "minimum requirements", and let other people deal with it needing to be recompiled.

In practice, it depends on "sufficiently simple". If you're using C++11, you'll quickly find that the C++11 libraries have dependencies on modern libc releases. In turn, those only ship with modern Linux distributions. I'm not aware of any "Long Term Support" Linux distribution which today (June 2014) ships with libc support for GCC 4.8

The short answer is no, at least without serious hack.
Different linux distribution may have different glue code between user-space and kernel. For instant, an hello world seemingly without dependency built from ubuntu cannot be executed under CentOS.
EDIT: Thanks for the comment. I re-verify this and the cause is im using 32-bit VM. Sorry for causing confusion. However, as noted above, the rule of thumb is that even same linux distribution may sometime breaks compatibility in order to deploy bugfix, so the conclusion stands.

Is it possible to implement a small Disk OS in C or C++?

I am not trying to do any such thing, but I was wondering out of curiosity whether one could implement an "entire OS" (not necessarily something big like Linux or Microsoft Windows, but more like a small DOS-like operating system) in C and/or C++ using no or little assembly.
By implementing an OS , I mean making an OS from scratch starting the boot-loader and the kernel to the graphics drivers (and optionally GUI) in C or C++. I have seen a few low-level things done in C++ by accessing low-level features through the compiler. Can this be done for an entire OS?
I am not asking whether it is a good idea, I am just asking whether it is even remotely possible?

Obligatory link to the OSDev wiki, which describes most of the steps needed to create an OS as described on x86/x64.
To answer your question, it is gonna be extremely difficult/unpleasant to create the boot loader and start protected mode without resorting to at least some assembly, though it can be kept to a minimum (especially if you're not really counting stuff like using __asm__ ( "lidt %0\n" : : "m" (*idt) ); as 'assembly').
A big hurdle (again on x86) is that the processor starts in 16-bit real mode, so you need some 16-bit code. According to this discussion you can have GCC generate 16-bit code, but you would still need some way to setup memory, load code from some storage media and so on, all of which requires interfacing with the hardware in ways that standard C just has no concept of (interrupts, IO ports etc.).
For architectures which communicate with hardware solely through memory mapped IO you could probably get away with writing everything except the C start-up code (that sets up the stack, initializes variables and so on) in pure C, though specific requirements of interrupt routines / exception or syscall gates etc. may be difficult to impossible to implement (as you have to access special CPU registers).

I assume that you have an OS for x86 in mind. In that case you need at least a few pages of assembler to set up protected mode and stuff like that, and besides that a lot of knowledge of all the stuff like paging, call gates, rings, exceptions, etc. If you are going to use a form of system calls you'll also need some lines of assembly code to switch between kernel and userspace mode.
Besides those things the rest of an OS can easily be programmed in C. For C++ you'll need a runtime environment to support things like virtual members and exceptions, but as far as I know that can all be programmed in C.
Just take a look at Linux kernel source, the most important assembler code (for x86) can be found in arch/x86/boot, but you'll notice that even in that directory most files are written in C. Furthermore you'll find a few assembly lines in the arch/x86/kernel directory for handling system calls and stuff like that.
Outside the arch directory there is hardly any assembler used (because assembler is machine specific, that machine specific code belongs in the arch directory). Even graphic drivers don't use assembler (e.g. nvidia driver in drivers/gpu/drm/nouveau).

A boot loader? You might want to skip that bit. For instance, Linux is quite often started by non-Linux boot loaders such as UBoot. After all, once the system is running, the OS will be present but not the boot loader, that's just there to get the OS proper into memory.
And once you've selected a decent existing bootloader, the remainder is pretty much all straightforward. Well, you have to deal with memory and files yourself; you can't rely on fopen obviously. But even a C++ compiler has little problem generating code that can run without OS support.

Compiler optimization of functions parameters

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers. It would make sense that this optimization will kick in if there are only 1-2 parameters, and not when there are 256 (not that one would want to have the max number of parameters).
How can one find out the parameter limit (number of parameters) for a certain compiler (such as gcc) where one can be sure that this optimization will be used?

Function parameters are placed on the stack, but compilers can optimize this task by the use of optional registers.
As FrankH says in his comments and as I'm going to say in my answer, the application binary interface for the system in question determines how arguments are passed to functions - this is called the calling convention for that platform.
To complicate matters, x86 32-bit actually has several. This is historical and comes from the fact that when Win32 bit arrived, everyone went crazy doing different things.
So, yes, you can "optimise" by writing function calls in such a way, but no, you shouldn't. You should follow the standards for your platform. Because the honest truth is, the speed of stack access probably isn't slowing your code down to that great an extent that you need to be binary-incompatible from everyone else on your system.
Why the need for ABIs/standard calling conventions? Well, in terms of using the processor registers, stack etc, applications must agree on what means what and where it shoudl go. If one function decided all its arguments were in registers and another that some were on the stack, how would they be interoperable? Moreover, you might come across the term scratch registers to mean those registers you don't have to restore. What happens if you call a function expecting it to leave some registers alone?
Anyway, as for what you asked for, here's some ABI documentation:
The difference between x86 and x64 on windows.
x86_64 ABI used for Unix-like platforms.
Wikipedia's x86 calling conventions.
A document on compiler calling conventions.
The last one is my favourite. To quote it:
In the days of the old DOS operating system, it was often possible to combine development
tools from different vendors with few compatibility problems. With 32-bit Windows, the
situation has gone completely out of hand. Different compilers use different data
representations, different function calling conventions, and different object file formats.
While static link libraries have traditionally been considered compiler-specific, the
widespread use of dynamic link libraries (DLL's) has made the distribution of function
libraries in binary form more common.
So whatever you're trying to do with optimising via modifying the function calling method, don't. Find another way to optimise. Profile your code. Study the compiler optimisations you've got for your compiler (-OX) if you think it helps and dump the assembly to check, if the speed is really that crucial

For publically visible functions, this is documented in the ABI standard. For functions that are not referencible from the outside, all bets are off anyway.

You would have to read the fine manual for the compiler. If you were lucky, you would find it there in a description of function calling conventions. Otherwise, for an OSS compiler such as gcc you would probably have to read its source-code.

Compile and optimize for different target architectures

Summary: I want to take advantage of compiler optimizations and processor instruction sets, but still have a portable application (running on different processors). Normally I could indeed compile 5 times and let the user choose the right one to run.
My question is: how can I can automate this, so that the processor is detected at runtime and the right executable is executed without the user having to chose it?
I have an application with a lot of low level math calculations. These calculations will typically run for a long time.
I would like to take advantage of as much optimization as possible, preferably also of (not always supported) instruction sets. On the other hand I would like my application to be portable and easy to use (so I would not like to compile 5 different versions and let the user choose).
Is there a possibility to compile 5 different versions of my code and run dynamically the most optimized version that's possible at execution time? With 5 different versions I mean with different instruction sets and different optimizations for processors.
I don't care about the size of the application.
At this moment I'm using gcc on Linux (my code is in C++), but I'm also interested in this for the Intel compiler and for the MinGW compiler for compilation to Windows.
The executable doesn't have to be able to run on different OS'es, but ideally there would be something possible with automatically selecting 32 bit and 64 bit as well.
Edit: Please give clear pointers how to do it, preferably with small code examples or links to explanations. From my point of view I need a super generic solution, which is applicable on any random C++ project I have later.
Edit I assigned the bounty to ShuggyCoUk, he had a great number of pointers to look out for. I would have liked to split it between multiple answers but that is not possible. I'm not having this implemented yet, so the question is still 'open'! Please, still add and/or improve answers, even though there is no bounty to be given anymore.
Thanks everybody!

Yes it's possible. Compile all your differently optimised versions as different dynamic libraries with a common entry point, and provide an executable stub that that loads and runs
the correct library at run-time, via the entry point, depending on config file or other information.

Can you use script?
You could detect the CPU using script, and dynamically load the executable that is most optimized for architecture. It can choose 32/64 bit versions too.
If you are using a Linux you can query the cpu with
cat /proc/cpuinfo
You could probably do this with a bash/perl/python script or windows scripting host on windows. You probably don't want to force the user to install a script engine. One that works on the OS out of the box IMHO would be best.
In fact, on windows you probably would want to write a small C# app so you can more easily query the architecture. The C# app could just spawn whatever executable is fastest.
Alternatively you could put your different versions of code in a dll's or shared object's, then dynamically load them based on the detected architecture. As long as they have the same call signature it should work.

If you wish this to cleanly work on Windows and take full advantage in 64bit capable platforms of the additional 1. Addressing space and 2. registers (likely of more use to you) you must have at a minimum a separate process for the 64bit ones.
You can achieve this by having a separate executable with the relevant PE64 header. Simply using CreateProcess will launch this as the relevant bitness (unless the executable launched is in some redirected location there is no need to worry about WoW64 folder redirection
Given this limitation on windows it is likely that simply 'chaining along' to the relevant executable will be the simplest option for all different options, as well as making testing an individual one simpler.
It also means you 'main' executable is free to be totally separate depending on the target operating system (as detecting the cpu/OS capabilities is, by it's nature, very OS specific) and then do most of the rest of your code as shared objects/dlls.
Also you can 'share' the same files for two different architectures if you currently do not feel that there is any point using the differing capabilities.
I would suggest that the main executable is capable of being forced into making a specific choice so you can see what happens with 'lesser' versions on a more capable machine (or what errors come up if you try something different).
Other possibilities given this model are:
Statically linking to different versions of the standard runtimes (for ones with/without thread safety) and using them appropriately if you are running without any SMP/SMT capabilities.
Detect if multiple cores are present and whether they are real or hyper threading (also whether the OS knows how the schedule effectively in those cases)
checking the performance of things like the system timer/high performance timers and using code optimized to this behaviour, say if you do anything where you look for a certain amount of time to expire and thus can know your best possible granularity.
If you wish to optimize you choice of code based on cache sizing/other load on the box. If you are using unrolled loops then more aggressive unrolling options may depend on having a certain amount level 1/2 cache.
Compiling conditionally to use doubles/floats depending on the architecture. Less important on intel hardware but if you are targetting certain ARM cpu's some have actual floating point hardware support and others require emulation. The optimal code would change heavily, even to the extent you just use conditional compilation rather than using the optimizing compiler(1).
Making use of co-processor hardware like CUDA capable graphics cards.
detect virtualization and alter behaviour (perhaps trying to avoid file system writes)
As to doing this check you have a few options, the most useful one on Intel being the the cpuid instruction.
Windows
Use someone else's implementation but you'll have to pay
Use a free open source one
Linux
Use the built in one
You could also look at open source software doing the same thing
Pixman does a fair amount of this and is a permissive licence.
Alternatively re-implement/update an existing one using available documentation on the features you need.
Quite a lot of separate documents to work out how to detect things:
Intel:
SSE 4.1/4.2
SSE3
MMX
A large part of what you would be paying for in the CPU-Z library is someone doing all this (and the nasty little issues involved) for you.
be careful with this - it is hard to beat decent optimizing compilers on this

Have a look at liboil: http://liboil.freedesktop.org/wiki/ . It can dynamically select implementations of multimedia-related computations at run-time. You may find you can liboil itself and not just its techniques.

Since you mention you are using GCC, I'll assume your code is in C (or C++).
Neil Butterworth already suggested making separate dynamic libraries, but that requires some non-trivial cross-platform considerations (manually loading dynamic libraries is different on Linux, Windows, OSX, etc., and getting it right will likely take some time).
A cheap solution is to simply write all of your variants using unique names, and use a function pointer to select the proper one at runtime.
I suspect the extra dereference caused by the function pointer will be amortized by the actual work you are doing (but you'll want to confirm that).
Also, getting different compiler optimizations will likely require different .c/.cpp files, as well as some twiddling of your build tool. But it's probably less overall work than separate libraries (which needed this already in one form or another).

Since you didn't specify whether you have limits on the number of files, I propose another solution: compile 5 executables, and then create a sixth executable that launches the appropriate binary. Here is some pseudocode, for Linux
int main(int argc, char* argv[])
{
char* target_path[MAXPATH];
char* new_argv[];
char* specific_version = determine_name_of_specific_version();
strcpy(target_path, "/usr/lib/myapp/versions");
strcat(target_path, specific_version);
/* append NULL to argv */
new_argv = malloc(sizeof(char*)*(argc+1));
memcpy(new_argv, argv, argc*sizeof(char*));
new_argv[argc] = 0;
/* optionally set new_argv[0] to target_path */
execv(target_path, new_argv);
}
On the plus side, this approach allows to provide the user transparently with both 32-bit and 64-bit binaries, unlike any library methods that have been proposed. On the minus side, there is no execv in Win32 (but a good emulation in cygwin); on Windows, you have to create a new process, rather than re-execing the current one.

Lets break the problem down to its two constituent parts. 1) Creating platform dependent optimized code and 2) building on multiple platforms.
The first problem is pretty straightforward. Encapsulate the platform dependent code in a set of functions. Create a different implementation of each function for each platform. Put each implementation in its own file or set of files. It's easiest for the build system if you put each platform's code in a separate directory.
For part two I suggest you look at Gnu Atuotools (Automake, AutoConf, and Libtool). If you've ever downloaded and built a GNU program from source code you know you have to run ./configure before running make. The purpose of the configure script is to 1) verify that your system has all of the required libraries and utilities need to build and run the program and 2) customize the Makefiles for the target platform. Autotools is the set of utilities for generating the configure script.
Using autoconf, you can create little macros to check that the machine supports all of the CPU instructions your platform dependent code needs. In most cases, the macros already exists, you just have to copy them into your autoconf script. Then, automake and autoconf can set up the Makefiles to pull in the appropriate implementation.
All this is a bit much for creating an example here. It takes a little time to learn. But the documentation is all out there. There is even a free book available online. And the process is applicable to your future projects. For multi-platform support, this is really the most robust and easiest way to go, I think. A lot of the suggestions posted in other answers are things that Autotools deals with (CPU detection, static & shared library support) without you have to think about it too much. The only wrinkle you might have to deal with is finding out if Autotools are available for MinGW. I know they are part of Cygwin if you can go that route instead.

You mentioned the Intel compiler. That is funny, because it can do something like this by default. However, there is a catch. The Intel compiler didn't insert checks for the approopriate SSE functionality. Instead, they checked if you had a particular Intel chip. There would still be a slow default case. As a result, AMD CPUs would not get suitable SSE-optimized versions. There are hacks floating around that will replace the Intel check with a proper SSE check.
The 32/64 bits difference will require two executables. Both the ELF and PE format store this information in the exectuables header. It's not too hard to start the 32 bits version by default, check if you are on a 64 bit system, and then restart the 64 bit version. But it may be easier to create an appropriate symlink at installation time.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js