How does a debug build make reverse engineering easy? - c++

Some answer here stated that debug info would make it easier to reverse engineer the software. When I use Visual C++ and distribute an executable with debugging information but without other files (.pdb), will it contain any interesting things?
I looked to the executable with a hex editor and found nothing like symbol names, for now I assume the .exe file just links to information in the .pdb files, right?
Do you know whether it contains
variable names?
function/member names?
line numbers?
anything interesting?

Debug builds tend to generate output that can easily be correlated with high-level language constructs. You can identify variables, tests, loops, etc., just by looking at the machine code. You won't get names of variables, but that's usually among the least important considerations when reverse-engineering.
Optimised code, OTOH, rearranges instructions, unfolds loops, reuses slots for multiple variables, shares blocks of code between functions, inlines small functions and so on, making it quite a bit more difficult to discern the original intent. It also makes it more difficult to debug, even if you own the code, since the current line marker is often very misleading, and variables tend to disappear or show random crap.
None of this makes reverse-engineering impossible, though. It's just more work to tease out the meaning.

Build with debugging information isn't "debug build".
"Debug build" is such build when _DEBUG symbol is defined. If so, there are lots of strings useful for reverse-engineer (asserts, etc).
So you can make Release build with debugging information in .pbd, and to decompile the program will be as hard as without debugging information.

The executable should not contain variable names or line numbers. It may contain function/member names, for any such names that are exported (more likely for a lib/dll than an exe).
The structure of the code will "more closely" resemble the original source code - it's unlikely that code will have been inlined, had statements re-ordered, had loops unrolled, etc.

Optimizations make code harder to understand (and also make it harder to correlate between the source and the assembly when debugging your own code with symbols and sources).
A debug build does not include line numbers, function names, nor line numbers, these belong to the PDB. However, every time you use assert() the code will include a string that contains file names and line numbers.

Long time ago the debug information was attached to the executable (in so-called CodeView format). These days it mostly comes separately in PDB files. The exe itself indeed only includes a link to the PDB.
PDBs usually come in two flavors: private and public (aka stripped). Public (e.g. those provided by Microsoft) usually have only names of the functions and global variables. Private ones (e.g. the ones produced when you build your app with debug info) can additionally include type information (structures, enums, classes, types of variables) function prototypes, local variable names and types and line number info.
If you want to examine your PDBs, check DIA2Dump in the "DIA SDK" folder in your Visual Studio installation.

Related

What are these hex strings that I see on call stack in visual studio

I see these hex strings which do not seem to belong to any dll in the call stack of visual studio:
000000001665b7e0()
0000000000000935()
0000094500000001()
000000001665b9a4()
Normally I would see something like :
libabc.dll!myclass:myfunction() Line76
What do they imply and how do I make meaning of them ?
Those are indeed functions, but no one has left "breadcrumbs" your debugger can use to translate those addresses into a function name.
In this case, the mapping between 000000001665b7e0 and a function name is either in a symbol file which you do not have, a symbol file your debugger is unaware of, a symbol file your debugger is unable to read, or such a mapping does not exist.
What can you do about it?
Find the symbol information for this function and point your development system at it or ignore that you do not have this information.
The former is tricky because you have no clue what the function is. You may have to use a shotgun approach, add all the libraries, but you can reduce the scope of the search if you know what libraries your program uses.
The latter is a viable option because if you don't have access to the debugging information odds are pretty good you can't do anything about any bugs made by whoever wrote the code. Maybe you can write them a nasty e-mail. For an established library it's more likely there is no bug in the library, and your program is using the library incorrectly. Check the library documentation and debug your code first. When you have eliminated the possibility of errors in at your end, then start digging into the third-party code. With an established library there is often a core of developers who will be able to help confirm and resolve a library bug.
Why this happens:
The computer doesn't care what people call things. The computer only cares where things are in memory, so to make the smallest output file possible, the development system's (AKA "IDE", "compiler", and "tool chain" with varying degrees of accuracy) build tools typically strip out all of the stuff that's unnecessary to run the program. The nice, human-readable names sane programmers give functions, variables, classes, and what have you are among the first things to go.
The development system usually will allow you to preserve this address-symbol mapping to make debugging easier. As you've seen raw hex numbers aren't much use without some way to map them to recognizable terms you can use to look up documentation. Depending on the build system, this information may be left in the executable or library (resulting in a much larger output file) or it may come on the side as an optional symbol information file. These mapping files are often specific to the development system and are not readable by other development systems.

Obfuscation of variable and function names in C++ to prevent basic reverse engineering

On my spare time, I am doing some reverse engineering games with some friends of mine and I would like to know how to prevent as much as possible asm readability. I do not want to "prevent" reverse engineering (after all it will always be possible), I just want to prevent easy understanding of functions/variables by obfuscating them in the assembly code.
For example, if I have declared a function like that in C++:
void thisFunctionReverseAString(std::string& mystring);
I would like to be sure that it will not be possible to get the names thisFunctionReverseAString and mystring from the assembly. Is there any compilation option to do that in g++ or clang++ ?
Obfuscation will only help for the source code. The executable, with no debugging information, does not contain variable names or function names.
The process of reverse engineering would involve:
Converting the executable to assembly language code.
Converting the assembly code to a high level language code.
Making sense of the sequentially named functions and variables.
For example, take an executable in FORTRAN (or compiled BASIC) and reverse engineer into C++ source code.
As others have said, there are functions to remove symbols from the Debugging version of an executable. You could start at the beginning and build an executable without symbols, often called a Release version.
Use strip to remove symbols from your executables in Linux. On Windows simple remove pdb files.

How much source information is stored in c++ executables

Some days ago I accidentally opened a C++ executable of a commercial application in Notepad++ and found out that there's quite a lot information about the original source code stored in the executable.
Inside the executable I could find file names (app.c, dlgstat.c, ...), function names (GetTickCount, DispatchMessageA, ...) and small pieces of source code, mostly conditions (szChar != TEXT('\0'), iRow < XTGetRows( hwndList )). After that I checked another QT executable and: yes again source file names and method signatures.
Because of that I am wondering how much source code information is really stored in a C/C++ executable (e.g., compiled using QT or MinGW). Is this probably some kind of debug build still containing the original source? Is this information used for some reflection stuff? Is there any reason why publishers don't remove this stuff?
How much source code information is really stored in a C/C++ executable?
In practice, not much. The source code is not required at runtime. The strings you name come from two things:
The function names (e.g. GetTickCount) are the names of functions imported from other modules. The names are required at runtime because the functions are resolved dynamically (by calling GetProcAddress with the function name).
The conditions are likely assertions: the assert macro stringizes its argument so that when it fires you know what condition was not met.
If you build a DLL, it will also contain a names of all of the functions it exports, so they can be resolved at runtime (the same is likely true for other shared object formats).
Debug symbols may also contain some of the original source code, though it depends on the format used by the debug symbols. These symbols may be contained either in the binary itself or in an auxiliary file (for example, .pdb files used on Windows).
Windows function names: they probably are there just because they are being accessed dynamically - somewhere in your program there's a GetProcAddress to get their address. Still, no reason to worry, every application uses WinAPIs, so there's not much to discover about your executable from that information.
Conditions: probably from some assert-like macro; they are included to allow assert to print what failed condition triggered the failed assertion. Anyhow, in release mode assertions should be removed automatically.
Source file names and method signatures: probably from some usage of __FILE__ and __func__ macros; probably, again, from assert.
Other sources of information about the inner structure of your program is RTTI, that has to provide some representation for every type that typeid could be working on. If you don't need its functionality, you can disable it (but I don't know if that is possible in Qt projects).
Mixed into the binary of a C++ app you will find the names of most global symbols (and debugging symbols if enabled in the compiler), but with extra 'decoration text' that encodes the calling signature of the symbol if it is a function or method. Likewise, the literals of character strings are embedded in clear text. But no where will you find anything like the actual source code that the compiler used to create the binary executable. That information is lost during the compilation process, and it is especially hard to reverse engineer if C++ templates are employed in the build.

Where do I learn "what I need to know" about C++ compilers?

I'm just starting to explore C++, so forgive the newbiness of this question. I also beg your indulgence on how open ended this question is. I think it could be broken down, but I think that this information belongs in the same place.
(FYI -- I am working predominantly with the QT SDK and mingw32-make right now and I seem to have configured them correctly for my machine.)
I knew that there was a lot in the language which is compiler-driven -- I've heard about pre-compiler directives, but it seems like someone would be able to write books the different C++ compilers and their respective parameters. In addition, there are commands which apparently precede make (like qmake, for example (is this something only in QT)).
I would like to know if there is any place which gives me an overview of what compilers are out there, and what their different options are. I'd also like to know how each of them views Makefiles (it seems that there is a difference in syntax between them?).
If there is no website regarding, "Everything you need to know about C++ compilers but were afraid to ask," what would be the best way to go about learning the answers to these questions?
Concerning the "numerous options of the various compilers"
A piece of good news: you needn't worry about the detail of most of these options. You will, in due time, delve into this, only for the very compiler you use, and maybe only for the options that pertain to a particular set of features. But as a novice, generally trust the default options or the ones supplied with the make files.
The broad categories of these features (and I may be missing a few) are:
pre-processor defines (now, you may need a few of these)
code generation (target CPU, FPU usage...)
optimization (hints for the compiler to favor speed over size and such)
inclusion of debug info (which is extra data left in the object/binary and which enables the debugger to know where each line of code starts, what the variables names are etc.)
directives for the linker
output type (exe, library, memory maps...)
C/C++ language compliance and warnings (compatibility with previous version of the compiler, compliance to current and past C Standards, warning about common possible bug-indicative patterns...)
compile-time verbosity and help
Concerning an inventory of compilers with their options and features
I know of no such list but I'm sure it probably exists on the web. However, suggest that, as a novice you worry little about these "details", and use whatever free compiler you can find (gcc certainly a great choice), and build experience with the language and the build process. C professionals may likely argue, with good reason and at length on the merits of various compilers and associated runtine etc., but for generic purposes -and then some- the free stuff is all that is needed.
Concerning the build process
The most trivial applications, such these made of a single unit of compilation (read a single C/C++ source file), can be built with a simple batch file where the various compiler and linker options are hardcoded, and where the name of file is specified on the command line.
For all other cases, it is very important to codify the build process so that it can be done
a) automatically and
b) reliably, i.e. with repeatability.
The "recipe" associated with this build process is often encapsulated in a make file or as the complexity grows, possibly several make files, possibly "bundled together in a script/bat file.
This (make file syntax) you need to get familiar with, even if you use alternatives to make/nmake, such as Apache Ant; the reason is that many (most?) source code packages include a make file.
In a nutshell, make files are text files and they allow defining targets, and the associated command to build a target. Each target is associated with its dependencies, which allows the make logic to decide what targets are out of date and should be rebuilt, and, before rebuilding them, what possibly dependencies should also be rebuilt. That way, when you modify say an include file (and if the make file is properly configured) any c file that used this header will be recompiled and any binary which links with the corresponding obj file will be rebuilt as well. make also include options to force all targets to be rebuilt, and this is sometimes handy to be sure that you truly have a current built (for example in the case some dependencies of a given object are not declared in the make).
On the Pre-processor:
The pre-processor is the first step toward compiling, although it is technically not part of the compilation. The purposes of this step are:
to remove any comment, and extraneous whitespace
to substitute any macro reference with the relevant C/C++ syntax. Some macros for example are used to define constant values such as say some email address used in the program; during per-processing any reference to this constant value (btw by convention such constants are named with ALL_CAPS_AND_UNDERSCORES) is replace by the actual C string literal containing the email address.
to exclude all conditional compiling branches that are not relevant (the #IFDEF and the like)
What's important to know about the pre-processor is that the pre-processor directive are NOT part of the C-Language proper, and they serve several important functions such as the conditional compiling mentionned earlier (used for example to have multiple versions of the program, say for different Operating Systems, or indeed for different compilers)
Taking it from there...
After this manifesto of mine... I encourage to read but little more, and to dive into programming and building binaries. It is a very good idea to try and get a broad picture of the framework etc. but this can be overdone, a bit akin to the exchange student who stays in his/her room reading the Webster dictionary to be "prepared" for meeting native speakers, rather than just "doing it!".
Ideally you shouldn't need to care what C++ compiler you are using. The compatability to the standard has got much better in recent years (even from microsoft)
Compiler flags obviously differ but the same features are generally available, it's just a differently named option to eg. set warning level on GCC and ms-cl
The build system is indepenant of the compiler, you can use any make with any compiler.
That is a lot of questions in one.
C++ compilers are a lot like hammers: They come in all sizes and shapes, with different abilities and features, intended for different types of users, and at different price points; ultimately they all are for doing the same basic task as the others.
Some are intended for highly specialized applications, like high-performance graphics, and have numerous extensions and libraries to assist the engineer with those types of problems. Others are meant for general purpose use, and aren't necessarily always the greatest for extreme work.
The technique for using each type of hammer varies from model to model—and version to version—but they all have a lot in common. The macro preprocessor is a standard part of C and C++ compilers.
A brief comparison of many C++ compilers is here. Also check out the list of C compilers, since many programs don't use any C++ features and can be compiled by ordinary C.
C++ compilers don't "view" makefiles. The rules of a makefile may invoke a C++ compiler, but also may "compile" assembly language modules (assembling), process other languages, build libraries, link modules, and/or post-process object modules. Makefiles often contain rules for cleaning up intermediate files, establishing debug environments, obtaining source code, etc., etc. Compilation is one link in a long chain of steps to develop software.
Also, many development environments abstract the makefile into a "project file" which is used by an integrated development environment (IDE) in an attempt to simplify or automate many programming tasks. See a comparison here.
As for learning: choose a specific problem to solve and dive in. The target platform (Linux/Windows/etc.) and problem space will narrow the choices pretty well. Which you choose is often linked to other considerations, such as working for a particular company, or being part of a team. C++ has something like 95% commonality among all its flavors. Learn any one of them well, and learning the next is a piece of cake.

MAP file analysis - where's my code size comes from?

I am looking for a tool to simplify analysing a linker map file for a large C++ project (VC6).
During maintenance, the binaries grow steadily and I want to figure out where it comes from. I suspect some overzealeous template expansion in a library shared between different DLL's, but jsut browsign the map file doesn't give good clues.
Any suggestions?
This is a wonderful compiler generated map file analysis/explorer/viewer tool. Check if you can explore gcc generated map file.
amap : A tool to analyze .MAP files produced by 32-bit Visual Studio compiler and report the amount of memory being used by data and code.
This app can also read and analyze MAP files produced by the Xbox360, Wii, and PS3 compilers.
The map file should have the size of each section, you can write a quick tool to sort symbols by this size. There's also a command line tool that comes with MSVC (undname.exe) which you can use to demangle the symbols.
Once you have the symbols sorted by size, you can generate this weekly or daily as you like and compare how the size of each symbol has changed over time.
The map file alone from any single build may not tell much, but a historical report of compiled map files can tell you quite a bit.
Have you tried using dumpbin.exe on your .obj files?
Stuff to look for:
Using a lot of STL?
A lot of c++ classes with inline methods?
A lot of constants?
If anything of the above applies to you. Check if they have a wide visibility, i.e. if they are used/seen in large parts of your application.
No suggestion for a tool, but a guess as to a possible cause: do you have incremental linking enabled? This can cause expansion during subsequent builds...
The linker will strip unused symbols if you're compiling with /opt:ref, so if you're using that and not using incremental linking, I would expect expansion of the binaries to be only a result of actual new code being added. That's as far as I know... hope it helps a little.
Templates, macros, STL in general all use a tremendous amount of space. Heralded as a great universal library, BOOST adds much space to projects. BOOST_FOR_EACH is an example of this. Its hundreds of lines of templated code, which could simply be avoided by writing a proper loop handle, which is in general only a few more key strokes.
Get Visual AssistX to save typing, not using templates. Also consider owning the code you use. Macros and inline function expansion are not necessarily going to show up.
Also, if you can, move away from DLL architecture to statically linking everything into one executable which runs in different "modes". There is absolutely nothing wrong with using the same executable image as many times as you want just passing in a different command line parameter depending on what you want it to do.
DLL's are the worst culprit for wasting space and slowing down the running time of a project. People think they are space savers, when in fact they tend to have the opposite effect, sometimes increasing project size by ten times! Plus they increase swapping. Use fixed code sections (no relocation section) for performance.