Strip symbols and RTTI text from GCC executable - c++

My project uses template metaprogramming heavily. Most of the action happens inside recursive templates which produce objects and functions with very long (mangled) symbol names.
Despite the build time being only ~30 sec, the resulting executable is about a megabyte, and it's mostly symbol names.
On Linux, adding a -s argument to GCC brings the size down to ~300 KiB, but a quick look with a text editor shows there are still a lot of cumbersome names in there. I can't find how to strip anything properly on OS X… will just write that off for now.
I suspect that the vtable entries for providing typeid(x).name() are taking up a big chunk. Removing all use of the typeid operator did not cause anything more to be stripped on Linux. I think that the default exception handler uses the facility to report the type of an uncaught exception.
How might I maximize strippage and minimize these kilobyte-sized symbols in my executable?

Just run the program strip on the final executable. If you want to be fancier, you can use some other tools to store the debug info separately, but for your stated purpose, just strip a.out is fine. Maybe use the --strip-all option--I haven't tried that myself to see if it differs from the default behavior.
If you really want to try disabling RTTI, well, it's gcc -fno-rtti. But that may break your program badly--only one way to find out I guess.

Related

Is striping certain information from a C++ binary safe?

I am brand new to C++, trying to create a program to read pixels on the screen on Linux.
I currently compile the project without any optimization flag, as I am unsure what it does to the program, but that would be another question, here's mine:
Is striping certain information from a C++ binary safe?
I found a possibly helpful manual page of strip program.
As I don't really know what striping means in this context, I am unsure if it is as simple as striping all of it with:
-s --strip-all Remove all symbol and relocation information
But, of course, I'd want the program to work flawlessly then, so does it interfere anyhow with program's execution?
As for my motivation for striping: I want to know if it's safe, and as I said already, I repeat:
I don't really know what striping means in this context.
I thought the answerer could have also covered this. For me to decide.
Symbols are used for debugging.
Your application would continue to work with out issues if you strip them; but you may find it harder to debug if there's a problem.
Relocation information is used for dynamic library loading and for address space layout randomisation (thank you #interjay); and from the strip documentation
--remove-relocations=sectionpattern
... Note that using this option inappropriately may make the output file unusable. ...

How to export function names and variable names using GCC or clang?

I am making a commercial software and I don't want for it to be easily crackable. It is targeted for Linux and I am compiling it using GCC (8.2.1). The problem is that when I compile it, technically anyone can use disassembler like IDA or Binary Ninja to see all functions names. Here is example (you can see function names on left panel):
Is there any way to protect my program from this kind if reverse engineering? Is there any way of exporting all if these function names and variables from code automatically (with GCC or clang?), so I can make a simple script to change them to completely random before compilation?
So you want to hide/mask the names of symbols in your binary. You've decided that, to do this, you need to get a list of them so that you can create a script to modify them. Well, you could get that list with nm but you don't need any of that (rewriting names inside a compiled binary? oof… recipe for disaster).
Instead, just do what everybody does in a release build and strip the symbols! You'll see a much smaller binary, too. Of course this doesn't prevent reverse engineering (nothing does), though it arguably makes said task more difficult.
Honestly you should be stripping your release binaries anyway, and not to prevent cracking. Common wisdom is not to try too hard to prevent cracking, because you'll inevitably fail, and at the cost of wasted dev time in the attempt (and possibly a more complex codebase that's harder to maintain / a more complex executable that is less fast and/or useful for the honest customer).

Protection against accidental object incompatibility?

TL;DR
Protection against binary incompatibility resulting from compiler argument typos in shared, possibly templated headers' preprocessor directives, which control conditional compilation, in different compilation units?
Ex.
g++ ... -DYOUR_NORMAl_FLAG ... -o libA.so
/**Another compilation unit, or even project. **/
g++ ... -DYOUR_NORMA1_FLAG ... -o libB.so
/**Another compilation unit, or even project. **/
g++ ... -DYOUR_NORMAI_FLAG ... main.cpp libA.so //The possibilities!
The Basic Story
Recently, I ran into a strange bug: the symptom was a single SIGSEGV, which always seemed to occur at the same location after recompling. This led me to believe there was some kind of memory corruption going on, and the actual underlying pointer is not a pointer at all, but some data section.
I save you from the long and strenuous journey taking almost two otherwise perfectly good work days to track down the problem. Sufficient to say, Valgrind, GDB, nm, readelf, electric fence, GCC's stack smashing protection, and then some more measures/methods/approaches failed.
In utter devastation, my attention turned to the finest details in the build process, which was analogous to:
Build one small library.
Build one large library, which uses the small one.
Build the test suite of the large library.
Only in case when the large library was used as a static, or a dynamic library dependency (ie. the dynamic linker loaded it automatically, no dlopen)
was there a problem. The test case where all the code of the library was simply included in the tests, everything worked: this was the most important clue.
The "Solution"
In the end, it turned out to be the simplest thing: a single (!) typo.
Turns out, the compilation flags differed by a single char in the test suite, and the large library: a define, which was controlling the behavior of the small library, was misspelled.
Critical information morsel: the small library had some templates. These were used directly in every case, without explicit instantiation in advance. The contents of one of the templated classes changed when the flag was toggled: some data fields were simply not present in case the flag was defined!
The linker noticed nothing of this. (Since the class was templated, the resultant symbols were weak.)
The code used dynamic casts, and the class affected by this problem was inheriting from the mangled class -> things went south.
My question is as follows: how would you protect against this kind of problem? Are there any tools or solutions which address this specific issue?
Future Proofing
I've thought of two things, and believe no protection can be built on the object file level:
1: Save options implemented as preprocessor symbols in some well defined place, preferably extracted by a separate build step. Provide check script which uses this to check all compiler defines, and defines in user code. Integrate this check into the build process. Possibly use Levenshtein distance or similar to check for misspellings. Expensive, and the script / solution can get complicated. Possible problem with similar flags (but why have them?), additional files must accompany compiled library code. (Well, maybe with DWARF 2, this is untrue, but let's assume we don't want that.)
2: Centralize build options: cheap, customization option left open (think makefile.local), but makes monolithic monstrosities, strong project couplings.
I'd like to go ahead and quench a few likely flame inducing embers possibly flaring up in some readers: "do not use preprocessor symbols" is not an option here.
Conditional compilation does have it's place in high performance code, and doing everything with templates and enable_if-s would needlessly overcomplicate things. While the above solution is usually not desirable it can arise form the development process.
Please assume you have no control over the situation, assume you have legacy code, assume everything you can to force yourself to avoid side-stepping.
If those won't do, generalize into ABI incompatibility detection, though this might escalate the scope of the question too much for SO.
I'm aware of:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/abi.html
DT_SONAME is not applicable.
Other version schemes therein are not applicable either - they were designed to protect a package which is in itself not faulty.
Mixing C++ ABIs to build against legacy libraries
Static analysis tool to detect ABI breaks in C++
If it matters, don't have a default case.
#ifdef YOUR_NORMAL_FLAG
// some code
#elsif YOUR_SPECIAL_FLAG
// some other code
#else
// in case of a typo, this is a compilation error
# error "No flag specified"
#endif
This may lead to a large list of compiler options if conditional compilation is overused, but there are ways around this like defining config-files
flag=normal
flag2=special
which get parsed by build scripts and generate the options and can possibly check for typos or could be parsed directly from the Makefile.

What are some techniques or tools for profiling excessive code size in C/C++ applications?

I have a C++ library that generates much larger code that I would really expect for what it is doing. From less than 50K lines of source I get shared objects that are almost 4 MB and static archives pushing 9. This is problematic both because the library binaries are quite large, and, much worse, even simple applications linking against it typically gain 500 to 1000 KB in code size. Compiling the library with flags like -Os helps this somewhat, but not really very much.
I have also experimented with GCC's -frepo command (even though all the documentation I've seen suggests that on Linux collect2 will merge duplicate templates anyway) and explicit template instantiation on templates that seemed "likely" to be duplicated a lot, but with no real effect in either case. Of course I say "likely" because, as with any kind of profiling, blind guessing like this is almost always wrong.
Is there some tool that makes it easy to profile code size, or some other way I can figure out what is taking up so much room, or, more generally, any other things I should try? Something that works under Linux would be ideal but I'll take what I can get.
If you want to find out what is being put into your executable, then ask your tools. Turn on the ld linker's --print-map (or -M) option to produce a map file showing what it has put in memory and where. Doing this for the static linked example is probably more informative.
If you're not invoking ld directly, but only via the gcc command line, you can pass ld specific options to ld from the gcc command line by preceding them with -Wl,.
On Linux the linker certainly does merge multiple template instantiations.
Make sure you aren't measuring debug binaries (debug info could take up more than 75% of the final binary size).
One technique to reduce final binary size is to compile with -ffunction-sections and -fdata-sections, then link with -Wl,--gc-sections.
Even bigger reduction (we've seen 25%) may be possible if you use development version of [gold][1] (the new ELF-only linker, part of binutils), and link with -Wl,--icf
Another useful technique is reducing the set of symbols which are "exported" by your shared libraries (everything is exported by default), either via __attribute__((visibility(...))), or by using linker script. Details here (see "Export control").
One method that is very crude but very quick is to look at the size of your object files. Not all the code in the object files will be compiled into the final binary, so there may be a few false positives, but it can give a good impression of where the hotspots will be. Once you've found the largest object files you can then delve into them with tools like objdump and nm.

MAP file analysis - where's my code size comes from?

I am looking for a tool to simplify analysing a linker map file for a large C++ project (VC6).
During maintenance, the binaries grow steadily and I want to figure out where it comes from. I suspect some overzealeous template expansion in a library shared between different DLL's, but jsut browsign the map file doesn't give good clues.
Any suggestions?
This is a wonderful compiler generated map file analysis/explorer/viewer tool. Check if you can explore gcc generated map file.
amap : A tool to analyze .MAP files produced by 32-bit Visual Studio compiler and report the amount of memory being used by data and code.
This app can also read and analyze MAP files produced by the Xbox360, Wii, and PS3 compilers.
The map file should have the size of each section, you can write a quick tool to sort symbols by this size. There's also a command line tool that comes with MSVC (undname.exe) which you can use to demangle the symbols.
Once you have the symbols sorted by size, you can generate this weekly or daily as you like and compare how the size of each symbol has changed over time.
The map file alone from any single build may not tell much, but a historical report of compiled map files can tell you quite a bit.
Have you tried using dumpbin.exe on your .obj files?
Stuff to look for:
Using a lot of STL?
A lot of c++ classes with inline methods?
A lot of constants?
If anything of the above applies to you. Check if they have a wide visibility, i.e. if they are used/seen in large parts of your application.
No suggestion for a tool, but a guess as to a possible cause: do you have incremental linking enabled? This can cause expansion during subsequent builds...
The linker will strip unused symbols if you're compiling with /opt:ref, so if you're using that and not using incremental linking, I would expect expansion of the binaries to be only a result of actual new code being added. That's as far as I know... hope it helps a little.
Templates, macros, STL in general all use a tremendous amount of space. Heralded as a great universal library, BOOST adds much space to projects. BOOST_FOR_EACH is an example of this. Its hundreds of lines of templated code, which could simply be avoided by writing a proper loop handle, which is in general only a few more key strokes.
Get Visual AssistX to save typing, not using templates. Also consider owning the code you use. Macros and inline function expansion are not necessarily going to show up.
Also, if you can, move away from DLL architecture to statically linking everything into one executable which runs in different "modes". There is absolutely nothing wrong with using the same executable image as many times as you want just passing in a different command line parameter depending on what you want it to do.
DLL's are the worst culprit for wasting space and slowing down the running time of a project. People think they are space savers, when in fact they tend to have the opposite effect, sometimes increasing project size by ten times! Plus they increase swapping. Use fixed code sections (no relocation section) for performance.