g++ generated Assembly looks ugly - c++

I'm quite familiar with gcc assembly... Recently I was forced to use g++ for some code cleanup. Let me mention I'm very familiar with assembly, hence out of curiosity I often take a look at how good the compiler generated asm is.
But the naming conventions with g++ are just bizarre. I was wondering if there are any guidelines on how to read its asm output ?
Thanks a lot.

I don't find g++'s asm 'ugly' or hard to understand, though I've been working with GCC for over 8 years now.
On Linux, function labels usually go by _ZN, The "_ZN" prefix being a token that designates C++ name mangling (as opposed to C), followed by namespace the function belongs, then function names and argument types, then templates, if any.
Example:
// tests::vec4::testEquality()
_ZN5tests4vec412testEqualityEv
_ZN - C++ mangling, 'N' for member (_ZZ for const or others)
5tests - length (5 chars) + name
4vec4 -length (4 chars) + sub namespace
12testEquality - length (12 chars) + function name
Ev - void argument (none)

From man g++:
-fverbose-asm
Put extra commentary information in the generated assembly code to make it more
readable. This option is generally only of use to those who actually need to read the
generated assembly code (perhaps while debugging the compiler itself).

If you're looking at the naming convention for external symbols then this will follow the name mangling convention of the platform that you are using. It can be reversed with the c++filt program which will give you the human readable version of C++ function names, although they will (in all probability) no longer be valid linker symbols.
If you're just looking at local function labels, then you're out of luck. g++'s assembler output is for talking to the assembler and not really designed for ease of human comprehension. It's going to generate a set of relatively meaningless labels.

If the code has debugging information, objdump can provide a more helpful disassembly :
-S, --source Intermix source code with disassembly
-l, --line-numbers Include line numbers and filenames in output

For people who are working on demangling those names inside the program (like me), hopefully this thread helps.
def demangle(name):
import subprocess as sp
stdout, _ = sp.Popen(['c++filt', name],
stdin=sp.PIPE, stdout=sp.PIPE).communicate()
return stdout.split("\n")[0]
print demangle('_ZNSt15basic_stringbufIcSt11char_traitsIcESaIcEE17_M_stringbuf_initESt13_Ios_Openmode')

Related

Can I tell my compiler to "inline" a function also w.r.t. debug/source-line info?

In my code (either C or C++; let's say it's C++) I have a one-liner inline function foo() which gets called from many places in the code. I'm using a profiling tool which gathers statistics by line in the object code, which it translates into statistics by using the source-code-line information (which we get with -g in clang or GCC). Thus the profiler can't distinguish between calls to foo() from different places.
I would like the stats to be counted separately for the different places foo() get called. For this to happen, I need the compiler to "fully" inline foo() - including forgetting about it when it comes to the source location information.
Now, I know I can achieve this by using a macro - that way, there is no function, and the code is just pasted where I use it. But that wont work for operators, for example; and it may be a problem with templates. So, can I tell the compiler to do what I described?
Notes:
Compiler-specific answers are relevant; I'm mainly interested in GCC and clang.
I'm not compiling a debug build, i.e. optimizations are turned on.

How to disable inline assembly in GCC?

I'm developing an online judge system for programming contests. Since C/C++ inline assembly is not allowed in certain programming contests, I would like to add the same restriction to my system.
I would like to let GCC produce an error when compiling a C/C++ program containing inline assembly, so that any program containing inline assembly will be rejected. Is there a way to achieve that?
Note: disabling inline assembly is just for obeying the rules, not for security concerns.
Is there a way to disable inline assembler in GCC?
Yes there are a couple of methods; none useful for security, only guard-rails that could be worked around intentionally, but will stop people from accidentally using asm in places they didn't realize they shouldn't.
Turn off the asm keyword in the compiler (C only)
To do it in compilation phase, use the parameter -fno-asm. However, keep in mind that this will only affect asm for C, not C++. And not __asm__ or __asm for either language.
Documentation:
-fno-asm
Do not recognize "asm", "inline" or "typeof" as a keyword, so that code can use these words as identifiers. You can use the keywords "__asm__", "__inline__" and "__typeof__" instead. -ansi implies -fno-asm.
In C++ , this switch only affects the "typeof" keyword, since "asm" and "inline" are standard keywords. You may want to use the -fno-gnu-keywords flag instead, which has the same effect. In C99 mode (-std=c99 or -std=gnu99), this switch only affects the "asm" and "typeof" keywords, since "inline" is a standard keyword in ISO C99.
Define the keyword as a macro
You can use the parameters -Dasm=error -D__asm__=error -D__asm=error
Note that this construction is generic. What it does is to create macros. It works pretty much like a #define. The documentation says:
-D name=definition
The contents of definition are tokenized and processed as if they appeared during translation phase three in a #define directive. In particular, the definition will be truncated by embedded newline characters.
...
So what it does is simply to change occurrences of asm, __asm, or __asm__ to error. This is done in the preprocessor phase. You don't have to use error. Just pick anything that will not compile.
Use a macro that fires during compilation
A way to solve it in compilation phase by using a macro, as suggested in comments by zwol, you can use -D'asm(...)=_Static_assert(0,"inline assembly not allowed")'. This will also solve the problem if there exist an identifier called error.
Note: This method requires -std=c11 or higher.
Using grep before using gcc
Yet another way that may be the solution to your problem is to just do a grep in the root of the source tree before compiling:
grep -nr "asm"
This will also catch __asm__ but it may give false positives, for instance is you have a string literal, identifier or comment containing the substring "asm". But in your case you could solve this problem by also forbidding any occurrence of that string anywhere in the source code. Just change the rules.
Possible unexpected problems
Note that disabling assembly can cause other problems. For instance, I could not use stdio.h with this option. It is common that system headers contains inline assembly code.
A way to cheat above methods
Aside from the trivial #undef __asm__, it is possible to execute strings as machine code. See this answer for an example: https://stackoverflow.com/a/18477070/6699433
A piece of the code from the link above:
/* our machine code */
char code[] = {0x55,0x48,0x89,0xe5,0x89,0x7d,0xfc,0x48,
0x89,0x75,0xf0,0xb8,0x2a,0x00,0x00,0x00,0xc9,0xc3,0x00};
/* copy code to executable buffer */
void *buf = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (buf, code, sizeof(code));
/* run code */
int i = ((int (*) (void))buf)();
The code above is only intended to give a quick idea of how to trick the rules OP has stated. It is not intended to be a good example of how to actually perform it in reality. Furthermore, the code is not mine. It is just a short code quote from the link I supplied. If you have ideas about how to improve it, then please comment on 4pie0:s original post instead.

Why doesn't g++ generate "raw" symbols?

From C we know what legal variable names are. The general regex for the legal names looks similar to [\w_](\w\d_)*.
Using dlsym we can load arbitrary strings, and C++ mangles names that include # in the ABI..
My question is: can arbitrary strings be used? The documentation on dlsym does not seem to mention anything.
Another question that came up appears to imply that it is fully possible to have arbitrary null-terminated symbols. This inquires me to ask the following question:
Why doesn't g++ emit raw function signatures, with name and parameter list, including namespace and class membership?
Here's what I mean:
namespace test {
class A
{
int myFunction(const int a);
};
}
namespace test {
int A::myFunction(const int a){return a * 2;}
}
Does not get compiled to
int ::test::A::myFunction(const int a)\0
Instead, it gets compiled to - on my 64 bit machine, using g++ 4.9.2 -
0000000000000000 T _ZN4test1A10myFunctionEi
This output is read by nm. The code was compiled using g++ -c test.cpp -o out
I'm sure this decision was pragmatically made to avoid having to make any changes to pre-existing C linkers (quite possibly even originated from cfront). By emitting symbols with the same set of characters the C linker is used to you don't have to possibly make any number of updates and can use the linker off the shelf.
Additionally C and C++ are widely portable languages and they wouldn't want to risk breaking a more obscure binary format (perhaps on an embedded system) by including unexpected symbols.
Finally since you can always demangle (with something like gc++filt for example) it probably didn't seem worth using a full text representation.
P.S. You would absolutely not want to include the parameter name in the function name: People will not be happy if renaming a parameter breaks ABI. It's hard enough to keep ABI compatibility already.
GCC is compliant with the Itanium C++ ABI. If your question is “Why does the Itanium C++ ABI require names to be mangled that way?” then the answer is probably
because its designers thought this would b a good idea and
shorter symbols make for smaller object files and faster dynamic linking.
For the second point, there is a pretty good explanation in Ulrich Drepper's article How To Write Shared Libraries.
Because of limitations on the exported names imposed by a linker (and that includes the OS's dynamic linker) - character set, length. The very phenomenon of mangling arose because of this.
Corollary: in media where these limitations don't exist (various VMs that use their own linkers: e.g. .NET, Java), mangling doesn't exist, either.
Each compiler that produces exports that are incompatible with others must use a different scheme. Because linker (static or dynamic) doesn't care about ABIs, all it cares about is identifiers.
You basically answered your own question:
The general regex for the legal names looks similar to [\w_](\w\d_)*.
From the beginning, C++ used preexisting (C) linker / loader technology. There is nothing "C++" about either ld, ld-linux.so etc.
So linking is limited to what was legal in C already. That does not include colons, parenthesis, ampersands, asteriskes, and whatever else you would need to encode C++ identifiers in plain text.
(In this answer I ignore that you made several typos in your example of ::test::A::void myFunction(const int a)).
This format is:
not programmer-specific; consider that all these are the same, so why confuse people:
int ::test::A::myFunction(const int)
int ::test::A::myFunction(int const)
int test::A::myFunction(int const)
int test :: A :: myFunction (int const)
and so on…
unambiguous
terse; no parameter names or other unnecessary decorations
easier to parse (notice that the length of each component is present as a number)
Meanwhile, I see no benefit at all in choosing a human-readable looks-like-C++ format for a C++ ABI. This stuff is supposed to be optimised for machines. Why would you make it less optimal for machines, in order to make it more optimal for humans? And probably failing at the latter whilst doing so.
You say that your compiler does not emit "raw symbols". I posit that it does precisely that.

gdb: size of a struct that isn't in context?

Sometimes I need to know size of a struct which is not in the scope (not even on the stack, i.e. frame-related commands won't help). E.g. it happens for debugging client + server communication, when restarting the apps to just break somewhere in context of the struct with the purpose of finding the size is uncomfortable and time consuming.
How do I find size of a struct defined in a header with disregard to my current context?
For C, gdb's "expression language" is just ordinary C expressions, with a few handy extensions for debugging. This is less true for C++, primarily because C++ is just much more difficult to parse, so there expression language tends to be a subset of C++ plus some gdb extensions.
So, the short answer is you can just type:
(gdb) print sizeof(mystruct)
However, there are caveats.
First, gdb's current language matters. You can find this with show language. In the case of a struct type, in C++ there is an automatic typedef, but in C there is not. So if you are using the auto language (and you usually should), and are stopped in a C frame, you will need to use the keyword:
(gdb) print sizeof(struct mystruct)
Now, this still may not work. The usual reason at this point is that the structure isn't used in your program, and so doesn't show up in the debug info. The debug info can be optimized out even if you think it ought to have been available, because it is up to the compiler. For example, I think if a struct is only used in sizeof expressions (and no variable is ever defined of that type), then I think (hard to remember for sure) that GCC won't emit DWARF for it.
You can check to see if the type is available using readelf or dwgrep, like:
$ readelf -wi myexecutableorlibrary | grep mystruct
(Though in real life I usually use less and then examine the DWARF DIEs carefully. You will need to know a little DWARF to make sense of this.)
Sometimes in gdb it's handy to use the "filename" extension to specify exactly which entity you mean. Like:
(gdb) print 'myfile.c'::variable
Not sure if that works for types, and anyway it shouldn't usually be necessary for them.
In C/C++, you have the sizeof function which will give you the size of any type (including struct) or variable.
I'm not sure if you can apply this while debugging but you could simply have a test program with the same headers (type definitions) tell you what the size of your types is.

Can I ungarble GCC's RTTI names?

Using gcc, when I ask for an object/variable's type using typeid, I get a different result from the type_info::name method from what I'd expect to get on Windows. I Googled around a bit, and found out that RTTI names are implementation-specific.
Problem is, I want to get a type's name as it would be returned on Windows. Is there an easy way to do this?
If it's what you're asking, there is no compiler switch that would make gcc behave like msvc regarding the name returned by type_info::name().
However, in your code you can rely on the gcc specific __cxa_demangle function.
There is in fact an answer on SO that addresses your problem.
Reference: libstdc++ manual, Chapter 40. Demangling.
c++ function names really include all the return and argument type information as well as the class and method name. When compiled, they are 'mangled' into a standard form (standard for each compiler) that can act as an assembler symbol and includes all the type information.
You need to run a function or program to reverse this mangling, called a demangler.
try running
c++filt myoutput.txt
on the output of the function. This demangles the real symbol name back into a human readable form.
Based on this other question Is there an online name demangler for C++? I've written a online tool for this: c++filtjs