Is it possible to strip type names from executable while keeping RTTI enabled? - c++

I recently disabled RTTI on my compiler (MSVC10) and the executable size decreased significantly. By comparing the produced executables using a text editor, I found that the RTTI-less version contains much less symbol names, explaining the saved space.
AFAIK, those symbol names are only used to fill the type_info structure associated with each the polymorphic type, and one can programmatically access them calling type_info::name().
According to the standard, the format of the string returned by type_info::name() is unspecified. That is, no one can rely one it to do serious things portably. So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
But... is it possible ? I'm using MSVC10 and I've not found any option to do that. I can either disable completely RTTI (/GR-), or enable it with full type names (/GR). Does any compiler provide such an option?

So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
You are misreading the standard. The intent of making the return value from type_info::name() unspecified (other than a null-terminated binary string) was to give the implementers of the compiler/library/run-time environment free reign to implement the RTTI requirements as they see best. You, the programmer, have no say in how the Application Binary Interface (if there is one) is designed or implemented.

You're asking three different questions here.
The initial question asks whether there's any way to get MSVC to not generate names, or whether it's possible with other compilers, or, failing that, whether there's any way to strip the names out of the generated type_info without breaking things.
Then you want to know whether it would be possible to modify the MS ABI (presumably not too radically) so that it would be possible to strip the names.
Finally, you want to know whether it would be possible to design an ABI that didn't have names.
Question #1 is itself a complex question. As far as I know, there's no way to get MSVC to not generate names. And most other compilers are aimed at ABIs that specifically define what typeid(foo).name() must return, so they also can't be made to not generate names.
The more interesting question is, what happens if you strip out the names. For MSVC, I don't know the answer. The best thing to do here is probably to try it—go into your DLLs and change the first character of each name to \0 and see if it breaks dynamic_cast, etc. (I know that you can do this with Mac and linux x86_64 executables generated by g++ 4.2 and it works, but let's put that aside for now.)
On to question #2, assuming blanking the names doesn't work, it wouldn't be that hard to modify a name-based system to no longer require names. One trivial solution is to use hashes of the names, or even ROT13-encoded names (remember that the original goal here is "I don't want casual users to see the embarrassing names of my classes"). But I'm not sure that would count for what you're looking for. A slightly more complex solution is as follows:
For "dllexport"ed classes, generate a UUID, put that in the typeinfo, and also put it in the .LIB import library that gets generated along with the DLL.
For "dllimport"ed classes, read the UUID out of the .LIB and use that instead.
So, if you manage to get the dllexport/dllimport right, it will work, because your exe will be using the same UUID as the dll. But what if you don't? What if you "accidentally" specify identical classes (e.g., an instantiation of the same template with the same parameters) in your DLL and your EXE, without marking one as dllexport and one as dllimport? RTTI won't see them as the same type.
Is this a problem? Well, the C++ standard doesn't say it is. And neither does any MS documentation. In fact, the documentation explicitly says that you're not allowed to do this. You cannot use the same class or function in two different modules unless you explicitly export it from one module and import it into another. The fact that this is very hard to do with class templates is a problem, and it's a problem they don't try to solve.
Let's take a realistic example: Create a node-based linkedlist class template with a global static sentinel, where every list's last node points to that sentinel, and the end() function just returns a pointer to it. (Microsoft's own implementation of std::map used to do exactly this; I'm not sure if that's still true.) New up a linkedlist<int> in your exe, and pass it by reference to a function in your dll that tries to iterate from l.begin() to l.end(). It will never finish, because none of the nodes created by the exe will point to the copy of the sentinel in the dll. Of course if you pass l.begin() and l.end() into the DLL, instead of passing l itself, you won't have this problem. You can usually get away with passing a std::string or various other types by reference, just because they don't depend on anything that breaks. But you're not actually allowed to do so, you're just getting lucky. So, while replacing the names with UUIDs that have to be looked up at link time means types can't be matched up at link-loader time, the fact that types already can't be matched up at link-loader time means this is irrelevant.
It would be possible to build a name-based system that didn't have these problems. The ARM C++ ABI (and the iOS and Android ABIs based on it) restricts what programmers can get away with much less than MS, and has very specific requirements on how the link-loader has to make it work (3.2.5). This one couldn't be modified to not be name-based because it was an explicit choice in the design that:
• type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names.
• No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2).
The first condition effectively requires that these operators (and type_info::before()) must be called out of line, and that the execution environment must provide appropriate implementations of them.
But it's also possible to build an ABI that doesn't have this problem and that doesn't use names. Which segues nicely to #3.
The Itanium ABI (used by, among other things, both OS X and recent linux on x86_64 and i386) does guarantee that a linkedlist<int> generated in one object and a linkedlist<int> generated from the same header in another object can be linked together at runtime and will be the same type, which means they must have equal type_info objects. From 2.9.1:
It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms.
The compiler, linker, and link-loader must work together to make sure that a linkedlist<int> created in your executable points to the exact same type_info object that a linkedlist<int> created in your shared object would.
So, if you just took out all the names, it wouldn't make any difference at all. (And this is pretty easily tested and verified.)
But how could you possibly implement this ABI spec? j_kubik effectively argues that it's impossible because you'd have to preserve some link-time information in the .so files. Which points to the obvious answer: preserve some link-time information in the .so files. In fact, you already have to do that to handle, e.g., load-time relocations; this just extends what you need to preserve. And in fact, both Apple and GNU/linux/g++/ELF do exactly that. (This is part of the reason everyone building complex linux systems had to learn about symbol visibility and vague linkage a few years ago.)
There's an even more obvious way to solve the problem: Write a C++-based link loader, instead of trying to make the C++ compiler and linker work together to trick a C-based link loader. But as far as I know, nobody's tried that since Be.

Requirements for type-descriptor:
Works correctly in multi compilation-unit and shared-library environment;
Works correctly for different versions of shared libraries;
Works correctly although different compilation units don't share any information about type, except it's name: usually one header is used for all compilation units to define same type, but it's not required; even if, it doesn't affect resulting object file.
Work correctly despite fact that template instantiations must be fully-defined (so including type_info data) in every library that uses them, and yet behave like one type if several such libs are used together.
The fourth rule essentially bans all non-name based type-descriptors like UUIDs (unless specifically mentioned in type definition, but that is just name-replacement at best, and probably requires standard-alterations).
Stroing thuse UUIDs in separate files like suggeste .LIB files also causes trouble: different library versions implementing new types would cause trouble.
Compilation units should be able to share the same type (and its type_info) without the need to involve linker - because it should stay free of any language-specifics.
So type-name can be only unique type descriptor without completely re-modeling compilation and linking (also dynamic). I could imagine it working, but not under current scheme.

Related

How to you add platform specific function definitions in LLVM pass?

I was trying to add a function declaration for something provided by the system.
However, the function prototype returns size_t, which is int32 on 32bit platform and int64 on 64bit platform.
I'd like to know if there is a method to detect target platform and add the declaration accordingly?
After a bit of research, LLVM IR as a target neutral language cannot possibly know target-specific type sizes. Have a look at this relevant discussion where Chris Lattner comments on the subject. Also, at this relevant SO question.
So, this is the job of the front-end and this causes extra bookkeeping information that front-ends need to "know" for a target and its ABI. So, for example, you might have needs for projects like this in the case of the Loci programming language.
Now, specifically for size_t according to this:
[...] std::size_t can safely store the value of any non-member pointer, in
which case it is synonymous with std::uintptr_t.
So, you could use the getIntPtrType method of DataLayout class.
For any other data types, I'm not sure how far "guessing" can get you (probably not very far judging from the previous references).
Lastly, another alternative could be extending LLVM with a custom intrinsic (see memcpy for example), which inevitably goes through specific definition per target.
For actually adapting your integer type creation you could use the sizeof operator along with the use of CHAR_BIT, in order to provide the correct number of bits in the getIntNType call.
This will get you as far as using the right size for integer type on the platform where your module pass is built on.
For detecting a type's size 'dynamically' on the platform where your pass is being run, I know of none other way than providing that info in some sort of configuration file.
However, this can be automated and using the example of various build systems (e.g. cmake which is also used by LLVM), you can craft a simple program that can be compiled and automate that generation.
To that end, and to make this as portable as possible and avoid reinventing the wheel, you can use cmake's CheckTypeSize module.

Are variable identifiers totally needless, at the end of the day?

I've taken a good time studying TOC and Compiler design, not done yet but I feel comfortable with the conceptions. On the other hand I have a very shallow knowledge of assembly and machine code, and I have always the desire/need to connect the two sides( HLL and LLL representation of the code ), as I'm learning C++ with paying great attention to performance and optimization discussions.
C++ is a statically typed language:
My question is: Our variables when written as expressions in the statements of the code, do all these variables ( and other entities with identifiers ) become at runtime, mere instructions of addressing to positions of the virtual memory ( for static and for globals ) and addressing relevant to stack address for local variables?
I mean, after a successful compilation including semantic and syntactic verification, isn't wise to deal with data at runtime as guaranteed entities of target memory bytes without any thinking of any identifier or any checking, with the symbol table no more needed?
If my question appeared to be the type of questions that are due to lacking of learning effort ( which I hope it doesn't ), please just inform me about that, and tell me where to read. If that was the case, then it's honestly because I'm concentrating on C++ nowadays and haven't got the chance yet to have a sound knowledge of low level languages, I apologize for that in advance.
You're spot on. Once compiled to machine code, there is no longer any notion of a variable identifier (or variable type, for that matter). It's just bytes at a certain location. Which location was determined by the compiler (when compiling) based on the variable name, or by the linker (when linking) in the case of global variables.
Of course, it can be useful to retain information such as identifiers, for debugging purposes. This is precisely what "compilation with debug information" means: when you do that, the compiler will somehow embed the (redundant) identifiers into the generated code such that a debugger can access them. Or put them in a separate file alongside; the details of that depend on the format of the debugging information.
Yes, mostly. There are a few details that will make identifiers remain more than just addresses or stack offsets.
First we have in RTTI in C++ which means that during runtime the name of at least types may still be available. For example:
const std::type_info &info = typeid(*ptr_interface);
std::cout << info.name() << std::endl;
would print the name of whatever type *ptr_interface is of.
Second, due to the way a program is linked the symbols from the object files may still be present in the executing image. You have for example the linux kernel making use of this as it can produce a backtrace of the stack including the function names. Also it uses knowledge of function names in order to be able to load and link modules. Similar functionality exists in Gnu C library, than when linked for it is able to retrieve function names in stack traces.
In normal cases though the code will not be affected by the original names of the variables (but the compiler will of course emit code suitable for the type the variable have).

Is there a portable wrapper for C++ type_info that standardizes type name string format?

The format of the output of type_info::name() is implementation specific.
namespace N { struct A; }
const N::A *a;
typeid(a).name(); // returns e.g. "const struct N::A" but compiler-specific
Has anyone written a wrapper that returns dependable, predictable type information that is the same across compilers. Multiple templated functions would allow user to get specific information about a type. So I might be able to use:
MyTypeInfo::name(a); // returns "const struct N::A *"
MyTypeInfo::base(a); // returns "A"
MyTypeInfo::pointer(a); // returns "*"
MyTypeInfo::nameSpace(a); // returns "N"
MyTypeInfo::cv(a); // returns "const"
These functions are just examples, someone with better knowledge of the C++ type system could probably design a better API. The one I'm interested in in base(). All functions would raise an exception if RTTI was disabled or an unsupported compiler was detected.
This seems like the sort of thing that Boost might implement, but I can't find it in there anywhere. Is there a portable library that does this?
There are some limitations to do such things in C++, so you probably won't find exactly what you want in the near future. The meta-information about the types that the compiler inserts in the compiled code is also implementation-specific to the RTL used by the compiler, so it'd be difficult for a third-party library to do a good job without relying to undocumented features of each specific compiler that might break in later versions.
The Qt framework has, to my knowledge, the nearest thing to what you intended. But they do that completely independent from RTTI. Instead, they have their own "compiler" that parses the source code and generates additional source modules with the meta-information. Then, you compile+link these modules along with your program and use their API to get the information. Take a look at http://doc.qt.nokia.com/latest/metaobjects.html
Jeremy Pack (from Boost Extension plugin framework) appears to have written such a thing:
http://blog.redshoelace.com/2009/06/resource-management-across-dll.html
3. RTTI does not always function as expected across DLL boundaries. Check out the type_info classes to see how I deal with that.
So you could have a look there.
PS. I remembered because I once fixed a bug in that area; this might still add information so here's the link: https://stackoverflow.com/a/5838527/85371
GCC has __cxa_demangle https://gcc.gnu.org/onlinedocs/libstdc++/manual/ext_demangling.html
If there are such extensions for all compilers you target, you could use them to write a portable function with macros to detect the compiler.

C++ class function in assembly

Hello Community
I am look at C++ assembly, I have compiled a benchmark from the PARSEC suite and I am having difficulty knowing how do they name the class attribute functions in assembly language. for example if I have a class with some functions to manipulate it, in cpp we call them like test.increment();
After some investigation I found out that this function is
atomic_load_acq_ptr
represented as:
_ZL19atomic_load_acq_intPVj
in assembly, or at least this is what I have found out.
Let me know if I am wrong!
Is there some fixed rule for the mapping? or are they random?
Thanks
It's called name mangling, is necessary because of overloads and templates and such (i.e. the plain chars-and-numbers name isn't enough to identify a chunk of code unambiguously; embedding spaces or <> or :: in names usually isn't legal; copying the additional information in uncondensed, human-readable form would be wasteful), and it therefore depends on types, arity, etc.
The exact scheme can vary, but usually each compiler is self-consistent for a relatively long time (sometimes even several compilers can settle for one way).
That's called name mangling.. It is compiler dependant. No standard way, sorry :)
C++ allows function overloading, this means that one can have two functions with the same name but different parameters. Since your binary formats do not understand type this is a proble. The way that this is worked around is to use a scheme called name mangling. This adds a whole function of type information to the name used in the source file and ensures one calls the correct overload.
The extra letters etc that are added are governed by the particular Application Binary Interface (ABI) being used. Different compilers (and sometimes even different versions) may use different ABIs.
Yes there's a standard method for creating these symbols known as name mangling.

C++ static global non-POD: theory and practice

I was reading the Qt coding conventions docs and came upon the following paragraph:
Anything that has a constructor or needs to run code to be initialized cannot be used as global object in library code, since it is undefined when that constructor/code will be run (on first usage, on library load, before main() or not at all). Even if the execution time of the initializer is defined for shared libraries, you’ll get into trouble when moving that code in a plugin or if the library is compiled statically.
I know what the theory says, but I don't understand the "not at all" part. Sometimes I use non-POD global const statics (e.g: QString) and it never occured to me that they might not be initialized... Is this specific to shared objects / DLLs? Does this happen for broken compilers only?
What do you think about this rule?
The "not at all" part simply says that the C++ standard is silent about this issue. It doesn't know about shared libraries and thus doesn't says anything about the interaction of certain C++ features with these.
In practice, I have seen global non-POD static globals used on Windows, OSX, and many versions of Linux and other Unices, both in GUI and command line programs, as plugins and as standalone applications. At least one project (which used non-POD static globals) had versions for the full set of all combinations of these. The only problem I have ever seen was that some very old GCC version generated code that called the dtors of such objects in dynamic libraries when the executable stopped, not when the library was unloaded. Of course, that was fatal (the library code was called when the library was already gone), but that has been almost a decade ago.
But of course, this still doesn't guarantee anything.
If the static object is defined in an object that does not get referenced, the linker can prune the object completely, including the static initializer code. It will do so regularly for libs (that's how the libc does not get completely linked in when using parts of it under gnu, e.g.).
Interestingly, I don't think this is specific to libraries. It can probably happen for objects even in the main build.
I see no problem with having global objects with constructors.
They should just not have any dependency on other global objects in their constructor (or destructor).
But if they do have dependencies then the dependent object must either by in the same compilation unit or be lazily evaluated so you can force it to be evaluated before you use it.
The code in the constructor should also not be dependent on when (this is related to dependencies but not quite the same) it is executed, But you are safe to assume that it will get constructed at the very least (just before a method is called) and C++ guarantees the destruction order is the reverse of instantiation.
Its not so difficult to stick to these rules.
I don't think static objects constructors can be elided. There is probably a confusion with the fact that a static library is often just a bunch of objects which are token in the executable if they are referenced. Some static objects are designed so that they aren't referenced outside their containing object, and so the object file is put in the executable only if there is another dependency on them. This is not the case in some patterns (using a static object which register itself for instance).
C++ doesn't define the order that static initializers execute for objects in different compilations units (the ordering is well defined within a compilation unit).
Consider the situation where you have 2 static objects A and B defined in different compilation units. Let's say that object B actually uses object A in it's initialization.
In this scenario it's possible that B will be initialized first and make a call against an uninitialized A object. This might be one thing that is meant by "not at all" - an object is being used when it hasn't had an opportunity to initialize it self first (even if it might be initialized later).
I suppose that dynamic linking might add complexities that I haven't thought of that cuase an object to never be initialized. Either way, the bottom line is that static initializatino introduces enough potential issues that it should be avoided where possible and very carefully handled where you have to use it.