How symbols are resolved when linking in dynamic libraries - c++

Something I still haven't quite absorbed. Nearly all of my development has involved generating and compiling my code together statically, without the use of linking dynamic libraries (dll or .so).
I understand in this case how the compiler and linker can resolve the symbols as it is stepping through the code.
However, when linking in a dynamic library for example, and the dynamic library has been compiled with a different compiler than the main code, will the symbol names not be different?
For example, I define a structure called RequiredData in my main code as shown below. Imagine that the FuncFromDynamicLib() function is part of a separate dynamic library and it takes this same structure RequiredData as an argument. It means that somewhere in the dynamic library, this structure must be defined again? How can this be resolved at the time of dynamic linking that these structures are the same. What if the data members are different? Thanks in advance.
#include <iostream>
using namespace std;
struct RequiredData
{
string name;
int value1;
int value2;
};
int FuncFromDynamicLib(RequiredData);
int main()
{
RequiredData data;
data.name = "test";
data.value1 = 120;
data.value2 = 200;
FuncFromDynamicLib(data);
return 0;
}
//------------Imagine that this func is Part of dynamic library in another file--------------------
int FuncFromDynamicLib(RequiredData rd)
{
cout<<rd.name<<endl;
cout<<rd.value1<<endl;
cout<<rd.value2<<endl;
}
//------------------------------------------------------

You are completely right that you cannot in general mix your tools. The process of "linking" is not standardized or specified, and so in general you have to use the same toolchain for the entire process, or indeed one part won't know how the other part called the symbols in the object code.
The problem is in fact worse: it's not just symbol names that must match, but also calling conventions: how to propagate function arguments, return values, exceptions, and how to handle RTTI and dynamic casts.
All these factors are summarized under the term "application binary interface", ABI for short. And the problem you describe can be summarized as "there is no standard ABI".
Several factors make the situation less dire:
For C code, every major platform as a de facto standard ABI which is followed by most tools. Therefore, C code is in practice highly portable and interoperable.
For C++ code, the ABI is much more complex than for C. However, for x86 and x86-64, two major platforms, the Itanium ABI is used on a wide variety of platforms, which creates at least a somewhat interoperable environment. In particular, that ABI's rules for name mangling (i.e. how to represent namespace-qualified names and function overload sets as flat strings) are well-known and have widespread tooling support.
For C++, there are further peculiar concerns. Shipping compiled code to customers is one thing, but another issue is that much of C++ library code lives in headers and gets compiled each time. So the problem is not just that you may be using different compilers, but also different library implementations. Internal details of class layout become relevant, because you can't just access vendor A's vector as if it were vendor B's vector. GCC has dubbed this "library ABI".

Related

Different data type lengths in third party libraries

Both C and C++ standards do not specify the exact length of some data types, only their minimum lengths.
I have a third party library: someLib.lib (compiled for my platform) and its corresponding someLib.h. Let's say it contains the following functions:
int getNumber();
void setNumber(int number);
When I compile a program consuming this library, the compiler checks the types with the signatures defined in the someLib.h, so as long as I use ints, everything should compile fine.
But what happens when in my compiler's int is longer or shorter than in the compiler that was used compile someLib.lib? Will it be detected during linking? Will it cause runtime errors? Can I safely use the someLib.lib without knowing how it was compiled?
You should not get compiler or linker errors, only undefined behavior at run-time. Possibly crashes, or if you're lucky just weird results.
Using a library that has narrow assumptions about the underlying system or the compiler can cause problems.
So, if the library that you're using has assumed that int is 16 bit but you're using it in a 32 bit system, you'll have problems in run-time.
Good implemented libraries have #if macros to minimize these issues, or they've implemented various .lib files for different systems. They could even explicitly use intX_t (e.g. int32_t) integers to be more portable.

Function with bool return value, only set 1 byte of the entire register

I have the following piece of code which is a part of api (cdecl). In MSVC++ the sizeof bool is 1 byte, but since bool is implementation defined, some programs compiled by other compiler/the author incorrectly define function signature may treat bool as >1 byte and calling the check below may return true on their side of programs.
virtual bool isValid()
{
return false;
// ^ code above in asm: xor al, al
}
To avoid this, I put an inline asm, xor eax, eax before the return - but I feel it a bit hacky and it of course will not work on x64 due to lack of inline assembler support.
Using #define bool int will work but it is not I wanted, as I have structs that have bool datatype inside it and using this will causes corruption.
Is there anything like intrinsics that can zeroed the eax/rax register or anything that can solve this problem?
There's nothing that will do what you're asking for. Your problem needs a much different solution.
First any code that "incorrectly define function signature" is broken an needs to fixed. It's never the solution to work around it in other code.
Next your problem is like more than just bool being implementation defined, the C++ standard makes a whole host of things are implementation defined. So much so that two different C++ compilers are rarely have a compatible ABIs. If your code provides C++ interfaces for the use of code compiled by other people you'll probably need to produce separately compiled binaries, whether in the form of object files, static libraries, DLLs or executables, for each different compiler you want to support. In fact you may need to provide separate binaries for each version of each compiler.
There are two C++ compilers the try to be compatible with the Microsoft C++ ABI. The first is Intel's C++ compiler and the second is the Windows port of clang. The clang implementation is notably still a work in progress. You may still need to create separate versions for each version of the Microsoft C/C++ runtime libraries your code is compiled with.
You can potentially reduce the number of different versions of binaries that you need to distribute by providing a pure C interface to your code. A pure C interface means using only C data types and only functions declared as extern "C". While things like classes, member functions, templates, RTTI and exceptions can be used in your implementation the can't be used as part of your public interface. An exception are COM-like interfaces, classes with nothing but public pure virtual functions. Since C compilers for Windows all use essentially the same C ABI and support COM interfaces, compatibility issues are less likely to be an issue. However the bool type (actually the _Bool type in C) is probably not safe to use, since it's a relatively recent addition to the C language. Use int in your C interfaces instead.
Note that because of C/C++ runtime differences even if you all you want to do distribute compiled binaries for use with Microsoft's Visual C++ compiler you may still need to distribute versions for each version of the compiler. That's because each version comes with a different runtime implementation and which have data structures with incompatible internal layouts. You can't pass an STL container created in a function compiled by one version of Visual C++ to a function compiled with a different version. You can't allocate memory with malloc in an executable and free it in a DLL, if the executable and DLL use different versions of the C runtime.
Unfortunately unless you're willing to restrict your users to one particular compiler the easy solution to your problem that you're looking for may not exist. Note that this is a common solution used by programs that provide plugin support. Pugins need to be compiled the same version of the same compiler that compiled the executable.

Why doesn't g++ generate "raw" symbols?

From C we know what legal variable names are. The general regex for the legal names looks similar to [\w_](\w\d_)*.
Using dlsym we can load arbitrary strings, and C++ mangles names that include # in the ABI..
My question is: can arbitrary strings be used? The documentation on dlsym does not seem to mention anything.
Another question that came up appears to imply that it is fully possible to have arbitrary null-terminated symbols. This inquires me to ask the following question:
Why doesn't g++ emit raw function signatures, with name and parameter list, including namespace and class membership?
Here's what I mean:
namespace test {
class A
{
int myFunction(const int a);
};
}
namespace test {
int A::myFunction(const int a){return a * 2;}
}
Does not get compiled to
int ::test::A::myFunction(const int a)\0
Instead, it gets compiled to - on my 64 bit machine, using g++ 4.9.2 -
0000000000000000 T _ZN4test1A10myFunctionEi
This output is read by nm. The code was compiled using g++ -c test.cpp -o out
I'm sure this decision was pragmatically made to avoid having to make any changes to pre-existing C linkers (quite possibly even originated from cfront). By emitting symbols with the same set of characters the C linker is used to you don't have to possibly make any number of updates and can use the linker off the shelf.
Additionally C and C++ are widely portable languages and they wouldn't want to risk breaking a more obscure binary format (perhaps on an embedded system) by including unexpected symbols.
Finally since you can always demangle (with something like gc++filt for example) it probably didn't seem worth using a full text representation.
P.S. You would absolutely not want to include the parameter name in the function name: People will not be happy if renaming a parameter breaks ABI. It's hard enough to keep ABI compatibility already.
GCC is compliant with the Itanium C++ ABI. If your question is “Why does the Itanium C++ ABI require names to be mangled that way?” then the answer is probably
because its designers thought this would b a good idea and
shorter symbols make for smaller object files and faster dynamic linking.
For the second point, there is a pretty good explanation in Ulrich Drepper's article How To Write Shared Libraries.
Because of limitations on the exported names imposed by a linker (and that includes the OS's dynamic linker) - character set, length. The very phenomenon of mangling arose because of this.
Corollary: in media where these limitations don't exist (various VMs that use their own linkers: e.g. .NET, Java), mangling doesn't exist, either.
Each compiler that produces exports that are incompatible with others must use a different scheme. Because linker (static or dynamic) doesn't care about ABIs, all it cares about is identifiers.
You basically answered your own question:
The general regex for the legal names looks similar to [\w_](\w\d_)*.
From the beginning, C++ used preexisting (C) linker / loader technology. There is nothing "C++" about either ld, ld-linux.so etc.
So linking is limited to what was legal in C already. That does not include colons, parenthesis, ampersands, asteriskes, and whatever else you would need to encode C++ identifiers in plain text.
(In this answer I ignore that you made several typos in your example of ::test::A::void myFunction(const int a)).
This format is:
not programmer-specific; consider that all these are the same, so why confuse people:
int ::test::A::myFunction(const int)
int ::test::A::myFunction(int const)
int test::A::myFunction(int const)
int test :: A :: myFunction (int const)
and so on…
unambiguous
terse; no parameter names or other unnecessary decorations
easier to parse (notice that the length of each component is present as a number)
Meanwhile, I see no benefit at all in choosing a human-readable looks-like-C++ format for a C++ ABI. This stuff is supposed to be optimised for machines. Why would you make it less optimal for machines, in order to make it more optimal for humans? And probably failing at the latter whilst doing so.
You say that your compiler does not emit "raw symbols". I posit that it does precisely that.

Mixing C++ flavours in the same project

Is it safe to mix C++98 and C++11 in the same project? By "mixing" I mean not only linking object files but also common header files included in the source code compiled with C++98 and C++11.
The background for the question is the desire to transition at least a part of a large code base to C++11. A part of the code is in C++ CUDA, compiled to be executed on either GPU or CPU, and the corresponding compiler doesn't support C++11 at this time. However, much of the code is intended for CPU only and can be compiled with either C++ flavour. Some header files are included in both CPU+GPU and CPU-only source files.
If we now compile CPU-only source files with C++11 compiler, can we be confident against undesirable side effects?
In practice, maybe.
It is relatively common for the standard library of C++11 and C++03 to disagree about what the layout of std namespace objects is. As an example, sizeof(std::vector<int>) changed noticeably over various compiler versions in MSVC land. (it got smaller as they optimized it)
Other examples could be a different heap on each side of the compiler fence.
So you have to carefully "firewall" between the two source trees.
Now, some compilers seek to minimize such binary compatibility changes, even at the cost of violating the standard. I believe std::list without a size counter might be an example of that (which violates C++11, but I recall that at least one vendor provided a standards-non-compliant std::list to maintain binary compatibility -- I don't remember which one).
For the two compilers (and a compiler in C++03 and C++11 are different compilers) you are going to have some ABI guarantees. There is probably a large chunk of the language for which the ABI will agree, and on that set you are relatively safe.
To be reasonably safe, you'll want to treat the other compiler version files as if they are third party DLLs (delay loaded libraries) that do not link to the same C++ standard library. That means any resources passed from one to the other have to be packaged with destruction code (ie, returned to the DLL from whence it came to be destroyed). You'll either have to investigate the ABI of the two standard libraries, or avoid using it in the common header files, so you can pass things like smart pointers between the DLLs.
An even safer approach is to strip yourself down to a C style interface with the other code base, and only pass handles (opaque types) between the two code bases. To make this sane, whip up some header-file only mojo that wraps the C style interface in pretty C++ code, just don't pass those C++ objects between the code bases.
All of this is a pain.
For example, suppose you have a std::string get_some_string(HANDLE) function, and you don't trust ABI stability.
So you have 3 layers.
namespace internal {
// NOT exported from DLL
std::string get_some_string(HANDLE) { /* implementation in DLL */ }
}
namespace marshal {
// exported from DLL
// visible in external headers, not intended to be called directly
void get_some_string(HANDLE h, void* pdata, void(*callback)( void*, char const* data, std::size_t length ) ) {
// implementation in DLL
auto r = ::internal::get_some_string(h);
callback( pdata, r.data(), r.size() );
}
}
namespace interface {
// exists in only public header file, not within DLL
inline std::string get_some_string(HANDLE h) {
std::string r;
::marshal::get_some_string(h, &r,
[](void* pr, const char* str, std::size_t length){
std::string& r = *static_cast<std::string*>(pr);
r.append( str, length );
}
);
return r;
}
}
So the code outside the DLL does an auto s = ::interface::get_some_string(handle);, and it looks like a C++ interface.
The code inside the DLL implements std::string ::internal::get_some_string(HANDLE);.
The marshal's get_some_string provides a C-style interface between the two, which provides better binary compatibility than relying on the layout and implementation of std::string to remain stable between the DLL and the code using the DLL.
The interface's std::string exists completely within the non-DLL code. The internal std::string exists completely within the DLL-code. The marshal code moves the data from one side to the other.

Is it possible to strip type names from executable while keeping RTTI enabled?

I recently disabled RTTI on my compiler (MSVC10) and the executable size decreased significantly. By comparing the produced executables using a text editor, I found that the RTTI-less version contains much less symbol names, explaining the saved space.
AFAIK, those symbol names are only used to fill the type_info structure associated with each the polymorphic type, and one can programmatically access them calling type_info::name().
According to the standard, the format of the string returned by type_info::name() is unspecified. That is, no one can rely one it to do serious things portably. So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
But... is it possible ? I'm using MSVC10 and I've not found any option to do that. I can either disable completely RTTI (/GR-), or enable it with full type names (/GR). Does any compiler provide such an option?
So, it should be possible for an implementation to always return an empty string without breaking anything, thus reducing the executable size without disabling RTTI support (so we can still use the typeid operator & compare type_info's objects safely).
You are misreading the standard. The intent of making the return value from type_info::name() unspecified (other than a null-terminated binary string) was to give the implementers of the compiler/library/run-time environment free reign to implement the RTTI requirements as they see best. You, the programmer, have no say in how the Application Binary Interface (if there is one) is designed or implemented.
You're asking three different questions here.
The initial question asks whether there's any way to get MSVC to not generate names, or whether it's possible with other compilers, or, failing that, whether there's any way to strip the names out of the generated type_info without breaking things.
Then you want to know whether it would be possible to modify the MS ABI (presumably not too radically) so that it would be possible to strip the names.
Finally, you want to know whether it would be possible to design an ABI that didn't have names.
Question #1 is itself a complex question. As far as I know, there's no way to get MSVC to not generate names. And most other compilers are aimed at ABIs that specifically define what typeid(foo).name() must return, so they also can't be made to not generate names.
The more interesting question is, what happens if you strip out the names. For MSVC, I don't know the answer. The best thing to do here is probably to try it—go into your DLLs and change the first character of each name to \0 and see if it breaks dynamic_cast, etc. (I know that you can do this with Mac and linux x86_64 executables generated by g++ 4.2 and it works, but let's put that aside for now.)
On to question #2, assuming blanking the names doesn't work, it wouldn't be that hard to modify a name-based system to no longer require names. One trivial solution is to use hashes of the names, or even ROT13-encoded names (remember that the original goal here is "I don't want casual users to see the embarrassing names of my classes"). But I'm not sure that would count for what you're looking for. A slightly more complex solution is as follows:
For "dllexport"ed classes, generate a UUID, put that in the typeinfo, and also put it in the .LIB import library that gets generated along with the DLL.
For "dllimport"ed classes, read the UUID out of the .LIB and use that instead.
So, if you manage to get the dllexport/dllimport right, it will work, because your exe will be using the same UUID as the dll. But what if you don't? What if you "accidentally" specify identical classes (e.g., an instantiation of the same template with the same parameters) in your DLL and your EXE, without marking one as dllexport and one as dllimport? RTTI won't see them as the same type.
Is this a problem? Well, the C++ standard doesn't say it is. And neither does any MS documentation. In fact, the documentation explicitly says that you're not allowed to do this. You cannot use the same class or function in two different modules unless you explicitly export it from one module and import it into another. The fact that this is very hard to do with class templates is a problem, and it's a problem they don't try to solve.
Let's take a realistic example: Create a node-based linkedlist class template with a global static sentinel, where every list's last node points to that sentinel, and the end() function just returns a pointer to it. (Microsoft's own implementation of std::map used to do exactly this; I'm not sure if that's still true.) New up a linkedlist<int> in your exe, and pass it by reference to a function in your dll that tries to iterate from l.begin() to l.end(). It will never finish, because none of the nodes created by the exe will point to the copy of the sentinel in the dll. Of course if you pass l.begin() and l.end() into the DLL, instead of passing l itself, you won't have this problem. You can usually get away with passing a std::string or various other types by reference, just because they don't depend on anything that breaks. But you're not actually allowed to do so, you're just getting lucky. So, while replacing the names with UUIDs that have to be looked up at link time means types can't be matched up at link-loader time, the fact that types already can't be matched up at link-loader time means this is irrelevant.
It would be possible to build a name-based system that didn't have these problems. The ARM C++ ABI (and the iOS and Android ABIs based on it) restricts what programmers can get away with much less than MS, and has very specific requirements on how the link-loader has to make it work (3.2.5). This one couldn't be modified to not be name-based because it was an explicit choice in the design that:
• type_info::operator== and type_info::operator!= compare the strings returned by type_info::name(), not just the pointers to the RTTI objects and their names.
• No reliance is placed on the address returned by type_info::name(). (That is, t1.name() != t2.name() does not imply that t1 != t2).
The first condition effectively requires that these operators (and type_info::before()) must be called out of line, and that the execution environment must provide appropriate implementations of them.
But it's also possible to build an ABI that doesn't have this problem and that doesn't use names. Which segues nicely to #3.
The Itanium ABI (used by, among other things, both OS X and recent linux on x86_64 and i386) does guarantee that a linkedlist<int> generated in one object and a linkedlist<int> generated from the same header in another object can be linked together at runtime and will be the same type, which means they must have equal type_info objects. From 2.9.1:
It is intended that two type_info pointers point to equivalent type descriptions if and only if the pointers are equal. An implementation must satisfy this constraint, e.g. by using symbol preemption, COMDAT sections, or other mechanisms.
The compiler, linker, and link-loader must work together to make sure that a linkedlist<int> created in your executable points to the exact same type_info object that a linkedlist<int> created in your shared object would.
So, if you just took out all the names, it wouldn't make any difference at all. (And this is pretty easily tested and verified.)
But how could you possibly implement this ABI spec? j_kubik effectively argues that it's impossible because you'd have to preserve some link-time information in the .so files. Which points to the obvious answer: preserve some link-time information in the .so files. In fact, you already have to do that to handle, e.g., load-time relocations; this just extends what you need to preserve. And in fact, both Apple and GNU/linux/g++/ELF do exactly that. (This is part of the reason everyone building complex linux systems had to learn about symbol visibility and vague linkage a few years ago.)
There's an even more obvious way to solve the problem: Write a C++-based link loader, instead of trying to make the C++ compiler and linker work together to trick a C-based link loader. But as far as I know, nobody's tried that since Be.
Requirements for type-descriptor:
Works correctly in multi compilation-unit and shared-library environment;
Works correctly for different versions of shared libraries;
Works correctly although different compilation units don't share any information about type, except it's name: usually one header is used for all compilation units to define same type, but it's not required; even if, it doesn't affect resulting object file.
Work correctly despite fact that template instantiations must be fully-defined (so including type_info data) in every library that uses them, and yet behave like one type if several such libs are used together.
The fourth rule essentially bans all non-name based type-descriptors like UUIDs (unless specifically mentioned in type definition, but that is just name-replacement at best, and probably requires standard-alterations).
Stroing thuse UUIDs in separate files like suggeste .LIB files also causes trouble: different library versions implementing new types would cause trouble.
Compilation units should be able to share the same type (and its type_info) without the need to involve linker - because it should stay free of any language-specifics.
So type-name can be only unique type descriptor without completely re-modeling compilation and linking (also dynamic). I could imagine it working, but not under current scheme.