loading multiple similar shared libraries on linux - c++

I am working on code that creates 'models'. A model is created from an XML file and part of its representation is, generated on the fly, C code. This C code is compiled, on the fly, into a shared library that is dynamically loaded (using POCO shared lib class). The shared library mainly contains small functions, and part of a models creation is to populate function pointers to these functions. All this works fine. However, creating several models, at the same time, causes problems.
I believe it has to do with how the dynamic loading works on Linux, and the fact that each shared library contain functions with identical names. Does PIC cause this? The problems manifest itself in no sense data being retrieved from shared libraries functions.
So question is, how to load multiple (thousands) shared libraries, containing identical function names, on linux?
The above works fine on windows, where it seems that dynamically loaded libraries data/functions are kept perfectly isolated from each other.

Poco SharedLibrary open function accepts linker flags. Default is Poco::SharedLibrary::SHLIB_GLOBAL, corresponding to dlopens RTLD_GLOBAL and Poco::SharedLibrary::SHLIB_LOCAL, corresponding to RTLD_LOCAL. See http://linux.die.net/man/3/dlopen for more info.
Passing the Poco::SharedLibrary::SHLIB_LOCAL flag fixed the problem.

First, you can dlopen many hundreds of thousands of shared objects. My manydl.c demonstrate that.
Then, you can also generate C code, compile it, and dlopen the shared object, all this from the same process. My (obsolete in 2017) MELT plugin (for GCC, MELT provides a high level language to extend GCC) does that (and so does my manydl.c example).
However, I don't think you should keep identical (defined) function names in them. I suggest avoiding that. You could
generate unique names (since the C code is generated, this is the best, most portable, and simplest solution)
compile with some -D flags to #define these names to unique names, so the source code could apparently contain duplicate names; that is if your generated code defines the foo function pass a -Dfoo=foo_123 (with foo_123 being globally unique) to the gcc command compiling it. (Of course you then dlsym for "foo_123").
add visibility("hidden") function attributes in your generated code. You could also pass -fvisibility=hidden option to gcc.
have only static functions (then, the name don't matter much so can be duplicate), and have a constructor function which somehow binds the functions (e.g. store their pointer somewhere, e.g. in some global table).
you might consider passing RTLD_LOCAL to dlopen(3). I'm not sure it is a good idea (and POCO probably don't know how to do that).
PS. I don't think it is related to position independent code (which is preferable, but not absolutely required, in shared objects; without -fPIC the dlopen would have to do a lot of inefficient relocations). It is related to Linux shared object linking & loading. Read Levine's Linkers & Loaders book for details.
See also this question, and read Drepper's paper on How to Write Shared Libraries.

Related

linux and shared libraries and different g++ compilers

What is the story regarding having a process on Linux, which dlopen() multiple shared libraries and the executable and/or the shared libraries compiled with different C++ compilers (e.g. provided by customers or 3rd parties).
Am I going right in the following assumptions:
there is only a single namespace for symbols in a linux process. Symbols are found and resolved only by symbol name. The source of the symbol is random in the presence of an unknown executable (customer supplied) or customer supplied shared libraries.
there is no way to make certain, that STL/boost symbols are being resolved from the correct source, as they are always weak and thus might be overwritten.
What are the implications of using multiple copies of (different) libc++ inside the same process (some of them static)?
I don't expect seperate libraries to be able to talk to each other via a C++ interface but only via a C interface. What I would like is, that one can load SharedLibraries from different vendors into a single process and they do not screw up each other.
I know that this has worked in Windows for decades
Your comment completely changes your question:
I don't expect it to be able to talk to each other via a C++ interface but only via a C interface. What I expect is, that one can load SharedLibraries from different vendors into a single process and they do not screw up each other. (This btw is working on Windows since decades)
This element of behaviour is largely system indipendant. The Windows PE format and Linux ELF are similar enough in design that they don't add any additional constraints or capabilities on this topic. So if your technique was going to work in Windows then it should also do so in Linux, just replacing .dll files for .so files.
Linux has more standardisation around calling conventions than Windows, so if anything you should find that Linux make's this simpler.
Origional Answer
Question:
there is only a single namespace for symbols in a Linux process?
That's correct; there's no such thing as namespaces in Linux's loader.
As you may know, C and C++ are very different languages. C++ has namespaces C does not. When libraries are loaded (in both Linux, Unix and also Windows) there is no concept of namespace.
C++ compilers use name mangling to ensure that names isolated by namespaces in your code, do not compiled when placed as symbols in the shared object. C compilers don't do this and don't need to do it because there are no namespaces.
Question:
Symbols are found and resolved only by symbol name. The source of the symbol is random in the presence of an unknown executable (customer supplied) or customer supplied shared libraries.
Let's replace the word "random" for unpredictable. That's also correct. From Wikipedia:
The C++ language does not define a standard decoration scheme, so each compiler uses its own. C++ also has complex language features, such as classes, templates, namespaces, and operator overloading, that alter the meaning of specific symbols based on context or usage. Meta-data about these features can be disambiguated by mangling (decorating) the name of a symbol. Because the name-mangling systems for such features are not standardized across compilers, few linkers can link object code that was produced by different compilers.
Question:
What is the story regarding having a process on LINUX, which dlopen() multiple shared libraries and the executable and/or the shared libraries compiled with different C++ compilers (e.g. provided by customers or 3rd parties).
You can off course dlopen() a shared object, but dlsym() would be tricky to use because of name mangling. You'd have to inspect the shared object manually to determine the precise symbol name.
What are the implications of using multiple copies of (different) libc++ inside the same process (some of them static)?
If you got that far, then I'd be concerned about memory management first of all. libc++ is responsible for implementing new and delete and converting these into memory requests from the OS. If they behave anything like GNU's malloc() and free(), they might manage their own pool of memory. It'd be hard to prectict what would happen if you called delete on an object that was created by a different libc++.
it seems that parts of this randomness can be avoided by loading the shared libraries using the following flags passed to dlopen():
RTLD_LOCAL
RTLD_DEEPBIND

Get available library functions at runtime

I am working with dynamic linked libraries (.dll) on windows or shared objects (.so) on Linux.
My goal is to write some code, that can - give the absolute path of the library on the disk - return a list of all exported functions (the export table) of that library and ultimately be able to call those functions. This should work on windows (wit dll) as well as on Linux (with so).
I am writing kind of a wrapper that delgates function calls to the respective library. Therefore I receive a path, a function name, and a list of parameters which I then want to forward. the thing is: I want to know whether the given function exists before trying to call it
From here I found a platform-independent way of opening and closing the library as well as getting a pointer to the function with the given name.
So the thing that remains is getting the names of available functions in the first place.
On this topic, I found this question dealing with the same kind of problem only that it asks for a Linux specific solution. In the given answer it is said
There is no libc function to do that. However, you can write one yourself (or copy/paste the code from a tool like readelf).
This clearly indicates that there are tools to do what I am looking for. The only question is, is there one that can work on windows as well as on Linux? If not how would I go about this on my own?
Here is a C# implementation (actually this is the code I want to port to C++) doing what I want (though windows only). To me, this appears as if the library structure is handled manually. If this is the way to go then where can I find the necessary information on the library structure?
So, on unixoids (and both Linux and WinNT have a posix subsytem), the dlopen function can be used to load a dynamic library and get function pointers to known symbols by name of that symbol.
Getting a list of symbols as far as I know was never an aspect that POSIX bothered to specify, so long story short, the functions that can do that for you on Linux are specific to the libc used there (GNU libc, mostly), and on Windows to the libc used there. Portable code means having to different codebases for two different libcs!
If you don't want to depend on your libc, you'd have to have a binary object parser (for ELF shared libraries on Linux, PE on Windows) to read out symbol names from the files. There's actually plenty of those – obviously, WINE has one for PE that is portable (especially works on Linux as well), and every linker (including glibc's runtime linker) under Linux can parse ELF files.
Personally, radare2 is a fine reverse-engineering framework with plenty of language bindings that is actually meant to analyze binary files, and give you exported symbols (as well as being capable of extracting non-exported functions, constructing call-graphs etc). It does have debugger, i.e. jumping-into-functions capabilities, too.
So, knowing now that
I am writing kind of a wrapper that delgates function calls to the respective library. Therefore I receive a path, a function name, and a list of parameters which I then want to forward. the thing is: I want to know whether the given function exists before trying to call it
things become way easier: you don't actually need to get the list of exports. It's easier and just as fast to simply try.
So, on any POSIX system (and that includes both Windows' POSIX subsystem and Linux), dlopen will open the library and load the symbol table, and dlsym will look up a symbol from that table. If that symbol is not in the table, it simply returns NULL. So, you already have all the tables you think you need; just not explicitly, but queryable.

C++: Include vs LoadLibrary()

I am having some trouble understanding why both #include and LoadLibrary() is needed in C++. In C++ "#include" forces the pre-processor to replace the #include line with the contents of the file you are including (usually a header file containing declarations). As far as I understand, this enables me to use the routines I might want in the external libraries the headers belong to.
Why do I then need LoadLibrary()? Can't i just #include the library itself?
Just as a side note: In C#, which I am more familiar with, I just Add a Reference to a DLL if I want to use types or routines from that DLL in my program. I do not have to #include anything, as the .NET framework apparently automatically searches all the referenced assemblies for the routines I want to use (as specified by the namespace)
Thank you very much in advance.
Edit: Used the word "definitions", but meant "declarations". Now fixed.
Edit 2: Tough to pick one answer, many good replies. Thanks for all contributions.
C++ uses a full separate compilation model; you can even compile
against code which hasn't been written. (This often occurs in
large projects.) When you include a file, all you are doing is
telling the compiler that the functions, etc. exist. You do not
provide an implementation (except for inline functions and
templates). In order to execute the code, you have to provide
the implementation, by linking it into your application. This
can occur in several different ways:
You have the source files; you compile them along with your
sources, and link in the resulting objects.
You have a static library; you must link against it.
You have a dynamic library. Here, what you must do will
depend on the implemention: under Windows, you must link
against a .lib stub, and put the .dll somewhere where the
runtime will find it when you execute. (Putting it in the same
directory as your application is usually a good solution.)
I don't quite understand your need to call LoadLibrary. The
only time I've needed this is when I've intentionally avoided
using anything in the library directly, and want to load it
conditionally, use GetProcAddr to get the addresses of the
functions I need.
EDIT:
Since I was asked to clarify "linking": program translation
(from the source to an executable) takes place in a number of
steps. In traditional terms, each translation unit is
"compiled" into an object file, which contains an image of the
machine instructions, but with unfilled spaces for external
references. For example, if you have:
extern void function();
in your source (probably via inclusion of a header), and you
call function, the compiler will leave the address field of
the call instruction blank, since it doesn't know where the
function will be located. Linking is the process of taking all
of the object files, and filling in these blanks. One of the
object files will define function, and the linker will
establish the actual address in the memory image, and fill in
the blank referring to function with the address of function
in that image. The result is a complete memory image of the
executable. On the early systems I worked on: literally. The
OS would simply copy the executable file directly into memory,
and then jump into it. Things like virtual memory and shared,
write protected code segments make this a little more
complicated today, but for statically linked libraries or object
files (my first two cases above), the differences aren't that
great.
Modern system technologies have blurred the lines somewhat. For
example, most Java (and I think C#) compilers don't generate
classical object files, with machine code, but rather byte code,
and the compile and link phases, above, don't take place until
runtime. Some C++ compilers also only generate byte code, which
will be compiled when the code is "linked". This is done to
permit cross-module optimizations. And all modern systems
support dynamic linking: some of the blank addresses are left
blank until execution time. And dynamic linking can be implicit
or explicit: when it is implicit, the link phase will insert
information into the executable concerning the libraries it
needs, and where to find them, and the OS will link them,
implicitly, either when the executable is loaded, or on demand,
triggered by the code attempting to use one of the unfilled
address slots. When it is explicit, you normally don't have any
explicit referenced to the name in your code. In the case of
function, above, for example, you wouldn't have any code which
directly called function. Your code would, however, load the
dynamic library using LoadLibrary (or dlopen under Unix),
then request the address of a name, using GetProcAddr (or
dlsys), and call the function indirectly through the pointer
it received.
The #include directive is, like all preprocessor functionality, merely a text replacement. The text "#include " is replaced with the contents of that file.
Typically (but not necessarily), this is used to include a header file which declares the functions that you want to use, i.e. you tell the compiler (which runs after the preprocessor) how some functions that you intend to use are named, what parameters they take, and what the return type is. It does not define what the function is actually doing.
You then also need an implementation of these functions, too. Usually, if you do not implement them in your program, you leave this task to the link stage. You give a list of libraries that your program depends on to the linker, and the linker divines via some implementation-defined way (such as an "import library") what it needs to do to "make it work". The linker will produce some glue code and write some information into the executable that will make the loader automatically load the required libraries. Everything "just works" without you having to do something special.
In some cases, however, you want to postpone the linker stage and do the loading "fully dynamically" by hand rather than automatically. This is when you have to call LoadLibrary() and GetProcAddress. The former brings the DLL into memory and does some setup (e.g. relocation), the latter gives you the address of a function that you want to call.
The #include in your code is still necessary so the compiler knows what to do with that pointer. Otherwise, you could of course call the obtained function via its address, but it would not be possible to call the function in a meaningful way.
One reason why one would want to load a library manually (using LoadLibrary) is that it is more failsafe. If you link a program against a library and the library cannot be found (or a symbol cannot be found), then your application will not start up and the user will see a more or less obscure error message.
If LoadLibrary fails or GetProcAddress doesn't work, your program can in principle still run, albeit with reduced functionality.
Another example for using LoadLibrary might be to load an alternative version of a function from a different library (some programs implement "plugins" that way). The function "looks" the same to the compiler, as defined in the include file, but may behave differently, as by whatever is in the loaded binary.
#include brings in source code only: symbol declarations for the compiler. A library (or a DLL) is object code: Use either LoadLibrary or link to a lib file to bring in object code.
LoadLibrary() causes the code module to be loaded from disk into your applications memory space for execution. This allows for dynamically loading code at runtime. You would not use LoadLibrary(), for example, if the code you want to use is compiled into a statically linked library. In that case you would provide the name of the .lib file that contained the code to the linker and it gets resolved at link time - the code is linked in to your .exe and the .lib is not distributed with the .exe in order for it to execute.
LoadLibrary() creates a dependency on an external DLL which must be present on the path provided to the method call in order for the .exe to properly execute. If LoadLibrary() fails, you must ensure your code will handle it appropriately, by either exiting gracefully or providing some other execution alternative. You must provide a .lib file to the linker the same as you would for the static library above. This .lib file however does not contain code, just entry points for the actual code that resides in the .dll.
In both cases you must #include the headers for the code you wish to execute. This is required by the compiler in order to build function call signatures properly based on the type information provided by the header.
C# assemblies contain both type information and IL. A single reference is sufficient to satisfy the need for header information and binding to the code itself.
#include is static, the substitution is done at compile time. LoadLibrary() lets you load a DLL at runtime, for example based on user imput.

Catching calls from a program in runtime and mapping them to other calls

A program usually depends on several libraries and might sometimes depend on other programs as well. I look at projects like Wine and think how do they figure out what calls a program is making?
In a Linux environment, what are the approaches used to know what calls an executable is making in runtime in order to catch and map them to other calls?
Any code snippets or references to resources for extra reading is greatly appreciated :)
On Linux you're looking for the LD_PRELOAD environment variable. This will load your libraries before any requested by the program. If you provide a function definition that matches one loaded by the target program then your version will be called instead.
You can't really detect what functions a program is calling however. You can however get all the functions in a shared library and implement all of those. You aren't really catching the functions, you are simply reimplementing them.
Projects like Wine do this in some cases, but not in all. They also rewrite some of the dynamic libraries. So when a Win32 loads some DLL it is actually loading the Wine version and not the native version. This is essentially the same concept of replacing the functions with your own.
Lookup LD_PRELOAD for more information.

are runtime linking library globals shared among plugins loaded with dlopen?

I've a C++ program that links at runtime with, lets say, mylib.so. then, the same program uses dlopen()/dlsym() to load a function from myplugin.so, dynamic library that in turn has dependencies to mylib.so.
My question is: will the program AND the function in the plugin access the same globals defined in mydlib.so in the same memory area reserved for the program, or each will be assigned different, unrelated copies in its own memory space? if the latter is the default behaviour, is it possible to change that?
Thanks in advance =)!
Globals in the main program that does the dlopen should be visible to the code that is dynamically loaded. However, the best advice I've seen to date (especially if you ever want to have even vaguely portable code) is to only have function calls be passed across the linker divide, and to not export any variables in either direction. It's also best if there is an API for the loaded code to register the interesting parts of its API with the loader (e.g., "Here is how I provide this SPI for drawing foobars on a baz") as that's a much saner way of doing callbacks rather than just mashing everything together.
[EDIT]: The other reason for doing this is if you're simulating weak linking on a platform that doesn't support it. That's a lot like the other one I list, except that it is the main program that is building the SPI out of the API exported by the dynamic library rather than the .so exporting it explicitly on startup. It's second best really, but you make do with what you've got rather than wishing (well, unless you're prepared to do the work by writing some sort of connection library).