std::thread weak when using -static-libstdc++, thus causing crash at runtime - c++

I need to build a portable shared object, which is a plugin for another software on Linux. I did some amount of reading on the subject, came down to the conclusion, that I should build a sysrooted gcc (gcc 5.4.0 if it matters) with a decently old glibc (to provide compatibility with older systems), link with -static-libstdc++ and -static-libgcc thus arriving to a point where I have something that only depends on the hosts glibc and some other minor stuff which will always be present.
Now, I did all that and now I am experiencing a weird crash - segmentation fault happens in a place where the code calls std::thread, and gdb actually shows that the stack frame is inside libstdc++.so.6 (where is shouldn't be, ldd of my shared object also does not list libstdc++.so). The top of the stack at the crash is:
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff79075e3 in std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 # THIS SHOULD NOT BE HERE RIGHT?
#2 0x00007ffff5a25a5c in std::thread::thread<void (ReferenceAnalytics::*)(std::timed_mutex&), ReferenceAnalytics*&, std::reference_wrapper<std::timed_mutex> >
(this=0x7fffffffcf40, __f=
#0x7fffffffcf60: (void (ReferenceAnalytics::*)(ReferenceAnalytics * const, std::timed_mutex &)) 0x7ffff5a1750c <ReferenceAnalytics::WorkerThreadMethod(std::timed_mutex&)>)
at /home/developer/Toolchains/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/5.4.0/thread:137 # Looks like my toolchain
So, I did some reading, and then using nm discovered that my shared object has all the std::thread stuff like ctor, dtor, swap, .... defined as weak symbols (which I assume causes a collision if the host that loads the plugin uses dynamic libstdc++ and then my calls are routed there and all hell breaks loose, is this right?).
My further attempts of googling and reading did not give me an answer to how can I control this as in force the std::thread stuff to be resolved to the static libstdc++ in my sysrooted gcc?
More over, I made a small executable that just does dlopen on my shared object and then calls a method which internally constructs the thread - if the executable is also built with -static-libstdc++ all is well, if not, the crash happens. So I assume my theory about the weak symbol for std::thread being resolved to the hosts libstdc++ is correct, but how to solve this?

If you statically link a DSO against libstdc++ without hiding the libstdc++ symbols, and the main program is linked against libstdc++ as well, then the symbol definitions in the main program will interpose/preempt the definitions in the DSO when it is opened with dlopen.
However, because the main program is not linked against libpthread, the the system libstdc++ DSO in the process image saw that the libpthread symbols were unavailable (null), and thus disabled thread support. However, your DSO needs thread support, but can't get it from the system libstdc++.
As an immediate workaround, you can hide all the statically linked libstdc++ symbols in the DSO. Then no interposition will take place, and your DSO will actually use the libstdc++ copy in the DSO itself, which has already established that there should not be any thread support in the process.
But this will likely not solve all of your problems because late loading of libpthread via dlopen has its problems. We fixed one bug here:
Segfault after a binary without pthread dlopen()s a library linked with pthread
But your distribution may not have that fix, and I expect there will be other issues, one of them being: The second, statically linked copy of libstdc++ is actually needed here because the system libstdc++ has been loaded without thread support (because libpthread was not loaded when its symbols were bound, causing the crash you observed), so you cannot use it for creating threads. It also has activated optimizations which make the library not thread safe (avoid atomic instructions and things like that).

Related

"undefined reference to __dso_handle" while linking static library with -nostdlib [duplicate]

I have an unresolved symbol error when trying to compile my program which complains that it cannot find __dso_handle. Which library is this function usually defined in?
Does the following result from nm on libstdc++.so.6 mean it contains that?
I tried to link against it but the error still occurs.
nm libstdc++.so.6 | grep dso
00000000002fc480 d __dso_handle
__dso_handle is a "guard" that is used to identify dynamic shared objects during global destruction.
Realistically, you should stop reading here. If you're trying to defeat object identification by messing with __dso_handle, something is likely very wrong.
However, since you asked where it is defined: the answer is complex. To surface the location of its definition (for GCC), use iostream in a C++ file, and, after that, do extern int __dso_handle;. That should surface the location of the declaration due to a type conflict (see this forum thread for a source).
Sometimes, it is defined manually.
Sometimes, it is defined/supplied by the "runtime" installed by the compiler (in practice, the CRT is usually just a bunch of binary header/entry-point-management code, and some exit guards/handlers). In GCC (not sure if other compilers support this; if so, it'll be in their sources):
Main definition
Testing __dso_handle replacement/tracker example 1
Testing __dso_handle replacement/tracker example 2
Often, it is defined in the stdlib:
Android
BSD
Further reading:
Subtle bugs caused by __dso_handle being unreachable in some compilers
I ran into this problem. Here are the conditions which seem to reliably generate the trouble:
g++ linking without the C/C++ standard library: -nostdlib (typical small embedded scenario).
Defining a statically allocated standard library object; specific to my case is std::vector. Previously this was std::array statically allocated without any problems. Apparently not all std:: statically allocated objects will cause the problem.
Note that I am not using a shared library of any type.
GCC/ARM cross compiler is in use.
If this is your use case then merely add the command line option to your compile/link command line: -fno-use-cxa-atexit
Here is a very good link to the __dso_handle usage as 'handle to dynamic shared object'.
There appears to be a typo in the page, but I have no idea who to contact to confirm:
After you have called the objects' constructor destructors GCC automatically calls the function ...
I think this should read "Once all destructors have been called GCC calls the function" ...
One way to confirm this would be to implement the __cxa_atexit function as mentioned and then single step the program and see where it gets called. I'll try that one of these days, but not right now.
Adding to #natersoz's answer-
For me, using -Wabi-tag -D_GLIBCXX_USE_CXX11_ABI=0 alongside -fno-use-cxa-atexit helped compile an old lib. A telltale is if the C++ functions in the error message have std::__cxx11 in them, due to an ABI change.

C++ exceptions and the .eh_frame ELF section

Is it that the absence or damage of the .eh_frame ELF section is the cause of exceptions in my C++ code stopped working? Any exception that previously was caught successfully is now calling std::terminate().
My situation:
My zzz.so shared library has try-catch blocks:
try {
throw Exc();
} catch (const Exc &e) {
LOG("ok " << e.what());
} catch (...) {
LOG("all");
}
An executable which loads the zzz.so (using ldopen). It call a function in the zzz.so
All the exceptions thrown in the zzz.so are successfully caught inside zzz.so and dumped into my log file
There is another aaa.so that is loaded into another binary. That another aaa.so is loading my zzz.so.
All the same exceptions thrown in the zzz.so lead to call std::terminate().
How is that possible?
update
I don't know HOW is that possible still, but Clang 3.3 (FreeBSD clang version 3.3 (tags/RELEASE_33/final 183502) 20130610) solved the problem.
How is that possible?
When an exception is thrown, control passes to __cxa_throw routine (usually in libstdc++.so), which is then responsible for finding the catch clause and calling destructors along the way, or calling std::terminate if no catch is found.
The answer then is most likely that the first executable (the one where exceptions work) uses libstdc++.so that is capable of decoding .eh_frame in your library, while the second application (the one where exceptions do not work), either uses an older (incompatible) version of libstdc++.so, or links against libstdc++.a, or something along these lines.
Note: the actual work of raising the exception is done by _Unwind_RaiseException in libgcc_s.so.1, so even if both applications use the same libstdc++.so, they may still use different libgcc.
Update:
Will I benefit from static linking libstdc++ and libgcc into my .so library?
Maybe. TL;DR: it's complicated.
There are a few things to consider:
On any platform other than i386, you would have to build your own copy of libstdc++.a and libgcc.a with -fPIC before you can link them into your zzz.so. Normally these libraries are built without -fPIC, and can't be statically linked into any .so.
Static linking of libstdc++.a into your zzz.so may make it a derived work, and subject to GPL (consult your lawyer).
Even when there is a _Unwind_RaiseException exported from zzz.so, normally there will already be another instance of _Unwind_RaiseException defined in (loaded earlier) libgcc_s.so, and that earlier instance is the one that will be called, rendering your workaround ineffective. To make sure that your copy of _Unwind_RaiseException is called, you would need to link zzz.so with -Bsymbolic, or with a special linker script to make all calls to _Unwind_RaiseException (and everything else from libgcc.a) internal.
Your workaround may fix the problem for zzz.so, but may cause a problem for unrelated yyy.so that is loaded even later, and that wants the system-provided _Unwind_RaiseException, not the one from zzz.so. This is another argument for hiding all libgcc.a symbols and making them internal to zzz.so.
So the short answer is: such workaround is somewhat likely to cause you a lot of pain.

MFXInit() in libmfx.a segfaults when called from shared object

(While Intel's forum is a more natural place to ask this question I'm posting it here hoping for more activity than Intel's total lack thereof -- so far)
I'm unable to create a dynamic link library that uses Intel Media SDK (linux server) to manipulate h264 video and noticed a problem in the design of the MFX library. The way I understand it, programs are supposed to link to static library, like:
$ g++ .... -L/opt/intel/mediasdk/lib/lin_x64 -lmfx
However, this libmfx.a library appears to delegate all calls to a dlopened dynamic library /opt/intel/mediasdk/lib64/libmfxhw64.so. It is worth noting that function names (and signatures) exposed by static and dynamic libraries are identical, which is kind of confusing and dangerous.
While I don't understand the rationale behind this design, it should not be a problem by itself were it not that apparently some static/global initialization from within the library causes havoc when the (static) libmfx.a is included in a shared object. Ie.:
+------+ +-----------+
| main | <-- | mylib.so |
+------+ | | +---------------+
| libmfx.a | (dlopen) | libmfxhw64.so |
| <------------- |
|+---------+| |+-------------+|
||MFXInit()|| || MFXInit() ||
||... || || ... ||
|| || || ||
+===========+ +===============+
The above library could be assembled like this:
$ g++ -shared -o mylib.so my1.o my2.o -lmfx
And then (dynamically) linked to main.o like so:
$ g++ -o main main.o mylib.so -ldl
(Note that the additional libdl is necessary to allow libmfx.a to dlopen() libmfxhw64.so.)
Unfortunately, upon the first MFXInit() call, the program causes a segmentation fault (accessing address 0x0000400). GDB backtrace:
#0 0x0000000000000400 in ?? ()
#1 0x00007ffff61fb4cd in MFXInit () from /opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.13
#2 0x00007ffff7bd3a1f in MFX_DISP_HANDLE::LoadSelectedDLL(char const*, eMfxImplType, int, int) () from ./lib-a.so
#3 0x00007ffff7bd12b1 in MFXInit () from ./lib-a.so
#4 0x00007ffff7bd09c8 in test_mfx () at lib.c:12
#5 0x0000000000400744 in main (argc=1, argv=0x7fffffffe0d8) at main.c:8
(Observe that MFXInit() at stackframe #3 is the one in libmfx.a whereas the one at #1 is in libmfxhw64.so.)
Note that there is no crash when mylib is created as a static library. Using breakpoints and disassembler, I managed to make following backtrace snapshot where in both cases #1 is at MFXInit+424, but they appear to hit different versions of MFXQueryVersion (absolute addresses are meaningless due to relocation):
#0 0x00007ffff6411980 in MFXQueryVersion () from /opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.13
#1 0x00007ffff640c4cd in MFXInit () from /opt/intel/mediasdk/lib64/libmfxhw64-p.so.1.13
#2 0x000000000040484f in MFX_DISP_HANDLE::LoadSelectedDLL(char const*, eMfxImplType, int, int) ()
#3 0x00000000004020e1 in MFXInit ()
#4 0x0000000000401800 in test_mfx () at lib.c:12
#5 0x0000000000401794 in main (argc=1, argv=0x7fffffffe0e8) at main.c:8
Because both static and shared Intel libs expose the same API functions, I can link straight into libmfxhw64.so guts directly, but I suppose that bypassing the static "dispatcher" is without warranty(?)
Could someone explain Intel's idea behind said design? Spec., why provide a static library that only delegates to an .so that has identical interface?
Also, it appears that the SEGV is caused by static/global data in either libmfx.a or libmfxhw64.so. Is there a way to force a specific execution order on dynamically loaded static/global sections? What is the best approach to debug these kinds of problems?
Tested with Intel Media SDK R2 (ubuntu 12) and Intel Media SDK 2015R3-R5 (Centos 7, 1.13/1.15) on Intel Haswell i7-4790 #3.6Ghz
If you have a working Intel MSDK setup, please compile my example code to confirm the issue.
At the very end of the file "readme-dispatcher-linux.pdf" in recent releases of the dispatcher source code, there is this:
There is slight difference between using Dispatcher library from
executable module or from shared object. To mitigate symbol conflict
between itself and SDK shared object on Linux*, application should:
1) link against libdispatch_shared.a instead of libmfx.a
2) define MFX_DISPATCHER_EXPOSED_PREFIX before any SDK includes
I have used this, and it works to address the symbol conflict issue you describe.
You can find this file, if you install "Intel Media Server Studio Professional 2016". There is a free community edition. The source files and the PDF will be found at /opt/intel/mediasdk/opensource/
(OK, since no one seems eager, I'll do the inelegant thing and post an answer to my own question).
After considerable research trying to break the unintentional circular linking, I discovered that the ld option --exclude-libs provides solace. Essentially, I was looking for a way to force removal of any libmfx.a symbols after using them to resolve dependencies in lib.o while creating the DLL. This could be accomplished by creating the so like this:
g++ -shared -o lib-a.so lib.o -L/opt/intel/mediasdk/lib/lin_x64 -lmfx -Wl,--exclude-libs=libmfx
Once the library is created like this, Bob's you uncle:
g++ -o main-so-a main.o lib-a.so -ldl
(Note that libdl is still needed because Intel's MFX (now inside lib-a.so) still uses dlopen to discover libmfxhw64.so)
From the ld man page:
--exclude-libs lib,lib,...
Specifies a list of archive libraries from which symbols should not be
automatically exported. The library names may be delimited by commas or
colons. Specifying "--exclude-libs ALL" excludes symbols in all archive
libraries from automatic export. This option is available only for the
i386 PE targeted port of the linker and for ELF targeted ports. For i386
PE, symbols explicitly listed in a .def file are still exported,
regardless of this option. For ELF targeted ports, symbols affected
by this option will be treated as hidden.
So, essentially the trick is no make sure that the relevant ELF symbols are marked hidden. Normally this would be handled through #pragmas by the library developers (ie. Intel), but due to their negligence this needs to be retrofitted in this case.
I suppose the same could have been accomplished with a --version-script map file, but that might have turned out to be more fragile since we want to fully encapsulate libmfx.a anyway.

Strange symbol lookup error in libstdc++

Trying to track down a segfault somewhere in MPI, I got this error:
./mpitest: symbol lookup error: /usr/lib64/libstdc++.so.6: bàþ;# BC_
-------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 8729 on ...
First, I'm used to getting lookup errors when the process is loaded if the library path is wrong. But those all happen before the process starts executing. This happened in the middle of the output from the test. Shouldn't all symbols be resolved by the runtime loader before the process starts?
Second, that symbol looks like garbage. It's certainly not a normal mangled C++ symbol.
Is it possible for memory corruptions (since I am tracking a segfault, it's likely there's something like that going on) to corrupt symbols like this?
This was compiled with icpc 12.0.3 20110309 on a Linux 2.6.18-194.32.1.el5 x86_64 machine.
OpenMPI loads plugins as dynamic shared object at runtime when MPI_INIT is called. See this FAQ. Therefore symbol lookup happens at that time. So it looks to me that your OpenMPI's libmpi_cxx.so was built against a different libstdc++ than what is available or found at runtime. on the system.
You can either rebuild OpenMPI, or if the correct libstdc++ is somewhere on your system (not /usr/lib64/libstdc++.so.6), you can adjust your LD_LIBRARY_PATH. Also, try setting LD_DEBUG=files to see if you are in fact load 2 different libstdc++'s.

Cannot Debug Shared Library - Symbols Not Loading Properly

I am currently writing a small library, and I want to check it for leaks (among other things); however, for some reason, gdb is not loading the library symbols. I have read many other posts on here (and various other places on the internet) about this, however, I cannot seem to find a solution. Here is what is going on:
I am compiling the shared library with the following flags (these are included in both the final shared library as well as all object files):
CFLAGS=-Wall -O0 -g -fPIC
Likewise, I am compiling the binary memtest (the client application for the library) to check for memory leaks and such with these flags
CFLAGS=-Wall -O0 -g
Now, I inserted a NULL pointer into the library to test if I could trace through it and "debug" the pointer (i.e. it's making it crash). So I try to run it through gdb, but it's a no go. The output of info sharedlibrary is the same for both the executable and the core:
(gdb) info sharedlibrary
From To Syms Read Shared Object Library
... Some libraries I am not worried about debugging...
0x00d37340 0x00d423a4 Yes (*) /home/raged/MyLIB/memtest/../lib/libMyLIB.so.0 <--- My lib
.... and some more....
(*): Shared library is missing debugging information.
As you can see, it's not loading the debug information. I am uncertain as to why this is. I have built and linked everything with the -g flag, and I even try -ggdb and -g3 but nothing seems to work properly. When I load in a core dump, this is what I see:
...some libs...
Reading symbols from /home/raged/MyLIB/memtest/../lib/libMyLIB.so.0...done.
Loaded symbols for /home/raged/MyLIB/memtest/../lib/libMyLIB.so.0
Reading symbols from /usr/lib/libstdc++.so.6...(no debugging symbols found)...done.
...some more libs...
Notice how my library does not give a (no debugging symbols found) error - anyone have any ideas why? As I said before, I am unable to debug this through running the program gdb ./memtest or through debugging the core file.
Thanks for your help.
EDIT It may also be important to note, that (if you didn't realize by path) this library is a local shared library (i.e. I'm using -Wl,-rpath to link/load it)
EDIT2 It seems my version of GDB was out-of-date. Now, I have updated to the latest version from the CVS server (I have also tried latest release version 7.2) and it can "load" symbols. My info sharedlibrary now reads this:
0x00e418b0 0x00e4be74 Yes /home/raged/MyLIB/memtest/../lib/libMyLIB.so.0
However, I am still unable to step through any functions (in the shared library) - anyone have any ideas?
EDIT3 I have also tried to step through linking against a static library (libMyLIB.a) but it still isn't working. My OS is CentOS 5.6; does anyone know of any issues with this system? Also, just another confirmation that my symbols are being loaded (it just can't step through any shared lib function for some reason)
(gdb) sharedlibrary MyLIB
Symbols already loaded for /home/raged/MyLIB/memtest/../lib/libMyLIB.so.0
I found the reason this wasn't working: I was calling an old function call to initialize a pointer in my test executable. Since the object was never being created, I could never step into the library. Once I updated the function call, all worked well.
That said, if anyone experiences similar issues while all symbols appear to be loaded, be sure to check that all pointers are initialized properly even if they have the correct type.