Strange symbol lookup error in libstdc++ - c++

Trying to track down a segfault somewhere in MPI, I got this error:
./mpitest: symbol lookup error: /usr/lib64/libstdc++.so.6: bàþ;# BC_
-------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 8729 on ...
First, I'm used to getting lookup errors when the process is loaded if the library path is wrong. But those all happen before the process starts executing. This happened in the middle of the output from the test. Shouldn't all symbols be resolved by the runtime loader before the process starts?
Second, that symbol looks like garbage. It's certainly not a normal mangled C++ symbol.
Is it possible for memory corruptions (since I am tracking a segfault, it's likely there's something like that going on) to corrupt symbols like this?
This was compiled with icpc 12.0.3 20110309 on a Linux 2.6.18-194.32.1.el5 x86_64 machine.

OpenMPI loads plugins as dynamic shared object at runtime when MPI_INIT is called. See this FAQ. Therefore symbol lookup happens at that time. So it looks to me that your OpenMPI's libmpi_cxx.so was built against a different libstdc++ than what is available or found at runtime. on the system.
You can either rebuild OpenMPI, or if the correct libstdc++ is somewhere on your system (not /usr/lib64/libstdc++.so.6), you can adjust your LD_LIBRARY_PATH. Also, try setting LD_DEBUG=files to see if you are in fact load 2 different libstdc++'s.

Related

std::thread weak when using -static-libstdc++, thus causing crash at runtime

I need to build a portable shared object, which is a plugin for another software on Linux. I did some amount of reading on the subject, came down to the conclusion, that I should build a sysrooted gcc (gcc 5.4.0 if it matters) with a decently old glibc (to provide compatibility with older systems), link with -static-libstdc++ and -static-libgcc thus arriving to a point where I have something that only depends on the hosts glibc and some other minor stuff which will always be present.
Now, I did all that and now I am experiencing a weird crash - segmentation fault happens in a place where the code calls std::thread, and gdb actually shows that the stack frame is inside libstdc++.so.6 (where is shouldn't be, ldd of my shared object also does not list libstdc++.so). The top of the stack at the crash is:
#0 0x0000000000000000 in ?? ()
#1 0x00007ffff79075e3 in std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 # THIS SHOULD NOT BE HERE RIGHT?
#2 0x00007ffff5a25a5c in std::thread::thread<void (ReferenceAnalytics::*)(std::timed_mutex&), ReferenceAnalytics*&, std::reference_wrapper<std::timed_mutex> >
(this=0x7fffffffcf40, __f=
#0x7fffffffcf60: (void (ReferenceAnalytics::*)(ReferenceAnalytics * const, std::timed_mutex &)) 0x7ffff5a1750c <ReferenceAnalytics::WorkerThreadMethod(std::timed_mutex&)>)
at /home/developer/Toolchains/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/5.4.0/thread:137 # Looks like my toolchain
So, I did some reading, and then using nm discovered that my shared object has all the std::thread stuff like ctor, dtor, swap, .... defined as weak symbols (which I assume causes a collision if the host that loads the plugin uses dynamic libstdc++ and then my calls are routed there and all hell breaks loose, is this right?).
My further attempts of googling and reading did not give me an answer to how can I control this as in force the std::thread stuff to be resolved to the static libstdc++ in my sysrooted gcc?
More over, I made a small executable that just does dlopen on my shared object and then calls a method which internally constructs the thread - if the executable is also built with -static-libstdc++ all is well, if not, the crash happens. So I assume my theory about the weak symbol for std::thread being resolved to the hosts libstdc++ is correct, but how to solve this?
If you statically link a DSO against libstdc++ without hiding the libstdc++ symbols, and the main program is linked against libstdc++ as well, then the symbol definitions in the main program will interpose/preempt the definitions in the DSO when it is opened with dlopen.
However, because the main program is not linked against libpthread, the the system libstdc++ DSO in the process image saw that the libpthread symbols were unavailable (null), and thus disabled thread support. However, your DSO needs thread support, but can't get it from the system libstdc++.
As an immediate workaround, you can hide all the statically linked libstdc++ symbols in the DSO. Then no interposition will take place, and your DSO will actually use the libstdc++ copy in the DSO itself, which has already established that there should not be any thread support in the process.
But this will likely not solve all of your problems because late loading of libpthread via dlopen has its problems. We fixed one bug here:
Segfault after a binary without pthread dlopen()s a library linked with pthread
But your distribution may not have that fix, and I expect there will be other issues, one of them being: The second, statically linked copy of libstdc++ is actually needed here because the system libstdc++ has been loaded without thread support (because libpthread was not loaded when its symbols were bound, causing the crash you observed), so you cannot use it for creating threads. It also has activated optimizations which make the library not thread safe (avoid atomic instructions and things like that).

OS X equivalent of --unresolved-symbols=ignore-in-object-files

On Linux (CentOS) I have occasionally used -Wl,--unresolved-symbols=ignore-in-object-files when building a test application that only depends on parts of some object files even though the full dependency would require a lot more object files to be included. The point is that I know by design any unresolved symbols are never encountered when running the test application (otherwise it should just crash).
On OS X, I found similar options -Wl,-undefined,suppress (or warning, dynamic_lookup),-flat_namespace which allowed me to build the binary, but it failed at run time complaining about dyld: Symbol not found: ... even though the missing symbols are never used during the run (the same application runs perfectly fine on CentOS).
Is there something else to force the application to run (till it crashes if ever an unresolved symbol is encountered) like on Linux?

Symbol lookup error at runtime instead of load time

I have an application which uses a class Foo from an .so shared library. I've come across a problem where at runtime it prints
<appname>: symbol lookup error: <appname>: undefined symbol: <mangled_Foo_symbol_name>
Now, it turned out that the unmangled symbol was for the constructor of the class Foo, and the problem was simply that an old version of the library was loaded, which didn't contain Foo yet.
My question isn't about resolving the error (that's obviously to use the correct library), but why it appears at runtime instead of at time of load / startup.
The line of code causing the error just instantiates an object of class Foo, so I'm not using anything like dlopen here, at least not explicitly / to my knowledge.
In contrast, if I remove the whole library from the load search path, I get this error at startup:
<appname>: error while loading shared libraries: libname.so.2: cannot open shared object file: No such file or directory
When the wrong version of gcc / libstdc++ is on the load path, an error also appears at starup:
<appname>: /path/to/gcc-4.8.0/lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by <appname>)
This "fail fast" behavior is much more desirable, I don't want to run my application for quite awhile first, until I finally realize it's using the wrong library.
What causes the load error to appear at runtime and how can I make it appear immediately?
From the man page of ld.so:
ENVIRONMENT
LD_BIND_NOW (libc5; glibc since 2.1.1) If set to a nonempty string, causes the dynamic linker to resolve all symbols at program startup instead of deferring function call resolution to the point when they are first referenced. This is useful when using a debugger.
LD_WARN (ELF only)(glibc since 2.1.3) If set to a nonempty string, warn about unresolved symbols.
I think you can not statically link .so library. If you want to avoid load/run time errors you have to use all static libraries (.a). If you do not have static version of library and source then try to find some statifier. After googling I find few statifiers but do not know how do they work so leaving that part up to you.

LD_BIND_NOW: Symbol lookup error but executable still running

I am trying to diagnose linker/runtime errors using setenv LD_BIND_NOW TRUE. When I run the executable with this option enabled, I get the error
lib/libmkl_intel_thread.so: error: symbol lookup error: undefined symbol: DftiFreeDescriptor (fatal)
However, if I then remove the LD_BIND_NOW environmental variable, the program executes just fine (until termination, whereupon it reports a memory corruption--though that might be unrelated).
So I am a bit confused: How does the program execute when it has a symbol lookup error? I thought it would have to terminate as the program is written in C++, not Java. (See here for reference.)
Also, does this error imply that my rpath is set incorrectly, or has the MKL so been built improperly? Is there a fix that can be achieved in bounded time?
Firstly, I thought you needed LD_BIND_NOW=1 (as opposed to TRUE, though that may be a synonym).
Secondly, although your application would not have linked had there been an unresolved symbol, is it possible you've done some form of shared library update so that one of the libraries used now uses a library in turn with an unresolved symbol? Or that it's using a different library to that with which it was linked?

gcc relocation error at runtime

Currently I'm running some multi-threaded code which all compiles with no errors or warnings and I get this error when I execute the code:
relocation error: /lib/x86_64-linux-gnu/libgcc_s.so.1:
1thread_mutex_locXãƨ+�����Ȩ+ ������ƨ+�&쏭Ũ�Ȩ+e
What is a relocation error?
The relocation is process of adopting some offsets in the code to the actual memory layout.
Relocations (places which will be edited by relocation process and the description of each relocation) are generated by compiler, e.g. for TLS variables, for dynamic library calls, for PIC/PIE code. Relocation description is stored in the binary file (e.g. in ELF format in Linux).
Relocations are partially done in linking step, by ld linker program in linux; other linkers in other OSes.
But there are some relocations which can't be done in offline (before starting program). Such relocations are needed to use ASLR (address space layout randomization), to load dynamic libraries. So some of them are done just before starting a program, by the program interpreter, (ld.so in linux), which is also called runtime linker. It will load your program and its dynamic libraries into memory and will do relocations.
Third place where relocations are done: is a call to dlopen() (in libdl.so in unix). It is library to dynamically load dynamic libraries; and because dynamic libraries has relocations, this library should do them too.
The error message is from some linker, and if you see this after starting a program, this is second (ld.so) or third case (libdl).
I can't find exact place where this message is generated, but it is possible due
memory or on-disk data corruption (non-ecc memory or other hardware bug), which made some data wrong. Do a reboot; filesystem and md5sums checks; reinstalling of packages which are used (glibc; libgcc); recompile your application; replug you memory, make memory frequency less.
some undefined symbol was used. Try to set environment variable LD_BIND_NOW (if you are on glibc or derivative) to non-null.
the program corrupted its memory itself. e.g. using the Stack Overflow, or Random Pointer Walk, or something like. Try to use a valgrind (if you are on intel).
synchronization error which allows you program to break itself memory. Use valgrind --tool=helgrind (if you are on intel and have a lot of time to wait)