How to determine in which .SO library is given C function?

How to determine in which .SO library is given C function? - c++

I have this problem all the time in Linux programming. As long as all the manuals and almost all the source code for Linux are C-centric, all references to some function needs only some include <something.h> line and the function is accessible from the C/C++ code.
But I am programming in assembly language and know almost nothing about C/C++.
In order to be able to call some function, I have to import it from the corresponding .so library.
How to determine the file name of the library? It often differs from the name of the library itself and is not specified in the manuals.
For example, the name of the XLib is actually libX11.so.6. The name of the XShm extension library seems to be libXext.so.6.
Is there easy way to determine the secret real name of the library, using provided C manuals and references?

This is another not-100%-accurate method that may give you some ideas as to how you can narrow things down a bit. It doesn't exactly fit the question because it uses common linux utilities instead of man files, but it may still be helpful.
Use your distribution's package management software.
For example, on Arch Linux, if you were interested in a function in GLFW/glfw3.h, you could find out who owns that file:
$ pacman -Qo /usr/include/GLFW/glfw3.h
/usr/include/GLFW/glfw3.h is owned by glfw 3.1-1
Find out which .so files are in that package:
$ pacman -Ql glfw | grep 'so$'
glfw /usr/lib/libglfw.so
And, if needed, find the actual file that link points to:
$ readlink -f /usr/lib/libglfw.so
/usr/lib/libglfw.so.3.1
This will depend on your distribution. I believe on Ubuntu/Debian you'd use dpkg-query instead.
Edit: DevSolar points out in a comment that you can use apt-file search <header> and apt-file list <package> instead of dpkg-query -S <header> and dpkg-query -L <package>. apt-file appears to work even for packages that aren't installed (though it seems slower?).
I also noticed that (on my Ubuntu VM at least) that, e.g., libglfw-dev contains the libglfw.so symlink, while libglfw2 contains the actual libglfw.so.2 object.
Once you have a set of .so files, you can check them for whatever function you are interested in:
$ nm -D /usr/lib/libglfw.so | grep "glfwCreateWindow"
0000000000007cd0 T glfwCreateWindow
Note that I pulled this last step from a comment on the previous question and don't fully understand it. Maybe you could even skip the earlier steps and rely on nm and grep alone?

This is not a sure fire way, but it can help in many cases.
Basically, you can usually find the library name at the bottom of the man page.
Eg, man XCreateWindow says libX11 on the last line. Then you look for libX11.so and use nm or readelf to see all exported functions.
Another example, man XShm says libXext at the bottom. And so on.
UPDATE
If the function is in section (2) of the man pages, it's a system call (see man man) and is provided by glibc, which would be libc-2.??.so.
Lastly (thanks Basile), if the function does not mention the library, it is also most likely provided by glibc.
DISCLAIMER: Again this is not a 100% accurate method -- but it should help in most cases.

You can ask gcc to tell you which file it would use for linking like so:
gcc --print-file-name=libX11.so
Sample output:
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/libX11.so
This file will usually be a symlink, so you'll have to pipe it through readlink or realpath to get the actual file. For example:
readlink -f $(gcc --print-file-name=libXext.so)
Sample output:
/usr/lib/x86_64-linux-gnu/libXext.so.6.4.0

As I commented, you could use gcc to link your program, and then it should be able to accept -lX11 ; by using gcc -v instead of gcc you'll find out what is actually linked and how.
However, you have a much more significant issue than finding the lib*.so.*; most C or C++ APIs are described in header files, and these C or C++ header files also contain symbolic constants (like O_RDONLY for open(2)...) or macros (like WIFEXITED in POSIX wait ...) whose value or expansion you should manually find in header files or documentations. (Quite often, such constants are either preprocessor #define-d constants or enum values). Also, some headers -in particular in C++- contains a lot of inline-d functions (or macros)!
A possible way might be to generate some C files to find all these constants, enums, macros, inlined functions..., and/or to customize the GCC compiler (e.g. with MELT ...) to find them.
So my message is that for better or worse, the C language is deeply tied to Linux & POSIX.
You might restrict yourself to use only syscalls(2) from your assembler code. Then you won't use libX11 and you don't need any header or constant (except the ones for syscalls, starting from <asm/unistd.h>).
BTW, in 2015, coding entirely in assembler for performance reasons is a mistake. The compiler is generating better code than you reasonably can (as soon as you have more than a few hundred machine instructions). In practice, you can code in assembler with GCC by using extended asm instructions in your C functions.
Or are you building your own compiler ? Then you should have told so in your question!
Read also the Program Library HowTo & the Linux Assembly HowTo

Related

How to know about all the files which are compiled together to make an executable?

We are looking for an procedure through which we can easily list down all the file which are compiled together to make an executable.
Use Case : Suppose, We have large repository and we want to know what all are the files existing in repository which are compiled to make an executable (i.e a.out)
For example :
dwarfdump a.out | grep "NS uri"
0x0000064a [ 9, 0] NS uri: "/home/main.c"
0x000006dd [ 2, 0] NS uri: "/home/zzzz.c"
0x000006f1 [ 2, 0] NS uri: "/home/yyyy.c"
0x00000705 [ 2, 0] NS uri: "/home/xxxx.c"
0x00000719 [ 2, 0] NS uri: "/home/wwww.c"
but it doesn't listed down the all the header files.
please suggest.

How to Extract Source Code From Executable with Debug Symbol Available ?
You cannot do that. I guess you are on Linux/x86-64 (and your question is operating system and ABI specific, and debugging format specific). Of course, you should pass -g (or even -g3) to all the gcc compilation commands for your executable. Without that -g or -g3 option used to compile every translation unit (including perhaps those of shared libraries!) you might not have enough information.
Even with debug information in DWARF format, the ELF executable don't contain source code, but only references to source code (e.g. source file path, position as line and column numbers). So the debug information contains stuff like file src/foo.c, line 34 column 5 (but don't give anything about the content of src/foo.c near that position). Of course once gdb knows the file path src/foo.c it is able to read that source file (if available and up to date w.r.t. executable) so it can list it.
Extracting that debugging meta-data is a different question. Once you have understood DWARF you could use tools like objdump or readelf or addr2line or dwarfdump or libdwarf; and you could also script gdb (recent versions of GDB may be extendable in Python or in Guile) and use it on your ELF executable.
Perhaps you should consider Ian Taylor's libbacktrace. It uses the DWARF information to provide nice looking backtraces at runtime.
BTW, cgdb is (like ddd) only a front-end to gdb which does all the real work of processing that DWARF information. It is free software, you can study its source code.
i have only a.out then i want to list done file names
You might try dwarfdump -i | grep DW_AT_decl_file and you could use some GNU awk command instead of grep. You need to dive into the details of DWARF specifications and you need to understand more about the elf(5) format.
It doesn't listed down the all the header files
This is expected. Most header files don't contain any code, only declarations (e.g. printf is not implemented in <stdio.h> but in some C source file of your C standard library, e.g. in tree/src/stdio/printf.c if you use musl-libc; it is just declared in /usr/include/stdio.h). DWARF (and other debug information formats) are describing the binary code. And some header files get included only to give access to a few preprocessor macros (which get expanded or skipped at preprocessing time).
Maybe you dream of homoiconic programming languages, then try Common Lisp (e.g. with SBCL).
If your question is how to use gdb, then please read the Debugging with GDB manual.
If your question is about decompilers, be aware that it is an impossible task in general (e.g. because of Rice's theorem). BTW, programs inside most Linux distributions are generally free software, so it is quite easy to get the source code (and you could even avoid using proprietary software on Linux).
BTW, you could also do more things at compilation time by passing more flags to gcc. You might pass -H or -M (etc...) to gcc (in addition of -g). You could even consider writing your own GCC plugin to collect the information you want in some database (but that is probably not worth the effort). You could also consider improving your build automation (e.g. adding more into your Makefile) to collect such information. BTW, many large C programs use some metaprogramming techniques by having some .c files perhaps containing #line directives generated by tools (e.g. bison) or scripts, then what kind of file path do you want to keep ??
We are looking for an procedure through which we can easily list down all the files which are compiled together to make an executable.
If you are writing that executable and compiling it from its source code, I would suggest collecting that information at build time. It could be as trivial as passing some -M and/or -H flag to gcc, perhaps into some generated timestamp.c file (see this for inspiration; but your timestamp.c might contain information provided by gcc -M etc...). Your timestamp file might contain git version control metadata (like generated in this Makefile). Read also about reproducible builds and about package managers.

OCaml statically detect dependency on non-pervasives library in standard distribution

Certain modules that ship with OCaml like Unix and Bigarray have their own .cmx and .cmxa files in ocamlopt -where (which is ~/.opam/4.03.0/lib/ocaml on my system in my current opam switch).
Is there a way to determine without compiling which source files depend on which of these "special" libraries in the standard distribution? I'm intending to consume this output later in a Makefile.
The following program example.ml
open Unix;;
Unix.system "echo hi";;
Can be compiled using ocamlfind ocamlopt -package unix -linkpkg example.ml. I'm not sure how to compile it without going through the ocamlfind wrapper.
I'm wondering if there's a way to statically detect that the unbound-in-this-file module Unix corresponds to "something" in the standard distribution and report unix.cmxa as a dependency. ocamldep does not seem to report it as a dependency by default.
ocamldep -all example.ml just reports that the various object and interfaces files that can be produced using example.ml depend on example.ml. I was hoping for either an error message complaining that ocamldep doesn't understand the Unix module or some indication that it's required to build the objects.
$ ocamldep -all example.ml
example.cmo example.cmi : example.ml
example.cmx example.o example.cmi : example.ml

I understand that your question is:
For a given module name, say Unix, how can we find the library which provides it?
Unfortunately there is no such a tool (yet).
If we limit the search space to the libraries come with the OCaml compiler itself, I would do:
$ ocamlobjinfo $HOME/.opam/4.03.0/lib/ocaml/*.cma | grep '^\(File\|Unit name\)'
This will list all the modules defined in each archive. You may or may not find the module name in the result.
It is impossible in general, since the library you seek may not be standard or may not be installed locally. You can use API search engines like ocamloscope but they never cover all the OCaml libraries ever written of course.

Though modules might be packed into libraries with arbitrary names, the module interfaces still preserve a one to one mapping between top-level module names and compiled module interface file names. So if you have an error 'Unbound module Xxx`, the you can do
find ~/.opam -iname Xxx.cmi
If you didn't find any, then it means, that such library is not installed. And currently, there is no well-established way to find out which package provides this module, you can use Google, ask people on mailing lists or discussion forums, or try to use apt-file hoping that the library is in a standard distribution.
If the search returned exactly one folder, then you're lucky, you got the package. The package may contain object files of different genera (.cmx - for native code, .cmo - for the bytecode), as well as libraries (.cma - is a collection of .cmo, .cmxa - is a collection of .cmx, and .cmxs is a dynamic version of .cmxs). The flexibility of OCaml, that is both a boon and a bane, allows any of these files to be missing. Well-mannered libraries usually provide all these files, as well as have a naming convention that package name matches with the library name. But if you're using ocamlfind and the folder has the META file, then the name of the folder is the name of the package that you need to pass to ocamlfind in order to link the libraries from this package.
If you have more than one results, then you need to use common sense to determine which of the two libraries you need to use. Alternatively, you may try to use one and another and see which one compiles.

How To Get g++ to list paths to all #included files

I would like to have g++/gcc tell me the paths to everything non-system it is #include-ing in C++ build. Turns out, that is a tough search as Google mus-interprets it about ten different ways.
I want these filenames and paths so I can add them to the search path for Exuberant CTAGS. We have a huge project and if I use ctags on the whole thing it takes about half an hour to generate the tags file and nearly as long for the editor to do a look-up.
We use CMakeLisats to do the compiling. If there is a directive I can paste into the CMakeLists.txt, that would be extra wonderfulness.
I don't really need the default paths and filenames, Johnathan Wakely gave a good tool for that here. I think that pretty much covers the fact that this is a cross compile job. I don't need the cross-system files either.

Try gcc or g++ with the -H option (to the preprocessor part of it). From the doc:
-H
Print the name of each header file used, in addition to other normal activities. Each name is indented to show how deep in the ‘#include’ stack it is. Precompiled header files are also printed, even if they are found to be invalid; an invalid precompiled header file is printed with ‘...x’ and a valid one with ‘...!’ .
It tells you all the headers which are included. You may filter out (with grep -v or awk) those that you don't want.
You could also consider developing your GCC plugin to register these headers somewhere (e.g. in your sqlite database), perhaps inspired by this draft report, or the CHARIOT or DECODER European projects. You could also consider using, or extending, the Clang static analyzer.
In contrast to the -M options suggested in Oliver Matthews' answer, it does not tell you more (but gives all the included files).

You need to invoke g++ with the -M option.
From the manual:
Instead of outputting the result of preprocessing, output a rule
suitable for make describing the dependencies of the main source file.
The preprocessor outputs one make rule containing the object file name
for that source file, a colon, and the names of all the included
files, including those coming from -include or -imacros command line
options.
It's worth reading the manual to consider the other -M sub options (-MM and -MF in particular may be of use).

C++ binary identification (manifest)

We have a large set of C++ projects (GCC, Linux, mostly static libraries) with many dependencies between them. Then we compile an executable using these libraries and deploy the binary on the front-end. It would be extremely useful to be able to identify that binary. Ideally what we would like to have is a small script that would retrieve the following information directly from the binary:
$ident binary
$binary : Product=PRODUCT_NAME;Version=0.0.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME1;Version=0.1.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME2;Version=1.0.1;Build=xxx;User=xxx...
So it should display all the information for the binary itself and for all of its dependencies.
Currently our approach is:
During compilation for each product we generate Manifest.h and Manifest.cpp and then inject Manifest.o into binary
ident script parses target binary, finds generated stuff there and prints this information
However this approach is not always reliable for different versions of gcc..
I would like to ask SO community - is there better approach to solve this problem?
Thanks for any advice

One of the catches with storing data in source code (your Manifest.h and .cpp) is the size limit for literal data, which is dependent on the compiler.
My suggestion is to use ld. It allows you to store arbitrary binary data in your ELF file (so does objcopy). If you prefer to write your own solution, have a look at libbfd.
Let us say we have a hello.cpp containing the usual C++ "Hello world" example. Now we have the following make file (GNUmakefile):
hello: hello.o hello.om
$(LINK.cpp) $^ $(LOADLIBES) $(LDLIBS) -o $#
%.om: %.manifest
ld -b binary -o $# $<
%.manifest:
echo "$#" > $#
What I'm doing here is to separate out the linking stage, because I want the manifest (after conversion to ELF object format) linked into the binary as well. Since I am using suffix rules this is one way to go, others are certainly possible, including a better naming scheme for the manifests where they also end up as .o files and GNU make can figure out how to create those. Here I'm being explicit about the recipe. So we have .om files, which are the manifests (arbitrary binary data), created from .manifest files. The recipe states to convert the binary input into an ELF object. The recipe for creating the .manifest itself simply pipes a string into the file.
Obviously the tricky part in your case isn't storing the manifest data, but rather generating it. And frankly I know too little about your build system to even attempt to suggest a recipe for the .manifest generation.
Whatever you throw into your .manifest file should probably be some structured text that can be interpreted by the script you mention or that can even be output by the binary itself if you implement a command line switch (and disregard .so files and .so files hacked into behaving like ordinary executables when run from the shell).
The above make file doesn't take into account the dependencies - or rather it doesn't help you create the dependency list in any way. You can probably coerce GNU make into helping you with that if you express your dependencies clearly for each goal (i.e. the static libraries etc). But it may not be worth it to take that route ...
Also look at:
C/C++ with GCC: Statically add resource files to executable/library and
Is there a Linux equivalent of Windows' "resource files"?
If you want particular names for the symbols generated from the data (in your case the manifest), you need to use a slightly different route and use the method described by John Ripley here.
How to access the symbols? Easy. Declare them as external (C linkage!) data and then use them:
#include <cstdio>
extern "C" char _binary_hello_manifest_start;
extern "C" char _binary_hello_manifest_end;
int main(int argc, char** argv)
{
const ptrdiff_t len = &_binary_hello_manifest_end - &_binary_hello_manifest_start;
printf("Hello world: %*s\n", (int)len, &_binary_hello_manifest_start);
}
The symbols are the exact characters/bytes. You could also declare them as char[], but it would result in problems down the road. E.g. for the printf call.
The reason I am calculating the size myself is because a.) I don't know whether the buffer is guaranteed to be zero-terminated and b.) I didn't find any documentation on interfacing with the *_size variable.
Side-note: the * in the format string tells printf that it should read the length of the string from the argument and then pick the next argument as the string to print out.

You can insert any data you like into a .comment section in your output binary. You can do this with the linker after the fact, but it's probably easier to place it in your C++ code like this:
asm (".section .comment.manifest\n\t"
".string \"hello, this is a comment\"\n\t"
".section .text");
int main() {
....
The asm statement should go outside any function, in this instance. This should work as long as your compiler puts normal functions in the .text section. If it doesn't then you should make the obvious substitution.
The linker should gather all the .comment.manifest sections into one blob in the final binary. You can extract them from any .o or executable with this:
objdump -j .comment.manfest -s example.o

Have you thought about using standard packaging system of your distro? In our company we have thousands of packages and hundreds of them are automatically deployed every day.
We are using debian packages that contain all the neccessary information:
Full changelog that includes:
authors;
versions;
short descriptions and timestamps of changes.
Dependency information:
a list of all packages that must be installed for the current one to work correctly.
Installation scripts that set up environment for a package.
I think you may not need to create manifests in your own way as soon as ready solution already exists. You can have a look at debian package HowTo here.

Determining what object files have caused .dll size increase [C++]

I'm working on a large c++ built library that has grown by a significant amount recently. Due to it's size, it is not obvious what has caused this size increase.
Do you have any suggestions of tools (msvc or gcc) that could help determine where the growth has come from.
edit
Things i've tried: Dumpbin the final dll, the obj files, creating a map file and ripping through it.
edit again
So objdump along with a python script seems to have done what I want.

If gcc, objdump. If visual studio, dumpbin.
I'd suggest doing a diff of the output of the tool for the old (small) library, vs. the new (large) library.

keysersoze's answer (compare the output of objdump or dumpbin) is correct. Another approach is to tell the linker to produce a map file, and compare the map files for the old and new versions of the DLL.
MSVC: link.exe /MAP
GCC and binutils: ld -M (or gcc -Wl,-M)

On Linux it should be quite easy to see if new files have been added with a recursive diff. They would certainly create an increase in the library size. You can then go an use the size command line tool on Linux to get the sizes of each of the new object files and sum them up. Then compare that sum to your library increase and check how much it differs.

G'day,
If you have any previous versions of the object file laying around can you run the size command to see which segment has grown?
Couple of questions:
Are you on a *nix platform or a Windows platform?
Which compiler are you using?
Was the compiler recently changed?
Was the -g flag added recently? (obvious question 1)
Was the object previously stripped? (obvious question 2)
Was the object dynamically linked previously? (obvious question 3)
Edit: If the code is under SCM, can you check out a version of the source that gave you the smaller object. Then compare:
the size of the source trees by doing a du -sk on the old source tree and then the new source tree without having built anything.
the number of files by doing something like find ./tree_top ( -name *.h -o -name *.cpp ) | wc -l
the location of an increased number of files by doing a find ./tree_top ( -name *.h -o -name *.cpp ) -print | sort > treelist and then do the same for the new larger tree. Doing a simple sdiff will show any large number of new files.
the size of code base, even a simple count of trailing semi-colons will give you a good basic mechanism for comparison between the two.
the Makefiles or build env. for the project to see if different options or settings have crept in to the build itself.
HTH
BTW Please post your findings here as I'm sure many people are interested in what you find out.
cheers,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js