Finding all libraries and header files forming a C++ executable - c++

If I have a C++ source file, gcc can give all its dependencies, in a tree structure, using the -H option. But given only the C++ executable, is it possible to find all libraries and header files that went into its compilation and linking?

If you've compiled the executable with debugging symbols, then yes, you can use the symbols to get the files.
If you have .pdb files (Visual studio creates them to store sebugging information separately) you can use all kinds of programs to open them and see the source files and methods.
You can even open it with a text editor and you'll see, among the gibrish, a list of the functions and source files.
If you're using linux (or GNU compilers in general), you can use gdb (again only if you have debug symbols enables in compilation time).
Run gdb on your executable, then run the command: info sources
That's an important reason why you should always remove that flag when going into production. You don't want clients to mess around with your sources, functions, and code.

You cannot do that, because that executable might have been build on a machine on which the header files (or the C++ code, or the libraries) are private or even generated. Also, if a static library is linked in, you have no reliable way to find out.
In practice however, on Linux, using nm or objdump or ldd on the executable will often (but not always) gives you a good clue about the needed libraries.
Also, some executables are dynamically loading a plugin e.g. using dlopen, so your question might not have any sense (since that plugin is known only at runtime).
Notice also that you might not know if an executable is obtained by compiling some C++ code (you might not be able to tell if it was obtained from C, C++, D, or Ocaml, ... source code, or a mixture of them).
On Linux, if you build an executable with static linking and stripping, people won't be able to easily guess the source programming language that you have used.
BTW, on Linux distributions, it is the role of the package management system to deal with such dependencies.
As answered by Yochai Timmer if the executable contains debug information (e.g. in DWARF format) you should be able to get a lot more information.

Related

Making sense of .so file: trying to restore poorly versioned source files

If this question is too generic, please tell me so i can delete it.
I have a software used in operation that is compiled with linking to a .so file. The file is generated in compilation of a set of versioned .c and .cpp sources. Previous developer generated the .so file compiling a local version of source files that was modified in unknown ways and modified sources are god-knows where, if anywhere in the system at all. Fortunately it was compiled with debugging symbols, so reading it with gdb is easier.
Software is being used in operation and i need to modify it. Recompiling any known version of it will obviously generate results that differ from current compiled version in unknown ways. I want to dig as deep as possible in current .so file to know what it is doing, so that i can recompile sources generating as similar a result as i can. What i did until now:
readelf --debug-dump=info path/to/file | grep "DW_AT_producer" to see compilation flags and reproduce them in new compilations.
(gdb) info functions to see what functions are defined and compare it with previous versions of code.
Going function by function on the functions listed by previous command and: list <function>
Does anyone have any more tips on how to get as much info from .so file as i can? Since im not expert with gdb yet: am i missing something important?
Edit: by using strip in both files (compiled from original source and compiled from mysterious lost source file) i managed to see that most of differences between them were just debug symbols (which is weird because it seems both were compiled with -g option).
There is only one line of difference between them now.
I just found out that "list" just reads the source file from the binary, so list doesn't help me
You are confused: the source is never stored in the binary. GDB list command is showing the source as it exists in some file on disk.
The info sources command will show where on disk GDB is reading the sources from.
If you are lucky, that's the sources that were used to build the .so binary, and your task is trivial -- compare them to VCS sources to find modifications.
If you are unlucky, the sources GDB reads have been overwritten, and your task is much harder -- effectively you'll need to reverse-engineer the .so binary.
The way I would approach the harder task: build the library from VCS sources, and then for each function compare disas fn between the two versions of .so, and study differences (if any).
P.S. I hope you are also using the exact same version of the compiler that was used to compile the in-production .so, otherwise your task becomes much harder still.

Load and link obj files at Runtime in C/C++ (JIT)

I'm searching for a C or C++ library which can load and link obj files (doesn't matter if ELF or obj) dynamicly at runtime. I spend some time searching for such library, but my results weren't successful.
What I tried:
LLVM:
Currently my best solution! I used Clang to generate .obj files in the bytecode format of LLVM and used its JIT functions to dynamic load and execute the function. But, the LLVM is huge and my PC at home hasn't the power to compile the complete LLVM just for the JIT. Also I encountered some problems with relocation overflows or not implemented relocation types.
libjit:
I read, that it can load .elf files and link them too. But sadly, I couldn't compile it for windows, so I couldn't try.
Nanojit and NativeJit:
It seems like they don't support JITting an object file.
So... What can I do? Do I have to stick around with the LLVM? Are there any alternatives?
I suppose that an analogy that can be taken as a 1st approach is that the .bc is similar to an .o (or .obj) file in that it is just the translation of C++ code to an intermediate language, and tht it can contain references to functions not defined in it, to be searched in libraries.
And that the JIT-ted code is similar to a DLL, in the sense that it will be linked dynamically to the executable where it will run in.
You need not to compile LLVM -- you can download the binaries for LLVM and assorted utilities (like clang) from LLVM Download Page

Locating "undefined" references in a C/C++ Project

I am building a C++ project on my Ubuntu 64bit system using a provided Makefile, and this project also provides an API library for developers.
The compilation was successful, no errors at all, but when I try to include in my files the API libraries provided in the "api" folder, then g++ complains about undefined references.
It is not a problem about dependencies (I already built the project succesfully), in fact the missing references are about classes and functions provided by the project, but they are in some specific (sub-)folders (I don't know which ones!), I guess in some .so files, and g++ is not finding them, probably because it does not know they are in those specific subfolders.
It is not the first time this happens when trying to use APIs from any project, then I think I am missing something or I am doing something wrong in general when trying to use the libraries provided in a project.
In particular, the main problem is that I don't know how to tell the compiler where some classes or data structures are declared, and moreover I don't know how to locate them in order to know where they are.
Usually, a workaround I use to avoid this problem is to run make install (as root or using sudo) so that libraries and APIs are installed in standard folders (like /usr/include or /usr/lib) and if I do this thend I can include the API libraries using #include <library>, but in this last case it didn't work either, because perhaps some required files containing the not found classes/structures are not properly installed in the right folders in the system.
Another workaround I used sometimes is to try to put all the project files in the same folder instead of using the project folder structure, but clearly this is not good! :-)
But I noticed that several other people managed to use the APIs, then apparently they know some way of finding the files containing the "undefined" references and including them in the compilation.
Then my general question is: given a "classic" C++ project based on "Makefile" files and with usual folder names like src, lib, build, bin, etc., if I want to write C++ files using the libraries provided by the project, but the compiler complains about undefined references, how can I find the files (.so or .o or .cpp) containing such references? Is there any tool to find them? And how can I tell the compiler where they are? Should I use some command-line option for g++ or should I use the #include macro in some smart way?
PS I also tried to use the pkc-config linux tool to get right options to use for compilation and they were available, but the compiler still complains about the undefined references.
Some more degails about the project i tried:
For interested people a link to the project is below:
https://github.com/dreal/dreal3
And instructions on how to build it:
http://dreal.github.io/download/
Look into the -rpath linker option - specifically with the "$ORIGIN" argument. That lets you find libraries relative to your executable location so you don't have to install them to the standard locations but just need to put them somewhere known, relative to the executable. That should help you with one piece of the puzzle.
Note: -Wl, can be used to pass arguments to the linker via g++.
As for pointing the compiler/linker at a library so it can resolve undefined references by using that library, use the -l (that's lowercase L) option to specify the library name and -L to specify directories to search for libraries.
As for looking into a library (.so) file to see what symbols are in there, you have a few tools at your disposal: objdump, nm, readelf and objcopy.

Difference between code object and executable file

I'm a C++ beginner and I'm studying the basics of the language. There is a topic in my book about the compiler and my problem is that I can not understand what the text wants to say:
C++ is a compiled language so you need to translate the source code in
a file that the computer can execute. This file is generated by the
compiler and is called the object code ( .obj ), but a program like
the "hello world" program is composed by a part that we wrote and a part
of the C++ library. The linker links these two parts of a program and
produces an executable file ( .exe ).
Why does my book tell that the file that is executed by the computer is the one with the obj suffix (the object code) and then say that it is the one with the exe suffix?
Object files are source compiled into binary machine language, but they contain unresolved external references (such as printf,for instance). They may need to be linked against other object files, third party libraries and almost always against C/C++ runtime library.
In Unix, both object and exe files are the same COFF format. The only difference is that object files have unresolved external references, while a.out files don't.
The C++ specification is a technical document in English. For C++11 have a look inside n3337 (or spend a lot of money to buy the paperback ISO standard). In theory you don't need a computer to run a C++ program (you could use a bunch of human slaves, but that would be unethical, inefficient, and unreliable).
You could have a C++ implementation which is an interpreter, not a compiler (e.g. Ch by SoftIntegration)
If you install Linux on your laptop (which I recommend doing to every student) then you could have several free software C++ compilers, in particular GCC and Clang/LLVM (using g++ and clang commands respectively). Source files are suffixed .cc, or .cxx, or .cpp, or even .C (I prefer .cc), and you could ask the compiler to handle a file of some other suffix as a C++ source file (but that is not conventional). Then, both object files (suffixed .o) and executables share the same ELF format. Conventionally, executables don't have any suffix (e.g. g++ is a binary executable, not doing much except starting other processes like cc1plus -the compiler proper-, as -the assembler-, ld -the linker- etc...)
In all cases I strongly recommend:
to enable all warnings and debug info during compilation (e.g. use g++ -Wall -g ....)
to improve your source code till you got no warnings
to learn how to use the debugger (gdb)
to be able to build your program on the command line
to use a version control system like git
to use a good editor like emacs, gedit, geany, or gvim
once you are writing programs in several source files, learn how to use a builder like make
to learn C++11 (or even perhaps C++14) rather than older C++ standards
to also learn other programming languages (Ocaml, Scheme, Haskell, Prolog, Scala, ....) since they would improve your thinking and your way of coding in C++
to study the source code of several free software coded in C++
to read the documentation of every function that you are using, e.g. on cppreference or in man pages (for Linux)
to understand what is undefined behavior (the fact that your program sometimes work does not make it correct).
Concretely, on Linux you could edit your Hello World program (file hello.cc) with gedit or emacs (with a command like gedit hello.cc) etc..., compile it using g++ -Wall -g hello.cc -o hello command, debug it using gdb ./hello, and repeat (don't forget to use git commands for version control).
Sometimes it makes sense to generate some C++ code, e.g. by some shell, Python, or awk script (or even by your own program coded in C++ which generates C++ code!).
Also, understand that an IDE is not a compiler (but runs the compiler for you).
The basic steps for creating an application from a C or C++ source file are as follows:
(1) the source files are created (by a person or generated by a program), (2) the source files are compiled (which is really two steps, Preprocessor and compilation) into object code, (3) the object files that are created by the C/C++ compiler are linked to create the .exe
So you have these steps of transforming one version of the computer program, the source files, to another, the executable. The C++ source is compiled to produce the object files. The object files are then linked to produce the executable file.
In most cases there are several different programs involved in the compile and link process with C and C++. Each program takes in certain files and creates new files.
C/C++ Preprocessor takes in source code files and generates source code files
C/C++ Compiler takes in source code files and generates object code files
the linker takes in object code files and libraries and generates executable files
See What is the difference between - 1) Preprocessor,linker, 2)Header file,library? Is my understanding correct?
Most compiler installations have a program that runs these various applications for you. So if you are using gcc then gcc program will run first the C++ Preprocessor then then C++ compiler and then the linker. However you can modify what gcc does with command line options to tell it to only run the C++ Preprocessor or to only compile the source files but not to link them or to only link the object code files.
A brief history of computer languages and programming
The languages used for programming computers along with the various software development tools have evolved over the years.
The first computers were programmed with numbers entered by switches on a console.
Then people started developing languages and software that could be used to create software more easily and quicker. The first major development was creating assembler language where each line of source was converted by a computer program into a machine code instruction. Along with this came the development of linkers (which link pieces of machine code together into larger pieces). Assemblers were improved by adding a macro or preprocessor facility somewhat like the C/C++ Preprocessor though designed for assembly language.
Then people created programming languages that looked more like people written languages rather than assembler (FORTRAN and COBOL and ALGOL for instance). These languages were easier to read and a single line of source might be converted into several machine instructions so it was more productive to write computer programs in these languages rather than assembler.
The C programming language was a later refinement using lessons learned from the early programming languages such as FORTRAN. And C used some of the same software development tools that already existed such as linkers which already existed. Still later C++ was invented, starting off as a refinement of C introducing object oriented facilities. In fact the first C++ compiler was really a C++ translator which translated C++ source code to C source code which was then compiled with a C compiler. However modern C++ is compiled straight to machine code in order to provide the full functionality of the C++ standard with templates, lambdas, and all the other things with C++11 and later.
linkers and loaders
When you run a program you run the executable file. The executable file contains several kinds of information. The first is the machine instructions that are the result of compiling the C++ source code. The other is information that the loader uses in order to know how to load the executable into memory.
In the old days, long long ago all libraries and object files were linked together into an executable file and the executable file was loaded by the loader and the loader was pretty simple.
Then people invented shared libraries and dynamic link libraries and this required the linker to be more complex and the loader to be more complex.
The linker became more complex because it had to be able to recognize the difference between using a shared library and a static library and be able to generate an executable file that not only contains the linked object code but also information for the loader about any dynamic libraries.
The loader became more complex because not only does the loader have to load the executable file into memory so that it can start running, the loader must also find any shared libraries or dynamic link libraries that are also needed and load those too. And the loader also has to do a certain amount of linking of the additional components, the shared libraries, so the loader does a lot more than it used to do.
See also
Difference between shared objects (.so), static libraries (.a), and DLL's (.so)?
What is an application binary interface (ABI)?
How to make a SIMPLE C++ Makefile
Object code (within an object file): Output from a compiler intended as input for a linker (for the linker to produce executable code).
Executable: A program ready to be run (executed) on a computer

why are my visual studio .obj files are massive in size compared to the output .exe?

As a background, I am a developer of an opensource project, a c++ library called openframeworks, that is a wrapper for different libraries, like opengl, quicktime, freeImage, etc. In the next release, we've added a c++ library called POCO, which is similar to boost in some ways in that it's an alternative for java foundation library type functionality.
I've just noticed, that in this latest release where I've added the POCO library as a statically linked library, the .obj files that are produced during the act of compilation are really massive - for example, several .obj files for really small .cpp files are 2mb each. The overall compiled .obj files are about 12mb or so. On the flip side, the exes that are produced are small - 300k to 1mb.
In comparison, the same library compiled in code::blocks produces .obj files that are roughly the same size at the exe - they are all fairly small.
Is there something happening with linking, and the .obj process in visual studio that I don't understand? for example, is it doing some kind of smart prelinking, or other thing, that's adding to the .obj size? I've experimented a bit with settings, such as incremental linking, etc, and not seen any changes.
thanks in advance for any ideas to try or insights !
-zach
note: thanks much! I just tried, dumpbin, which says "anonymous object" and doesn't return info about the object. this might be the reason why....
note 2, after checking out the above link, removing LTCG (link time code generation - /GL) the .obj files are much smaller and dumpbin understands them. thanks again !!
I am not a Visual Studio expert by any stretch of imagination, having hardly used it, but I believe Visual Studio employs link-time optimizations, which can make the resulting code run faster, but can cost a lot of space in the libraries. Also, it may be (I don't know the internals) that debugging information isn't stripped until the actual linking phase.
I'm sure someone's going to come with a better/more detailed answer anyway.
Possibly the difference is debug information.
The compiler outputs the debug information into the .obj, but the linker does not put that data into the .exe or .dll. It is either discarded or put into a .pdb.
In any case use the Visual Studio DUMPBIN utility on the .obj files to see what's in them.
Object files need to contain sufficient information for linking. In C++, this is name-based. Two object files refer to the same object (data/function/class) if they use the same name. This implies that all object files must contain names for all objects that might be referenced by other object files. The executable however will need the names visible from outside the library. In case of a DLL, this means only the names exported. The saving is twofold: there are less names, and those names are present only once in the DLL.
Modern C++ libraries will use namespaces. These namespaces mean that object names become longer, as they include the names of the encapsulating namespaces too.
The compiled library obj files will be huge because they must contain all of the functions, classes and template that your end users might eventually use.
Executables which link to your library will be smaller because they will include only the compiled code that they require to run. This will usually be a tiny subset of the library.