Making sense of .so file: trying to restore poorly versioned source files

Making sense of .so file: trying to restore poorly versioned source files - gdb

If this question is too generic, please tell me so i can delete it.
I have a software used in operation that is compiled with linking to a .so file. The file is generated in compilation of a set of versioned .c and .cpp sources. Previous developer generated the .so file compiling a local version of source files that was modified in unknown ways and modified sources are god-knows where, if anywhere in the system at all. Fortunately it was compiled with debugging symbols, so reading it with gdb is easier.
Software is being used in operation and i need to modify it. Recompiling any known version of it will obviously generate results that differ from current compiled version in unknown ways. I want to dig as deep as possible in current .so file to know what it is doing, so that i can recompile sources generating as similar a result as i can. What i did until now:
readelf --debug-dump=info path/to/file | grep "DW_AT_producer" to see compilation flags and reproduce them in new compilations.
(gdb) info functions to see what functions are defined and compare it with previous versions of code.
Going function by function on the functions listed by previous command and: list <function>
Does anyone have any more tips on how to get as much info from .so file as i can? Since im not expert with gdb yet: am i missing something important?
Edit: by using strip in both files (compiled from original source and compiled from mysterious lost source file) i managed to see that most of differences between them were just debug symbols (which is weird because it seems both were compiled with -g option).
There is only one line of difference between them now.

I just found out that "list" just reads the source file from the binary, so list doesn't help me
You are confused: the source is never stored in the binary. GDB list command is showing the source as it exists in some file on disk.
The info sources command will show where on disk GDB is reading the sources from.
If you are lucky, that's the sources that were used to build the .so binary, and your task is trivial -- compare them to VCS sources to find modifications.
If you are unlucky, the sources GDB reads have been overwritten, and your task is much harder -- effectively you'll need to reverse-engineer the .so binary.
The way I would approach the harder task: build the library from VCS sources, and then for each function compare disas fn between the two versions of .so, and study differences (if any).
P.S. I hope you are also using the exact same version of the compiler that was used to compile the in-production .so, otherwise your task becomes much harder still.

Related

Compiling modules in different directories

I'm trying to follow these instructions to compile a module that depends on another module which I've created: https://ocaml.org/learn/tutorials/modules.html
In my case, I have a module ~/courseFiles/chapter5/moduleA.ml and another module in ~/OCamlCommons/listMethods.ml. I have compiled listMethods.ml using ocamlopt -c listMethods.ml and this seemed to work, it produced a file listMethods.cmx.
The file moduleA.ml contains open ListMethods;;. Now with my terminal located at ~/courseFiles/chapter5 I ran ocamlopt -c moduleA.ml but the terminal returns
Error: Unbound module ListMethods
Now I can understand why it would do this, but the instructions at that site seem to indicate that what I've done is how you're supposed to do this. Presumably I need to pass in the location of either the script or executable files when compiling moduleA.ml, but I'm not sure what the syntax should be. I've tried a few guesses, and guessed about how I could do this with ocamlfind but I haven't succeeded. I tried looking for instructions on compiling modules located in different directories but didn't find anything (or anything I can make sense of anyway).

First of all, the toolkit that is shipped with the OCaml System Distribution (aka the compiler) is very versatile but quite low-level and should be seen as a foundation layer for building more high-level build systems. Therefore, learning it is quite hard and usually makes sense only if you're going to build such systems. It is much easier to learn how to use dune or oasis or ocamlbuild instead. Moreover, it will diverge your attention from what actually matters - learning the language.
With all that said, let me answer your question in full details. OCaml implements a separate compilation scheme, where each compilation unit could be built independently and then linked into a single binary. This scheme is common in C/C++ languages, and in fact, OCaml compiler toolchain is very similar to the C compiler toolchain.
When you run ocamlopt -c x.ml you're creating a compilation unit, and as a result a few files are produced, namely:
x.o - contains actually the compiled machine code
x.cmx - contains optimization data and other compiler-specific information
x.cmi - contains compiled interface to the module X.
In order to compile a module, the compiler doesn't need the code of any other modules used in that module. But what it needs is the typing information, i.e., it needs to know what is the type of List.find function, or a type of any other function that is provided by some module which is external to your module. This information is stored in cmi files, for which (compiled) header files from C/C++ is the closest counterpart. As in C/C++ the compiler is searching for them in the include search path, which by default includes the current directory and the location of the standard library, but could be extended using the -I option (the same as in C/C++). Therefore, if your module is using another module defined in a folder A you need to tell the compiler where to search for it, e.g.,
ocamlopt -I A -c x.ml
The produced objective file will not contain any code from external modules. Therefore, once you will reach the final stage of compilation - the linking phase, you have to provide the implementation, e.g., if your module X was using a module implemented in a file with relative path A/y.ml, and you have compiled it in that folder, then you need to specify again the location of the compiled implementation, e.g.,
ocamlopt -I A y.cmx x.cmx -o exe
The order is important, all modules used by a module should be specified before that module, otherwise, you will get the "No implementations provided" error.
As you can see, it is a pretty convoluted process, and it is really not worthwhile to invest your time in learning it. So, if you have an option, then use a higher-level tool to build your programs. If not sure, then choose Dune :)

How to find out which c++ standard used in a binary file?

For example, I have a "helloworld" cpp file named main.cpp.
If I compile it with flag -std=c++11. And I compile it again with flag -std=c++03.
How can I specify which is compiled with c++11 flag between this two?
extra: My specific problem is that I have a third-party lib file, I used it in my code, but I don't know which "-std" flag should I use.

If it is a third party library, then there must be some documentation stating the compilation steps to build from source. Please refer that.
If there is no such thing available, I am assuming that you at least have access to the source code, please look into the implementation (the header files or the source files), you will probably get more than enough information to figure out if it uses code conforming to C++ 11 standard.
#πάνταῥεῖ,I mean compile with different c++ standard won't leave something in the binary filie? - Riopho
If you want to figure out from binary, then I would probably use objdump and disassemble the binary with demangling turned on - objdump -dC <binary_name> - (assuming that you are on Linux, don't know much windows though). you should be able to get some hint from that.
I am not sure if the compiler leaves any traces in the binary though.

Finding all libraries and header files forming a C++ executable

If I have a C++ source file, gcc can give all its dependencies, in a tree structure, using the -H option. But given only the C++ executable, is it possible to find all libraries and header files that went into its compilation and linking?

If you've compiled the executable with debugging symbols, then yes, you can use the symbols to get the files.
If you have .pdb files (Visual studio creates them to store sebugging information separately) you can use all kinds of programs to open them and see the source files and methods.
You can even open it with a text editor and you'll see, among the gibrish, a list of the functions and source files.
If you're using linux (or GNU compilers in general), you can use gdb (again only if you have debug symbols enables in compilation time).
Run gdb on your executable, then run the command: info sources
That's an important reason why you should always remove that flag when going into production. You don't want clients to mess around with your sources, functions, and code.

You cannot do that, because that executable might have been build on a machine on which the header files (or the C++ code, or the libraries) are private or even generated. Also, if a static library is linked in, you have no reliable way to find out.
In practice however, on Linux, using nm or objdump or ldd on the executable will often (but not always) gives you a good clue about the needed libraries.
Also, some executables are dynamically loading a plugin e.g. using dlopen, so your question might not have any sense (since that plugin is known only at runtime).
Notice also that you might not know if an executable is obtained by compiling some C++ code (you might not be able to tell if it was obtained from C, C++, D, or Ocaml, ... source code, or a mixture of them).
On Linux, if you build an executable with static linking and stripping, people won't be able to easily guess the source programming language that you have used.
BTW, on Linux distributions, it is the role of the package management system to deal with such dependencies.
As answered by Yochai Timmer if the executable contains debug information (e.g. in DWARF format) you should be able to get a lot more information.

make SCons compile everything in one gcc line?

I have a rather complex SCons script that compiles a big C++ project.
This gcc manual page says:
The compiler performs optimization based on the knowledge it has of the program. Compiling multiple files at once to a single output file mode allows the compiler to use information gained from all of the files when compiling each of them.
So it's better to give all my files to a single g++ invocation and let it drive the compilation however it pleases.
But SCons does not do this. it calls g++ separately for every single C++ file in the project and then links them using ld
Is there a way to make SCons do this?

The main reason to have a build system with the ability to express dependencies is to support some kind of conditional/incremental build. Otherwise you might as well just use a script with the one command you need.
That being said, the result of having gcc/g++ optimize as the manual describe is substantial. In particular if you have C++ templates you use often. Good for run-time performance, bad for recompile performance.
I suggest you try and make your own builder doing what you need. Here is another question with an inspirational answer: SCons custom builder - build with multiple files and output one file

Currently the answer is no.
Logic similar to this was developed for MSVC only.
You can see this in the man page (http://scons.org/doc/production/HTML/scons-man.html) as follows:
MSVC_BATCH When set to any true value, specifies that SCons should
batch compilation of object files when calling the Microsoft Visual
C/C++ compiler. All compilations of source files from the same source
directory that generate target files in a same output directory and
were configured in SCons using the same construction environment will
be built in a single call to the compiler. Only source files that have
changed since their object files were built will be passed to each
compiler invocation (via the $CHANGED_SOURCES construction variable).
Any compilations where the object (target) file base name (minus the
.obj) does not match the source file base name will be compiled
separately.
As always patches are welcome to add this in a more general fashion.

In general this should be left up to the program developer. Trying to compile all together in an amalgamation may introduce unintended behaviour to the program if it even compiles in the first place. Your best bet if you want this kind of optimisation without editing the source yourself is to use a compiler with inter-process optimisation like icc -ipo.
Example where an amalgamation of two .c files would not compile is for example if they use two identical static symbols with different functionality.

Can you retrieve source from a debug-compiled binary?

I was digging around and found an executable for something I wrote in Visual C++ 6.0 about 8 years ago. I never backed up the source code, but I think I always compiled everything in debug mode. I also vaguely remember hearing somewhere that "you can't decompile an executable into source code unless you have your compiler's debugging symbols or something." The code would have sentimental value, but its not mission-critical that I retrieve it.
That's the background; here are the questions:
How do I check if an executable was compiled in debug mode or not?
If it is, what information comes with a debug mode executable?
Can I retrieve full source code? Failing that, can I get any substantial improvement when decompiling compared to a release version? If so, how?
Thanks,
-- Michael Burge

I do not believe there is a flag though you might find something by using PEDUMP which will dump out COFF file formats (Windows EXE and DLLs). You can infer if an executable was compiled for debug rather quickly by running Dependecy Walker and seeing if your EXE is linking to any debug DLLs (suffixed with D, e.g. MSVCRT5D.DLL).
FYI in VC6 Debug and Release are simple named builds, not modes per say, each build a collection of compiler and linker settings. The EXE is just code, debug exes normally not having been optimized which makes using a debugger with it easy (versus debugging optimized code). Thus you can compile a Release binary with Debug Symbols which is sometimes useful for tracking down optimized code errors.
Debug EXEs and DLLs did not contain any debugging information but instead had a sidecar PDB file that resided in the same folder and contains all the debugging symbols information that was produced during compilation.
No, source is source and not compiled into the symbols file or executables. There are some amazing decompilers out there that can regenerate decent C versions of your code but they are amazing only in how good the C is, not in how well they can recreate your source.

With Visual Studio, I am afraid you can't, since the debug executable doesn't contain the source. Visual Studio generates pdb files that only contains the mapping between the binary and the sources filenames and line numbers, but you still need the source code with them. That might be different with gcc, which I think integrate the source itself inside the binaries.

I think that many disassemblers can show the source if a binary is compiled in debug mode. For example, I use OllyDBG and it has an option to show the source, although I've never tried.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js