Analyzing shared libraries for duplicate code linkage

Analyzing shared libraries for duplicate code linkage - c++

We have a large codebase with > 40 projects (in VS lingo) creating several DLLs/SOs (~15) and an EXE.
There are a few Utility projects which are statically linked to create the EXE and also used by most of the DLLs. Ideally, we'd want these Utility projects to be DLLs too, so that the code isn't duplicated in each of the DLLs that depend on them.
Are there any tools to do a binary analysis on the DLLs to see how much of duplication exists (code + data)? Getting an estimate on this would help.

No tools, just the one between your ears. You want to focus on the projects that link a static library, find the ones where the same static library is used more than once. That's the start point for assuming that a function can be linked in more than once.
Then you can use the linker's /VERBOSE option, it shows you which functions are getting linked in from the static library. There's a lot of output from that option, it is however brief and easy to parse.
As an alternative, consider using the linker's /MAP option to generate a .map file. Which shows in detail which functions got linked into the final executable. Having the same function appear more than once in different .map files is your lead that it might be beneficial to put it in a DLL instead. Writing a little program in your favorite scripting language that processes the /VERBOSE output or .map files and finds matches is feasible.

Well, on a Unix/Linux/OSX system you'd do something like
for eachfile in *.exe *.dll ; do
nm $eachfile | sort | uniq > $eachfile.symbols.txt
done
cat *.symbols.txt | sort | uniq -c > count-duplicate-symbols.txt
sort -r count-duplicate-symbols.txt | less
The first three lines say "Dump the symbols out of each .exe and .dll file in the current directory; store each dump in a separate file. By the way, if the same line appears multiple times in a single file, just store it once."
The line beginning with cat says "Count the number of times each line appears across all the files we just produced. Write a new file named count-duplicate-symbols.txt that contains the duplicated lines with their counts."
The final line says "Sort this file by the number of duplicates (in increasing order), and pipe it to the terminal so I can read it."
If you wanted to see which source files contained the offending duplicate symbols, you could use grep for that.
Notice that this approach probably won't work for static symbols (functions and variables), and it may produce false positives for things like inline functions which are supposed to appear everywhere. You could filter out symbols appearing in linkonce sections, prettyprint the output with c++filt, etc. etc.
Some of these tools are definitely available for Windows. I don't know if they all are.

Related

How can I compile c++ to multiple files?

I have a program (cpp) with many classes. Every class is in separate source file (.h + .cpp).
How can I split the compiled program into multiple files (instead of one big executable file)?
Let's say, one file for every class (same as the code structure).
So that every time there is change in a specific class, I compile only that class, and replace the specific compiled file related to that class.
(Something similar to .DLL files in Windows.)
Example from real life:
I am making TUI interface for managing mysql.
I would like to create mysql text editor (TUI) with ncurses.
the code (class) for creating and managing single window object is in
'textWin.cpp' + 'textWin.h'
the code (class) for managing multiple windows, by creating windows objects from previous class is in winMan.cpp winMan.h
the code (class) for managing mysql database is in :
mysql.cpp mysql.h
and so on...
so, I have the following files:
MyProgram.cpp
- winMan.cpp + winMan.h
- textWin.cpp + textWin.h
- mysql.cpp + mysql.h
- ..
- ..
After g++ compilation, I get one executable file, './MyProgram' (size about 15Mb.) which I deliver to all my customers (1000's of them).
I Just found a typo in textWin.cpp, I fixed it, and I told to all customers that there is an update... all of them need to download one big 15Mb file, this consumes allot of bandwidth and server resources, for just a small update.
Is there a way to send to all my customers smaller file, that contains only the compiled code for textWin class ?
I use g++ on Centos7

The gcc compiler will happily take a list of cpp files to compile together to make one executable. You don't need to write a "containing" cpp file. However, you still have the issue that each time it rebuilds them all.
The alternative is to build each sourcefile separately to an object file, then link those all together. Hopefully each of those invocations of the compiler will add up to less time than the single command-line. But how to keep track of which cpp files actually need to be rebuilt?
The usual approach is to use a makefile and a make utility which will check the dates of all the mentioned files. There are a variety of flavours of makefile, and helper makefile engines. Download a simple package like gzip and you can quickly get an idea of how the Makefile is structured. Then there is lots of help online, or you may decide that this is just too much trouble for a project with 5 files in it.

As suggested in the comments by #RSahu
Shared Libraries (.so files) is the way to split your compiled code.
here is a small example:
https://www.cprogramming.com/tutorial/shared-libraries-linux-gcc.html

Of course, you could put your texts into separate text-files and only deploy those in the an error is there. For your special use case, where binary differences must be deployed, this question might be helpful: How do I create binary patches?
Another option, do proper versioning. That way, your customers might be able to decide for themselves. That is, if they need this update.

C++ symbol has different size in shared object

I have been working on a cross platform windowing library aimed to be used for OpenGL specifically, currently focusing on linux. I am making use of glload to manage OpenGL extensions, and this is being compiled, along with other libraries that I will use later, into an .so. This `.so is being dynamically loaded as you would expect, but at run time the program gives the following output (manually wrapped so it is easier to read):
_dist/x64-linux-debug/bin/test: Symbol `glXCreateContextAttribsARB' has \
different size in shared object, consider re-linking
Now, obviously I have tried re-linking, going as far as rebuilding the entire project many times (testing things out, not just blindly hoping it will magically make it all better). The program does seem to be willing to run as it will produce some logging output as I would expect it to. I have used nm to confirm that the 'symbol' is in the .so
nm _dist/x64-linux-debug/lib64/libvendor.so | grep glXCreateContextAttribsARB
00000000009e0e78 B glXCreateContextAttribsARB
If I use readelf to look at the symbols being defined I get the following (again, I have manually wrapped the first three lines for formatting sake):
readelf -Ws _dist/x64-linux-debug/bin/test \
_dist/x64-linux-debug/lib64/libvendor.so | \
grep glXCreateContextAttribsARB
348: 000000000062b318 8 OBJECT GLOBAL DEFAULT 26 glXCreateContextAttribsARB
421: 000000000062b318 8 OBJECT GLOBAL DEFAULT 26 glXCreateContextAttribsARB
1370: 00000000009e0e78 8 OBJECT GLOBAL DEFAULT 25 glXCreateContextAttribsARB
17464: 00000000009e0e78 8 OBJECT GLOBAL DEFAULT 25 glXCreateContextAttribsARB
I am afraid that this is about all I can offer to help, as I really do not know what to try or look into. Like I said, I am sure more will info will be need, so please just say an I will provide what I can. I am running these commands from my project root, encase you are wondering.

wilsonmichaelpatrick's answer is mostly correct, but using gdb is likely not the fastest way to find the problem, and will likely not work at all if you have a non-debug build.
First, you should confirm that there in fact is a problem:
readelf -Ws _dist/x64-linux-debug/bin/test _dist/x64-linux-debug/lib64/libvendor.so |
grep glXCreateContextAttribsARB
This should show the symbol being defined in test and libvendor.so, with different size.
Second, re-link test and libvendor.so with -Wl,-y,glXCreateContextAttribsARB flag. That will tell you which object files (or libraries) provide the (different) definitions.
Finally, preprocess the sources that produce above object files with -E and -dD flags, and see what's different between them.
Update:
I need help digesting what it is saying
Don't be helpless. Read man readelf, or just run it by hand. You'll see something like this:
readelf -Ws /bin/date | head -5
Symbol table '.dynsym' contains 75 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FUNC GLOBAL DEFAULT UND __ctype_toupper_loc#GLIBC_2.3 (2)
This tells you the meaning of the data you've got. In particular, this tells you that the size of the symbol in test and in libvendor.so is the same (8). Therefore, the problem is not in these two ELF files, but somewhere else. Run readelf on your other libraries, and look for definition of glXCreateContextAttribsARB that has a different size. Then follow the rest of the procedure.

The runtime is noticing that glXCreateContextAttribsARB as compiled in the shared object, and glXCreateContextAttribsARB as compiled in the main program (or maybe even some other shared object previously linked) have different sizes. This means that, in the separate builds for the shared object and whatever else references that object, they must be looking at different code (probably in a shared object) where this is defined. Sometimes this occurs because they are looking at different files, sometimes this occurs because of different #defines causing different interpretations of the same file. Whatever the reason, you absolutely need to make sure that the same symbol (e.g. a structure) is defined the same way (i.e. with the same member variables and size) across everything that is linked together at runtime.
It's actually a very good thing that it is refusing to run, as this is a catastrophe when two parts of the code interpret the same bit of memory in different ways at runtime. (Not too much of an exaggeration to say anything could happen if this was allowed to proceed.)
You might want to try just loading up the executable in gdb (without running it) and typing
info types
to see where it is defined, and then load the shared object in gdb (without running it) and doing another info types there to see what each of them thinks it's looking at. If it's the same thing, check the preprocessor directives.

I have faced a tedious issue related to objects of different sizes so I want to share my experience - even though it is clear to me that it is only one reason that might explain different object sizes - and not mandatorily the OP's.
The symptoms were objects of different sizes in debug mode, none in release mode. The linker produced the according warnings. The symbol names were hard to decipher but related to some unnamed static variables in instances of class templates.
The reason was the debug logging feature à la LOG("Do something.");. The LOG macro used the C ANSI macro __FILE__ which expanded to another path depending on whether the header was included by the application or by the shared library. And this string was exactly the aforementioned unnamed static variable.
Even more tedious was the fact that due to our make environment the __FILE__ macro sometimes expanded to, let's say, C:\temp\file.h and sometimes to C:\other\..\temp\file.h so that building the application and the library from the same place didn't solve the problem either.
I hope this piece of experience might spare some time to some of you.

In most cases you're probably just linking against the wrong library (a different version). For example, you have libfoo installed twice and link your executable with -L /path/to/version1 -lfoo but during runtime you link with /path/to/version2 (you can see this one with ldd yourprogram).
One reason could be that the executable was linked with -rpath,/path/to/version1 but (as recent versions do) this set the RUNPATH entry in the dynamic section; while you have LD_LIBRARY_PATH=/path/to/version2. When RUNPATH is set, LD_LIBRARY_PATH gets precedence. In this case delete the library from /path/to/version2 (or remove that path from LD_LIBRARY_PATH).
EXAMPLE
$ minimal
/home/carlo/minimal: Symbol `_ZN6libcwd8libcw_doE' has different size in shared object, consider re-linking
COREDUMP : /home/carlo/projects/libcwd/libcwd/elfxx.cc:2381: void libcwd::elfxx::objfile_ct::load_dwarf(): Assertion `size == sizeof(address)' failed.
(libcwd is smart enough to see it too; aka the problem here is with libcwd):
$ ldd minimal | grep libcwd_r
libcwd_r.so.5 => /usr/local/install/6.0.0-1ubuntu2/lib/libcwd_r.so.5 (0x00007f0b69840000)
$ echo $LD_LIBRARY_PATH
/usr/local/install/6.0.0-1ubuntu2/lib
$ objdump -a -x minimal | grep PATH
RUNPATH /opt/gitache/libcwd_r/888f62c44fd64f1486176bf9e35b36f79612790017c31f95e117fc59743a54ca/lib
Unsetting LD_LIBRARY_PATH or removing libcwd from that path results in
$ unset LD_LIBRARY_PATH
$ ldd minimal | grep libcwd_r
libcwd_r.so.5 => /opt/gitache/libcwd_r/888f62c44fd64f1486176bf9e35b36f79612790017c31f95e117fc59743a54ca/lib/libcwd_r.so.5 (0x00007f11d7298000)
and things work again. Or alternatively I could add to my CMakeLists.txt of the project:
$ set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -Wl,--disable-new-dtags")
After which we get,
$ objdump -a -x minimal | grep PATH
RPATH /opt/gitache/libcwd_r/888f62c44fd64f1486176bf9e35b36f79612790017c31f95e117fc59743a54ca/lib
which now has precedence over LD_LIBRARY_PATH and therefore also solves the issue. This is not the recommended way however: if you set LD_LIBRARY_PATH you should know what you are doing. If that doesn't work, you should fix LD_LIBRARY_PATH or remove the offending library.

C++ binary identification (manifest)

We have a large set of C++ projects (GCC, Linux, mostly static libraries) with many dependencies between them. Then we compile an executable using these libraries and deploy the binary on the front-end. It would be extremely useful to be able to identify that binary. Ideally what we would like to have is a small script that would retrieve the following information directly from the binary:
$ident binary
$binary : Product=PRODUCT_NAME;Version=0.0.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME1;Version=0.1.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME2;Version=1.0.1;Build=xxx;User=xxx...
So it should display all the information for the binary itself and for all of its dependencies.
Currently our approach is:
During compilation for each product we generate Manifest.h and Manifest.cpp and then inject Manifest.o into binary
ident script parses target binary, finds generated stuff there and prints this information
However this approach is not always reliable for different versions of gcc..
I would like to ask SO community - is there better approach to solve this problem?
Thanks for any advice

One of the catches with storing data in source code (your Manifest.h and .cpp) is the size limit for literal data, which is dependent on the compiler.
My suggestion is to use ld. It allows you to store arbitrary binary data in your ELF file (so does objcopy). If you prefer to write your own solution, have a look at libbfd.
Let us say we have a hello.cpp containing the usual C++ "Hello world" example. Now we have the following make file (GNUmakefile):
hello: hello.o hello.om
$(LINK.cpp) $^ $(LOADLIBES) $(LDLIBS) -o $#
%.om: %.manifest
ld -b binary -o $# $<
%.manifest:
echo "$#" > $#
What I'm doing here is to separate out the linking stage, because I want the manifest (after conversion to ELF object format) linked into the binary as well. Since I am using suffix rules this is one way to go, others are certainly possible, including a better naming scheme for the manifests where they also end up as .o files and GNU make can figure out how to create those. Here I'm being explicit about the recipe. So we have .om files, which are the manifests (arbitrary binary data), created from .manifest files. The recipe states to convert the binary input into an ELF object. The recipe for creating the .manifest itself simply pipes a string into the file.
Obviously the tricky part in your case isn't storing the manifest data, but rather generating it. And frankly I know too little about your build system to even attempt to suggest a recipe for the .manifest generation.
Whatever you throw into your .manifest file should probably be some structured text that can be interpreted by the script you mention or that can even be output by the binary itself if you implement a command line switch (and disregard .so files and .so files hacked into behaving like ordinary executables when run from the shell).
The above make file doesn't take into account the dependencies - or rather it doesn't help you create the dependency list in any way. You can probably coerce GNU make into helping you with that if you express your dependencies clearly for each goal (i.e. the static libraries etc). But it may not be worth it to take that route ...
Also look at:
C/C++ with GCC: Statically add resource files to executable/library and
Is there a Linux equivalent of Windows' "resource files"?
If you want particular names for the symbols generated from the data (in your case the manifest), you need to use a slightly different route and use the method described by John Ripley here.
How to access the symbols? Easy. Declare them as external (C linkage!) data and then use them:
#include <cstdio>
extern "C" char _binary_hello_manifest_start;
extern "C" char _binary_hello_manifest_end;
int main(int argc, char** argv)
{
const ptrdiff_t len = &_binary_hello_manifest_end - &_binary_hello_manifest_start;
printf("Hello world: %*s\n", (int)len, &_binary_hello_manifest_start);
}
The symbols are the exact characters/bytes. You could also declare them as char[], but it would result in problems down the road. E.g. for the printf call.
The reason I am calculating the size myself is because a.) I don't know whether the buffer is guaranteed to be zero-terminated and b.) I didn't find any documentation on interfacing with the *_size variable.
Side-note: the * in the format string tells printf that it should read the length of the string from the argument and then pick the next argument as the string to print out.

You can insert any data you like into a .comment section in your output binary. You can do this with the linker after the fact, but it's probably easier to place it in your C++ code like this:
asm (".section .comment.manifest\n\t"
".string \"hello, this is a comment\"\n\t"
".section .text");
int main() {
....
The asm statement should go outside any function, in this instance. This should work as long as your compiler puts normal functions in the .text section. If it doesn't then you should make the obvious substitution.
The linker should gather all the .comment.manifest sections into one blob in the final binary. You can extract them from any .o or executable with this:
objdump -j .comment.manfest -s example.o

Have you thought about using standard packaging system of your distro? In our company we have thousands of packages and hundreds of them are automatically deployed every day.
We are using debian packages that contain all the neccessary information:
Full changelog that includes:
authors;
versions;
short descriptions and timestamps of changes.
Dependency information:
a list of all packages that must be installed for the current one to work correctly.
Installation scripts that set up environment for a package.
I think you may not need to create manifests in your own way as soon as ready solution already exists. You can have a look at debian package HowTo here.

How to bundle C/C++ code with C-shell-script?

I have a C shell script that calls two
C programs - one after the another
with some file handling before,
in-between and afterwards.
Now, as such I have three different files - one C shell script and 2 .c files.
I need to give this script to other users. The problem is that I have to distribute three files - which the users must keep in the same folder and then execute the script.
Is there some better way to do this?
[I know I can make one C code file out of those two... but I will still be left with a shell script and a C code. Actually, the two C codes do entirely different things... so I want them to be separate]

Sounds like you're worried that your users aren't savy enough to figure out how to resolve issues like command not found errors and the like. If absolutely MUST hide "complexity" of a collection of files you could have your script create the other files. In most other circumstances I would suggest that this approach is only going to increase your support workload since semi-experienced users are less likely to know how to troubleshoot the process.
If you choose to rely on the presence of a compiler on the system that you are running on you can store the C code as a collection of cat $STRING >> file.c commands to to create your two C files, which you then compile and use.
If you would want to use pre-compiled programsn instead then the same basic process can be used except instead use xxd to both generate the strings in your script and reverse the conversion process to give you working binaries. Note: Remember to chmod the binary so that it is executable.

use shar command to create self-extracting archive.
or better yet use unzipsfx with AUTORUN option.
This provides users with ONE file, and only ONE command to execute (as opposed to one for untarring and one for execution).
NOTE: The unzip command to run should use "-n" option, that way only the first run would extract the files and the subsequent would skip the extraction.

Use a zip or tar file? And you do realize that .c files aren't executable, you need to compile & link them first?

You can include the c code inside the shell script as a here document:
#!/bin/bash
cat > code.c << EOF
line #1
line #2
...
EOF
# compile
# execute
If you want to get fancy, you can test for the existence of the executable and skip compiling them if they exists.
If you are doing much shell programming, the rest of the Advanced Bash-Scripting Guide is worth looking at as well.

Determining what object files have caused .dll size increase [C++]

I'm working on a large c++ built library that has grown by a significant amount recently. Due to it's size, it is not obvious what has caused this size increase.
Do you have any suggestions of tools (msvc or gcc) that could help determine where the growth has come from.
edit
Things i've tried: Dumpbin the final dll, the obj files, creating a map file and ripping through it.
edit again
So objdump along with a python script seems to have done what I want.

If gcc, objdump. If visual studio, dumpbin.
I'd suggest doing a diff of the output of the tool for the old (small) library, vs. the new (large) library.

keysersoze's answer (compare the output of objdump or dumpbin) is correct. Another approach is to tell the linker to produce a map file, and compare the map files for the old and new versions of the DLL.
MSVC: link.exe /MAP
GCC and binutils: ld -M (or gcc -Wl,-M)

On Linux it should be quite easy to see if new files have been added with a recursive diff. They would certainly create an increase in the library size. You can then go an use the size command line tool on Linux to get the sizes of each of the new object files and sum them up. Then compare that sum to your library increase and check how much it differs.

G'day,
If you have any previous versions of the object file laying around can you run the size command to see which segment has grown?
Couple of questions:
Are you on a *nix platform or a Windows platform?
Which compiler are you using?
Was the compiler recently changed?
Was the -g flag added recently? (obvious question 1)
Was the object previously stripped? (obvious question 2)
Was the object dynamically linked previously? (obvious question 3)
Edit: If the code is under SCM, can you check out a version of the source that gave you the smaller object. Then compare:
the size of the source trees by doing a du -sk on the old source tree and then the new source tree without having built anything.
the number of files by doing something like find ./tree_top ( -name *.h -o -name *.cpp ) | wc -l
the location of an increased number of files by doing a find ./tree_top ( -name *.h -o -name *.cpp ) -print | sort > treelist and then do the same for the new larger tree. Doing a simple sdiff will show any large number of new files.
the size of code base, even a simple count of trailing semi-colons will give you a good basic mechanism for comparison between the two.
the Makefiles or build env. for the project to see if different options or settings have crept in to the build itself.
HTH
BTW Please post your findings here as I'm sure many people are interested in what you find out.
cheers,

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js