Determining what object files have caused .dll size increase [C++] - c++

I'm working on a large c++ built library that has grown by a significant amount recently. Due to it's size, it is not obvious what has caused this size increase.
Do you have any suggestions of tools (msvc or gcc) that could help determine where the growth has come from.
edit
Things i've tried: Dumpbin the final dll, the obj files, creating a map file and ripping through it.
edit again
So objdump along with a python script seems to have done what I want.

If gcc, objdump. If visual studio, dumpbin.
I'd suggest doing a diff of the output of the tool for the old (small) library, vs. the new (large) library.

keysersoze's answer (compare the output of objdump or dumpbin) is correct. Another approach is to tell the linker to produce a map file, and compare the map files for the old and new versions of the DLL.
MSVC: link.exe /MAP
GCC and binutils: ld -M (or gcc -Wl,-M)

On Linux it should be quite easy to see if new files have been added with a recursive diff. They would certainly create an increase in the library size. You can then go an use the size command line tool on Linux to get the sizes of each of the new object files and sum them up. Then compare that sum to your library increase and check how much it differs.

G'day,
If you have any previous versions of the object file laying around can you run the size command to see which segment has grown?
Couple of questions:
Are you on a *nix platform or a Windows platform?
Which compiler are you using?
Was the compiler recently changed?
Was the -g flag added recently? (obvious question 1)
Was the object previously stripped? (obvious question 2)
Was the object dynamically linked previously? (obvious question 3)
Edit: If the code is under SCM, can you check out a version of the source that gave you the smaller object. Then compare:
the size of the source trees by doing a du -sk on the old source tree and then the new source tree without having built anything.
the number of files by doing something like find ./tree_top ( -name *.h -o -name *.cpp ) | wc -l
the location of an increased number of files by doing a find ./tree_top ( -name *.h -o -name *.cpp ) -print | sort > treelist and then do the same for the new larger tree. Doing a simple sdiff will show any large number of new files.
the size of code base, even a simple count of trailing semi-colons will give you a good basic mechanism for comparison between the two.
the Makefiles or build env. for the project to see if different options or settings have crept in to the build itself.
HTH
BTW Please post your findings here as I'm sure many people are interested in what you find out.
cheers,

Related

SQLITE3 increase Max Columns

I am thinking about using SQLlite3 so I started some performance tests.
Because I've got an table with many Columns I want to increase the MAX Column Limit from SQLite3.
As I read on the SQLite website the limit was set to 2000 columns and now I want to increase it.
I wanted to use sqlite3_limit() but, because I use SQLAPI, I cannot use this function.
I've read on some website I need to change something "inside" SQLite and then have to recompile it but I haven't really understood this part and so I wanted to ask:
Is there was another way to increase Max columns?
My programm is running on Win10 in c++
There is no other way but by setting the number of columns you want at compile time.
Luckily, it is not hard.
Download amalgamation source code here. It is named sqlite-amalgamation-*.zip
Unzip and change directory inside the unzippped folder. You will see sqlite3.c and sqlite3.h among other files.
Issue this statement to create an object file and setting the desired number of columns: gcc -I. -O3 -DSQLITE_MAX_COLUMN=5000 -c sqlite3.c -o sqlite3.o Change 5000 to the number you desire. Note that, according to the comments in sqlite3.c, you shouldn't use a number greater than 32676.
create a static library to link against: ar crsf libsqlite3.a sqlite3.o
Now you can link against the library when compiling your program using -lsqlite3, after having placed the library in a location that your linker is aware of (or use -L/path/to/library).
Alternative to steps 3) and 4).
If you don't want to actually create the static library but you just want to drop the sqlite3.h and sqlite3.c in your source code (something that SQLite itself suggests doing with their amalgamation files), then open sqlite3.c with your favourite editor, look for SQLITE_MAX_COLUMN and modify the value in there.
I used gcc via MinGW to do this. The same steps apply to the tools offered by MSVC, just changing the commands for compilation and library creation as appropriate.

How to determine in which .SO library is given C function?

I have this problem all the time in Linux programming. As long as all the manuals and almost all the source code for Linux are C-centric, all references to some function needs only some include <something.h> line and the function is accessible from the C/C++ code.
But I am programming in assembly language and know almost nothing about C/C++.
In order to be able to call some function, I have to import it from the corresponding .so library.
How to determine the file name of the library? It often differs from the name of the library itself and is not specified in the manuals.
For example, the name of the XLib is actually libX11.so.6. The name of the XShm extension library seems to be libXext.so.6.
Is there easy way to determine the secret real name of the library, using provided C manuals and references?
This is another not-100%-accurate method that may give you some ideas as to how you can narrow things down a bit. It doesn't exactly fit the question because it uses common linux utilities instead of man files, but it may still be helpful.
Use your distribution's package management software.
For example, on Arch Linux, if you were interested in a function in GLFW/glfw3.h, you could find out who owns that file:
$ pacman -Qo /usr/include/GLFW/glfw3.h
/usr/include/GLFW/glfw3.h is owned by glfw 3.1-1
Find out which .so files are in that package:
$ pacman -Ql glfw | grep 'so$'
glfw /usr/lib/libglfw.so
And, if needed, find the actual file that link points to:
$ readlink -f /usr/lib/libglfw.so
/usr/lib/libglfw.so.3.1
This will depend on your distribution. I believe on Ubuntu/Debian you'd use dpkg-query instead.
Edit: DevSolar points out in a comment that you can use apt-file search <header> and apt-file list <package> instead of dpkg-query -S <header> and dpkg-query -L <package>. apt-file appears to work even for packages that aren't installed (though it seems slower?).
I also noticed that (on my Ubuntu VM at least) that, e.g., libglfw-dev contains the libglfw.so symlink, while libglfw2 contains the actual libglfw.so.2 object.
Once you have a set of .so files, you can check them for whatever function you are interested in:
$ nm -D /usr/lib/libglfw.so | grep "glfwCreateWindow"
0000000000007cd0 T glfwCreateWindow
Note that I pulled this last step from a comment on the previous question and don't fully understand it. Maybe you could even skip the earlier steps and rely on nm and grep alone?
This is not a sure fire way, but it can help in many cases.
Basically, you can usually find the library name at the bottom of the man page.
Eg, man XCreateWindow says libX11 on the last line. Then you look for libX11.so and use nm or readelf to see all exported functions.
Another example, man XShm says libXext at the bottom. And so on.
UPDATE
If the function is in section (2) of the man pages, it's a system call (see man man) and is provided by glibc, which would be libc-2.??.so.
Lastly (thanks Basile), if the function does not mention the library, it is also most likely provided by glibc.
DISCLAIMER: Again this is not a 100% accurate method -- but it should help in most cases.
You can ask gcc to tell you which file it would use for linking like so:
gcc --print-file-name=libX11.so
Sample output:
/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../x86_64-linux-gnu/libX11.so
This file will usually be a symlink, so you'll have to pipe it through readlink or realpath to get the actual file. For example:
readlink -f $(gcc --print-file-name=libXext.so)
Sample output:
/usr/lib/x86_64-linux-gnu/libXext.so.6.4.0
As I commented, you could use gcc to link your program, and then it should be able to accept -lX11 ; by using gcc -v instead of gcc you'll find out what is actually linked and how.
However, you have a much more significant issue than finding the lib*.so.*; most C or C++ APIs are described in header files, and these C or C++ header files also contain symbolic constants (like O_RDONLY for open(2)...) or macros (like WIFEXITED in POSIX wait ...) whose value or expansion you should manually find in header files or documentations. (Quite often, such constants are either preprocessor #define-d constants or enum values). Also, some headers -in particular in C++- contains a lot of inline-d functions (or macros)!
A possible way might be to generate some C files to find all these constants, enums, macros, inlined functions..., and/or to customize the GCC compiler (e.g. with MELT ...) to find them.
So my message is that for better or worse, the C language is deeply tied to Linux & POSIX.
You might restrict yourself to use only syscalls(2) from your assembler code. Then you won't use libX11 and you don't need any header or constant (except the ones for syscalls, starting from <asm/unistd.h>).
BTW, in 2015, coding entirely in assembler for performance reasons is a mistake. The compiler is generating better code than you reasonably can (as soon as you have more than a few hundred machine instructions). In practice, you can code in assembler with GCC by using extended asm instructions in your C functions.
Or are you building your own compiler ? Then you should have told so in your question!
Read also the Program Library HowTo & the Linux Assembly HowTo

C++ binary identification (manifest)

We have a large set of C++ projects (GCC, Linux, mostly static libraries) with many dependencies between them. Then we compile an executable using these libraries and deploy the binary on the front-end. It would be extremely useful to be able to identify that binary. Ideally what we would like to have is a small script that would retrieve the following information directly from the binary:
$ident binary
$binary : Product=PRODUCT_NAME;Version=0.0.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME1;Version=0.1.1;Build=xxx;User=xxx...
$ dependency: Product=PRODUCT_NAME2;Version=1.0.1;Build=xxx;User=xxx...
So it should display all the information for the binary itself and for all of its dependencies.
Currently our approach is:
During compilation for each product we generate Manifest.h and Manifest.cpp and then inject Manifest.o into binary
ident script parses target binary, finds generated stuff there and prints this information
However this approach is not always reliable for different versions of gcc..
I would like to ask SO community - is there better approach to solve this problem?
Thanks for any advice
One of the catches with storing data in source code (your Manifest.h and .cpp) is the size limit for literal data, which is dependent on the compiler.
My suggestion is to use ld. It allows you to store arbitrary binary data in your ELF file (so does objcopy). If you prefer to write your own solution, have a look at libbfd.
Let us say we have a hello.cpp containing the usual C++ "Hello world" example. Now we have the following make file (GNUmakefile):
hello: hello.o hello.om
$(LINK.cpp) $^ $(LOADLIBES) $(LDLIBS) -o $#
%.om: %.manifest
ld -b binary -o $# $<
%.manifest:
echo "$#" > $#
What I'm doing here is to separate out the linking stage, because I want the manifest (after conversion to ELF object format) linked into the binary as well. Since I am using suffix rules this is one way to go, others are certainly possible, including a better naming scheme for the manifests where they also end up as .o files and GNU make can figure out how to create those. Here I'm being explicit about the recipe. So we have .om files, which are the manifests (arbitrary binary data), created from .manifest files. The recipe states to convert the binary input into an ELF object. The recipe for creating the .manifest itself simply pipes a string into the file.
Obviously the tricky part in your case isn't storing the manifest data, but rather generating it. And frankly I know too little about your build system to even attempt to suggest a recipe for the .manifest generation.
Whatever you throw into your .manifest file should probably be some structured text that can be interpreted by the script you mention or that can even be output by the binary itself if you implement a command line switch (and disregard .so files and .so files hacked into behaving like ordinary executables when run from the shell).
The above make file doesn't take into account the dependencies - or rather it doesn't help you create the dependency list in any way. You can probably coerce GNU make into helping you with that if you express your dependencies clearly for each goal (i.e. the static libraries etc). But it may not be worth it to take that route ...
Also look at:
C/C++ with GCC: Statically add resource files to executable/library and
Is there a Linux equivalent of Windows' "resource files"?
If you want particular names for the symbols generated from the data (in your case the manifest), you need to use a slightly different route and use the method described by John Ripley here.
How to access the symbols? Easy. Declare them as external (C linkage!) data and then use them:
#include <cstdio>
extern "C" char _binary_hello_manifest_start;
extern "C" char _binary_hello_manifest_end;
int main(int argc, char** argv)
{
const ptrdiff_t len = &_binary_hello_manifest_end - &_binary_hello_manifest_start;
printf("Hello world: %*s\n", (int)len, &_binary_hello_manifest_start);
}
The symbols are the exact characters/bytes. You could also declare them as char[], but it would result in problems down the road. E.g. for the printf call.
The reason I am calculating the size myself is because a.) I don't know whether the buffer is guaranteed to be zero-terminated and b.) I didn't find any documentation on interfacing with the *_size variable.
Side-note: the * in the format string tells printf that it should read the length of the string from the argument and then pick the next argument as the string to print out.
You can insert any data you like into a .comment section in your output binary. You can do this with the linker after the fact, but it's probably easier to place it in your C++ code like this:
asm (".section .comment.manifest\n\t"
".string \"hello, this is a comment\"\n\t"
".section .text");
int main() {
....
The asm statement should go outside any function, in this instance. This should work as long as your compiler puts normal functions in the .text section. If it doesn't then you should make the obvious substitution.
The linker should gather all the .comment.manifest sections into one blob in the final binary. You can extract them from any .o or executable with this:
objdump -j .comment.manfest -s example.o
Have you thought about using standard packaging system of your distro? In our company we have thousands of packages and hundreds of them are automatically deployed every day.
We are using debian packages that contain all the neccessary information:
Full changelog that includes:
authors;
versions;
short descriptions and timestamps of changes.
Dependency information:
a list of all packages that must be installed for the current one to work correctly.
Installation scripts that set up environment for a package.
I think you may not need to create manifests in your own way as soon as ready solution already exists. You can have a look at debian package HowTo here.

Compiling libmagic statically (c/c++ file type detection)

Thanks to the guys that helped me with my previous question (linked just for reference).
I can place the files fileTypeTest.cpp, libmagic.a, and magic in a directory, and I can compile with g++ -lmagic fileTypeTest.cpp fileTypeTest. Later, I'll be testing to see if it runs in Windows compiled with MinGW.
I'm planning on using libmagic in a small GUI application, and I'd like to compile it statically for distribution. My problem is that libmagic seems to require the external file, magic. (I'm actually using my own shortened and compiled version, magic_short.mgc, but I digress.)
A hacky solution would be to code the file into the application, creating (and deleting) the external file as needed. How can I avoid this?
added for clarity:
magic is a text file that describes properties of different filetypes. When asked to identify a file, libmagic searches through magic. There is a compiled version, magic.mgc that works faster. My application only needs to identify a handful of filetypes before deciding what to do with them, so I'll be using my own magic_short file to create magic_short.mgc.
This is tricky, I suppose you could do it this way... by the way, I have downloaded the libmagic source and looking at it...
There's a function in there called magic_read_entries within the minifile.c (this is the pure vanilla source that I downloaded from sourceforge where it is reading from the external file.
You could append the magic file (which is found in the /etc directory) to the end of the library code, like this cat magic >> libmagic.a. In my system, magic is 474443 bytes, libmagic.a is 38588 bytes.
In the magic.c file, you would need to change the magichandle_t* magic_init(unsigned flags) function, at the end of the function, add the line magic_read_entries and modify the function itself to read at the offset of the library itself to pull in the data, treat it as a pointer to pointer to char's (char **) and use that instead of reading from the file. Since you know where the offset is to the library data for reading, that should not be difficult.
Now the function magic_read_entries will no longer be used, as it is not going to be read from a file anymore. The function `magichandle_t* magic_init(unsigned flags)' will take care of loading the entries and you should be ok there.
If you need further help, let me know,
Edit:
I have used the old 'libmagic' from sourceforge.net and here is what I did:
Extracted the downloaded archive into my home directory, ungzipping/untarring the archive will create a folder called libmagic.
Create a folder within libmagic and call it Test
Copy the original magic.c and minifile.c into Test
Using the enclosed diff output highlighting the difference, apply it onto the magic.c source.
48a49,51
> #define MAGIC_DATA_OFFSET 0x971C
> #define MAGIC_STAT_LIB_NAME "libmagic.a"
>
125a129,130
> /* magic_read_entries is obsolete... */
> magic_read_entries(mh, MAGIC_STAT_LIB_NAME);
251c256,262
<
---
>
> if (!fseek(fp, MAGIC_DATA_OFFSET, SEEK_SET)){
> if (ftell(fp) != MAGIC_DATA_OFFSET) return 0;
> }else{
> return 0;
> }
>
Then issue make
The magic file (which I copied from /etc, under Slackware Linux 12.2) is concatenated to the libmagic.a file, i.e. cat magic >> libmagic.a. The SHA checksum for magic is (4abf536f2ada050ce945fbba796564342d6c9a61 magic),
here's the exact data for magic
(-rw-r--r-- 1 root root 474443 2007-06-03 00:52 /etc/file/magic) as found on my system.
Here's the diff for the minifile.c source, apply it and rebuild minifile executable by running make again.
40c40
< magic_read_entries(mh,"magic");
---
> /*magic_read_entries(mh,"magic");*/
It should work then. If not, you will need to adjust the offset into the library for reading by modifying the MAGIC_DATA_OFFSET. If you wish, I can stick up the magic data file into pastebin. Let me know.
Hope this helps,
Best regards,
Tom.
I can tell you how to compile a library in statically - you simply pass the path to the .a file on the end of your g++ command - .a files are just archives of compiled objects (.o). Using "ldd fileTypeTest" will show you the dynamically linked libraries - ${libdir}/libmagic.so shouldn't be in it.
As for linking in an external data file... I don't know - Can you not package the application (.deb|.rpm|.tar.bz2)? On windows, I'd write an installer using NSIS.
In the past I've built self extracting archives. Basically it is a .exe file consisting of a .zip archive and code to unzip it. download the .exe, run it, and poof! you can have as many files as you want.
http://en.wikipedia.org/wiki/Self-extracting_archive

Is there a way to parse a dependency tree from a build script output?

I have an inherited project that uses a build script (not make) to build and link the project with various libraries.
When it performs a build I would like to parse the build output to determine what and where the actual static libraries being linked into the final executable are and where are they coming from.
The script is compiling and linking with GNU tools.
You might try using the nm tool. Given the right options, it will look at a binary (archive or linked image) and tell you what objects were linked into it.
Actually, here's a one-liner I use at work:
#!/bin/sh
nm -Ag $* | sed 's/^.*\/\(.*\.a\):/\1/' | sort -k 3 | grep -v ' U '
to find the culprits for undefined symbols. Just chop off the last grep expression and it should pretty much give you what you want.
Static libraries, that makes life more difficult in this regard. In case of dynamic libraries you could just have used ldd on the resulting executable and be done with it. The best bet would be some kind of configuration file. Alternatively you could try to look for -l arguments to gcc/ld. Those are used to specify libraries. You could write a script for extracting it from the output, though I suspect that you will have to do it manually because by the time you know what the script should look for you probably already know the answer.
It is probably possible to do something useful using e.g. Perl, but you would have to provide more details. On the other hand, it could be easier to simply analyze the script...