How to reduce size of WASM binary? - c++

I have a project written in C++, and the platform to be deployed in has a limitation of 256KB of binary size.
The toolchain is wasi-sdk-16.0 clang++, we use this compiler to compile the source code into binary in WASM format. In this step we compile the sources with the following CXX_FLAGS.
-fstack-protector-strong -Os -D_FORTIFY_SOURCE=2 -fPIC -Wformat -Wformat-security -Wl,-z,relro,-z,now -Wno-psabi
Then we strip the binary with
strip -s output_binary.wasm
After steps above, the compiled binary size in this step is 254KB.
Then we use wamrc in WAMR to compile the WASM binary with AOT runtime, the command is shown below.
wamrc --enable-indirect-mode --disable-llvm-intrinsics -o output.aot output.wasm
the output binary size become 428KB, much bigger than the limitation (256KB).
After google'd, I use wasm-opt to reduce size,
wasm-opt -Oz output.wasm -o output.wasm
The size become 4KB smaller. (almost useless)
I tried to comfirm how much does my code affects the binary size, so I write simple minimum sample code just called the standard c++ library, as the following,
#include <vector>
#include <iostream>
int main() {
std::vector<int> vec = {1,2,3,4,5};
for (auto v: vec) {
std::cout << v << " ";
}
}
The binary size after compiled has alrealy become 205KB.
I also tried to use a binary size profiler (twiggy) to track the size of each part of the bianry, but the tool failed to open this binary.
So I want to ask
While just including two standard C++ headers makes the binary size reach the size limitation, how can I strip the C++ standard library with the function I really use(I cannot use strip unused function flag because my project is a library provide to others), or is really the c++ standard library affected the binary size?
Is there any other compiler flags, or strip flags, or any other optimization tools can significantly reduce the binary size?

One of the things I didn't see mentioned in your post is wabt's wasm-strip. I am not familiar enough with it to know if it does more than the simple strip command, but maybe it is worth a try. You can install it with apt install wasm-strip on a Debian system.
From the few micro-benchmarks I see on the internet, C++ wasm binaries have large overhead. For a totally-not-scientific slide you can watch this.
If, for whatever reason, you cannot work the language to produce smaller binaries, you may try to optimize also at the link level, as busybox does.

I solve this issue with just replacing iostream and fstream to cstdio. It reduces size from 254KB to 85KB, because there contains too many templates in iostream.
Using iostream
Count of functions (with readelf -Ws)
Size of Binary
Yes
685
254KB
No
76
85KB
While specifying compiler flags such as -Oz also reduces some size, but the main factor is too many codes were generated from templates. So, do not use C++ stream API (and any other template-heavy general-purpose libraries) when there are binary size limitations. (C libraries are always worth to believe.)

Related

How can I isolate the compiled logic of a C++ program from the rest of the file header data?

I want to observe the difference in op code binary output of compilation between two versions of a very basic C++ program. For example, 2 + 2 = ?, with no libraries called. I expected the compiled output to be a tiny file of binary op codes with a few small headers, being new to compiled programs, but there are large headers.
simple.cpp
int main()
{
unsigned int a = 2;
unsigned int b = 2;
unsigned int c = a + b;
}
compiler:
g++ -std=c++0x simple.cpp -o simple
Is there a format that I can export to that doesn't contain headers, just op code binary that we instruct the machine to execute? If not, what bytes or location in the resulting file can I look for to isolate the relevant logic from the program?
I need the machine code, not assembly, since my project is the analysis of differently obfuscated versions of a source file to attempt recognizing one based on the other. A complex subject with questionable feasibility, but nevertheless that's why I'm asking to isolate the machine code and not just the assembly - to test analysis against the true machine code outputs.
I tried googling the header structure but can't seem to find much info.
Seeing ld(1): GNU linker - Linux man page, you will find that you can use --oformat=output-format option to specify output format.
binary is a format that don't have headers.
Then, seeing gcc(1): GNU project C/C++ compiler - Linux man page, you will find that you can use -Wl option to pass options to the linker.
-nostdlib option is also useful to avoid extra things added.
Combining these, you can try this command:
g++ -std=c++0x simple.cpp -nostdlib -Wl,--oformat=binary -o simple

g++ compilation of a separately preprocessed file gives error depending on the architecture

I am using g++ version 4.1.2 on a x64_86 GNU linux architecture. Code base is very huge and I don't have sufficient understanding of makefiles used in the project. The code compiles fine as it is.
For some debugging purpose, I need to preprocess (g++ -E) few source files individually and then re-compile it. I am giving the required include paths using -I. Ideally the compilation should go fine.
But I am getting few discrepancies in standard headers like:
typedef unsigned long size_t; causes errors with operator new()
declaration generated by compiler (if I change to unsigned int
manually then this error disappears)
In library functions like unsigned long numeric_limits<>::max(),
compiler complains for big numbers such as 922...807L; it generates
compiler error as integer constant is too large for long type
Mismatch declaration of __errorno_location() gives compiler error
I am having hard time finding what is going wrong. Why compilation goes fine when I do make on unchanged file and why standard headers start cribbing when I give g++ -I <> -E option on individual file ?
(Note that there is no problem with the code we have written, it's just from standard library side. I tried locating the stddef.h which has unsigned int as typedef, but that just fixes the 1st problem. )
Any idea to fix this errors would be highly appreciated.
Don't preprocess and compile separately, or if you must then use consistent compiler options and a consistent environment.
It sounds a though you're running the preprocessor on a 32-bit machine (or using the -m32 option) then compiling on a 64-bit machine.
When compiling the output of the preprocessor, make sure that you use the-fpreprocessed compiler option so that the preprocessor will not run again.
If you don't pass in that option certain constructs that produced identifiers that look like macros may get expanded again into something they shouldn't get expanded to. It's hard for me to come up with a case that shows a difference (I'm sure I can, but it would take a bit of puzzling out and would be pretty contrived). However, the implementation headers may well use some arcane macro techniques that might be sensitive to this option.

GotoBLAS2 performance

I've got some code which performs a packed symmetric matrix inversion and multiplication using the LAPACK routines DPPTRF, DPPTRI, and DSPMV. Here is an older topic in which you can see the C++ code I use to invoke the LAPACK routines.
My code currently assembles a symmetric matrix which is mostly populated along the diagonal.
I am testing different BLAS and LAPACK implementations and I am comparing GotoBLAS2 with the reference LAPACK implementation from netlib.
Here is how I compile the netlib LAPACK code. I select the .f code files from source, and compile them all into a compact static library like this:
$ ls
ddot.f dpptrf.f dscal.f dspr.f dtpsv.f lsame.f
dgemm.f dpptri.f dspmv.f dtpmv.f dtptri.f xerbla.f
$ gfortran -c *.f
$ ar rcs liblapack_lite.a *.o
I can then link this lib to my C++ application using -llapack_lite -lgfortran.
I then tried using GotoBLAS2. I got it from here. The package contained scripts that compiled a massive 19MB static lib automatically. It works great with my existing code by linking it: -lgoto2_nehalemp-r1.13.
I felt that this went well at first. With GotoBLAS2, on large problem sets (inverting 1000x1000 or larger matrices) I saw about a 6x performance increase. Since GotoBLAS is threaded for my architecture and reference LAPACK is single threaded I thought this was reasonable. System monitor also showed >300% CPU usage to corroborate.
Here's where it gets weird. I think, what if I optimize the reference implementation?
I recompile my lapack_lite lib like this: gfortran -c -O3 *.f
My lapack_lite lib now outperforms GotoBLAS2 even on a 3200x3200 matrix inversion, using only one thread. It also consumes ~80MB less RAM.
The subsequent packed matrix-vector multiply does happen faster with GotoBLAS, however.
How is this even remotely possible? Did the make configuration of the GotoBLAS package fail to use an optimization switch with gfortran?

How to remove unused C/C++ symbols with GCC and ld?

I need to optimize the size of my executable severely (ARM development) and
I noticed that in my current build scheme (gcc + ld) unused symbols are not getting stripped.
The usage of the arm-strip --strip-unneeded for the resulting executables / libraries doesn't change the output size of the executable (I have no idea why, maybe it simply can't).
What would be the way (if it exists) to modify my building pipeline, so that the unused symbols are stripped from the resulting file?
I wouldn't even think of this, but my current embedded environment isn't very "powerful" and
saving even 500K out of 2M results in a very nice loading performance boost.
Update:
Unfortunately the current gcc version I use doesn't have the -dead-strip option and the -ffunction-sections... + --gc-sections for ld doesn't give any significant difference for the resulting output.
I'm shocked that this even became a problem, because I was sure that gcc + ld should automatically strip unused symbols (why do they even have to keep them?).
For GCC, this is accomplished in two stages:
First compile the data but tell the compiler to separate the code into separate sections within the translation unit. This will be done for functions, classes, and external variables by using the following two compiler flags:
-fdata-sections -ffunction-sections
Link the translation units together using the linker optimization flag (this causes the linker to discard unreferenced sections):
-Wl,--gc-sections
So if you had one file called test.cpp that had two functions declared in it, but one of them was unused, you could omit the unused one with the following command to gcc(g++):
gcc -Os -fdata-sections -ffunction-sections test.cpp -o test -Wl,--gc-sections
(Note that -Os is an additional compiler flag that tells GCC to optimize for size)
If this thread is to be believed, you need to supply the -ffunction-sections and -fdata-sections to gcc, which will put each function and data object in its own section. Then you give and --gc-sections to GNU ld to remove the unused sections.
You'll want to check your docs for your version of gcc & ld:
However for me (OS X gcc 4.0.1) I find these for ld
-dead_strip
Remove functions and data that are unreachable by the entry point or exported symbols.
-dead_strip_dylibs
Remove dylibs that are unreachable by the entry point or exported symbols. That is, suppresses the generation of load command commands for dylibs which supplied no symbols during the link. This option should not be used when linking against a dylib which is required at runtime for some indirect reason such as the dylib has an important initializer.
And this helpful option
-why_live symbol_name
Logs a chain of references to symbol_name. Only applicable with -dead_strip. It can help debug why something that you think should be dead strip removed is not removed.
There's also a note in the gcc/g++ man that certain kinds of dead code elimination are only performed if optimization is enabled when compiling.
While these options/conditions may not hold for your compiler, I suggest you look for something similar in your docs.
Programming habits could help too; e.g. add static to functions that are not accessed outside a specific file; use shorter names for symbols (can help a bit, likely not too much); use const char x[] where possible; ... this paper, though it talks about dynamic shared objects, can contain suggestions that, if followed, can help to make your final binary output size smaller (if your target is ELF).
The answer is -flto. You have to pass it to both your compilation and link steps, otherwise it doesn't do anything.
It actually works very well - reduced the size of a microcontroller program I wrote to less than 50% of its previous size!
Unfortunately it did seem a bit buggy - I had instances of things not being built correctly. It may have been due to the build system I'm using (QBS; it's very new), but in any case I'd recommend you only enable it for your final build if possible, and test that build thoroughly.
While not strictly about symbols, if going for size - always compile with -Os and -s flags. -Os optimizes the resulting code for minimum executable size and -s removes the symbol table and relocation information from the executable.
Sometimes - if small size is desired - playing around with different optimization flags may - or may not - have significance. For example toggling -ffast-math and/or -fomit-frame-pointer may at times save you even dozens of bytes.
It seems to me that the answer provided by Nemo is the correct one. If those instructions do not work, the issue may be related to the version of gcc/ld you're using, as an exercise I compiled an example program using instructions detailed here
#include <stdio.h>
void deadcode() { printf("This is d dead codez\n"); }
int main(void) { printf("This is main\n"); return 0 ; }
Then I compiled the code using progressively more aggressive dead-code removal switches:
gcc -Os test.c -o test.elf
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections
gcc -Os -fdata-sections -ffunction-sections test.c -o test.elf -Wl,--gc-sections -Wl,--strip-all
These compilation and linking parameters produced executables of size 8457, 8164 and 6160 bytes, respectively, the most substantial contribution coming from the 'strip-all' declaration. If you cannot produce similar reductions on your platform,then maybe your version of gcc does not support this functionality. I'm using gcc(4.5.2-8ubuntu4), ld(2.21.0.20110327) on Linux Mint 2.6.38-8-generic x86_64
strip --strip-unneeded only operates on the symbol table of your executable. It doesn't actually remove any executable code.
The standard libraries achieve the result you're after by splitting all of their functions into seperate object files, which are combined using ar. If you then link the resultant archive as a library (ie. give the option -l your_library to ld) then ld will only include the object files, and therefore the symbols, that are actually used.
You may also find some of the responses to this similar question of use.
I don't know if this will help with your current predicament as this is a recent feature, but you can specify the visibility of symbols in a global manner. Passing -fvisibility=hidden -fvisibility-inlines-hidden at compilation can help the linker to later get rid of unneeded symbols. If you're producing an executable (as opposed to a shared library) there's nothing more to do.
More information (and a fine-grained approach for e.g. libraries) is available on the GCC wiki.
From the GCC 4.2.1 manual, section -fwhole-program:
Assume that the current compilation unit represents whole program being compiled. All public functions and variables with the exception of main and those merged by attribute externally_visible become static functions and in a affect gets more aggressively optimized by interprocedural optimizers. While this option is equivalent to proper use of static keyword for programs consisting of single file, in combination with option --combine this flag can be used to compile most of smaller scale C programs since the functions and variables become local for the whole combined compilation unit, not for the single source file itself.
You can use strip binary on object file(eg. executable) to strip all symbols from it.
Note: it changes file itself and don't create copy.

Detect GCC compile-time flags of a binary

Is there a way to find out what gcc flags a particular binary was compiled with?
A quick look at the GCC documentation doesn't turn anything up.
The Boost guys are some of the smartest C++ developers out there, and they resort to naming conventions because this is generally not possible any other way (the executable could have been created in any number of languages, by any number of compiler versions, after all).
(Added much later): Turns out GCC has this feature in 4.3 if asked for when you compile the code:
A new command-line switch -frecord-gcc-switches ... causes the command line that was used to invoke the compiler to be recorded into the object file that is being created. The exact format of this recording is target and binary file format dependent, but it usually takes the form of a note section containing ASCII text.
Experimental proof:
diciu$ gcc -O2 /tmp/tt.c -o /tmp/a.out.o2
diciu$ gcc -O3 /tmp/tt.c -o /tmp/a.out.o3
diciu$ diff /tmp/a.out.o3 /tmp/a.out.o2
diciu$
I take that as a no as the binaries are identical.
I'm the one who asked Brian to post this originally. My question had to do with the samba binary. I found out that you can run smb -b to get information on how it was built.
I don't think so.
You can see if it has debug symbols, which means -g was used ;) But I can't think of any way how you could find out which directories have been used to search for include headers for example.
Maybe a better answer is possible if you only target for a specific flag; e.g. if you only want to know if the flag "..." was set when this binary was compiled or not. In that case, what flag would this be?