GotoBLAS2 performance - c++

I've got some code which performs a packed symmetric matrix inversion and multiplication using the LAPACK routines DPPTRF, DPPTRI, and DSPMV. Here is an older topic in which you can see the C++ code I use to invoke the LAPACK routines.
My code currently assembles a symmetric matrix which is mostly populated along the diagonal.
I am testing different BLAS and LAPACK implementations and I am comparing GotoBLAS2 with the reference LAPACK implementation from netlib.
Here is how I compile the netlib LAPACK code. I select the .f code files from source, and compile them all into a compact static library like this:
$ ls
ddot.f dpptrf.f dscal.f dspr.f dtpsv.f lsame.f
dgemm.f dpptri.f dspmv.f dtpmv.f dtptri.f xerbla.f
$ gfortran -c *.f
$ ar rcs liblapack_lite.a *.o
I can then link this lib to my C++ application using -llapack_lite -lgfortran.
I then tried using GotoBLAS2. I got it from here. The package contained scripts that compiled a massive 19MB static lib automatically. It works great with my existing code by linking it: -lgoto2_nehalemp-r1.13.
I felt that this went well at first. With GotoBLAS2, on large problem sets (inverting 1000x1000 or larger matrices) I saw about a 6x performance increase. Since GotoBLAS is threaded for my architecture and reference LAPACK is single threaded I thought this was reasonable. System monitor also showed >300% CPU usage to corroborate.
Here's where it gets weird. I think, what if I optimize the reference implementation?
I recompile my lapack_lite lib like this: gfortran -c -O3 *.f
My lapack_lite lib now outperforms GotoBLAS2 even on a 3200x3200 matrix inversion, using only one thread. It also consumes ~80MB less RAM.
The subsequent packed matrix-vector multiply does happen faster with GotoBLAS, however.
How is this even remotely possible? Did the make configuration of the GotoBLAS package fail to use an optimization switch with gfortran?

Related

How to reduce size of WASM binary?

I have a project written in C++, and the platform to be deployed in has a limitation of 256KB of binary size.
The toolchain is wasi-sdk-16.0 clang++, we use this compiler to compile the source code into binary in WASM format. In this step we compile the sources with the following CXX_FLAGS.
-fstack-protector-strong -Os -D_FORTIFY_SOURCE=2 -fPIC -Wformat -Wformat-security -Wl,-z,relro,-z,now -Wno-psabi
Then we strip the binary with
strip -s output_binary.wasm
After steps above, the compiled binary size in this step is 254KB.
Then we use wamrc in WAMR to compile the WASM binary with AOT runtime, the command is shown below.
wamrc --enable-indirect-mode --disable-llvm-intrinsics -o output.aot output.wasm
the output binary size become 428KB, much bigger than the limitation (256KB).
After google'd, I use wasm-opt to reduce size,
wasm-opt -Oz output.wasm -o output.wasm
The size become 4KB smaller. (almost useless)
I tried to comfirm how much does my code affects the binary size, so I write simple minimum sample code just called the standard c++ library, as the following,
#include <vector>
#include <iostream>
int main() {
std::vector<int> vec = {1,2,3,4,5};
for (auto v: vec) {
std::cout << v << " ";
}
}
The binary size after compiled has alrealy become 205KB.
I also tried to use a binary size profiler (twiggy) to track the size of each part of the bianry, but the tool failed to open this binary.
So I want to ask
While just including two standard C++ headers makes the binary size reach the size limitation, how can I strip the C++ standard library with the function I really use(I cannot use strip unused function flag because my project is a library provide to others), or is really the c++ standard library affected the binary size?
Is there any other compiler flags, or strip flags, or any other optimization tools can significantly reduce the binary size?
One of the things I didn't see mentioned in your post is wabt's wasm-strip. I am not familiar enough with it to know if it does more than the simple strip command, but maybe it is worth a try. You can install it with apt install wasm-strip on a Debian system.
From the few micro-benchmarks I see on the internet, C++ wasm binaries have large overhead. For a totally-not-scientific slide you can watch this.
If, for whatever reason, you cannot work the language to produce smaller binaries, you may try to optimize also at the link level, as busybox does.
I solve this issue with just replacing iostream and fstream to cstdio. It reduces size from 254KB to 85KB, because there contains too many templates in iostream.
Using iostream
Count of functions (with readelf -Ws)
Size of Binary
Yes
685
254KB
No
76
85KB
While specifying compiler flags such as -Oz also reduces some size, but the main factor is too many codes were generated from templates. So, do not use C++ stream API (and any other template-heavy general-purpose libraries) when there are binary size limitations. (C libraries are always worth to believe.)

explicitly link intel icpc openmp

I have the intel compiler install at the following $HOME/tpl/intel. When I compile a simple hello_omp.cpp with openMP enabled
#include <omp.h>
#include <iostream>
int main ()
{
#pragma omp parallel
{
std::cout << "Hello World" << std::endl;
}
return 0;
}
I compile with ~/tpl/intel/bin/icpc -O3 -qopenmp hello_omp.cpp but when I run I get the following error:
./a.out: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory.
I would like to explicitly link the intel compiler and the appropriate library during the make process without using the LD_LIBRARY_PATH?
You have 2 simple solutions for your problem:
Linking statically with the Intel run time libraries:
~/tpl/intel/bin/icpc -O3 -qopenmp -static_intel hello_omp.cpp
Pros: you don't have to care where the Intel run time environment is installed on the machine where you run the binary, or even having it installed altogether;
Cons: your binary becomes bigger and won't allow to select a different (more recent ideally) run time environment even when it is available.
Adding the search path for dynamic library into the binary using the linker option -rpath:
~/tpl/intel/bin/icpc -O3 -qopenmp -Wl,-rpath=$HOME/tpl/intel/lib/intel64 hello_omp.cpp
Notice the use of -Wl, to transmit the option to the linker.
I guess that is more like what you were after than the first solution I proposed so I let you devise what the pros and cons are for you in comparison.
Intel Compiler ships compilervars.sh script in the bin directory which when sourced will set the appropriate env variables like LD_LIBRARY_PATH, LIBRARY_PATH and PATH with the right directories which host OpenMP runtime library and other compiler specific libraries like libsvml (short vector math library) or libimf (more optimized version of libm).

Can we use g++ compile code to do performance analysis with Solaris Studio's Performance Analyser?

I am getting the following error while running the collect command
$ collect -c on sample
bit (warning): Cannot operate on /home/user1/ANALYSIS/SAMPLE_PROGRAM/sample. Ple ase recompile it on a machine with Solaris10 update 5 or higher (or OpenSolaris version snv_52 or higher). If using an older OS, try -xbinopt=prepare (SPARC onl y).
The sample program was build with following g++ flags
g++ -c -Wall -g3 -m64 -pthread -O2 -DSOLARIS -DSS_64BIT_SERVER
The sample program is simple
contains only the following code
while (true)
{
sleep (10);
}
I was just trying to see whether c++ code compiled code can be used with collect command.
As we have a huge g++ compiled binary which we would not like to compile again with Solaris Studio C++ compilers
I don't think so. Studio option -xbinopt=prepare includes special code to binaries to use performance counters. I haven't been using it for years... as far as I remeber -xbinopt=prepare makes binary to write perfomance data to files in CWD (or dir specified by other parameter) and late you can use those data with -xbinopt=use.
Rules are compile 1st with prepare, then run to collect data, later recompile with collected performance data to get better optimized code. Similar to JIT compiler, but in compile time.

How to debug GCC/LD linking process for STL/C++

I'm working on a bare-metal cortex-M3 in C++ for fun and profit. I use the STL library as I needed some containers. I thought that by simply providing my allocator it wouldn't add much code to the final binary, since you get only what you use.
I actually didn't even expect any linking process at all with the STL
(giving my allocator), as I thought it was all template code.
I am compiling with -fno-exception by the way.
Unfortunately, about 600KB or more are added to my binary. I looked up what symbols are included in the final binary with nm and it seemed a joke to me. The list is so long I won't try and past it. Although there are some weak symbols.
I also looked in the .map file generated by the linker and I even found the scanf symbols
.text
0x000158bc 0x30 /CodeSourcery/Sourcery_CodeBench_Lite_for_ARM_GNU_Linux/bin/../arm-none-linux-gnueabi/libc/usr/lib/libc.a(sscanf.o)
0x000158bc __sscanf
0x000158bc sscanf
0x000158bc _IO_sscanf
And:
$ arm-none-linux-gnueabi-nm binary | grep scanf
000158bc T _IO_sscanf
0003e5f4 T _IO_vfscanf
0003e5f4 T _IO_vfscanf_internal
000164a8 T _IO_vsscanf
00046814 T ___vfscanf
000158bc T __sscanf
00046814 T __vfscanf
000164a8 W __vsscanf
000158bc T sscanf
00046814 W vfscanf
000164a8 W vsscanf
How can I debug this? For first I wanted to understand what exactly GCC is using for linking (I'm linking through GCC). I know that if symbol is found in a text segment, the
whole segment is used, but still that's too much.
Any suggestion on how to tackle this would really be appreciated.
Thanks
Using GCC's -v and -Wl,-v options will show you the linker commands (and version info of the linker) being used.
Which version of GCC are you using? I made some changes for GCC 4.6 (see PR 44647 and PR 43863) to reduce code size to help embedded systems. There's still an outstanding enhancement request (PR 43852) to allow disabling the inclusion of the IO symbols you're seeing - some of them come from the verbose terminate handler, which prints a message when the process is terminated with an active exception. If you're not using execptions then some of that code is useless to you.
The problem is not about the STL, it is about the Standard library.
The STL itself is pure (in a way), but the Standard Library also includes all those streams packages and it seems that you also managed to pull in the libc as well...
The problem is that the Standard Library has never been meant to be picked apart, so there might not have been much concern into re-using stuff from the C Standard Library...
You should first try to identify which files are pulled in when you compile (using strace for example), this way you can verify that you only ever use header-only files.
Then you can try and remove the linking that occurs. There are options to pass to gcc to precise that you would like a standard library-free build, something like --nostdlib for example, however I am not well versed enough in those to instruct you exactly here.

How to compile MATLAB code to be used with QT C++

I have 2 sets of code:
MATLAB code, and
QT C++ code.
I have tried to compile the MATLAB code to a C++ library using the mcc command with the msvc2008 compiler. For my QT C++ code, I'm using mingw to compile. However, when I try to add in the MATLAB converted C++ code there seems to be a lot of problems.
Is it possible to mix these two types of code together? Does anyone have any experiences using a combination of these languages?
! have tried to use Octave but I would rather not recode my MATLAB code. I am trying to look for an alternative to run MATLAB code directly.
NB: I need to use mingw in QT as it is requirement and for matlab mcc command, I only have the choice to use msvc compiler. It would be best if I could make the program as a standalone for portability. The reason why I need to use MATLAB code is because there are some nice matrix math manipulation functions I need and also because it would be easier for me to do research using MATLAB.
When you compile matlab code using mcc (by default or when using the -m option), you get an executable. So from your C++ file, you can call the matlab executable with the C/C++ command exec.
If you use the -l option (using mcc), you get a shared library, and header.
For instance if you type (in matlab):
mcc -l test.m -W cpplib:test.h
This should produce a shared library test.lib or test.so, and a header test.h
In test.h you should have line similar to that:
bool MW_CALL_CONV mlxTest(int nlhs, mxArray *plhs[], int nrhs, mxArray *prhs[]);
You can call your matlab function using that.
In addition you have to add both shared libraries and headers in you msvc project.
I fixed the mxInt64 and mxUint64 by adding more typedefs to make the code recognize those as signed and unsigned integers 64 bytes long.