What are these functions given by Intel Advisor? - c++

I'm trying to use Intel Advisor to understand hotspot in my application.
These are the compile and linker flags that I'm using:
INTEL_OPT=-O3 -simd -xCORE-AVX2 -parallel -ipo -qopenmp -fargument-noalias -ansi-alias -no-prec-div -fp-model fast=2
INTEL_PROFILE=-g -qopt-report=5 -Bdynamic -shared-intel -debug inline-debug-info -qopenmp-link dynamic -parallel-source-info=2 -ldl
This is a sample image taken from this tutorial:
This is a screenshot from my application:
I don't understand what all these functions before _clone, [stack], _start and _libc_start_main are.

James is correct: things like _clone, [stack], _start and _libc_start_main correspond to CRT, Cray sys libs (if you use Cray env), OMP runtime internals or general system calls .
Also in your profile you don't seem to have any vectorization info enabled (empty "why no vectorization", no peel-remainder break-down, no SIMD Efficiency metrics and so on). Since your compilation flags seems to be reasonable, my next guess is that you are either stripping debug info into separate file or use pretty old ICL version. Removing ipo may also help to enable missed information.

Related

g++ arm-none-eabi upgrade from 4.9 to gcc 8.2. Generated binary do not fit any more in flash

I recently updated my Linux laptop from Ubuntu 16.04 to 18.04.
I had a STM32 (Cortex-M4) Makefile based project that compiled correctly with the arm-none-eabi g++ version provided by Ubuntu. The generated file required 47620 bytes in the .text section.
With the Ubuntu upgrade, I have also installed an up-to-date version of gcc (from ARM website). Version is 8.2.1.
When I compile the same project (make clean && make), the generated binary do not fit in flash (97424 bytes required, more than twice!). The project is exactly the same (sources, link script, startup files, Makefile).
The compiler options are: -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -DSTM32F303x8 -DARMCM4 -O0 -g -Wall -fexceptions -Wno-deprecated.
The linker options are -mthumb -mcpu=cortex-m4 -Tstm32f303K8.ld -mfloat-abi=hard -mfpu=fpv4-sp-d16 --specs=nosys.specs -lm -Wl,--start-group -lm -Wl,--end-group -Wl,--gc-sections -Lsys -Xlinker -Map=test.elf.map
When I look at the .Map generated file, all the user functions take approximatively the same size (new version saves 8 bytes!). But after, it includes C++ sepcific parts, and one is more than 26Kb (from map file):
.text 0x00000000080079e8 0x683c /usr/local/gcc-arm-none-eabi-8-2018-q4-major/bin/../lib/gcc/arm-none-eabi/8.2.1/../../../../arm-none-eabi/lib/thumb/v7e-m+fp/hard/libstdc++.a(cp-demangle.o)
0x000000000800e13c __cxa_demangle
Note: there is no problem with C only projects, only with C++. The library included are the same (gcc 4.9.3 -> armv7e-m/fpu, and gcc 8.2.1 -> thumb/v7e-m+fp/hard):
libm.a libstdc++.a libc.a libnosys.a libgcc.a
Is there a way to get rid of that so that I can compile and flash my (no so old) project?
regards,
I found a solution using the libstdc++_nano (instead of implicit libstc++). With that, the code size is reduced from 84kb to 26kb!
LDFLAGS += -lstdc++_nano
It just works. Thanks #Henrik, #Matthieu and #EOF for your support!
It might be related to exception handling, as std::terminate(), which is used with exceptions, might call the demangling routine. If you don't need exceptions then try disabling them with -fno-exceptions as described here.
Another solution might be to look at the GCC headers:
Demangling routine.
ABI-mandated entry point in the C++ runtime library for demangling.
[...]
returns a pointer to the start of the NUL-terminated demangled
name, or NULL if the demangling fails. The caller is
responsible for deallocating this memory using free.
The prototype is:
char*
__cxa_demangle(const char* __mangled_name, char* __output_buffer,
size_t* __length, int* __status);
So you could probably just supply your own dummy function returning NULL (Given that all library functions are weak, and can be overridden). I'll advise you to look at the disassembled code first though, and find out how and why it is being called in the first place, since it might change behaviour to just discard functionality).
They also give other advise in This forum post, which might be useful for you as well:
Optimize for size with -Os instead of -O0 (possibly add the -Og option instead, if you prefer easily debuggable code, it is often both smaller and faster than -O0).
Optimize at link-time with -flto while compiling and linking.
Maybe disable RTTI if not used.

Equivalent of mpif90 --showme for Cray Fortran Wrapper ftn

I am currently compiling code on a HPC system that was set up by Cray. To call Fortran, C, and C++ compilers it is suggested to use ftn, cc, and CC compiler wrappers provided by Cray.
Now, I would like to know which options the ftn wrapper adds to the actual compiler call (in my case to ifort, but it should not matter). From working with MPI wrappers I know the option --showme to get this information:
> mpif90 --showme
pgf90 -I/opt/openmpi/pgi/ib/include -fast -I/opt/openmpi/pgi/ib/lib -L/opt/openmpi/pgi/ib/lib -lmpi_f90 -lmpi_f77 -lmpi -libverbs -lrt -lnsl -lutil -ldl -lm -lrt -lnsl -lutil
## example from another HPC system; MPI wrapper around Portland Fortran Group Compiler
I am locking for an option like --OPTION_TO_GET_APPENDED_FLAGS that provides the same information for the ftn wrapper
> ftn --OPTION_TO_GET_APPENDED_FLAGS
ifort -one_option -O2 -another_option
Because it is Friday afternoon local time all colleagues with knowledge on this topic left already for their weekend (as well as the cluster support team).
Thanks in advance for the answers.
On the Cray system I am using (Cray Linux Environment (CLE), 27th Apr. 2016), the appropriate option is -craype-verbose:
ftp -craype-verbose
> ifort -xCORE-AVX2 -static -D__CRAYXC [...]
It is written on the man page which I just scanned quickly before asking this question:
-craype-verbose
Print the command which is forwarded to compiler invocation.

GNU Fortran compiler optimisation flags for Ivy Bridge architecture

May I please ask for your suggestions on the GNU Fortran compiler (v6.3.0) flags to optimise the code for the Ivy Bridge architecture (Intel Xeon CPU E5-2697v2 Ivy Bridge # 2.7 GHz)?
At the moment I’m compiling the code with the following flags:
-O3 -march=ivybridge -mtune=ivybridge -ffast-math -mavx -m64 -w
Unless you use intrinsics specific to Ivy bridge, Sandy bridge flag os sufficient. I expect you should find some advantage by setting additionally -funroll-loops --param max-unroll-times=2
Sometimes -O2 -ftree-vectorize will work out better than -O3.
If you have complex data type you will want to check vs. -fno-cx-limited-range as the default of -ffast-math may be too aggressive.

g++ enables wrong flags at -Os

at the moment I am doing some experiments with the GNU C++-Compiler and the -Os optimization option for minimal code size. I checked the enabled compiler flags at -Os with the following command:
g++ -c -Q -Os --help=optimizers | grep "enabled"
I got this list of enabled options:
-faggressive-loop-optimizations [enabled]
-falign-functions [enabled]
-falign-jumps [enabled]
-falign-labels [enabled]
-falign-loops [enabled]
-fasynchronous-unwind-tables [enabled]
...
This seems a bit strange, because I also looked up, which flags should be enabled at -Os, here and under the -Os section it is written that all the falign- options should be disabled for code minimization.
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os !
My gcc-version is 4.9.2 and I am working on Arch-Linux.
Already thanks for helping :)
Q: So is this a bug or am I doing something wrong here ? Cause after reading what the falign- flags do I really think they should be disabled in -Os
I think Hans did a good job of finding part of the problem. Its definitely a documentation bug. But no one from GCC commented on why -Os enabled them, so you might not have all of the information.
Older ARM devices were very intolerant of unaligned accesses. Older arm devices included ARMv4 and I think ARMv5. If you performed an unaligned access, you would get a SIGBUS (been there, done that, got the tee shirt).
Modern ARM devices fix up unaligned accesses like x86 processors do, so you no longer get a SIGBUS. Instead, you just take the performance penalty.
You should try to specify an architecture in case those options are an artifact from older ARM device support. For example, -march=armv7. If you find it on ARMv6 and ARMv7, then that could still be a bug. It depends if the GCC team decided the tradeoff was sufficient for ARM (code size vs performance penalty).

Can I enable vectorization only for one part of the code?

Is there a way to enable vectorization only for some part of the code, like a pragma directive? Basically having as if the -ftree-vectorize is enabled only while compiling some part of the code? Pragma simd for example is not available with gcc...
The reason is that from benchmarking we saw that with -O3 (which enables vectorization) the timings were worse than with -O2. But there are some part of the code for which we would like the compiler to try vectorizing loops.
One solution I could use would be to restrict the compiler directive to one file.
Yes, this is possible. You can either disable it for the whole module or individual functions. You can't however do this for particular loops.
For individual functions use
__attribute__((optimize("no-tree-vectorize"))).
For whole modules -O3 automatic enables -ftree-vectorize. I'm not sure how to disable it once it's enabled but you can use -O2 instead. If you want to use all of -O3 except -ftree-vectorize then do this
gcc -c -Q -O3 --help=optimizers > /tmp/O3-opts
gcc -c -Q -O2 --help=optimizers > /tmp/O2-opts
diff /tmp/O2-opts /tmp/O3-opts | grep enabled
And then include all the options except for -ftree-vectorize.
Edit: I don't see -fno-tree-vectorize in the man pages but it works anyway so you can do -O3 -fno-tree-vectorize.
Edit: The OP actually wants to enable vectorization for particular functions or whole modules. In that case for individual functions __attribute__((optimize("tree-vectorize"))) can be used and for whole modules -O2 -ftree-vectorize.
Edit (from Antonio): In theory there is a pragma directive to enable tree-vectorizing all functions that follow
#pragma GCC optimize("tree-vectorize")
But it seems not to work with my g++ compiler, maybe because of the bug mentioned here:
How to enable optimization in G++ with #pragma. On the other hand, the function attribute works.