__xl_pow, __xl_log, __xl_exp in perf report - fortran

I work on performance profiling of an application app that was compiled with IBM XL Fortran compiler for IBM POWER8 processor. This is a part of the output of perf report:
3.88% app app [.] __xl_pow
2.91% app app [.] __xl_log
1.81% app app [.] __xl_exp
What are these functions shown in the profile? My hypothesis is that these are the implementations of pow(), log() and exp() that are supplied with the compiler (see a similar discussion). Is that correct?

When you enable an optimization level of -O3 or higher, the XL compilers replace several libm function calls with calls to a high performance library shipped with the compiler. The __xl_* function calls you're seeing are coming from that library. If you don't want them, for example because their precision is sometimes slightly different from the libm calls, compile with -qstrict=library.
Note: Even with -qstrict=library, XL Fortran might still call its own functions for pow(), but these functions have the same precision as libm's pow().

Related

how does assembler convert from assembly to machine code?

I know this has been asked many times, but I am looking for a simple interpretation.
Let's say I have some assembly code that C++ compiler generated.
Now assembler kicks in and it has to transform the assembly code into machine code.
Question 1). Will the C++ assembler compiler look at the table where each assembly instruction has the corresponding machine code instruction ?
Question 2). If the C++ program runs on the intel processor, then, assembler needs to take a look at the table published by Intel team, right ? because in the end, C++ program runs on the intel processor.
Question 3). If I am right about the question 2, then how is it possible that program written in C++ can be run on the computer which uses Intel and on the computer which uses AMD processor ?
Please try to limit your questions to one question per question. Neverthless, let me try and answer them.
Question 1
An “assembly compiler” is called an “assembler.” Assembly is assembled, not compiled. And the assembler is not specific to C++. It is specific to the architecture and can only be used to assemble assembly programs for that architecture.
Yes, assemblers are usually implemented by having a large table mapping instruction mnemonics to the operation codes (opcodes) they correspond to. This table also tells the assembler what operands the instruction takes and how the operands are encoded. There can be multiple entries for the same mnemonic if the mnemonic corresponds to multiple instructions.
It is however not a requirement to do it this way. Assemblers may chose different approaches or combine tables with pre- and postprocessing steps.
Question 2
This is correct. Processor vendors generally provide documentation for their processors in which all instructions and their instruction encodings are listed. For Intel, this information can be found in the Intel Software Development Manuals. Note that while the processor vendor provides such specifications, it is the job of the assembler author to translate these documents into tables for use by the assembler. This is traditionally done manually but recently, people have started automatically translating manuals into tables.
Question 3
Both Intel and AMD produce processors of the amd64 (also called x86-64, IA32e, Intel 64, EM64T, and other things) architecture. So a program written for an Intel processor generally also runs on an AMD processor.
Note that there are tiny differences between Intel's and AMD's implementation of this architecture. Your compiler is aware of them and won't generate code that can behave differently between the two.
There are also various instruction set extensions available on some but not all amd64 processors. Programs using these will only run on processors that have these instruction set extensions. However, unless you specifically tell your compiler to make use of such extensions, it won't use any of them and your code will run on amd64 processors of any vendor.
Will the C++ assembler
There is no "the C++" assembler. An assembler generally doesn't need to know anything about a higher level languages (if any) that were compiled to the assembly code.
... look at the table where each assembly instruction has the corresponding machine code instruction ?
Nothing says that there has to be a "table" but sure, an assembler supporting multiple CPU architectures could do that.
If the C++ program runs on the intel processor, then, assembler needs to take a look at the table published by Intel team, right ?
Such table would likely be written by the authors of the assembler program rather than the CPU vendor. It would be based on manuals published by the vendor.
how is it possible that program written in C++ can be run on the computer which uses Intel and on the computer which uses AMD processor ?
Intel, AMD and VIA all make CPU's that implement the same(ish) instruction set called x86-64. An assembler targeting x86-64 instruction set should work on CPU's that support x86-64 instruction set.
There are a few small variations between the different implementations, and the assemblers (and compilers) must be designed in a way to take such differences into consideration if the program is to work on all those systems. Example: Early Intel64 CPU's lack the NX bit (according to wikipedia, which doesn't cite a source). A program that is to work on those CPU's mustn't use that feature.

Why do I see __scalbnf in my profiler?

I am profiling some C++ code with perf, and I see that __scalbnf and __wrap_scalbnf are taking up a good chunk of the run time. I looked up what these functions are, and my best guess is I am calling them via a call to std::exp. However I'd like to be able to confirm this. Is there a place where I can see the C++ code implementing std::exp to confirm this? Or what is the best way for me (a C++ amateur) to start digging into this and understanding what is happening?
Thank you.
Set a breakpoint on __scalbn. Run your program. Look at a backtrace (in GDB, bt). The call tree will show that exp() is a parent function for __scalbn.
If a function has multiple callers, the first hit might not be from the "hot" function you're profiling.
To actually figure out which higher-up function (including its children) is responsible for using a lot of time, see linux perf: how to interpret and find hotspots. Top-down profiling can find expensive functions that do all their work in calls to other functions, even when those other functions also have "innocent" callers. (e.g. memcpy is heavily used and often unavoidable, but what you'd want to find are callers that use it too much and could be optimized better. Or not called at all.)
And BTW, yes glibc's math lib exp() implementation does internally use __scalbn. I'm not sure how bad the implementation is, but I don't see an asm version for x86-64, only this pure C version. https://code.woboq.org/userspace/glibc/sysdeps/ieee754/dbl-64/wordsize-64/s_scalbn.c.html. (For __scalbnl(long double) there's https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/s_scalbnl.S.html, using the x87 fscale instruction for 80-bit floats. But there are only i386 asm files for the other sizes. And IA-64 (Itanium), but not x86-64).
glibc does have some vectorized EXP code, though, like the SSE4 SVML version https://code.woboq.org/userspace/glibc/sysdeps/x86_64/fpu/multiarch/svml_d_exp2_core_sse4.S.html#_ZGVbN2v_exp_sse4.
If you want higher-performance exp() without perfect accuracy, see Fastest Implementation of Exponential Function Using AVX (that's for float, not double. I forget if there's an SO answer with a double version).
Also related: Efficient implementation of log2(__m256d) in AVX2.
To confirm that std::exp is the reason for __scalbnf and __wrap_scalbnf, you can replace the std::exp calls by either:
an identity function that returns the input value
or by an alternative exp implementation (for example fm_exp, found here)
Then, if you still see __scalbnf and __wrap_scalbnf in the profiler output, it means it's not coming from std::exp.

How to use compiler builtin functions without Standard C library

I know that some functions like sin cos min max memcpy may be treated not as normal functions but may be replaced by built-in functions (which may be more optimal than merely inline function calls, when the replacement is (an) actual processor instruction(s), such as directly calling the FSIN instruction for standard sin function when compiled for an x86 with a floating point unit).
The question I would like to use power of built-in functions (in C/C++ mostly in mingw/gcc maybe other compiler) but I do not want to link to libc, the Standard C Library).
Is it possible to use builtins with not linking to libc?
Are they any command line flags needed to optimize those symbols as a built-ins?
(Related to previous, but rephrased)
Will they be automatically recognized by name, or are compiler flag(s) necessary to enable usage of built-ins?
#randomusername has already explained the usage of the __builtin_ prefix for many common Standard C Library functions. I recommend using #define to make the change, while keeping your code clean.
#include <math.h>
#define cos __builtin_cos
#define sin __builtin_sin
#define printf __builtin_printf
...
printf("Distance is %f\n", cos(M_PI/4.0) * 7);
...
No Standard C Library
Now to not use the Standard C Library, which means not linking to it, or including the typical startup and exit code stubs, well, with GCC that is possible with the -nostdlib which is equivalent to -nostartfiles and -nodefaultlibs.
The issue is that you then have to replace all the library functions you would normally use, including system calls (or their wrappers / macros from glibc) for any kernel based functions.
I don't know of a portable or robust method that works across processors or even necessarily different families (sysenter vs. syscall (instruction) vs. int 0x80 for various 32 and 64-bit x86 processors). There is issues with ELF Auxiliary Vectors (Elf32_auxv_t) and vDSO (virtual ELF dynamic shared object) that may be possible to address and create a portable solution, I don't know.
Entry Point
I believe all GCC environments use the same default entry point, which is the label/function _start. This is normally included in the "Startup files" and then calls the traditional C/C++ entry point of main. So you would need to replace it with a minimal stub of your own (which can be in C).
Program termination
I don't know how to replace _exit(rc) or similar function required to correct terminate the program, in a portable fashion. For example in a Linux environment it needs to make a system call to the kernel function SYS_exit (aka __NR_exit or sys_exit)
void _start(void) {
int rc;
/* Get command line arguments if necessary */
rc = main(0, NULL);
your_exit_replacement(rc);
}
Alternatives
Normally user processes i.e. application programs, as opposed to Operating System kernels or drivers, accept the overhead of linking the Startup Files and the necessary overhead to enable dynamic linking to the Startard C Library, as memory is considered cheap and readily available that for any real (actually does something) application the memory saving is not worthwhile. In embedded domain, where it is not as acceptable to just assume plenty of memory is available, the alternative is the use a minimal libc replacement. For Linux there are several available (e.g. musl, uClibc, dietlibc), I don't know if there is one available for mingw or Windows-compatible open source replacements (ReactOS, and Wine).
Further
For further information, from a Linux platform point of view, there is a nice introduction "Hello from a libc-free world!" Part 1 and Part 2 by Jessica McKellar blogging at Oracle. There are also a number of related questions, and some (partial in some cases) answers here at stackoverflow about using -nostdlib in various circumstances.
Where to go from here depends on your goals: education, embedded, tiny program (Linux ELF executable) or Windows PE executable competitions.
Microsoft Windows
There are various articles for a Microsoft Windows environment dealing with .COM and .EXE executables, and Windows PE but using Microsoft's Visual Studio environment or assembly typically. The "classics" are Matt Pietrek's Under the Hood column "Reduce EXE and DLL Size with LIBCTINY.LIB" (January 2001 issue of MSDN Magazine) and "Remove Fatty Deposits from Your Applications Using Our 32-Bit Liposuction Tools" from October 1996 Microsoft Systems Journal. Another article, but I haven't read myself, that appears to have include explanations is "Reducing Executable Size".
Lets say you wanted to replace the function cos, all you have to do is replace every occurance of cos in your code with __builtin_cos. The same goes for any other function that you can replace with the compiler's version. Just prepend __builtin_ to the name.
For more information consult the gcc manual.

Why is the LLVM execution engine faster than compiled code?

I have a compiler which targets LLVM, and I provide two ways to run the code:
Run it automatically. This mode compiles the code to LLVM and uses the ExecutionEngine JIT to compile it into machine code on-the-fly and run it without ever generating an output file.
Compile it and run separately. This mode outputs an LLVM .bc file, which I manually optimise (with opt), compile to native assembly (with llc) compile to machine code and link (with gcc), and run.
I was expecting approach #2 to be faster than approach #1, or at least the same speed, but running a few speed tests, I am surprised to find that #2 consistently runs about twice as slow. That is a huge speed difference.
Both cases are running the same LLVM source code. With approach #1, I haven't yet bothered to run any LLVM optimisation passes (which is why I was expecting it to be slower). With approach #2, I am running opt with -std-compile-opts and llc with -O3, to maximise optimisation, yet it isn't getting anywhere near #1. Here is an example run of the same program:
#1 without optimisation: 11.833s
#2 without optimisation: 22.262s
#2 with optimisation (-std-compile-opts and -O3): 18.823s
Is the ExecutionEngine doing something special that I don't know about? Is there any way for me to optimise the compiled code to achieve the same performance as the ExecutionEngine JIT?
It is normal for a VM with JIT to run some applications faster than than a compiled application. That's because a VM with JIT is like a simulator that simulates a virtual computer, and also runs a compiler in realtime. Because both tasks are built into the VM with JIT, the machine simulator can feed information to the compiler so that the code can be recompiled to run more efficiently. The information that it provides is not available to statically compiled code.
This effect has also been noted with Java VMs and with Python's PyPy VM, among others.
Another issue is aligning code and other optimizations. Nowadays cpu's are so complex that it's hard to predict which techniques will result in faster execution of final binary.
As an real-life example, let's consider Google's Native Client - I mean original nacl compilation approach, not involing LLVM (cause, as far as I know, currently there is direction on supporting both "nativeclient" and "LLVM bitcode"(modyfied) code).
As you can see on presentations (check out youtube.com) or in papers, like this Native Client: A Sandbox for Portable, Untrusted x86 Native Code, even their aligning technique makes code size bigger, in some cases such aligning of instructions (for example with noops) gives better cache hitting.
Aligning instructions with noops and instruction reordering it known in parallel computing, and here it shows it's impact as well.
I hope this answer gives an idea how much circumstances might influence on code speed execution, and that are many possible reasons for different pieces of code, and each of them needs investigation. Nevermore, it's interesting topic, so If you find some more details, don't hestitate to reedit your answer and let us know in "Post-Scriptorium", what have you found more :). (Maybe link to whitepaper/devblog with new findings :) ). Benchmarks are always welcome - take a look : http://llvm.org/OpenProjects.html#benchmark .

C/C++ usage of special CPU features

I am curious, do new compilers use some extra features built into new CPUs such as MMX SSE,3DNow! and so?
I mean, in original 8086 there was even no FPU, so compiler that old cannot even use it, but new compilers can, since FPU is part of every new CPU. So, does new compilers use new features of CPU?
Or, it should be more right to ask, does new C/C++ standart library functions use new features?
Thanks for answer.
EDIT:
OK, so, if I get all of you right,even some standart operations, especially with float numbers can be done using SSE faster.
In order to use it, I must enable this feature in my compiler, if it supports it. If it does, I must be sure that targeted platform supports that features.
In case of some system libraries that require top performance, such as OpenGL, DirectX and so, this support may be supported in system.
By default, for compatibility reasons, compiler doesen´t support it, but you can add this support using special C functions delivered by, for example Intel. This should be the best way, since you can directly control wheather and when you use special features of desired platform, to write multi-CPU-support applications.
gcc will support newer instructions via command line arguments. See here for more info. To quote:
GCC can take advantage of the
additional instructions in the MMX,
SSE, SSE2, SSE3 and 3dnow extensions
of recent Intel and AMD processors.
The options -mmmx, -msse, -msse2,
-msse3 and -m3dnow enable the use of these extra instructions, allowing
multiple words of data to be processed
in parallel. The resulting executables
will only run on processors supporting
the appropriate extensions--on other
systems they will crash with an
Illegal instruction error (or similar)
These instructions are not part of any ISO C/C++ standards. They are available through compiler intrinsics, depending on the compiler used.
For MSVC, see http://msdn.microsoft.com/en-us/library/26td21ds(VS.80).aspx
For GCC, you could look at http://developer.apple.com/hardwaredrivers/ve/sse.html
AFAIK, SSE intrinsics are the same between GCC and MSVC.
Compilers will aim for producing code for a minimal set of features in a processor. They also provide compilation switches that allow you to target specific processors. In this manner, they can sell more compilers (to those folks with old processors as well as the trendy folk with new ones).
You will need to study the documentation that came with your compiler.
Sometimes the runtime library will contain multiple implementations of a feature, and the library will dynamically choose between implementations when the program is run. The overhead might be the cost of a function pointer call instead of a direct function call, but the benefit could be much greater when using a CPU-specific optimised function.
JIT compilers (for VM languages such as Java and C#) take this one step further and compile the bytecode for the specific CPU that it's running on. This gives your own code the benefit of specific CPU optimisation. This is one reason why Java code can actually be faster than compiled C code, because the Java JIT compiler can delay its optimisation decisions until the program is run on the actual target machine. A C compiler must make those decisions without always knowing what the target CPU is. Furthermore, JIT compilers evolve and can make your program faster over time without you having to do anything.
If you use the Intel C compiler, and set sufficiently high optimisation options, you will find that some of your loops get 'vectorised', which means the compiler has rewritten them to use SSE-style instructions.
If you want to use SSE operations directly, you use the intrinsics defined in the 'xmmintrin.h' header file; say
#include <xmmintrin.h>
__m128 U, V, W;
float ww[4];
V=_mm_set1_ps(1.5);
U=_mm_set_ps(0,1,2,3);
W=_mm_add_ps(U,V);
_mm_storeu_ps(ww,W);
Varying compilers will use varying new features. Visual Studio will use SSE/2, and I believe the Intel compiler will support the very latest in CPU features. You should, of course, be wary about the market penetration of your favourite feature.
As for what your favourite standard library use, that depends on what it was compiled with. However, C++ standard library is typically compiled on-site, since it's very heavily templated, so if you enable SSE2, the C++ std libs should use it. As for the CRT, depends on what they were compiled with.
There are generally two ways a compiler can generate code that uses special features like these:
When the compiler itself is compiled, you configure it to generate code for a particular architecture, and it can take advantage of any features it knows that architecture will have. For example, if it gcc is configured for an Intel processor new enough (or is that "not old enough"?) to contain an integrated FPU, it will generate floating-point instructions.
When the compiler is invoked, flags or parameters can specify the type of features available to the processor that will run the program, and then the compiler will know it is safe to use these features. If the flags aren't present, it will generate equivalent code without using the special instructions provided by those features.
If you're talking about code written in C/C++, the new features are explited if you tell to your compiler to do so. By default, your compiler probably targets "plain x86" (naturally with FPU :) ), usually optimized for the most widespread processor generation at the moment, but still able to run on older processors.
If you want the compiler to generate code also considering the new instruction sets, you should tell it to do so with the appropriate command line switch/project setting, for example for Visual C++ the option to enable SSE/SSE2 instructions generation is /arch.
Notice that many features of new instruction sets cannot be exploited directly in "normal" code, so you are usually provided with compiler intrinsics to operate on the particular datatypes native of the new instruction sets.
Intel provides updated CPUID example code every time they release a new cpu so that you can check for the new features and has been as long as I remember. At least this is what I found the first time I thought about this same question myself.
Using CPUID to Detect the presence of SSE 4.1 and SSE 4.2 Instruction Sets
As new compilers are released they add the new features directly like VS2010 for example.
Visual C++ Code Generation in Visual Studio 2010