After reading around the subject, there is overwhelming evidence from numerous sources that using standard C or C++ casts to convert from floating point to integer numbers on Intel is very slow. In order to meeting the ANSI/ISO specification, Intel CPUs need to execute a large number of instructions including those needed to switch the rounding mode of the FPU hardware.
There are a number of workarounds described in various documents, but the cleanest and most portable seems to be the lrint() call added to C99 and C++ 0x standards. Many documents say that a compiler should inline expand these functions when optimization is enabled, leading to code which is faster than a conventional cast, or a function call.
I even found references to gcc feature tracking bags to add this inline expansion to the gcc optimizer, but in my own performance tests I have been unable to get it to work. All my attempts show lrint performance to be much slower than a simple C or C++ style cast. Examining the assembly output of the compiler, and disassembling the compiled objects always shows an explicit call to an external lrint() or lrintf() function.
The gcc versions I have been working with are 4.4.3 and 4.6.1, and I have tried a number of flag combinations on 32bit and 64bit x86 targets, including options to explicitly enable SSE.
How do I get gcc to inline expand lrint, and give me fast conversions?
The lrint() function may raise domain and range errors. One possible way the libc deals with such errors is setting errno (see C99/C11 section 7.12.1). The overhead of the error checking can be quite significant and in this particular case seems to be enough for the optimizer to decide against inlining.
The gcc flag -fno-math-errno (which is part of -ffast-math) will disable these checks. It might be a good idea to look into -ffast-math if you do not rely on standards-compliant handling of floating-point semantics, in particular NaNs and infinities...
Have you tried the -finline-functions flag to gcc.
You can also direct GCC to try to integrate all “simple enough” functions into their callers with the option -finline-functions.
see http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Here you can say gcc to make all function to inline but not all will be inlined.
The compiler uses some heuristics to determine whether the function is small enough to be inlined. One more thing is that a recursive function are also not going to be inline here.
Related
This question already has answers here:
How will i know whether inline function is actually replaced at the place where it is called or not?
(10 answers)
Closed 7 years ago.
My understanding is that compilers follow certain semantics that decide whether or not a function should be expanded inline. for example, if the callee unconditionally (no if/élse-if to return) returns a value, it may be expanded in caller itself. Similarly, function call overhead can also guide this expansion.(I may be completely wrong)
Similarly, the hardware parameters like cache-usage may also play a role in expansion.
As a programmer, I want to understand these semantics and the algorithms which guide inline expansion. Ultimately, I should be able to write(or recognize) a code that surely will be inlined(not-inlined). I don't mean to override compiler or that I think I would be able to write a code better than compiler itself. The question is rather to understand internals of the compilers.
EDIT: Since I use gcc/g++ in my work, we can limit the scope to these two alone. Though, I was of opinion that there will be several things common across compilers in this context.
You don't need to understand the inlining (or other optimizations) criteria, because by definition (assuming that the optimizing compiler is not buggy on that respect), an inlined code should behave the same as a non-inlined code.
Your first example (callee unconditionally returning a value) is in practice certainly wrong, in the sense that several compilers are able to inline conditional returns.
For example, consider this f.c file:
static int fact (int n) {
if (n <= 0) return 1;
else
return n * fact (n - 1);
}
int foo () {
return fact (10);
}
Compile it with gcc -O3 -fverbose-asm -S f.c; the resulting f.s assembly file contains only one function (foo), the fact function has completely gone, and the fact(10) has been inlined (recursively) and replaced (constant folding) by 3628800.
With GCC -current version is GCC 5.2 in july 2015-, assuming you ask it to optimize (e.g. compile with gcc -O2 or g++ -O2 or -O3) the inlining decision is not easy to understand. The compiler would very probably make inlining decisions better than what you can do. There are many internal heuristics guiding it (so no simple few guiding principles, but some heuristics to inline, other to avoid inlining, and probably some meta-heuristics to choose). Read about optimize options (-finline-limit=...), function attributes.
You might use the always_inline and gnu_inline and noinline (and also noclone) function attributes, but I don't recommend doing that in general.
you could disable inlining with noinline but very often the resulting code would be slower. So don't do that...
The key point is that the compiler is better optimizing and inlining than what you reasonably can, so trust it to inline and optimize well.
Optimizing compilers (see also this) can (and do) inline functions even without you knowing that, e.g. they are sometimes inlining functions not marked inline or not inlining some functions marked inline.
So no, you don't want to "understand these semantics and the algorithms which guide inline expansion", they are too difficult ... and vary from one compiler to another (even one version to another). If you really want to understand why GCC is inlining (this means spending months of work, and I believe you should not lose your time on that), use -fdump-tree-all and other dump flags, instrument the compiler using MELT -which I am developing-, dive into the source code (since GCC is a free software).
You'll need more than your life time, or at least several dozens of years, to understand all of GCC (more than ten millions lines of source code) and how it is optimizing. By the time you understood something, the GCC community would have worked on new optimizations, etc...
BTW, if you compile and link an entire application or library with gcc -flto -O3 (e.g. with make CC='gcc -flto -O3') the GCC compiler would do link-time optimization and inline some calls accross translation units (e.g. in f1.c you call foo defined in f2.c, and some of the calls to foo in f1.c would got inlined).
The compiler optimizations do take into account cache sizes (for deciding about inlining, unrolling, register allocation & spilling and other optimizations), in particular when compiling with gcc -mtune=native -O3
Unless you force the compiler (e.g. by using noinline or alwaysinline function attributes in GCC, which is often wrong and would produce worse code), you'll never be able in practice to guess that a given code chunk would certainly be inlined. Even people working on GCC middle end optimizations cannot guess that reliably! So you cannot reliably understand -and predict- the compiler behavior in practice, hence don't even waste your time to try that.
Look also into MILEPOST GCC; by using machine learning techniques to tune some GCC parameters, they have been able to sometimes get astonishing performance improvements, but they certainly cannot explain or understand them.
If you need to understand your particular compiler while coding some C or C++, your code is probably wrong (e.g. probably could have some undefined behavior). You should code against some language specification (either the C11 or C++14 standards, or the particular GCC dialect e.g. -std=gnu11 documented and implemented by your GCC compiler) and trust your compiler to be faithful w.r.t. that specification.
Inlining is like copy-paste. There aren't so many gotchas that will prevent it from working, but it should be used judiciously. If it gets out of control, the program will become bloated.
Most compilers use a heuristic based on the "size" of the function. Since this is usually before any code generation pass, the number of AST nodes may be used as a proxy for size. A function that includes inlined calls needs to include them it its own size, or inlining can go totally out of control. However, AST nodes that will not generate instructions should not prevent inlining. It can be difficult to tell what will generate a "move" instruction and what will generate nothing.
Since modern C++ tends to involve lots of functions that perform conceptual rearrangement with no underlying instructions, the difficulty is telling the difference between no instructions, "just a few" moves, and enough move instructions to cause a problem. The only way to tell for a particular instance is to run the program in a debugger and/or read the disassembly.
Mostly in typical C++ code, we just assume that the inliner is working hard enough. For performance-critical situations, you can't just eyeball it or assume that anything is working optimally. Detailed performance analysis at the disassembly level is essential.
-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!
First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.
I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.
I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.
Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.
The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.
Are the following functions executed in a single clock cycle?
__builtin_popcount
__builtin_ctz
__builtin_clz
also what is the no of clock cycles for the ll(64 bit) version of the same.
are they portable. why or why not?
Do these functions execute in a single clock-cycle?
Not necessarily. On architectures where they can be implemented with a single instruction, they will typically be the fastest way to compute that function (but still not necessarily a single clock cycle). On architectures where they cannot be implemented as a single instruction, their performance is less certain.
On my processor (a Core 2 Duo), __builtin_ctz and __builtin_clz can be implemented with a single instruction (Bit Scan Forward and Bit Scan Reverse). However, __builtin_popcount cannot be implemented with a single instruction on my processor. For __builtin_popcount, gcc 4.7.2 calls a library function, while clang 3.1 generates an inline instruction sequence (implementing this bit twiddling hack). Clearly, the performance of those two implementations will not be the same.
Are they portable?
They are not portable across compilers. They originated with GCC (as far as I know), and are also implemented in some other compilers such as Clang.
Compilers that do support these functions may provide them for multiple architectures, but implementation quality (performance) is likely to vary.
__builtin functions like this are used to access specific machine instructions in a somewhat easier way than using inline assembly. If you need to achieve the highest performance and are willing to sacrifice portability to do so or to provide an alternate implementation for compilers or platforms where these functions are not provided, then it makes sense to use them. If optimal low level performance is your goal you should also check the assembly output of the compiler, to determine whether it really is generating the instruction that you expect it to use.
You can get a first idea of what your compiler does with it by compiling it with -O3 -march=native -S into assembler code. There you can check if this resolves to just one assembler statement. If so, this is not a guarantee that this is done in one cycle. To know the real cost, you'd have to measure.
When I compile an application with Intel's compiler it is slower than when I compile it with GCC. The Intel compiler's output is more than 2x slower. The application contains several nested loops. Are there any differences between GCC and the Intel compiler that I am missing? Do I need to turn on some other flags to improve the Intel compiler's performance? I expected the Intel compiler to be at least as fast as GCC.
Compiler Versions:
Intel version 12.0.0 20101006
GCC version 4.4.4 20100630
The compiler flags are the same with both compilers:
-O3 -openmp -parallel -mSSE4.2 -Wall -pthread
I have no experience with the intel compiler so I can't answer whether you are missing some flags or not.
However from what I recall recent versions of gcc are generally as good at optimizing code as icc (sometimes better, sometimes worse (although most sources seem to indicate to generally better)), so you might have run into a situation where icc is particulary bad. Examples for what optimizations each compiler can do can be found here and here. Even if gcc is not generally better you could simply have a case which gcc recognizes for optimization and icc doesn't. Compilers can be very picky about what they optimize and what not, especially regarding things like autovectorization.
If your loop is small enough it might be worth it to compare the generated assembly code between gcc and icc. Also if you show some code or at least tell us what you are doing in your loop we might be able to give you better speculations what leads to this behaviour. For example in some situations. If it's a relatively small loop it is likely a case of icc missing one (or some, but probably not many) optimization which either have inherently good potential (prefetching, autovectorization, unrolling, loop invariant motion,...) or which enable other optimizations (primarily inlining).
Note that I'm only talking about optimization potential when I compare gcc to icc. In the end icc might typically generate faster code then gcc, but not so much because it does more optimizations, but because it has a faster standard library implementation and because it is smarter about where to optimize (on high optimization levels gcc gets a little bit overeager (or at least it used to) about trading code size for (theoretical) runtime improvements. This can actually hurt performance, e.g. when the carefully unrolled and vectorized loop is only ever executed with 3 iterations.
I normally use -inline-level=1 -inline-forceinline to make sure that functions which I have explicitly declared inline actually do get inlined. Other than that I would expect ICC performance to be at least as good as with gcc. You will need to profile your code to see where the performance difference is coming from. If this is Linux then I recommend using Zoom, which you can get on a free 30 day evaluation.