Slow performance of std::complex operator* on some compilers

Slow performance of std::complex operator* on some compilers - c++

The performance of my inner loop is suffering due to the NaN tests in std::complex operator*. I know this because if I replace the operator* by my own manual multiplication function that does not include the NaN tests, my code runs twice as fast(!) on real-world workloads. I am happy that there will be no NaNs in the data I am processing, so I do not care about this pedantic IEEE compliance and would like to eliminate it to gain speed.
What I am not sure about is the best way to cause that NaN-checking code to be eliminated. I had thought that -ffinite-math-only should do the trick, but (examining the disassembly) it does not seem to have any effect. Indeed, using all of "-Ofast -ffast-math -funsafe-math-optimizations -ffinite-math-only" out of desperation does not have any effect. Is there some compiler option I could use that would eliminate the undesirable NaN tests?
Frustratingly, I am 99% sure my code used to compile to the "fast" version, which leads me to think that such an option does exist. Using the manual multiplication I am exactly replicating fine-grained benchmark measurements I made a few days ago. However, I cannot easily know exactly what has changed about the compile process, because this is a python extension and I noticed the performance drop when I upgraded from anaconda python 3.5 to 3.7. I can only imagine that the compiler options (or something else) being used behind the scenes have changed, but it's not that easy to go back and check (I'd have to downgrade my entire install and hope that I can reproduce the old performance).
I have naively tried subclassing std::complex, and redefining operator* the way I want it. That does work, but I am finding that I need to overload a fair few other functions, e.g. conj(), to avoid accidentally reverting to the base std::complex type. This feels like a clunky workaround rather than the "right" way to do this.
Can anyone suggest a good way to achieve what I want (causing the compiler to eliminate the NaN checks in std::complex operator*, in order to get a performance gain)?
Edit: 'codegorilla' requested a reproducible example. The following simple code demonstrates the problem but see note below about how my issue seems to be compiler-specific.
#include <complex>
int main(void)
{
std::complex<float> a(rand(), rand()); // input that cannot be optimized away
std::complex<float> b(rand(), rand());
std::complex<float> c = a * b;
printf("%lf\n", c.real()); // output that cannot be optimized away
return 0;
}
Disassemble with
gcc -Ofast -S test.cpp -o test.s; open test.s
I have been working on OS X, where gcc -v reports Apple LLVM version 8.1.0 (clang-802.0.42). However, I now see that this appears to be a compiler-specific issue - things compile as I would expect on gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) things compile as I would expect (optimized code generated with -Ofast but not -O3)
With my compiler, a call to __ZNSt3__1mlIfEENS_7complexIT_EERKS3_S5_ can be seen (and I suspect much of the overall performance issue is that this is not inlined as a compact operation). Inspection of the disassembly for that function confirms that, as per the source for std::complex, the majority of the function is performing checks and corrections for edge cases involving NaNs. This is what I hope to be able to eliminate (which then enables function inlining and subsequent optimizations and, ultimately, a measured doubling of code speed).

Related

Optimized Execution Time

Because of a school assignment I have to convert a C++ code to assembly(ARMv8). Then I have to compile the C++ code using GCC's -O0,-O1,-O2,-O3 and -Os optimizations, write down the time and compare with the execute time of my assembly code. As, I think I know -O3 have to be faster than -O1 and -O2. However, I get that -O2 is the fastest, then are -O1,-O3,-Os,-O0. Is that usual? (Calculated times are about 30 seconds).

Notice that GCC has many other optimization flags.
There is no guarantee that -O3 gives faster code than -O2; a compiler can apply more optimization passes, but they are all heuristics and might be unsuccessful (or even slow down slightly your particular code). Hence it does happen that -O3 gives some slightly slower code than -O2 (on some particular input source code).
You could try a more recent version of GCC (the latest -in November 2017- is GCC 7, GCC 8 will go out in few months). You could also try some better -march= or -mtune= option.
At last, with your GCC plugin, you might add your own optimization pass, or change the order (and the set) of applied optimization passes (there are several hundreds different optimization passes in GCC). But you'll need a lot of work (perhaps a year or two) to be able to extend GCC.
You could tune optimization parameters, and some project (MILEPOST) has even used machine learning techniques to improve them.
See also slides and references on my (old) GCC MELT documentation.

Yes, it is usual. Take the -Ox optimization as guide-lines. In average, they produce optimization that is advertise, but a lot depends on the style in which the code is written, memory layout, as well as the compiler itself.
Sometimes, you need to try and fail many times before getting the optimal code.
-O2 indeed gives the best optimization in most of the cases.

What are guiding principles of expansion of callee inside the caller (Inlining - Compiler Optimization) [duplicate]

This question already has answers here:
How will i know whether inline function is actually replaced at the place where it is called or not?
(10 answers)
Closed 7 years ago.
My understanding is that compilers follow certain semantics that decide whether or not a function should be expanded inline. for example, if the callee unconditionally (no if/élse-if to return) returns a value, it may be expanded in caller itself. Similarly, function call overhead can also guide this expansion.(I may be completely wrong)
Similarly, the hardware parameters like cache-usage may also play a role in expansion.
As a programmer, I want to understand these semantics and the algorithms which guide inline expansion. Ultimately, I should be able to write(or recognize) a code that surely will be inlined(not-inlined). I don't mean to override compiler or that I think I would be able to write a code better than compiler itself. The question is rather to understand internals of the compilers.
EDIT: Since I use gcc/g++ in my work, we can limit the scope to these two alone. Though, I was of opinion that there will be several things common across compilers in this context.

You don't need to understand the inlining (or other optimizations) criteria, because by definition (assuming that the optimizing compiler is not buggy on that respect), an inlined code should behave the same as a non-inlined code.
Your first example (callee unconditionally returning a value) is in practice certainly wrong, in the sense that several compilers are able to inline conditional returns.
For example, consider this f.c file:
static int fact (int n) {
if (n <= 0) return 1;
else
return n * fact (n - 1);
}
int foo () {
return fact (10);
}
Compile it with gcc -O3 -fverbose-asm -S f.c; the resulting f.s assembly file contains only one function (foo), the fact function has completely gone, and the fact(10) has been inlined (recursively) and replaced (constant folding) by 3628800.
With GCC -current version is GCC 5.2 in july 2015-, assuming you ask it to optimize (e.g. compile with gcc -O2 or g++ -O2 or -O3) the inlining decision is not easy to understand. The compiler would very probably make inlining decisions better than what you can do. There are many internal heuristics guiding it (so no simple few guiding principles, but some heuristics to inline, other to avoid inlining, and probably some meta-heuristics to choose). Read about optimize options (-finline-limit=...), function attributes.
You might use the always_inline and gnu_inline and noinline (and also noclone) function attributes, but I don't recommend doing that in general.
you could disable inlining with noinline but very often the resulting code would be slower. So don't do that...
The key point is that the compiler is better optimizing and inlining than what you reasonably can, so trust it to inline and optimize well.
Optimizing compilers (see also this) can (and do) inline functions even without you knowing that, e.g. they are sometimes inlining functions not marked inline or not inlining some functions marked inline.
So no, you don't want to "understand these semantics and the algorithms which guide inline expansion", they are too difficult ... and vary from one compiler to another (even one version to another). If you really want to understand why GCC is inlining (this means spending months of work, and I believe you should not lose your time on that), use -fdump-tree-all and other dump flags, instrument the compiler using MELT -which I am developing-, dive into the source code (since GCC is a free software).
You'll need more than your life time, or at least several dozens of years, to understand all of GCC (more than ten millions lines of source code) and how it is optimizing. By the time you understood something, the GCC community would have worked on new optimizations, etc...
BTW, if you compile and link an entire application or library with gcc -flto -O3 (e.g. with make CC='gcc -flto -O3') the GCC compiler would do link-time optimization and inline some calls accross translation units (e.g. in f1.c you call foo defined in f2.c, and some of the calls to foo in f1.c would got inlined).
The compiler optimizations do take into account cache sizes (for deciding about inlining, unrolling, register allocation & spilling and other optimizations), in particular when compiling with gcc -mtune=native -O3
Unless you force the compiler (e.g. by using noinline or alwaysinline function attributes in GCC, which is often wrong and would produce worse code), you'll never be able in practice to guess that a given code chunk would certainly be inlined. Even people working on GCC middle end optimizations cannot guess that reliably! So you cannot reliably understand -and predict- the compiler behavior in practice, hence don't even waste your time to try that.
Look also into MILEPOST GCC; by using machine learning techniques to tune some GCC parameters, they have been able to sometimes get astonishing performance improvements, but they certainly cannot explain or understand them.
If you need to understand your particular compiler while coding some C or C++, your code is probably wrong (e.g. probably could have some undefined behavior). You should code against some language specification (either the C11 or C++14 standards, or the particular GCC dialect e.g. -std=gnu11 documented and implemented by your GCC compiler) and trust your compiler to be faithful w.r.t. that specification.

Inlining is like copy-paste. There aren't so many gotchas that will prevent it from working, but it should be used judiciously. If it gets out of control, the program will become bloated.
Most compilers use a heuristic based on the "size" of the function. Since this is usually before any code generation pass, the number of AST nodes may be used as a proxy for size. A function that includes inlined calls needs to include them it its own size, or inlining can go totally out of control. However, AST nodes that will not generate instructions should not prevent inlining. It can be difficult to tell what will generate a "move" instruction and what will generate nothing.
Since modern C++ tends to involve lots of functions that perform conceptual rearrangement with no underlying instructions, the difficulty is telling the difference between no instructions, "just a few" moves, and enough move instructions to cause a problem. The only way to tell for a particular instance is to run the program in a debugger and/or read the disassembly.
Mostly in typical C++ code, we just assume that the inliner is working hard enough. For performance-critical situations, you can't just eyeball it or assume that anything is working optimally. Detailed performance analysis at the disassembly level is essential.

GCC 4.6.2 inlining behavior

-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!

First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.

How do I ensure lrint is inlined in gcc?

After reading around the subject, there is overwhelming evidence from numerous sources that using standard C or C++ casts to convert from floating point to integer numbers on Intel is very slow. In order to meeting the ANSI/ISO specification, Intel CPUs need to execute a large number of instructions including those needed to switch the rounding mode of the FPU hardware.
There are a number of workarounds described in various documents, but the cleanest and most portable seems to be the lrint() call added to C99 and C++ 0x standards. Many documents say that a compiler should inline expand these functions when optimization is enabled, leading to code which is faster than a conventional cast, or a function call.
I even found references to gcc feature tracking bags to add this inline expansion to the gcc optimizer, but in my own performance tests I have been unable to get it to work. All my attempts show lrint performance to be much slower than a simple C or C++ style cast. Examining the assembly output of the compiler, and disassembling the compiled objects always shows an explicit call to an external lrint() or lrintf() function.
The gcc versions I have been working with are 4.4.3 and 4.6.1, and I have tried a number of flag combinations on 32bit and 64bit x86 targets, including options to explicitly enable SSE.
How do I get gcc to inline expand lrint, and give me fast conversions?

The lrint() function may raise domain and range errors. One possible way the libc deals with such errors is setting errno (see C99/C11 section 7.12.1). The overhead of the error checking can be quite significant and in this particular case seems to be enough for the optimizer to decide against inlining.
The gcc flag -fno-math-errno (which is part of -ffast-math) will disable these checks. It might be a good idea to look into -ffast-math if you do not rely on standards-compliant handling of floating-point semantics, in particular NaNs and infinities...

Have you tried the -finline-functions flag to gcc.
You can also direct GCC to try to integrate all “simple enough” functions into their callers with the option -finline-functions.
see http://gcc.gnu.org/onlinedocs/gcc/Inline.html
Here you can say gcc to make all function to inline but not all will be inlined.
The compiler uses some heuristics to determine whether the function is small enough to be inlined. One more thing is that a recursive function are also not going to be inline here.

gcc optimizations cause app to fail

I'm having a real strange problem using GCC for ARM with the optimizations turned on.
Compiling my C++ application without the optimizations produces an executable that
at runtime outputs the expected results. As soon as I turn on the
optimizations - that is -O1 - my application fails to produce the expected results.
I tried for a couple of days to spot the problem but I'm clueless.
I eliminated any uninitialized variables from my code, I corrected the spots where
strict aliasing could cause problems but still I do not have the proper results.
I'm using GCC 4.2.0 for ARM(the processor is an ARM926ej-s) and running the app
on a Montavista Linux distribution.
Below are the flags I'm using:
-O1 -fno-unroll-loops fno-merge-constants -fno-omit-frame-pointer -fno-toplevel-reorder \
-fno-defer-pop -fno-function-cse -Wuninitialized -Wstrict-aliasing=3 -Wstrict-overflow=3 \
-fsigned-char -march=armv5te -mtune=arm926ej-s -ffast-math
As soon as I strip the -O1 flag and recompile/relink the application I get the proper output results. As you can see from the flags I tried to disable any optimization I thought it might cause problems but still no luck.
Does anyone have any pointers on how I could further tackle this problem?
Thanks

Generally speaking, if you say "optimization breaks my program", it is 99.9% your programm that is broken. Enabling optimizations only uncovers the faults in your code.
You should also go easy on the optimization options. Only in very specific circumstances will you need anything else beyond the standard options -O0, -O2, -O3 and perhaps -Os. If you feel you do need more specific settings than that, heed the mantra of optimizations:
Measure, optimize, measure.
Never go by "gut feeling" here. Prove that a certain non-standard optimization option does significantly benefit your application, and understand why (i.e., understand exactly what that option does, and why it affects your code).
This is not a good place to navigate blindfolded.
And seeing how you use the most defensive option (-O1), then disable half a dozen optimizations, and then add -ffast-math, leads me to assume you're currently doing just that.
Well, perhaps one-eyed.
But the bottom line is: If enabling optimization breaks your code, it's most likely your code's fault.
EDIT: I just found this in the GCC manual:
-ffast-math: This option should never be turned on by any -O option
since it can result in incorrect output for programs which depend on
an exact implementation of IEEE or ISO rules/specifications for math
functions.
This does say, basically, that your -O1 -ffast-math could indeed break correct code. However, even if taking away -ffast-math removes your current problem, you should at least have an idea why. Otherwise you might merely exchange your problem now with a problem at a more inconvenient moment later (like, when your product breaks at your client's location). Is it really -ffast-math that was the problem, or do you have broken math code that is uncovered by -ffast-math?

-ffast-math should be avoided if possible. Just use -O1 for now and drop all the other optimisation switches. If you still see problems then it's time to start debugging.

Without seeing your code, it's hard to get more specific than "you probably have a bug".
There are two scenarios where enabling optimizations changes the semantics of the program:
there is a bug in the compiler, or
there is a bug in your code.
The latter is probably the most likely. Specifically, you probably rely on Undefined Behavior somewhere in your program. You rely on something that just so happen to be true when you compile using this compiler on this computer with these compiler flags, but which isn't guaranteed by the language. And so, when you enable optimizations, GCC is under no obligation to preserve that behavior.
Show us your code. Or step through it in the debugger until you get to the point where things go wrong.
I can't be any more specific. It might be a dangling pointer, uninitialized variables, breaking the aliasing rules, or even just doing one of the many things that yield undefined results (like i = i++)

Try to make a minimal test case. Rewrite the program, removing things that don't affect the error. It's likely that you'll discover the bug yourself in the process, but if you don't, you should have a one-screen example program you can post.
Incidentally, if, as others have speculated, it is -ffast-math which causes your trouble (i.e. compiling with just -O1 works fine), then it is likely you have some math in there you should rewrite anyhow. It's a bit of an over-simplification, but -ffast-math permits the compiler to essentially rearrange computations as you could abstract mathematical numbers - even though doing so on real hardware may cause slightly different results since floating point numbers aren't exact. Relying on that kind of floating point detail is likely to be unintentional.
If you want to understand the bug, a minimal test-case is critical in any case.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js