GCC 4.6.2 inlining behavior - c++

-- snipped from chat.so --
I am stuck with gcc 4.6.2 on a certain project and after profiling with intel VTune
i noticed that very insignificant functions were not being inlined (or at least showed up under hotspots, which I assumed meant a failed inline)
an example function is a reinterpret cast, 2 numeric additions, and a ternary statement
i BELIEVE these are being inlined in Windows, but due to the profiling, think they are not being inlined in linux under gcc 4.6.2
I am attempting to get an ICC build working in linux (works in windows), but that'll take a little time
until then, does anyone know if GCC 4.6.2 is that different from VS2010 in terms of relatively simple compiler optimizations? I've turned on -O3 in GCC
what led me to this is that this is a rewrite of a significant section of code, and on Windows, the performance is approximately equal or a little slower, while on Linux it is at least 2x as slow.
The most informative answer would help me understand the steps required to verify inlining across platforms and how best to approach this situation as I understand these things are extremely situation-specific.
EDIT: Also, assuming that business-specific reasons force me to stick with GCC 4.6.2, what can I do about this without rewriting the code to make it less maintainable?
Thanks!

First the super-obvious for completeness: Are you absolutely sure that all the files doing the probably non-inlined calls were compiled with -O3?
The gcc and VS compiler and tool chains are sufficiently different that it wouldn't surprise me at all if their optimizers behaved rather differently.
Next let me observe that the ternary operator can be very deceiving. Ternary operators are almost certainly going to create a branch and potentially constructor calls, conversions, etc. Don't assume that just because it's a terse operator in C++ the compiler will be able generate a tiny amount of code for it. This could potentially inhibit the compiler from optmizing it. In fact, you could try reworking the ternary code into a normal if statement and see if that helps your performance at all.
Then once you've moved on to further diagnostics, an easy thing to try is to use strings <binary> | grep function and see if the function name shows up in the binary at all. If it doesn't then it's definitely being inlined (although even if it shows up it could be strictly debug information and not actual code). There are other tools such as nm, readelf, elfdump, and dump that can introspect binaries for symbols as well. You would need to see which tools are available on your platform and then try to use them to find the function(s) in question.
Another idea is to load the compiled binary into gdb, and ask it to disassemble the code at the file and line at the point where the function call is made. Then you can read the disassembly code to see what the compiler did. Most of the code should actually be fairly obvious. You will likely see something like a call instruction if an actual function call was made.

Related

Slow performance of std::complex operator* on some compilers

The performance of my inner loop is suffering due to the NaN tests in std::complex operator*. I know this because if I replace the operator* by my own manual multiplication function that does not include the NaN tests, my code runs twice as fast(!) on real-world workloads. I am happy that there will be no NaNs in the data I am processing, so I do not care about this pedantic IEEE compliance and would like to eliminate it to gain speed.
What I am not sure about is the best way to cause that NaN-checking code to be eliminated. I had thought that -ffinite-math-only should do the trick, but (examining the disassembly) it does not seem to have any effect. Indeed, using all of "-Ofast -ffast-math -funsafe-math-optimizations -ffinite-math-only" out of desperation does not have any effect. Is there some compiler option I could use that would eliminate the undesirable NaN tests?
Frustratingly, I am 99% sure my code used to compile to the "fast" version, which leads me to think that such an option does exist. Using the manual multiplication I am exactly replicating fine-grained benchmark measurements I made a few days ago. However, I cannot easily know exactly what has changed about the compile process, because this is a python extension and I noticed the performance drop when I upgraded from anaconda python 3.5 to 3.7. I can only imagine that the compiler options (or something else) being used behind the scenes have changed, but it's not that easy to go back and check (I'd have to downgrade my entire install and hope that I can reproduce the old performance).
I have naively tried subclassing std::complex, and redefining operator* the way I want it. That does work, but I am finding that I need to overload a fair few other functions, e.g. conj(), to avoid accidentally reverting to the base std::complex type. This feels like a clunky workaround rather than the "right" way to do this.
Can anyone suggest a good way to achieve what I want (causing the compiler to eliminate the NaN checks in std::complex operator*, in order to get a performance gain)?
Edit: 'codegorilla' requested a reproducible example. The following simple code demonstrates the problem but see note below about how my issue seems to be compiler-specific.
#include <complex>
int main(void)
{
std::complex<float> a(rand(), rand()); // input that cannot be optimized away
std::complex<float> b(rand(), rand());
std::complex<float> c = a * b;
printf("%lf\n", c.real()); // output that cannot be optimized away
return 0;
}
Disassemble with
gcc -Ofast -S test.cpp -o test.s; open test.s
I have been working on OS X, where gcc -v reports Apple LLVM version 8.1.0 (clang-802.0.42). However, I now see that this appears to be a compiler-specific issue - things compile as I would expect on gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) things compile as I would expect (optimized code generated with -Ofast but not -O3)
With my compiler, a call to __ZNSt3__1mlIfEENS_7complexIT_EERKS3_S5_ can be seen (and I suspect much of the overall performance issue is that this is not inlined as a compact operation). Inspection of the disassembly for that function confirms that, as per the source for std::complex, the majority of the function is performing checks and corrections for edge cases involving NaNs. This is what I hope to be able to eliminate (which then enables function inlining and subsequent optimizations and, ultimately, a measured doubling of code speed).

Does building the compiler from source result in better optimization?

Consider this simple case scenario:
I download the pre-built binaries of a C++ compiler (say CLang or GCC or anything else) for my generic OS (that is not windows). I compile my code which consists of some computationally expensive mathematical calculation with optimization flag -O3 and I have an execution time of T1.
On a different attempt, this time instead of using pre-built binaries I download the source code and build the compiler by myself on my generic machine. I compile the same code with the same optimization flag, achieving execution time T2?
Will T2 < T1 or they will be more or less the same?
In other words, is the execution time independent from the way that compiler is built?
The compiler's optimization of your code is the result of the behavior of the compiler, not the performance of the compiler.
As long as the compiler has the same behavioral design, it will produce exactly the same output.
Generally the same compiler version should generate the same assembler code given the same C or C++ code input. However there are certain things that might further affect the code that is being execute when you run the compiler.
A distro might have backported (or even created own) patches from other versions.
Modern compilers often have library depenencies (e.g. cloog) that may have different behaviour in different versions, causing the compiler to make code generation decisions based on essentially other data
These libraries may (in some compiler versions) be optional at compile time (might need to give --enable switches to configure, or configure tries to autodetect them).
Compiler switches like -march=native will look on what hardware you compile and try to optimize accordingly.
a time limit in the compilers optimizer triggers, essentially making better optimizations on better machines; or the same for memory (I don't think thats to be found in modern compilers anymore though)
That said, even the same assembler might perform different on yours and their machine, e.g. because one is optimized for AMD, the other for intel.
In my opinion, and in theory, compilation speed can be faster, since you can say to "compiler which compile the compiler", "please target to my computer, and you can use my computer's processor's own machine code to optimize".
But I think compiler's optimization cannot be faster.. To make compiler's optimization faster, I think we need put something like new technology into compiler, not just re-compile.
That depends on how that compiler is implemented and on your platform, but the answer will be most likely "no".
If your platform provides specific functionality that can improve the performance of your program, the optimizer in your compiler might use that functionality to produce a faster program. The optimizer can do so only if the compiler writer was aware of the functionality and has implemented special treatment for your platform in the optimizer. If that is the case, the detection might be done dynamically in the optimizer, meaning any build of the optimizer can detect the platform and optimize your code. Only if the detection has to occur at compiletime of the optimizer for some reason, recompiling it on your platform could give that advantage. But if such a better build for your plaform exists, the compiler vendor most likely has provided binaries for it.
So, with all these ifs, it's unlikely that your program will be any faster when you recompile the compiler on your platform. There is a chance, however, that the compiler will be a bit faster if it is optimized to your platform rather than a generic binary, resulting on shorter compiletimes.

Is it possible to convert C/C++ source code to assembly?

Is it possible to somehow convert a simple C or C++ code (by simple I mean: taking some int as input, printing some simple shapes dependent on that int as output) to assembly language? If there isn't I'll just do it manually but since I'm gonna be doing it for processors like Intel 8080, it just seemed a bit tedious. Can you somehow automate the process?
Also, if there is a way, how good (as in: elegant) would the output assembly file source code be when compared to just translating it manually?
Most compilers will let you produce assembly output. For a couple of obvious examples, Clang and gcc/g++ use the -S flag, and MS VC++ uses the -Fa flag to do so.
A few compilers don't support this directly (e.g., if memory serves Watcom didn't). The ones I've seen like this had you produce an object file, and then included a disassembler that would produce an assembly language file from the object file. I don't remember for sure, but it wouldn't surprise me if this is what you'd need to do with the Digital Mars compiler.
To somebody who's accustomed to writing assembly language, the output from most compilers typically tends to look at least somewhat inelegant, especially on a CPU like an x86 that has quite a few registers that are now really general purpose, but have historically had more specific meanings. For example, if some piece of code needs both a pointer and a counter, a person would probably put the pointer in ESI or EDI, and the counter in ECX. The compiler might easily reverse those. That'll work fine, but an experienced assembly language programmer will undoubtedly find it more readable using ESI for the pointer and ECX for the counter.
Take look at gcc -S:
gcc -S hello.c # outputs hello.s file
Other compilers that maintain at lest partial gcc compatibility may also accept this flag. LLVM's clang, for example, does.
Well, yes there is such a program. It's called "Compiler"
To answer your edit: The elegance of the output depends on the optimization of your compiler. Usually compilers do not generate code we humans would call "elegant".
Most folks here are right, but seem to have missed the note about 8080 (no wonder, it's not in the title :). However, Google comes to the rescue as always - looking for compiler for 8080 produces some nice results like these:
http://www.bdsoft.com/resources/bdsc.html
http://tack.sourceforge.net/
Most of these are pretty old and might be poorly maintained. You might also try 8085 which is fairly similar
(by simple I mean: taking some int as input, printing some simple shapes dependent on
that int as output) to assembly language?
Looking at the output of an x86 compiler is not going to be very helpful, since inputting and outputting are done by a C or C++ library. With an 8080 there is no such library so you have to develop your own I/O routines for some particular hardware. That's lots and lots of additional work.

Get optimized source code from GCC

I have a task to create optimized C++ source code and give it to friend for compilation. It means, that I do not control the final compilation, I just write the source code of C++ program.
I know, that a can make optimization during compilation with -O1 (and -O2 and others) options of GCC. But how can I get this optimized source code instead of compiled program? I am not able to configure parameters of my friend's compiler, that is why I need to make a good source on my side.
The optimizations performed by GCC are low level, that means you won't get C++ code again but assembly code in best case. But you won't be able to convert it or something.
In sum: Optimize the source code on code level, not on object level.
You could ask GCC to dump its internal (Gimple, ...) representations, at various "stages". The middle-end of GCC is made of hundreds of passes, and you could ask GCC to dump them, with arguments like -fdump-tree-all or -fdump-gimple-all; beware that you can get hundreds of dump files for a single compilation!
However, GCC internal representations are quite low level, and you should not expect to understand them without reading a lot of material.
The dump options I am mentionning are mostly useful to those working inside GCC, or extending it thru plugins coded in C or extensions coded in MELT (a high-level domain specific language to extend GCC). I am not sure they will be very useful to your friend. However, they can be useful to make you understand that optimization passes do a lot of complex processing.
And don't forget that premature optimization is evil : you should first make your program run correctly, then benchmark and profile it, at last optimize the few parts worth of your efforts. You probably won't be able to write correct & efficient programs without testing and running them yourself, before giving them to your friend.
Easy - choose the best algorithm possible, let the rest be handled by the optimizer.
Optimizing the source code is different than optimizing the binary. You optimize the source code, the compiler will optimize the binary.
For anything more than algorithm choice, you'll need to do some profiling. Sure, there are practices that can speed up code speed, but some make the code less readable. Only optimize when you have to, and after you measure.

Verifying compiler optimizations in gcc/g++ by analyzing assembly listings

I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?
GCC's optimization passes work on an intermediary representation of your code in a format called GIMPLE.
Using the -fdump-* family of options, you can ask GCC to output intermediary states of the tree.
For example, feed this to gcc -c -fdump-tree-all -O3
unsigned fib(unsigned n) {
if (n < 2) return n;
return fib(n - 2) + fib(n - 1);
}
and watch as it gradually transforms from simple exponential algorithm into a complex polynomial algorithm. (Really!)
A useful technique is to run the code under a good sampling profiler, e.g. Zoom under Linux or Instruments (with Time Profiler instrument) under Mac OS X. These profilers not only show you the hotspots in your code but also map source code to disassembled object code. Highlighting a source line shows the (not necessarily contiguous) lines of generated code that map to the source line (and vice versa). Online opcode references and optimization tips are a nice bonus.
Instruments: developer.apple.com
Zoom: www.rotateright.com
Not gcc, but when debugging in Visual Studio you have the option to intersperse assembly and source, which gives a good idea of what has been generated for what statement. But sometimes it's not quite aligned correctly.
The output of the gcc tool chain and objdump -dS isn't at the same granularity. This article on getting gcc to output source and assembly has the same options as you are using.
Adding the -L option (eg, gcc -L -ahl) may provide slightly more intelligible listings.
The equivalent MSVC option is /FAcs (and it's a little better because it intersperses the source, machine language, and binary, and includes some helpful comments).
About one third of my job consists of doing just what you're doing: juggling C code around and then looking at the assembly output to make sure it's been optimized correctly (which is preferred to just writing inline assembly all over the place).
Game-development blogs and articles can be a good resource for the topic since games are effectively real-time applications in constant memory -- I have some notes on it, so does Mike Acton, and others. I usually like to keep Intel's instruction set reference up in a window while going through listings.
The most helpful thing is to get a good ground-level understanding of assembly programming generally first -- not because you want to write assembly code, but because having done so makes reading disassembly much easier. I've had a hard time finding a good modern textbook though.
In order to output the optimizations applied you can use:
-fopt-info-optimized
To see those that have not been applied
-fopt-info-missed
Beware that the output is sent to standard error stream so to see it you actually have to redirect that : ( hint 2>&1 )
Here is nice example of :
g++ -O3 -std=c++11 -march=native -mtune=native
-fopt-info-optimized h2d.cpp -o h2d 2>&1
h2d.cpp:225:3: note: loop vectorized
h2d.cpp:213:3: note: loop vectorized
h2d.cpp:198:3: note: loop vectorized
h2d.cpp:186:3: note: loop vectorized
You can check the interleaved output, when having applied -g with objdump -dS|c++filt , but that will not get you that far.Enjoy!
Zoom from RotateRight ( http://rotateright.com ) is mentioned in another answer, but to expand on that: it shows you the mapping of source to assembly in what they call the "code browser". It's incredibly handy even if you're not an asm expert because they have also integrated assembly documentation into the app. And the assembly listing is annotated with comments and timing for several CPU types.
You can just open your object or executable file with Zoom and take a look at what the compiler has done with your code.
Victor, in your case the optimization you are looking for is just a smaller allocation of local memory on the stack. You should see a smaller allocation at function entry and a smaller deallocation at function exit if the space used by the empty class is optimized away.
As for the general question, I've been reading (and writing) assembly language for more than (gulp!) 30 years and all I can say is that it takes practice, especially to read the output of a compiler.
Instead of trying to read through an assembler dump, run your program inside a debugger. You can pause execution, single-step through instructions, set breakpoints on the code you want to check, etc. Many debuggers can display your original C code alongside the generated assembly so you can more easily see what the compiler did to optimize your code.
Also, if you are trying to test a specific compiler optimization you can create a short dummy function that contains the type of code that fits the optimization you are interested in (and not much else, the simpler it is the easier the assembly is to read). Compile the program once with optimizations on and once with them off; comparing the generated assembly code for the dummy function between builds should show you what the compiler's optimizers did.