Is possible to profile C/C++ inline functions looking at the assembly? - c++

I want to measure the performance and in general the behaviour (how much assembly is created, etc) of some inline functions i use around my project. Other than profiling timings is it possible to look at the overral code expansion of the functions that use those inline ones?
I tried in Visual C++ and MingW (through NetBeans) to look at the Disassembly panel during debugging. With debug building every inline function use call in the assembly so they are not inlined. If i activate optimizations the assembly is so changed that i cannot even put breakpoints inside those functions.
Do you know any compiler settings (in GCC or VC, for example, just optimizing inline functions), book (i have "Efficient C++" that talks about inlining measuring timings) or anything else to understand better the topic?

Here is the link to the compiler switch in VS. If you just want to test inline only enable this optimization.

Tools like Intel VTune can profile inlined functions. They use the debugging info in the binary to map the instruction pointers back to the function "from which the code was derived" even when there is no actual call anymore in the assembly.
You can see this effect while looking at annotated assembly with some tools as well - the source code for several functions will be mixed together which reflects the inlining.
This process isn't perfect, since "to what function does this particular instruction belong" almost becomes a philosophical one rather than a technical one for some types of inlining (indeed, instructions may effectively be "shared" among several functions).

Related

Who to take the write numbers when it' came to function inlining in C++?

When developing C++ programs with clang or gcc inlining metrics are taken by default so how a user can choose the inlining parameters like the max size of the inlinee method or the container size for example to optimize for the best his program? Does a programmer have to look for example to the size of the produced executable? number of virtual methods? how inlining metrics should be taken?
If you have a need to spend time with micro-optimization, output the assembly and see if your function is inlined within the contexts you use it. There are online tools that does this with multiple versions of gcc and clang. You will get the hang of it after a few experiments.

What are guiding principles of expansion of callee inside the caller (Inlining - Compiler Optimization) [duplicate]

This question already has answers here:
How will i know whether inline function is actually replaced at the place where it is called or not?
(10 answers)
Closed 7 years ago.
My understanding is that compilers follow certain semantics that decide whether or not a function should be expanded inline. for example, if the callee unconditionally (no if/élse-if to return) returns a value, it may be expanded in caller itself. Similarly, function call overhead can also guide this expansion.(I may be completely wrong)
Similarly, the hardware parameters like cache-usage may also play a role in expansion.
As a programmer, I want to understand these semantics and the algorithms which guide inline expansion. Ultimately, I should be able to write(or recognize) a code that surely will be inlined(not-inlined). I don't mean to override compiler or that I think I would be able to write a code better than compiler itself. The question is rather to understand internals of the compilers.
EDIT: Since I use gcc/g++ in my work, we can limit the scope to these two alone. Though, I was of opinion that there will be several things common across compilers in this context.
You don't need to understand the inlining (or other optimizations) criteria, because by definition (assuming that the optimizing compiler is not buggy on that respect), an inlined code should behave the same as a non-inlined code.
Your first example (callee unconditionally returning a value) is in practice certainly wrong, in the sense that several compilers are able to inline conditional returns.
For example, consider this f.c file:
static int fact (int n) {
if (n <= 0) return 1;
else
return n * fact (n - 1);
}
int foo () {
return fact (10);
}
Compile it with gcc -O3 -fverbose-asm -S f.c; the resulting f.s assembly file contains only one function (foo), the fact function has completely gone, and the fact(10) has been inlined (recursively) and replaced (constant folding) by 3628800.
With GCC -current version is GCC 5.2 in july 2015-, assuming you ask it to optimize (e.g. compile with gcc -O2 or g++ -O2 or -O3) the inlining decision is not easy to understand. The compiler would very probably make inlining decisions better than what you can do. There are many internal heuristics guiding it (so no simple few guiding principles, but some heuristics to inline, other to avoid inlining, and probably some meta-heuristics to choose). Read about optimize options (-finline-limit=...), function attributes.
You might use the always_inline and gnu_inline and noinline (and also noclone) function attributes, but I don't recommend doing that in general.
you could disable inlining with noinline but very often the resulting code would be slower. So don't do that...
The key point is that the compiler is better optimizing and inlining than what you reasonably can, so trust it to inline and optimize well.
Optimizing compilers (see also this) can (and do) inline functions even without you knowing that, e.g. they are sometimes inlining functions not marked inline or not inlining some functions marked inline.
So no, you don't want to "understand these semantics and the algorithms which guide inline expansion", they are too difficult ... and vary from one compiler to another (even one version to another). If you really want to understand why GCC is inlining (this means spending months of work, and I believe you should not lose your time on that), use -fdump-tree-all and other dump flags, instrument the compiler using MELT -which I am developing-, dive into the source code (since GCC is a free software).
You'll need more than your life time, or at least several dozens of years, to understand all of GCC (more than ten millions lines of source code) and how it is optimizing. By the time you understood something, the GCC community would have worked on new optimizations, etc...
BTW, if you compile and link an entire application or library with gcc -flto -O3 (e.g. with make CC='gcc -flto -O3') the GCC compiler would do link-time optimization and inline some calls accross translation units (e.g. in f1.c you call foo defined in f2.c, and some of the calls to foo in f1.c would got inlined).
The compiler optimizations do take into account cache sizes (for deciding about inlining, unrolling, register allocation & spilling and other optimizations), in particular when compiling with gcc -mtune=native -O3
Unless you force the compiler (e.g. by using noinline or alwaysinline function attributes in GCC, which is often wrong and would produce worse code), you'll never be able in practice to guess that a given code chunk would certainly be inlined. Even people working on GCC middle end optimizations cannot guess that reliably! So you cannot reliably understand -and predict- the compiler behavior in practice, hence don't even waste your time to try that.
Look also into MILEPOST GCC; by using machine learning techniques to tune some GCC parameters, they have been able to sometimes get astonishing performance improvements, but they certainly cannot explain or understand them.
If you need to understand your particular compiler while coding some C or C++, your code is probably wrong (e.g. probably could have some undefined behavior). You should code against some language specification (either the C11 or C++14 standards, or the particular GCC dialect e.g. -std=gnu11 documented and implemented by your GCC compiler) and trust your compiler to be faithful w.r.t. that specification.
Inlining is like copy-paste. There aren't so many gotchas that will prevent it from working, but it should be used judiciously. If it gets out of control, the program will become bloated.
Most compilers use a heuristic based on the "size" of the function. Since this is usually before any code generation pass, the number of AST nodes may be used as a proxy for size. A function that includes inlined calls needs to include them it its own size, or inlining can go totally out of control. However, AST nodes that will not generate instructions should not prevent inlining. It can be difficult to tell what will generate a "move" instruction and what will generate nothing.
Since modern C++ tends to involve lots of functions that perform conceptual rearrangement with no underlying instructions, the difficulty is telling the difference between no instructions, "just a few" moves, and enough move instructions to cause a problem. The only way to tell for a particular instance is to run the program in a debugger and/or read the disassembly.
Mostly in typical C++ code, we just assume that the inliner is working hard enough. For performance-critical situations, you can't just eyeball it or assume that anything is working optimally. Detailed performance analysis at the disassembly level is essential.

Does inline assembly mess with portability?

Suppose you've written a portable C++ code which runs smoothly on different platforms. To make some modifications to optimize performance, you use inline assembly inside your code. Is it a good practice (compiler optimization set aside) or will it make troubles for portability?
Obviously it breaks portability - the code will only work on the specific architecture the assembly language is for. Also, it's normally a waste of time - the compiler's optimiser is almost certainly better at writing assembler code than you are.
Obviously the inline assembly isn't even close to portable. To maintain any portability at all, you generally have to use an #ifdef (or something on that order) to determine when to use it at all.
My own preference is to segregate the assembly language into a separate file, and in the makefile decide whether to build the portable version or the assembly language version.
It depends.
If you have only x86 assembly, your application won't ever run on ARM and native x64. To solve this, you can surround it with #ifdef's depending on the architecture. This is the approach cross-platform, highly optimized libraries such as h264 use. In most cases, though, it's not worth it. Just use very specific C and it will behave very similarly to native assembly.
The other obvious choice is to only implement inline assembly on certain architectures, and keep the original (unoptimized) C++ for any other architecture, rather than trying to generate assembly for all architectures. (Suitably #ifdefed, of course.) Then you get the benefit of the optimization on the one architecture, with the basic functionality on all.
However, when we've done this on projects I've worked on in the past, this was the worst part to maintain - some other piece of code would change, and exactly what was being passed into the isolated function(s) would change, and the original C++ and assembly wouldn't match any more, and there was much wailing and gnashing of teeth.

Runtime optimization of static languages: JIT for C++?

Is anyone using JIT tricks to improve the runtime performance of statically compiled languages such as C++? It seems like hotspot analysis and branch prediction based on observations made during runtime could improve the performance of any code, but maybe there's some fundamental strategic reason why making such observations and implementing changes during runtime are only possible in virtual machines. I distinctly recall overhearing C++ compiler writers mutter "you can do that for programs written in C++ too" while listening to dynamic language enthusiasts talk about collecting statistics and rearranging code, but my web searches for evidence to support this memory have come up dry.
Profile guided optimization is different than runtime optimization. The optimization is still done offline, based on profiling information, but once the binary is shipped there is no ongoing optimization, so if the usage patterns of the profile-guided optimization phase don't accurately reflect real-world usage then the results will be imperfect, and the program also won't adapt to different usage patterns.
You may be interesting in looking for information on HP's Dynamo, although that system focused on native binary -> native binary translation, although since C++ is almost exclusively compiled to native code I suppose that's exactly what you are looking for.
You may also want to take a look at LLVM, which is a compiler framework and intermediate representation that supports JIT compilation and runtime optimization, although I'm not sure if there are actually any LLVM-based runtimes that can compile C++ and execute + runtime optimize it yet.
I did that kind of optimization quite a lot in the last years. It was for a graphic rendering API that I've implemented. Since the API defined several thousand different drawing modes as general purpose function was way to slow.
I ended up writing my own little Jit-compiler for a domain specific language (very close to asm, but with some high level control structures and local variables thrown in).
The performance improvement I got was between factor 10 and 60 (depended on the complexity of the compiled code), so the extra work paid off big time.
On the PC I would not start to write my own jit-compiler but use either LIBJIT or LLVM for the jit-compilation. It wasn't possible in my case due to the fact that I was working on a non mainstream embedded processor that is not supported by LIBJIT/LLVM, so I had to invent my own.
The answer is more likely: no one did more than PGO for C++ because the benefits are likely unnoticeable.
Let me elaborate: JIT engines/runtimes have both blesses and drawbacks from their developer's view: they have more information at runtime but much little time to analyze.
Some optimizations are really expensive and you will unlikely see without a huge impact on start time are those one like: loop unrolling, auto-vectorization (which in most cases is also based on loop unrolling), instruction selection (to use SSE4.1 for CPU that use SSE4.1) combined with instruction scheduling and reordering (to use better super-scalar CPUs). This kind of optimizations combine great with C like code (that is accessible from C++).
The single full-blown compiler architecture to do advanced compilation (as far as I know) is the Java Hotspot compilation and architectures with similar principles using tiered compilation (Java Azul's systems, the popular to the day JaegerMonkey JS engine).
But one of the biggest optimization on runtime is the following:
Polymorphic inline caching (meaning that if you run the first loop with some types, the second time, the code of the loop will be specialized types that were from previous loop, and the JIT will put a guard and will put as default branch the inlined types, and based on it, from this specialized form using a SSA-form engine based will apply constant folding/propagation, inlining, dead-code-elimination optimizations, and depends of how "advanced" the JIT is, will do an improved or less improved CPU register assignment.)
As you may notice, the JIT (hotspots) will improve mostly the branchy code, and with runtime information will get better than a C++ code, but a static compiler, having at it's side the time to do analysis, instruction reordering, for simple loops, will likely get a little better performance. Also, typically, the C++ code, areas that need to be fast tends to not be OOP, so the information of the JIT optimizations will not bring such an amazing improvement.
Another advantage of JITs is that JIT works cross assemblies, so it has more information if it wants to do inlining.
Let me elaborate: let's say that you have a base class A and you have just one implementation of it namely B in another package/assembly/gem/etc. and is loaded dynamically.
The JIT as it see that B is the only implementation of A, it can replace everywhere in it's internal representation the A calls with B codes, and the method calls will not do a dispatch (look on vtable) but will be direct calls. Those direct calls may be inlined also. For example this B have a method: getLength() which returns 2, all calls of getLength() may be reduced to constant 2 all over. At the end a C++ code will not be able to skip the virtual call of B from another dll.
Some implementations of C++ do not support to optimize over more .cpp files (even today there is the -lto flag in recent versions of GCC that makes this possible). But if you are a C++ developer, concerned about speed, you will likely put the all sensitive classes in the same static library or even in the same file, so the compiler can inline it nicely, making the extra information that JIT have it by design, to be provided by developer itself, so no performance loss.
visual studio has an option for doing runtime profiling that then can be used for optimization of code.
"Profile Guided Optimization"
Microsoft Visual Studio calls this "profile guided optimization"; you can learn more about it at MSDN. Basically, you run the program a bunch of times with a profiler attached to record its hotspots and other performance characteristics, and then you can feed the profiler's output into the compiler to get appropriate optimizations.
I believe LLVM attempts to do some of this. It attempts to optimize across the whole lifetime of the program (compile-time, link-time, and run-time).
Reasonable question - but with a doubtful premise.
As in Nils' answer, sometimes "optimization" means "low-level optimization", which is a nice subject in its own right.
However, it is based on the concept of a "hot-spot", which has nowhere near the relevance it is commonly given.
Definition: a hot-spot is a small region of code where a process's program counter spends a large percentage of its time.
If there is a hot-spot, such as a tight inner loop occupying a lot of time, it is worth trying to optimize at the low level, if it is in code that you control (i.e. not in a third-party library).
Now suppose that inner loop contains a call to a function, any function. Now the program counter is not likely to be found there, because it is more likely to be in the function. So while the code may be wasteful, it is no longer a hot-spot.
There are many common ways to make software slow, of which hot-spots are one. However, in my experience, that is the only one of which most programmers are aware, and the only one to which low-level optimization applies.
See this.

Register allocation rules in code generated by major C/C++ compilers

I remember some rules from a time ago (pre-32bit Intel processors), when was quite frequent (at least for me) having to analyze the assembly output generated by C/C++ compilers (in my case, Borland/Turbo at that time) to find performance bottlenecks, and to safely mix assembly routines with C/C++ code. Things like using the SI register for the this pointer, AX being used for return values, which registers should be preserved when an assembly routine returns, etc.
Now I was wondering if there's some reference for the more popular C/C++ compilers (Visual C++, GCC, Intel...) and processors (Intel, ARM, ...), and if not, where to find the pieces to create one. Ideas?
You are asking about "application binary interface" (ABI) and calling conventions. These are typically set by operating systems and libraries, and enforced by compilers and linkers. Google for "ABI" or "calling convention." Some starting points from Wikipedia and Debian for ARM.
Agner Fog's "Calling Conventions" document summarizes, amongst other things, the Windows and Linux 64 and 32-bit ABIs: http://www.agner.org/optimize/calling_conventions.pdf. See Table 4 on p.10 for a summary of register usage.
One warning from personal experience: don't embed assumptions about the ABI in inline assembly. If you write a function in inline assembly that assumes return and/or parameter transfer in particular registers (e.g. eax, rdi, rsi), it will break if/when the function is inlined by the compiler.
Open Watcom C/C++ compiler supports two calling conventions, register-based (default) and stack-based (very close to what other compilers use). User's Guide for this compiler describes them both and is available for free online, together with the compiler itself. You may find these topics in the User's Guide especially helpful:
10.4.1 Passing Arguments Using Register-Based Calling Conventions
10.4.6 Using Stack-Based Calling Conventions
10.5 Calling Conventions for 80x87-based Applications
Well, today if optimisation is turned on, there arn't any. But GCC allows you to declare that your assembly instruction should use particular variable regardless if it's in register or not, or even to force GCC tu put that variable into a register usable with your instruction. You can also declare which registers your inline assembly block reserves for itself (so compiler should generate apropriate save/restore code around your inline piece, if needed)
I believe but am by no means sure that GCC uses the Itanium ABI for most of its function; the incompatibilites between it and the ABI it uses are documented.