Why not mark everything inline? - c++

First off, I am not looking for a way to force the compiler to inline the implementation of every function.
To reduce the level of misguided answers make sure you understand what the inline keyword actually means. Here is good description, inline vs static vs extern.
So my question, why not mark every function definition inline? ie Ideally, the only compilation unit would be main.cpp. Or possibly a few more for the functions that cannot be defined in a header file (pimpl idiom, etc).
The theory behind this odd request is it would give the optimizer maximum information to work with. It could inline function implementations of course, but it could also do "cross-module" optimization as there is only one module. Are there other advantages?
Has any one tried this in with a real application? Did the performance increase? decrease?!?
What are the disadvantages of marking all function definitions inline?
Compilation might be slower and will consume much more memory.
Iterative builds are broken, the entire application will need to be rebuilt after every change.
Link times might be astronomical
All of these disadvantage only effect the developer. What are the runtime disadvantages?

Did you really mean #include everything? That would give you only a single module and let the optimizer see the entire program at once.
Actually, Microsoft's Visual C++ does exactly this when you use the /GL (Whole Program Optimization) switch, it doesn't actually compile anything until the linker runs and has access to all code. Other compilers have similar options.

sqlite uses this idea. During development it uses a traditional source structure. But for actual use there is one huge c file (112k lines). They do this for maximum optimization. Claim about 5-10% performance improvement
http://www.sqlite.org/amalgamation.html

We (and some other game companies) did try it via making one uber-.CPP that #includeed all others; it's a known technique. In our case, it didn't seem to affect runtime much, but the compile-time disadvantages you mention turned out to be utterly crippling. With a half an hour compile after every single change, it becomes impossible to iterate effectively. (And this is with the app divvied up into over a dozen different libraries.)
We tried making a different configuration such that we would have multiple .objs while debugging and then have the uber-CPP only in release-opt builds, but then ran into the problem of the compiler simply running out of memory. For a sufficiently large app, the tools simply are not up to compiling a multimillion line cpp file.
We tried LTCG as well, and that provided a small but nice runtime boost, in the rare cases where it didn't simply crash during the link phase.

Interesting question! You are certainly right that all of the listed disadvantages are specific to the developer. I would suggest, however, that a disadvantaged developer is far less likely to produce a quality product. There may be no runtime disadvantages, but imagine how reluctant a developer will be to make small changes if each compile takes hours (or even days) to complete.
I would look at this from a "premature optimization" angle: modular code in multiple files makes life easier for the programmer, so there is an obvious benefit to doing things this way. Only if a specific application turns out to run too slow, and it can be shown that inlining everything makes a measured improvement, would I even consider inconveniencing the developers. Even then, it would be after a majority of the development has been done (so that it can be measured) and would probably only be done for production builds.

This is semi-related, but note that Visual C++ does have the ability to do cross-module optimization, including inline across modules. See http://msdn.microsoft.com/en-us/library/0zza0de8%28VS.80%29.aspx for info.
To add an answer to your original question, I don't think there would be a downside at run time, assuming the optimizer was smart enough (hence why it was added as an optimization option in Visual Studio). Just use a compiler smart enough to do it automatically, without creating all the problems you mention. :)

Little benefit
On a good compiler for a modern platform, inline will affect only a very few functions. It is just a hint to the compiler, modern compilers are fairly good at making this decision themselves, and the the overhead of a function call has become rather small (often, the main benefit of inlining is not to reduce call overhead, but opening up further optimizations).
Compile time
However, since inline also changes semantics, you will have to #include everything into one huge compile unit. This usually increases compile time significantly, which is a killer on large projects.
Code Size
if you move away from current desktop platforms and its high performance compilers, things change a lot. In this case, the increased code size generated by a less clever compiler will be a problem - so much that it makes the code significantly slower. On embedded platforms, code size is usually the first restriction.
Still, some projects can and do profit from "inline everything". It gives you the same effect as link time optimization, at least if your compiler doesn't blindly follow the inline.

That's pretty much the philosophy behind Whole Program Optimization and Link Time Code Generation (LTCG) : optimization opportunities are best with global knowledge.
From a practical point of view it's sort of a pain because now every single change you make will require a recompilation of your entire source tree. Generally speaking you need an optimized build less frequently than you need to make arbitrary changes.
I tried this in the Metrowerks era (it's pretty easy to setup with a "Unity" style build) and the compilation never finished. I mention it only to point out that it's a workflow setup that's likely to tax the toolchain in ways they weren't anticipating.

It is done already in some cases. It is very similar to the idea of unity builds, and the advantages and disadvantages are not fa from what you descibe:
more potential for the compiler to optimize
link time basically goes away (if everything is in a single translation unit, there is nothing to link, really)
compile time goes, well, one way or the other. Incremental builds become impossible, as you mentioned. On the other hand, a complete build is going to be faster than it would be otherwise (as every line of code is compiled exactly once. In a regular build, code in headers ends up being compiled in every translation unit where the header is included)
But in cases where you already have a lot of header-only code (for example if you use a lot of Boost), it might be a very worthwhile optimization, both in terms of build time and executable performance.
As always though, when performance is involved, it depends. It's not a bad idea, but it's not universally applicable either.
As far as buld time goes, you have basically two ways to optimize it:
minimize the number of translation units (so your headers are included in fewer places), or
minimize the amount of code in headers (so that the cost of including a header in multiple translation units decreases)
C code typically takes the second option, pretty much to its extreme: almost nothing apart from forward declarations and macros are kept in headers.
C++ often lies around the middle, which is where you get the worst possible total build time (but PCH's and/or incremental builds may shave some time off it again), but going further in the other direction, minimizing the number of translation units can really do wonders for the total build time.

The assumption here is that the compiler cannot optimize across functions. That is a limitation of specific compilers and not a general problem. Using this as a general solution for a specific problem might be bad. The compiler may very well just bloat your program with what could have been reusable functions at the same memory address (getting to use the cache) being compiled elsewhere (and losing performance because of the cache).
Big functions in general cost on optimization, there is a balance between the overhead of local variables and the amount of code in the function. Keeping the number of variables in the function (both passed in, local, and global) to within the number of disposable variables for the platform results in most everything being able to stay in registers and not have to be evicted to ram, also a stack frame is not required (depends on the target) so function calling overhead is noticeably reduced. Hard to do in real world applications all the time, but the alternative a small number of big functions with lots of local variables the code is going to spend a significant amount of time evicting and loading registers with variables to/from ram (depends on the target).
Try llvm it can optimize across the entire program not just function by function. Release 27 had caught up to gcc's optimizer, at least for a test or two, I didnt do exhaustive performance testing. And 28 is out so I assume it is better. Even with a few files the number of tuning knob combinations are too many to mess with. I find it best to not optimize at all until you have the whole program into one file, then perform your optimization, giving the optimizer the whole program to work with, basically what you are trying to do with inlining, but without the baggage.

Suppose foo() and bar() both call some helper(). If everything is in one compilation unit, the compiler might choose not to inline helper(), in order to reduce total instruction size. This causes foo() to make a non-inlined function call to helper().
The compiler doesn't know that a nanosecond improvement to the running time of foo() adds $100/day to your bottom line in expectation. It doesn't know that a performance improvement or degradation of anything outside of foo() has no impact on your bottom line.
Only you as the programmer know these things (after careful profiling and analysis of course). The decision not to inline bar() is a way of telling the compiler what you know.

The problem with inlining is that you want high performance functions to fit in cache. You might think function call overhead is the big performance hit, but in many architectures a cache miss will blow the couple pushes and pops out of the water. For example, if you have a large (maybe deep) function that needs to be called very rarely from your main high performance path, it could cause your main high performance loop to grow to the point where it doesn't fit in L1 icache. That will slow your code down way, way more than the occasional function call.

Related

Deciding where to place a function implementation

Let me first state that I know that inline does not mean that the compiler will always inline a function...
In C++ there really are two places for a non-template non-constexpr function implementation to go:
A header, definition should be inline
A source file
There are benefits/negatives to placing the implementation in one or the other:
inline function definition
compiler can inline the function
slower compiler times both due to having to parse definitions and include implementation dependencies.
multiple copies of a function between multiple users on the same site
source file definition
compiler can never inline the function (maybe that's not true with LTO?)
can avoid recompilation if the file hasn't changed
one copy per site
I am in the midst of writing a reusable math library where inlining can offer significant speedups. I only have test code and snippets to work with right now, so profiling isn't an option for helping me decide. Are there any rules - or just rules of thumb - on deciding where to define the function? Are there certain types of functions, like those with exceptions, which are known to always generate large amounts of code that should be relegated to a source file?
If you have no data, keep it simple.
Libraries that suck to develop don't get finished, and those that suck to use don't get used. So split h/cpp by default; that makes build times slower and development faster.
Then get data. Write tests and see if you get significant speedups from inlining. Then go and learn how to profile and realize your speedups where spurious, and write better tests.
How to profile and determine what is spurious and what is microbenchmark noise is between a chapter of a book and a book in length. Read SO questions about performance in C++ and you'll at least learn the 10 most common ways to microbenchmark are not accurate.
For general rules, smallish bits of code in tight loops benefit from inlining, as do cases where external vectorization is plausible, and where false aliasing could block compiler optimizations.
Often you can hoist the benefits of inlining into your library by offering vector operations.
Generally speaking, if you are statically linking (as opposed to DLL/DSO methods), then the compiler/linker will basically ignore inline and do what's sensible.
The old rule of thumb (which everyone seems to ignore) is that inline should only be used for small functions. The one problem with inlining is that all do often I see people doing some timed test, e.g.
auto startTime = getTime();
for(int i = 0; i < BIG_NUM; ++i)
{
doThing();
}
auto endTime = getTime();
The immediate conclusion from that test is that inline is good for performance everywhere. But that isn't the case.
inlining also increases the size of your compiled exe. This has a nasty side effect in that it increases the burden placed on the instruction and uop caches, which can cause a performance loss. So in the case of a large scale app, more often than not you'll find that removing inline from commonly used functions can actually be a performance win.
One of the nastiest problems with inline is that if it's applied to the wrong method, it's very hard to get a profiler to point out a hot spot - It's just a little warmer than needed in multiple points in the codebase.
My rule of thumb - if the code for a method can fit on one line, inline it. If the code doesn't fit on one line, put it in the cpp file until a profiler indicates moving it to the header would be beneficial.
The rule of thumb I work by is simple: No function definitions in headers, and all function definitions in a source file, unless I have a specific reason to do otherwise.
Generally speaking, C++ code (like code in many languages) is easier to maintain if there is a clear separation of interface from implementation. Maintenance effort is (quite often) a cost driver in non-trivial programs, because it translates into developer time and salary costs. In C++, interface is represented by declarations of functions (without definition), type declarations, struct and class definition, etc i.e. the things that are typically placed in a header, if the intent is to use them in more than one source file. Changing the interface (e.g. changing a function's argument types or return type, adding a member to a class, etc) means that everything which depends on that interface must be recompiled. In the long run, it often works out that header files need to change less often than source files - as long as interface is kept separate from implementation. Whenever a header changes, all source files which use that header (i.e. that #include it) must be recompiled. If a header doesn't change, but a function definition changes, then only the source file which contains the changed function definition, needs to be recompiled.
In large projects (say, with hundreds of source and header files) this sort of thing can make the difference between incremental builds taking a few seconds or a few minutes to recompile a few changed source files versus significantly longer to recompile a large number of source files because a header they all depend on has changed.
Then the focus can be on getting the code working correctly - in the sense of producing the same observable output given a set of inputs, meeting its functional requirements, and passing suitable test cases.
Once the code is working correctly enough, attention can turn to other program characteristics, such as performance. If profiling shows that a function is called many times and represents a performance hot-spot, then you can look at options for improving performance. One option that MIGHT be relevant for improving performance of a program that is otherwise correct is to selectively inline functions. But, every time this is done, it amounts to deciding to accept a greater maintenance burden in order to get performance. But it is necessary to have evidence of the need (e.g. by profiling).
Like most rules of thumb, there are exceptions. For example, templated functions or classes in C++ do generally need to be inlined since, more often than not, the compiler needs to see their definition in order to instantiate the template. However, that is not an justification to inlining everything (and it is not a justification for turning every class or function into a template).
Without profiling or other evidence, I would rarely bother to inline functions. Inlining is a hint to the compiler, which the compiler may well ignore, so the effort of inlining may not even be worth it. Doing such a thing without evidence may achieve nothing - in which case it is simply premature optimisation.

Increase Program Speed By Avoiding Functions? (C++)

When it comes to procedural programming, functional decomposition is ideal for maintaining complicated code. However, functions are expensive- adding to the call stack, passing parameters, storing return addresses. all of this takes extra time! When speed is crucial, how can I get the best of both worlds? I want a highly decomposed program without any necessary overhead introduced by function calls. I'm familiar with the keyword: "inline" but that seems to be only be a suggestion to the compiler, and if used incorrectly by the programmer it will yield an even slower program. I'm using g++, so will the -03 flag optimize away my functions that call functions that call functions..
I just wanted to know, if my concerns are valid and if there are any methods to combat this issue.
First, as always when dealing with performance issues, you should try and measure what are your bottlenecks with a profiler. The first thing coming out is usually not function calls and by a large margin. If you did this, then please read on.
Then, you can anticipate a bit what functions you want inlined by using the inline keyword. The compiler is usually smart enough to know what to inline and what not to inline (it can inline functions you forgot and may not inline some you mentionned if he thinks it won't help).
If (really) you still want to improve performance on function calls and want to force inlining, some compilers allow you to do so (see this question). Please consider that massive inlining may actually decrease performance: your code will use a lot of memory and you may get more cache misses on the code than before (which is not good).
If it's a specific piece of code you're worried about you can measure the time yourself. Just run it in a loop a large number of times and get the system time before and after. Use the difference to find the average time of each call.
As always the numbers you get are subjective, since they will vary depending on your system and compiler. You can compare the times you get from different methods to see which is generally faster, such as replacing the function with a macro. My guess is however you won't notice much difference, or at the very least it will be inconsequential.
If you don't know where the slowdown is follow J.N's advice and use a code profiler and optimise where it's needed. As a rule of thumb always pass large objects to functions by reference or const reference to avoid copy times.
I highly doubt speed is that curcial, but my suggestion would be to use preprocessor macros.
For example
#define max(a,b) ( a > b ? a : b )
This would seem obvious to me, but I don't consider myself an expect in C++, so I may have misunderstood the question.

compile code as a single automaticly merged file to allow compiler better code optimization

suppose you have a program in C, C++ or any other language that employs the "compile-objects-then-link-them"-scheme.
When your program is not small, it is likely to compromise several files, in order to ease code management (and shorten compilation time). Furthermore, after a certain degree of abstraction you likely have a deep call hierarchy. Especially at the lowest level, where tasks are most repetitive, most frequent you want to impose a general framework.
However, if you fragment your code into different object files and use a very abstract archictecture for your code, it might inflict performance (which is bad if you or your supervisor emphasizes performance).
One way to circuvent this is might be extensive inlining - this is the approach of template meta-programming: in each translation unit you include all the code of your general, flexible structures, and count on the compiler to counteract performance issues. I want to do something similar without templates - say, because they are too hard to handle or because you use plain C.
You could write all your code into one single file. That would be horrible. What about writing a script, which merges all your code into one source file and compiles it? Requiring your source files are not too wildly written. Then a compiler could probably apply much more optimization (inlining, dead code elamination, compile-time arithmetics, etc.).
Do you Have any experience with or objections against this "trick"?
Pointless with a modern compiler. MSVC, GCC and clang all support link-time code generation (GCC and clang call it 'link-time optimisation'), which allows for exactly this. Plus, combining multiple translation units into one large makes you unable to parallelise the compilation process, and (at least in case of C++) makes RAM usage go through the roof.
in each translation unit you include all the code of your general, flexible structures, and count on the compiler to counteract performance issues.
This is not a feature, and it's not related to performance in any way. It's an annoying limitation of compilers and the include system.
This is a semi-valid technique, iirc KDE used to use this to speed up compilation back in the day when most people had one cpu core. There are caveats though, if you decide to do something like this you need to write your code with it in mind.
Some samples of things to watch out for:
Anonymous namespaces - namespace { int x; }; in two source files.
Using-declarations that affect following code. using namespace foo; in a .cpp file can be OK - the appended sources may not agree
The C version of anon namespaces, static globals. static int i; at file scope in several cpp files will cause problems.
#define's in .cpp files - will affect source files that don't expect it
Modern compilers/linkers are fully able to optimize across translation units (link-time code generation) - I don't think you'll see any noticeable difference using this approach.
It would be better to profile your code for bottlenecks, and apply inlining and other speed hacks only where appropriate. Optimization should be performed with a scalpel, not with a shotgun.
Though it is not suggested, using #include statements for C files is essentially the same as appending the entire contents of the included file in the current one.
This way, if you include all of your files in one "master file" that file will be essentially compile as if all the source code were appended in it.
SQlite does that with its Amalgamation source file, have a look at:
http://www.sqlite.org/amalgamation.html
Do you mind if I share some experience about what makes software slow, especially when the call tree gets bushy? The cost to enter and exit functions is almost totally insignificant except for functions that
do very little computation and (especially) do not call any further functions,
and are actually in use for a significant fraction of the time (i.e. random-time samples of the program counter are actually in the function for 10% or more of the time).
So in-lining helps performance only for a certain kind of function.
However, your supervisor could be right that software with layers of abstraction have performance problems.
It's not because of the cycles spent entering and leaving functions.
It's because of the temptation to write function calls without real awareness of how long they take.
A function is a bit like a credit card. It begs to be used. So it's no mystery that with a credit card you spend more than you would without it.
However, it's worse with functions, because functions call functions call functions, over many layers, and the overspending compounds exponentially.
If you get experience with performance tuning like this, you come to recognize the design approaches that result in performance problems. The one I see over and over is too many layers of abstraction, excess notification, overdesigned data structure, stuff like that.

Profile optimised C++/C code

I have some heavily templated c++ code that I am working with. I can compile and profile with AMD tools and sleepy in debug mode. However without optimisation most of time spent concentrated in the templated code and STL. With optimised compilation, all the profile tools that I know produce garbage information. Does anybody know a good way to profile optimised native code
PS1:
The code that I am writing is also heavily templated. Most of the time spent in the unoptimised code will be optimized away. I am talking about 96-97% of the run time are spent in templated code without optimisation. This is going to corrupt the accuracy of the profiling. And yes I can change many templated code or at least what part of the templated code is introducing the most trouble and I can do better in those places.
You should focus on the code you wrote because that is what you can change, time spent in STL is irrelevant, just ignore it and focus on the callers of that code. If too much time is spent in STL you probably can call some other STL primitive instead of the current one.
Profiling unoptimized code is less interesting, but you can still get some informations. If used algorithms from some parts of code are totally flawed it will show up even there. But you should be able to get useful informations from any good profiling tool in optimized code. What tools do you use exactly and why do you call their output garbage ?
Also it's usually easy enough to instrument your code by hand and find out exactly which parts are efficient and which are not. It's just a matter of calling timer functions (or reading cycle count of processor if possible) at well chosen points. I usually do that from unit tests to have reproducible results, but all depends of the specifics of your program.
Tools or instrumenting code are the easy part of optimization. The hard part is finding ways to get faster code where it's needed.
What do you mean by "garbage information"?
Profiling is only really meaningful on optimized builds, so tools are designed to work with them -- thus if you're getting meaningless results, it's probably due to the profiler not finding the right symbols, or needing to instrument the build.
In the case of Intel VTune, for example, I found I got impossible results from the sampler unless I explicitly told it where to find the PDBs for the executable I was tuning. In the instrumented version, I had to fiddle with the settings until it was reliably putting probes into the function calls.
When #kriss says
You should focus on the code you wrote
because that is what you can change
that's exactly what I was going to say.
I would add that in my opinion it is easier to do performance tuning first on code compiled without optimization, and then later turn on the optimizer, for the same reason. If something you can fix is costing excess time, it will cost proportionally excess time regardless of what the compiler does, and it's easier to find it in code that hasn't been scrambled.
I don't look for such code by measuring time. If the excess time is, say, 20%, then what I do is randomly pause it several times. As soon as I see something that can obviously be improved on 2 or more samples, then I've found it. It's an oddball method, but it doesn't really miss anything. I do measure the overall time before and after to see how much I saved. This can be done multiple times until you can't find anything to fix. (BTW, if you're on Linux, Zoom is a more automated way to do this.)
Then you can turn on the optimizer and see how much it gives you, but when you see what changes you made, you can see there's really no way the compiler could have done it for you.

Should I use a function in a situation where it would be called an extreme number of times?

I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken
Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.
Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.
You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.
Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)
The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.