huge executables because of debugging symbols, why? - c++

We have been developing a large financial application at a bank. It started out being 150k lines of really bad code. By 1 month ago, it was down to a little more than half that, but the size of the executable was still huge. I expected that as we were just making the code more readable, but the templated code was still generating plenty of object code, we were just being more efficient with our effort.
The application is broken into about 5 shared objects and a main. One of the bigger shared objects was 40Mb and grew to 50 even while the code shrank.
I wasn't entirely surprised that the code started to grow, because after all we are adding some functionality. But I was surprised that it grew by 20%. Certainly no one came close to writing 20% of the code, so it's hard for me to imagine how it grew that much. That module is kind of hard for me to analyze, but on Friday, I have a new datapoints that sheds some light.
There are perhaps 10 feeds to SOAP servers. The code is autogenerated, badly. Each service had one parser class with exactly the same code, something like:
#include <boost/shared_ptr.hpp>
#include <xercesstuff...>
class ParserService1 {
public:
void parse() {
try {
Service1ContentHandler*p = new Service1ContentHandler( ... );
parser->setContentHandler(p);
parser->parser();
} catch (SAX ...) {
...
}
}
};
These classes were completely unnecessary, a single function works. Each ContentHandler class had been autogenerated with the same 7 or 8 variables, which I was able to share with inheritance.
So I was expecting the size of the code to go down when I removed the parser classes and all from the code. But with only 10 services, I wasn't expecting it to drop from 38Mb to 36Mb. That's an outrageous amount of symbols.
The only thing that I can think of is that each parser was including boost::shared_ptr, some Xerces parser stuff, and that somehow, the compiler and linker are storing all those symbols repeatedly for each file. I'm curious to find out in any case.
So, can anyone suggest how I would go about tracking down why a simple modification like this should have so much impact? I can use nm on a module to look at the symbols inside, but that's going to generate a painful, huge amount of semi-readable stuff.
Also, when a colleague ran her code with my new library, the user time went from 1m55 seconds to 1m25 seconds. The real time is highly variable, because we are waiting on slow SOAP servers (IMHO, SOAP is an incredibly poor replacement for CORBA...) but the CPU time is quite stable. I would have expected a slight boost from reducing the code size that much, but the bottom line is, on a server with massive memory, I was really surprised that the speed was impacted so much, considering I didn't change the architecture of the XML processing itself.
I'm going to take it much further on Tuesday, and hopefully will get more information, but if anyone has some idea of how I could get this much improvement, I'd love to know.
Update:
I verified that in fact, having debugging symbols in the task does not appear to change the run time at all. I did this by creating a header file that included lots of stuff, including the two that had the effect here: boost shared pointers and some of the xerces XML parser. There appears to be no runtime performance hit (I checked because there were differences of opinion between two answers). However, I also verified that including header files creates debugging symbols for each instance, even though the stripped binary size is unchanged. So if you include a given file, even if you don't even use it, there is a fixed number of symbols objected into that object that are not folded together at link time even though they are presumably identical.
My code looks like:
#include "includetorture.h"
void f1()
{
f2(); // call the function in the next file
}
The size with my particular include files was about 100k per source file. Presumably, if I had included more, it would be higher. The total executable with the includes was ~600k, without about 9k. I verified that the growth is linear with the number of files doing the including, but the stripped code is the same size regardless, as it should be.
Clearly I was mistaken thinking this was the reason for the performance gain. I think I have accounted for that now. Even though I didn't remove much code, I did streamline a lot of big xml string processing, and reduced the path through code considerably, and that is presumably the reason.

You can use the readelf utility on linux, or dumpbin on windows, to find the exact amount of space used by various kinds of data in the exe file. Though, I don't see why the executable size is worrying you: debugging symbols use ABSOLUTELY NO memory at run-time!

It seems you are using a lot of c++ classes with inline methods. If these classes have a high visibility, this inline code will bloat the whole application. I bet your link times have increased as well. Try reducing the number of inline methods and move the code to the .cpp files. This will reduce the size of your object files, the exe file and reduce link times.
The trade off in this case is of course reduced size of compilation units, versus execution time.

I don't have the very answer you are expecting to your question, but let me share my experience.
It is pretty common that the difference in size of executable files is very high. I cannot explain why in detail, but just think of all the crazy things that modern debuggers let you do on your code. You know, this is thanks to debugging symbols.
The difference in size is so big that if you are, say, loading dynamically some shared libraries, then the sheer loading time of the file could explain the difference in performance you found.
Indeed, this is a pretty "internal" aspect of compilers, and just to give you an example, years back I was quite unhappy with the huge executable files that GCC-4 produced in comparison to GCC-3, then I simply got used to it (and my HD grew in size, also).
All in all, I would not mind, because you are supposed to use builds with debugging symbols only during development, where it should not be an issue. In deployment, no debugging symbol just be there, and you will see how much the files will shrink.

Related

OOP - Do I over complicate things?

I was looking at some of my projects and comparing them to things I've seen on github and I feel like I over-think things. I like OOP but I feel like I make too many files, too many classes.
For example, on a small project I had of a game of checkers, I had so many files that could maybe all go into one file/class. How do I know when I have over-thought my solutions? Here is what some of my files look like;
|src
| |- player.cpp
| |- piece.cpp
| |- color.cpp
| |- ...
And of course, there are many more files that will deal with things like rules, setting the game, GUI, etc,. But in this short example you can see how my projects can and will get very large. Is this common, to write things in this way? Or should I simply write a player.cpp file that either contains multiple classes that, in this case, are related and would set pieces/colors/king information, etc,.
Yes, distributing your code to multiple files is a good practice, since it makes your project maintainable.
I can see your concerns on a small project (is the overhead worth it?), but in real big projects, if you don't do it that way, you will end up with people scrolling forever in a large file, and using searching trough the file to find out what they are looking for.
Try to keep your files compact, and one class per file, where every class is robust and its goal is clear.
Sometimes, we write functions to files. It would not be wise to have a file for every small, inline function, it will increase the number of files without a reason. It would be better to have a family of functions inside a file (functions related to printing for example).
At the end, it's probably opinion based which is the ideal balance between size and number of files, but I hope I made myself clear.
You are actually asking two distinct questions: "what is the good granularity for separating functionality into classes" and "what is the good practice to organize project file structure". Both are rather broad.
The answer to first one would probably be to follow a single responsibility idiom. The answer to second one would be to make folder structure resemble the namespace structure (like in boost for example). Your current approach with storing everything in src folder is not good for C++ because it will lead to longer file names to prevent names collision when classes with the same name appearing in different namespaces. Larger projects indeed tend to have too many files as one class would require 4-5 files. And that leads to yet another question of selecting appropriate granularity for projects...
People tend to worry a lot about "too many classes" or "too many files", but that's largely a historical carryover. 40 years ago when we wrote programs on punch cards, and had to carry large trays and boxes of them (and not drop them!), this certainly would have been a concern. 35 years ago when the biggest hard drive you could get for a PC was 33MB, this was a concern. Today, when you wouldn't consider buying a PC with less than 512GB of SSD, and have access to terabytes and petabytes of online storage, the number of files and number of bytes taken up by the programs are essentially immaterial to the development process.
What that means to us humans is that we should use this abundance of capacity to improve other aspects of our code. If more files helps you understand the code better, use more files. If longer file names help you understand the code better, use longer file names. If following a rule like "one .cpp and one .h file per class" helps people maintain the codebase, then follow the rule. The key is to focus on truly important issues, such as "what makes this code more maintainable, more readable, more understandable to me and my team?"
Another way to approach this is to ask if the "number of files" is a useful metric for determining if code is maintainable? While a number that is obviously too low for the app would be concerning, I wouldn't be able to tell you if 10 or 100 or 1000 was an appropriate number (at least without knowing the number of classes they contain.) Therefore, it doesn't appear to be a useful metric.
Does that mean a program should have 1000 files all piled into a single folder, all compiling and linking into a single library or executable file? It depends, but it seems that 1000 classes in the same namespace would be a bit crowded and the resultant program might be too monolithic. At some point you'll probably want to refactor the architecture into smaller, more cohesive packages, each with an appropriate area of responsibility. Of course, nobody can tell you what that magic number is, as it's completely application dependent. But it's not the number of files that drives a decision like this, it's that the files should be related to each other logically or architecturally.
Each class should be designed and programmed to accomplish one, and only one, thing
Because each class is designed to have only a single responsibility, many classes are used to build an entire application

What are the drawbacks of single source project structures?

I'm new in my current company and working on a project written by my direct team lead. The company usually doesn't work with C++ but there is productive code written by my coworker in C/C++. It's just us who know how to code in C++ (me and my lead, so no 3rd opinion that can be involved).
After I got enough insight of the project I realized the whole structure is... special.
It actually consist of a single compilation unit where the makefile lists as only source the main.hpp.
This headerfile then includes all the source files the project consists off, so it looks like a really big list of this:
#include "foo.cpp"
#include "bar.cpp"
While trying to understand the logic behind it and I realized that this is indeed working for this project, as it's just an interface where each unit can operate without accessing any other unit, at some point I asked him what are the reasons why he did it this way.
I got a defensive reaction with the argument
Well, it is working, isn't it? You are free to do it your way if you think that's better for you.
And that's what I'm doing now, simply because I'm really having trouble with thinking into this structure. So for now I'm applying the "usual" structure to the implementation I'm writing right now, while doing only mandatory changes to the whole project, to demonstrate how I would have designed it.
I think there are a lot of drawbacks, starting with mixing linkers and compilers jobs by own project structure can't serve well, up to optimizations that will probably end in redundant or obscure results, not to mention that a clean build of the project takes ~30 minutes, what I think might be caused by the structure as well. But I'm lacking the knowledge to name real and not just hypothetical issues with this.
And as his argument "It works my way, doesn't it?" holds true, I would like to be able to explain to him why it's a bad idea anyway, rather than coming over as the new nit picky guy.
So what problems could actually be caused by such a project structure?
Or am I overreacting and such a structure is totally fine?
not to mention that a (clean) build of the project takes ~30 minutes
The main drawback is that a change to any part of the code will require the entire program to be recompiled from the scratch. If the compilation takes a minute, this would probably not be significant, but if it takes 30 min, it's going to be painful; it destroys the make a change->compile->test workflow.
not to mention that a clean build of the project takes ~30 minutes
Having separate translation units is actually typically a quite a bit slower to compile from scratch, but you only need to recompile each unit separately when they're changed, which is the main advantage. Of course, it is easy to mistakenly destroy this advantage by including a massive, often changing header in all translation units. Separate translation units take a bit of care to do it right.
These days, with multi core CPU's, the slower build from scratch is mitigated by parallelism that multiple translation units allow (perhaps the disadvantage may even be overcome if the size of individual translation units happen to hit a sweet spot, and there are enough cores; you'll need some thorough research to find out).
Another potential drawback is that the entire compilation process must fit in memory. This is only a problem when that memory becomes more than the free memory on your developers workstations.
In conclusion: The problem is that one-massive-source-file approach does not scale well to big projects.
Now, a word about advantages, for fairness
up to optimizations will probably end in redundant or obscure results
Actually, the single translation unit is easier to optimize than separate ones. This is because some optimizations, inline expansion in particular, are not possible across translation units because they depend on the definitions that are not in the currently compiled translation unit.
This optimization advantage has been mitigated since link time optimization has been available in stable releases of popular compilers. As long as you're able and willing to use a modern compiler, and to enable link time optimization (which might not be enabled by default)
PS. It's very un-conventional to name the single source file with extension .hpp.
First thing I would like to mention are advantages of project with Single Compilation Unit:
Drastic compilation time reduction. This is actually one of the primary reasons to switch to SCU. When you have a regular project with n translation units compilation time will grow ~ linearly with each new translation unit added. While with SCU it will grow ~ logarithmically and adding new units to large projects hardly effects compilation time.
Compilation memory reduction. Both disc and RAM. "big" translation unit will obviously occupy considerably more memory than each individual "small" translation unit containing only part of the project, however their cumulative size will greatly exceed size of the "big" translation unit.
Some optimization benefits. Obviously you get "everything is inline" automatically.
No more fear of "compilation from scratch". This is very important because it is what CI server performs.
Now to disadvantages:
Developers must maintain strict header organization discipline. Header guards, consistent ordering of #include directives, mandatory inclusion of headers directly required by current file, proper forwarding, consistent naming conventions etc. The problem is that there are no tools to help developers with this and even minor violations in header organization may lead to messy failed build logs.
Potential increase of total amount of files in project. See this answer for details.
No more "it's compiling" excuses for wooden sword fencing.
P.S.
The SCU organization in your case is kind of "soft-boiled". By that I mean that project still has translation units that are not proper headers. Typically such scenario happens when an old project being converted to SCU.
If building your project with SCU takes ~30 minutes, I have a feeling that it is either not fault of project organization (it could be antivirus, no SSD, recursive templates bloat) or it would take several hours without SCU.
some numbers: compilation time dropped from ~14 minutes to ~20 seconds, 3x executable size reduction (result of converting existing project to SCU from my experience)
real-world use cases: CppCon 2014: Nicolas Fleury "C++ in Huge AAA Games", Chromium Jumbo / Unity builds ("is can save hours for a full build")
I might a bit exaggerating, but is seem to me that entire concept of "multiple translation units" (and static libraries as well) should be left in the past.

Performance with non executed code

Maybe, my question is stupid but I didn't find any answer and I really wonder to know it. When we have program with functions which are not called (they are for example only prepared for future implementation) I think that compiler read also these lines (minimally function declaration). It would be no problem but how about performance in bigger projects? Is there anything what we should avoid (for example some allocations / include files) which has bigger impact?
For example:
//never called/used
class abc{
...
}
//never called/used
float function_A(float x, int y){
...}
int main(){
...
}
This is just a short example but I think everyone know what I mean.
Thank you very much!
current implementations of compiler will not generate code for some kind of functions as you can read here. Non used code is typically not a performance hit especially if you declare and do not define the functions. Only functions with a lot of instructions can be a performance hit, but therefore I recommend you to read about instruction caching.
In bigger libraries you should care about include files. If you use and (more important) include them intelligently, you can gain performance at compile time. I.e use forward declarations in header files, and include headers in cpp files.
Another thing is, if your split to a few header files, the compiler can skip whole .o files (which the compiler creates during compilation) at link time if they are not used.
Hope this helped you a bit
If you mean application performance, leaving in unused code will have no impact. The compiler does dead code elimination. But having to go through more code, the compiler will slow down slightly, so you will have to wait a bit longer for program compilation. Not including unused header files is a good idea, as one header file can pull in dozens or hundreds of others. (But precompiled headers can also help with that.)
Instruction caching may still be an issue if unremoved dead code reduces locality of the program as a whole.
Imagine two functions A and B, where B is called from A repeatedly. If A and B fit into the same cache line, calling B from A is unlikely to produce a cache miss. But if a third function is placed in between the two functions by the linker so that A and B are not on the same cache line anymore, cache misses when calling B are becoming more likely, reducing overall execution speed.
While the effect may not be measurable very well and depend on a lot of other factors, reducing dead code is generally a good idea.
If the compiler can detect it as dead code, it will remove it completely and probably print a warning. If not, it will increases the object code size. With static linkage, linker will remove unused functions.

compile code as a single automaticly merged file to allow compiler better code optimization

suppose you have a program in C, C++ or any other language that employs the "compile-objects-then-link-them"-scheme.
When your program is not small, it is likely to compromise several files, in order to ease code management (and shorten compilation time). Furthermore, after a certain degree of abstraction you likely have a deep call hierarchy. Especially at the lowest level, where tasks are most repetitive, most frequent you want to impose a general framework.
However, if you fragment your code into different object files and use a very abstract archictecture for your code, it might inflict performance (which is bad if you or your supervisor emphasizes performance).
One way to circuvent this is might be extensive inlining - this is the approach of template meta-programming: in each translation unit you include all the code of your general, flexible structures, and count on the compiler to counteract performance issues. I want to do something similar without templates - say, because they are too hard to handle or because you use plain C.
You could write all your code into one single file. That would be horrible. What about writing a script, which merges all your code into one source file and compiles it? Requiring your source files are not too wildly written. Then a compiler could probably apply much more optimization (inlining, dead code elamination, compile-time arithmetics, etc.).
Do you Have any experience with or objections against this "trick"?
Pointless with a modern compiler. MSVC, GCC and clang all support link-time code generation (GCC and clang call it 'link-time optimisation'), which allows for exactly this. Plus, combining multiple translation units into one large makes you unable to parallelise the compilation process, and (at least in case of C++) makes RAM usage go through the roof.
in each translation unit you include all the code of your general, flexible structures, and count on the compiler to counteract performance issues.
This is not a feature, and it's not related to performance in any way. It's an annoying limitation of compilers and the include system.
This is a semi-valid technique, iirc KDE used to use this to speed up compilation back in the day when most people had one cpu core. There are caveats though, if you decide to do something like this you need to write your code with it in mind.
Some samples of things to watch out for:
Anonymous namespaces - namespace { int x; }; in two source files.
Using-declarations that affect following code. using namespace foo; in a .cpp file can be OK - the appended sources may not agree
The C version of anon namespaces, static globals. static int i; at file scope in several cpp files will cause problems.
#define's in .cpp files - will affect source files that don't expect it
Modern compilers/linkers are fully able to optimize across translation units (link-time code generation) - I don't think you'll see any noticeable difference using this approach.
It would be better to profile your code for bottlenecks, and apply inlining and other speed hacks only where appropriate. Optimization should be performed with a scalpel, not with a shotgun.
Though it is not suggested, using #include statements for C files is essentially the same as appending the entire contents of the included file in the current one.
This way, if you include all of your files in one "master file" that file will be essentially compile as if all the source code were appended in it.
SQlite does that with its Amalgamation source file, have a look at:
http://www.sqlite.org/amalgamation.html
Do you mind if I share some experience about what makes software slow, especially when the call tree gets bushy? The cost to enter and exit functions is almost totally insignificant except for functions that
do very little computation and (especially) do not call any further functions,
and are actually in use for a significant fraction of the time (i.e. random-time samples of the program counter are actually in the function for 10% or more of the time).
So in-lining helps performance only for a certain kind of function.
However, your supervisor could be right that software with layers of abstraction have performance problems.
It's not because of the cycles spent entering and leaving functions.
It's because of the temptation to write function calls without real awareness of how long they take.
A function is a bit like a credit card. It begs to be used. So it's no mystery that with a credit card you spend more than you would without it.
However, it's worse with functions, because functions call functions call functions, over many layers, and the overspending compounds exponentially.
If you get experience with performance tuning like this, you come to recognize the design approaches that result in performance problems. The one I see over and over is too many layers of abstraction, excess notification, overdesigned data structure, stuff like that.

Why not mark everything inline?

First off, I am not looking for a way to force the compiler to inline the implementation of every function.
To reduce the level of misguided answers make sure you understand what the inline keyword actually means. Here is good description, inline vs static vs extern.
So my question, why not mark every function definition inline? ie Ideally, the only compilation unit would be main.cpp. Or possibly a few more for the functions that cannot be defined in a header file (pimpl idiom, etc).
The theory behind this odd request is it would give the optimizer maximum information to work with. It could inline function implementations of course, but it could also do "cross-module" optimization as there is only one module. Are there other advantages?
Has any one tried this in with a real application? Did the performance increase? decrease?!?
What are the disadvantages of marking all function definitions inline?
Compilation might be slower and will consume much more memory.
Iterative builds are broken, the entire application will need to be rebuilt after every change.
Link times might be astronomical
All of these disadvantage only effect the developer. What are the runtime disadvantages?
Did you really mean #include everything? That would give you only a single module and let the optimizer see the entire program at once.
Actually, Microsoft's Visual C++ does exactly this when you use the /GL (Whole Program Optimization) switch, it doesn't actually compile anything until the linker runs and has access to all code. Other compilers have similar options.
sqlite uses this idea. During development it uses a traditional source structure. But for actual use there is one huge c file (112k lines). They do this for maximum optimization. Claim about 5-10% performance improvement
http://www.sqlite.org/amalgamation.html
We (and some other game companies) did try it via making one uber-.CPP that #includeed all others; it's a known technique. In our case, it didn't seem to affect runtime much, but the compile-time disadvantages you mention turned out to be utterly crippling. With a half an hour compile after every single change, it becomes impossible to iterate effectively. (And this is with the app divvied up into over a dozen different libraries.)
We tried making a different configuration such that we would have multiple .objs while debugging and then have the uber-CPP only in release-opt builds, but then ran into the problem of the compiler simply running out of memory. For a sufficiently large app, the tools simply are not up to compiling a multimillion line cpp file.
We tried LTCG as well, and that provided a small but nice runtime boost, in the rare cases where it didn't simply crash during the link phase.
Interesting question! You are certainly right that all of the listed disadvantages are specific to the developer. I would suggest, however, that a disadvantaged developer is far less likely to produce a quality product. There may be no runtime disadvantages, but imagine how reluctant a developer will be to make small changes if each compile takes hours (or even days) to complete.
I would look at this from a "premature optimization" angle: modular code in multiple files makes life easier for the programmer, so there is an obvious benefit to doing things this way. Only if a specific application turns out to run too slow, and it can be shown that inlining everything makes a measured improvement, would I even consider inconveniencing the developers. Even then, it would be after a majority of the development has been done (so that it can be measured) and would probably only be done for production builds.
This is semi-related, but note that Visual C++ does have the ability to do cross-module optimization, including inline across modules. See http://msdn.microsoft.com/en-us/library/0zza0de8%28VS.80%29.aspx for info.
To add an answer to your original question, I don't think there would be a downside at run time, assuming the optimizer was smart enough (hence why it was added as an optimization option in Visual Studio). Just use a compiler smart enough to do it automatically, without creating all the problems you mention. :)
Little benefit
On a good compiler for a modern platform, inline will affect only a very few functions. It is just a hint to the compiler, modern compilers are fairly good at making this decision themselves, and the the overhead of a function call has become rather small (often, the main benefit of inlining is not to reduce call overhead, but opening up further optimizations).
Compile time
However, since inline also changes semantics, you will have to #include everything into one huge compile unit. This usually increases compile time significantly, which is a killer on large projects.
Code Size
if you move away from current desktop platforms and its high performance compilers, things change a lot. In this case, the increased code size generated by a less clever compiler will be a problem - so much that it makes the code significantly slower. On embedded platforms, code size is usually the first restriction.
Still, some projects can and do profit from "inline everything". It gives you the same effect as link time optimization, at least if your compiler doesn't blindly follow the inline.
That's pretty much the philosophy behind Whole Program Optimization and Link Time Code Generation (LTCG) : optimization opportunities are best with global knowledge.
From a practical point of view it's sort of a pain because now every single change you make will require a recompilation of your entire source tree. Generally speaking you need an optimized build less frequently than you need to make arbitrary changes.
I tried this in the Metrowerks era (it's pretty easy to setup with a "Unity" style build) and the compilation never finished. I mention it only to point out that it's a workflow setup that's likely to tax the toolchain in ways they weren't anticipating.
It is done already in some cases. It is very similar to the idea of unity builds, and the advantages and disadvantages are not fa from what you descibe:
more potential for the compiler to optimize
link time basically goes away (if everything is in a single translation unit, there is nothing to link, really)
compile time goes, well, one way or the other. Incremental builds become impossible, as you mentioned. On the other hand, a complete build is going to be faster than it would be otherwise (as every line of code is compiled exactly once. In a regular build, code in headers ends up being compiled in every translation unit where the header is included)
But in cases where you already have a lot of header-only code (for example if you use a lot of Boost), it might be a very worthwhile optimization, both in terms of build time and executable performance.
As always though, when performance is involved, it depends. It's not a bad idea, but it's not universally applicable either.
As far as buld time goes, you have basically two ways to optimize it:
minimize the number of translation units (so your headers are included in fewer places), or
minimize the amount of code in headers (so that the cost of including a header in multiple translation units decreases)
C code typically takes the second option, pretty much to its extreme: almost nothing apart from forward declarations and macros are kept in headers.
C++ often lies around the middle, which is where you get the worst possible total build time (but PCH's and/or incremental builds may shave some time off it again), but going further in the other direction, minimizing the number of translation units can really do wonders for the total build time.
The assumption here is that the compiler cannot optimize across functions. That is a limitation of specific compilers and not a general problem. Using this as a general solution for a specific problem might be bad. The compiler may very well just bloat your program with what could have been reusable functions at the same memory address (getting to use the cache) being compiled elsewhere (and losing performance because of the cache).
Big functions in general cost on optimization, there is a balance between the overhead of local variables and the amount of code in the function. Keeping the number of variables in the function (both passed in, local, and global) to within the number of disposable variables for the platform results in most everything being able to stay in registers and not have to be evicted to ram, also a stack frame is not required (depends on the target) so function calling overhead is noticeably reduced. Hard to do in real world applications all the time, but the alternative a small number of big functions with lots of local variables the code is going to spend a significant amount of time evicting and loading registers with variables to/from ram (depends on the target).
Try llvm it can optimize across the entire program not just function by function. Release 27 had caught up to gcc's optimizer, at least for a test or two, I didnt do exhaustive performance testing. And 28 is out so I assume it is better. Even with a few files the number of tuning knob combinations are too many to mess with. I find it best to not optimize at all until you have the whole program into one file, then perform your optimization, giving the optimizer the whole program to work with, basically what you are trying to do with inlining, but without the baggage.
Suppose foo() and bar() both call some helper(). If everything is in one compilation unit, the compiler might choose not to inline helper(), in order to reduce total instruction size. This causes foo() to make a non-inlined function call to helper().
The compiler doesn't know that a nanosecond improvement to the running time of foo() adds $100/day to your bottom line in expectation. It doesn't know that a performance improvement or degradation of anything outside of foo() has no impact on your bottom line.
Only you as the programmer know these things (after careful profiling and analysis of course). The decision not to inline bar() is a way of telling the compiler what you know.
The problem with inlining is that you want high performance functions to fit in cache. You might think function call overhead is the big performance hit, but in many architectures a cache miss will blow the couple pushes and pops out of the water. For example, if you have a large (maybe deep) function that needs to be called very rarely from your main high performance path, it could cause your main high performance loop to grow to the point where it doesn't fit in L1 icache. That will slow your code down way, way more than the occasional function call.