If I want a subroutine to be inlined in the calling routine, where may I keep it? Need it be in the same module or file? Can inlining be done with subroutines from different object files? May the answer be compiler dependent?
This is not controlled by the Fortran standard. The processor can do as it sees fit.
It will definitely depend on the compiler settings.
Commonly, internal functions will be inlined. But many other other functions are often inlined as well, at least if they happen to be in the same source file or module.
But even inlining from other source files / compiled object files is not out of the question. That can and is often done during link time optimizations (https://gcc.gnu.org/wiki/LinkTimeOptimization). These optimizations are either included in certain compiler flags (like -fast) or can be enabled separately (-flto,-ipo).
Related
I am aware that the keyword inline has useful properties e.g. for keeping template specializations inside a header file.
On the other hand I have often read that inline is almost useless as hint for the compiler to actually inline functions.
Further the keyword cannot be used inside a cpp file since the compiler wants to inspect functions marked with the inline keyword whenever they are called.
Hence I am a little confused about the "automatic" inlining capabilities of modern compilers (namely gcc 4.43). When I define a function inside a cpp, can the compiler inline it anyway if it deems that inlining makes sense for the function or do I rob him of some optimization capabilities ? (Which would be fine for the majority of functions, but important to know for small ones called very often)
Within the compilation unit the compiler will have no problem inline functions (even if they are not marked as inline). Across compilation units it is harder but modern compilers can do it.
Use of the inline tag has little affect on 'modern' compilers and whether it actually inlines functions (it has better heuristics than the human mind) (unless you specify flags to force it one way or the other (which is usually a bad idea as humans are bad at making this decision)).
Microsoft Visual C++ was able to do so at least since Visual Studio 2005. They call it "Whole Program Optimization" or "Link-Time Code Generation". In this implementation, the compiler will not actually produce machine code, but write the preprocessed C++ code into the object files. The linker will then merge all of the code into one huge code unit and perform the actual compilation..
GCC is able to do this since at least version 4.5, with major improvements coming in GCC 4.7. To my knowledge the feature is still considered somewhat experimental (at least in so far as many Linux distributions not using it). GCC's implementation works very similarly by first writing the preprocessed source (in its GIMPLE intermediate language) into the object files, then compiling all of the object files into a single object file which is then passed to the linker (this allows GCC to continue to work with existing linkers).
Many big C++ projects also do what is now being called "unity builds". Instead of passing hundreds of individual C++ source files into the compiler, one source file is created that includes all the other source files in the project. The original intent behind this is to decrease compilation times (since headers etc. do not have to be parsed over and over), but as a side-effect, it will have the same outcome as the LTO/LTCG techniques mentioned above: giving the compiler perfect visibility into all functions in all compilation units.
I jump between being impressed by my C++ compiler's (MSVC 2010) ingenuity and its stupidity. Some code that did pixel format conversion via templates, which would have resolved into 5-10 assembly instructions when properly inlined, got bloated into kilobytes(!) of nested function calls. At other times, it inlines so aggressively that whole classes disappear even though they contained non-trivial functionality.
This depends on your compilation flags. With -combine and -fwhole-program, gcc will do function inlining across cpp boundaries. I'm not sure how much the linker will do if you compile into multiple object files.
The standard dictates nothing about how a function can be inlined. And compilers can inline functions if they have access to their implementation. If you only have a header with binaries, it would be impossible. If it's in the same module, the compiler can inline the function even if it is in the cpp file.
I have many class that written in .h and .cu, so I tried the relocatable device code(-rdc=true). It cost about 12 seconds. Then I tried to combine the code, use header only classes and remove the -rdc=true, it took only 2 seconds.
What the code does is sha1(some string) 0x40000 times, which is used in winrar encryption.
Why is that? It's ok for now, but my project will become larger and separate compilation would be useful. Is it normal behavior that -rdc=true can slow down the performance?
If the code of a fuction is located in a separate translation unit, that is not in a header of the entry-point you are calling, then, no inlining may occur. In this case, function call will be more expensive. You might want to relocate your time-critical functions in a header file with inline keyword so that compiler has opportunity to inline.
Separate compilation might yield to use of local address space for parameters (see http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#abstracting-abi for parameter passing) which is much more expensive than registers as this table shows (http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#operand-costs).
Moving some methods from your class implementation file into the header file with the inline keyword to avoid linking issues might be a solution.
It could be possible that separate compilation cause this slowdown. The compilers may not have enough info to apply certain optimizations (all link time info are missing). Apparently the nvcc still does not incorporate those optimization at link stage.
I've working on a project at work where there's loads and loads of code in the header files. If I were using Visual Studio this wouldn't be an issue, as this has pre-compiled headers etc, but this is Linux GCC code.
Anyway, its starting to become a bit of an issue with compilation times. Of course the templates are going to have to remain in the headers etc, but most of this code could be extracted into implementation files and linked against as a static library. All of the projects uses these headers and get compiled each time, so it makes sense to create a static lib.
Implementations in the header files are in-lined, or is this only a hint, like the inline keyword? This code is VERY time critical and I'm concerned about moving the implementations out of the headers. Can I achieve the same thing if I use the inline keyword as opposed to having implementations in header files?
** UPDATE **
I know that inline is only a hint to the compiler. I'm not in control of everything in the project and I just want to move everything out of the headers into a library without affecting performance. Is this actually going to be a try it and see thing? I just want to keep performance exactly the same but enhance compile time.
The inline keyword is only a hint to the compiler that it may wish to inline that function. Its real purpose is to allow you to legally "violate" the one definition rule.
In order to inline a function its body has to be visible at the point of call, which typically means that if you move the function to an implentation file it may not be inlined anymore.
But keep in mind that most likely large functions in the header will not be inlined anyway. Also consider that in many cases inlined functions may actually be slower than called functions due to a variety of architecture-specific issues.
inline is a hint for optimization, but it is also used to work around ODR.
Consider using whole program optimization / link-time optimization instead. It allows you to have implementation in multiple files and basically everything has the same opportunity to be optimized (and inlined) as if it were in the same translation unit.
Your compile times become much quicker, but link times usually suffer, sometimes quite a bit. You don't need to enable it for debug builds though, so it can manifest a pretty immediate improvement to dev time.
If I were using Visual Studio this wouldn't be an issue, as this has pre-compiled headers etc
GCC has them too.
http://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html
The inline keyword does NOT mean that the function has to be implemented "in the line" where you define the function. As you know it is a hint for the compiler, to let it try compiling as if the few lines in the function were at the place where you call it. Thus avoiding the overhad of saving the adress to jump back, a vtable lookup etc.
Thinking it is faster called because it is in the header, is wishfull thinking (of the original guy who wrote the code).
Try it and move the implementation to the cpp file in a minimal example - main one object one inline function one call to it. Once implemented in the header and once in a cpp file. Then look at the assembly. No difference.
I have a big class with lots of utility functions. those functions are very small and I would like them inlined.
The problem is they are all in a source file and should stay in the source file and not move to the header file (so I don't need to recompile everything every time one changes).
If I mark them as inline I get
symbols not found
Is there a way to make them inline or do I need to blindly trust the link time optimizer?
I need the code to be portable between clang 3 and gcc 4.6, but #defines based on compiler will be ok (so answer how to do it only in one of the compilers is fine too).
[These] functions are very small and I would like them inlined. [But] I don't [want] to recompile everything every time one changes.
You can't have both of these things. If a function is inlined, then you have no choice but to recompile all its callers when it changes. That's how inlining works. Even if you used a link-time optimizer to do it automatically at link time, you would still be paying the compilation-time cost of reprocessing all the callers.
AFAIK neither gcc 4.6 nor clang 3 have link-time optimizers that are up to scratch, by the way.
Editorial aside: No compiler that I know of has heuristics that are good enough to make manual inline annotations unnecessary, yet. Even VS2010, which I mentioned in the comments as an example of a link-time optimizer that is up to scratch, still needs quite a bit of advice about what to inline.
Place the implementation in an .inl file. Have only the necessary .cpp files #include it. This way changes to the implementation (touch .inl file) trigger recompile only of the dependent .cpp file. Changes to the definition (touch the .h file) trigger recompile of all files consuming the declaration.
This is a common practice for both inline functions and for template implementations. .inl files should be viewed basically as 'included cpp' files.
If you want a function to be inline'd you have to place it in a header file, unless it's only being used in the same source file. The reason being that the compiler needs the actual function definition in order to place the definition "inline" wherever it's being called, and then compile it up.
You can find further information: here and here.
You need the function mandatory inlined - you place it in the class definition.
One option could be to place the inline functions in a precompiled header file, which will speed up the compilation. But because of the nature of inlining functions all places they are used will have to be recompiled.
http://gcc.gnu.org/onlinedocs/gcc/Precompiled-Headers.html
I am wondering whether there is any difference between inlining functions on a linker level or compiler level in terms of execution speed?
e.g. if I have all my functions in .cpp files and rely on the linker to do inlining, will this inlining potentially be less efficient than say defining some functions in the headers for selected inlining on the compiler level or unity builds without any linking and all inlining done by the compiler?
If the linker is just as efficient, why would one then still bother inlining functions explicitly on the compiler level? Is that just for convenience, say there is just a one line constructor hence one can't be bothered with a .cpp file?
I suppose this might depend on the compiler, in which case I would be most interested in Visual C++ (Windows) and gcc (Linux).
Thanks
The general rule is all else being equal the closer to execution (compiling->linking->(maybe JIT)->execution) the optimization is done the more data the optimizer has and the better optimization it can perform. So unless the optimizer is dumb you should expect better results when inlining is done by the linker - the linker will know more about the invokation context and do better optimization.
Generally, by the time the linker is run, your source has already been compiled into machine code. The linkers job is to take all the code fragments and link then together (possibly fixing addresses along the way). In such a case, there is no room for performing inlining.
But all is not lost. Gcc does provide a mechanism for link time optimization (using the -flto) option when compiling and linking. This causes gcc to produce a byte code that can then be compiled and linked by the linker into a single executable. Since the byte code contains more information than optimized machine code. The linker can now perform radical optimization on the whole codebase. Something that the compiler cannot do.
See here for more details on gcc. Not to sure about VC++ though.
Inlining is normally performed within a single translation unit (.cpp file). When you call functions in another file, they’re never inlined.
Link Time Optimization (LTO) changes this, allowing inlining to work across translation units. It should always be equal or better (sometimes very very significantly) to regular linking in terms of how efficient the generated code is.
The reason both options are still available is that LTO can take a large amount of RAM and CPU – I’ve had VC++ take several minutes on linking a large C++ project before. Sometimes it’s not worth it to enable until you ship. You could also run out of address space with a large enough project, as it has to load all that bytecode into RAM.
For writing efficient code, nothing changes – all the same rules apply with LTO. It is potentially more efficient to explicitly define an inline function in a header file versus depending on LTO to inline it. The inline keyword only provides a hint so there’s no guarantee, but it might nudge it into being inlined where normally (with or without LTO) it wouldn’t be.
If the function is inlined, there would be no difference.
I believe the main reason for having inline functions defined in the headers is history. Another is portability. Until resently most compilers did not do link time code generation, so it having the functions in the headers was a necessity. That of course affects code bases started on more than a couple of years ago.
Also, if you still target some compilers that don't support link time code generation, you dont have a choice.
As an aside, I have in one case been forced to add a pragma to ask one specific compiler not to inline an init() function defined in one .cpp file, but potentially called from many places.