Pure subroutines in Fortran - Compiler optimization

Pure subroutines in Fortran - Compiler optimization - fortran

I recently discovered the use of pure functions and subroutines in Fortran. From what the Fortran manual indicates, it seems that most of my subroutines can actually be defined as pure (since I always specify the intent of all the arguments, and usually I dont have "save", "pause", or external I/O in most of my subroutines).
My question is then: Should I do it? I was wondering if the compiler optimizes better pure subroutines or if it just does not matter, or if it can make things worse.
Thanks !

You work with the compiler to generate good code, and the more information you provide the compiler, the better a job the two of you can do together.
Whether it's labelling with intent(in) any dummy arguments you don't change, or using parameter for constants, or explicitly making pure any subprogram which doesn't have any side effects, or using forall when you don't really care about the order a loop is calculated in, by being more explicit about what you want to happen, you benefit because:
the compiler can now flag more errors at compile time - hey, you modified that argument you said was intent 'in', or you modified that module variable in a pure subroutine
your code is clearer to the next person to come to it without knowing what it's supposed to be doing (and that person could well be you three months later)
the compiler can be more aggressive with optimization (if the compiler has a guarantee from you that nothing is going to change, it can crankup the optimization).
Of those three benefits, the optimization is probably not the most important; in the case of pure subroutines, a smart compiler can probably see just through static analysis that your subroutine has no side effects. Still, the more guarantees you can give it, the better a job it can do of optimizing your code while maintaining correctness.

As far as I know, it just does not matter in a sequential mode. But if you activate "auto parallelization" options, then a compiler may sometimes take advantage of the PURE declaration to parallelize loops (multi-threading) containing calls to pure subroutines (it cannot take the risk if the subroutines are not pure).
For the same reason, the PURE declaration is also useful for the programmer who wants to set manually // directives (OpenMP for instance) because the risk of trouble with such procedures is rather limited. It is often possible to parallelize loops with calls to non pure subroutines, but this needs a deep verification...

Related

Deciding where to place a function implementation

Let me first state that I know that inline does not mean that the compiler will always inline a function...
In C++ there really are two places for a non-template non-constexpr function implementation to go:
A header, definition should be inline
A source file
There are benefits/negatives to placing the implementation in one or the other:
inline function definition
compiler can inline the function
slower compiler times both due to having to parse definitions and include implementation dependencies.
multiple copies of a function between multiple users on the same site
source file definition
compiler can never inline the function (maybe that's not true with LTO?)
can avoid recompilation if the file hasn't changed
one copy per site
I am in the midst of writing a reusable math library where inlining can offer significant speedups. I only have test code and snippets to work with right now, so profiling isn't an option for helping me decide. Are there any rules - or just rules of thumb - on deciding where to define the function? Are there certain types of functions, like those with exceptions, which are known to always generate large amounts of code that should be relegated to a source file?

If you have no data, keep it simple.
Libraries that suck to develop don't get finished, and those that suck to use don't get used. So split h/cpp by default; that makes build times slower and development faster.
Then get data. Write tests and see if you get significant speedups from inlining. Then go and learn how to profile and realize your speedups where spurious, and write better tests.
How to profile and determine what is spurious and what is microbenchmark noise is between a chapter of a book and a book in length. Read SO questions about performance in C++ and you'll at least learn the 10 most common ways to microbenchmark are not accurate.
For general rules, smallish bits of code in tight loops benefit from inlining, as do cases where external vectorization is plausible, and where false aliasing could block compiler optimizations.
Often you can hoist the benefits of inlining into your library by offering vector operations.

Generally speaking, if you are statically linking (as opposed to DLL/DSO methods), then the compiler/linker will basically ignore inline and do what's sensible.
The old rule of thumb (which everyone seems to ignore) is that inline should only be used for small functions. The one problem with inlining is that all do often I see people doing some timed test, e.g.
auto startTime = getTime();
for(int i = 0; i < BIG_NUM; ++i)
{
doThing();
}
auto endTime = getTime();
The immediate conclusion from that test is that inline is good for performance everywhere. But that isn't the case.
inlining also increases the size of your compiled exe. This has a nasty side effect in that it increases the burden placed on the instruction and uop caches, which can cause a performance loss. So in the case of a large scale app, more often than not you'll find that removing inline from commonly used functions can actually be a performance win.
One of the nastiest problems with inline is that if it's applied to the wrong method, it's very hard to get a profiler to point out a hot spot - It's just a little warmer than needed in multiple points in the codebase.
My rule of thumb - if the code for a method can fit on one line, inline it. If the code doesn't fit on one line, put it in the cpp file until a profiler indicates moving it to the header would be beneficial.

The rule of thumb I work by is simple: No function definitions in headers, and all function definitions in a source file, unless I have a specific reason to do otherwise.
Generally speaking, C++ code (like code in many languages) is easier to maintain if there is a clear separation of interface from implementation. Maintenance effort is (quite often) a cost driver in non-trivial programs, because it translates into developer time and salary costs. In C++, interface is represented by declarations of functions (without definition), type declarations, struct and class definition, etc i.e. the things that are typically placed in a header, if the intent is to use them in more than one source file. Changing the interface (e.g. changing a function's argument types or return type, adding a member to a class, etc) means that everything which depends on that interface must be recompiled. In the long run, it often works out that header files need to change less often than source files - as long as interface is kept separate from implementation. Whenever a header changes, all source files which use that header (i.e. that #include it) must be recompiled. If a header doesn't change, but a function definition changes, then only the source file which contains the changed function definition, needs to be recompiled.
In large projects (say, with hundreds of source and header files) this sort of thing can make the difference between incremental builds taking a few seconds or a few minutes to recompile a few changed source files versus significantly longer to recompile a large number of source files because a header they all depend on has changed.
Then the focus can be on getting the code working correctly - in the sense of producing the same observable output given a set of inputs, meeting its functional requirements, and passing suitable test cases.
Once the code is working correctly enough, attention can turn to other program characteristics, such as performance. If profiling shows that a function is called many times and represents a performance hot-spot, then you can look at options for improving performance. One option that MIGHT be relevant for improving performance of a program that is otherwise correct is to selectively inline functions. But, every time this is done, it amounts to deciding to accept a greater maintenance burden in order to get performance. But it is necessary to have evidence of the need (e.g. by profiling).
Like most rules of thumb, there are exceptions. For example, templated functions or classes in C++ do generally need to be inlined since, more often than not, the compiler needs to see their definition in order to instantiate the template. However, that is not an justification to inlining everything (and it is not a justification for turning every class or function into a template).
Without profiling or other evidence, I would rarely bother to inline functions. Inlining is a hint to the compiler, which the compiler may well ignore, so the effort of inlining may not even be worth it. Doing such a thing without evidence may achieve nothing - in which case it is simply premature optimisation.

Proper use of the PURE keyword Fortran

I'm currently delving into Fortran and I've come across the pure keyword specifying functions/subroutines that have no side effects.
I have a book, Fortran 90/95 by S Chapman which introduces the pure keyword but strangely provides no "good coding practice" uses.
I'm wondering how liberally one should use this keyword in their procedures. Just by looking around it's obvious to me that most procedures without side effects don't find it necessary to include the pure keyword.
So where is it best used? Only in procedures one wants to completely guarantee have no side effects? Or perhaps in procedures one plans to convert to elemental procedures later? (As elemental procedures must first be pure.)

PURE is required in some cases - for example, procedures called within specification expressions or from FORALL or DO CONCURRENT constructs. PURE is required in these case to give the Fortran processor flexibility in the ordering of procedure invocations while still having a reasonably deterministic outcome from a particular stretch of code.
Beyond those required cases, whether to use PURE or not is basically a question of style, which is somewhat subjective.
There are costs to use of PURE (inability to do IO within the procedure, inability to call procedures that are not PURE) and benefits (a pure procedure written today can be called from a context written tomorrow that requires a pure procedure, because PURE procedures have no side effects the implications of an invocation of such a procedure may be clearer to a reader of the code), the trade-off between the two depends on specifics.
The standard may give Fortran processors considerable lee-way in how they evaluate expressions and function references within expressions. It definitely constrains programs in some ways around side effects of function execution and modification of function arguments. The requirements on a pure function are consistent with that lee-way and those constraints, consequently some people use a style where most functions are pure. Again, it may still depend on specifics, and exceptions may have to exist for things like C interoperability or interaction with external API's.

As suggested by chw21, the primary motivation of PURE is to allow the compiler to better optimize. In particular, the lack of PURE for a function will prevent parallelization due to unknown side effects. Note that PURE subroutines, unlike functions, may have INTENT(INOUT) arguments, but there is still the restriction on side effects (and that a PURE procedure can call only other PURE procedures.)
Up through Fortran 2003, ELEMENTAL procedures are implicitly PURE. Fortran 2008 adds an IMPURE prefix that can be used with ELEMENTAL procedures to disable that aspect.

When should you not use [[carries_dependency]]?

I've found questions (like this one) asking what [[carries_dependency]] does, and that's not what I'm asking here.
I want to know when you shouldn't use it, because the answers I've read all make it sound like you can plaster this code everywhere and magically you'd get equal or faster code. One comment said the code can be equal or slower, but the poster didn't elaborate.
I imagine appropriate places to use this is on any function return or parameter that is a pointer or reference and that will be passed or returned within the calling thread, and it shouldn't be used on callbacks or thread entry points.
Can someone comment on my understanding and elaborate on the subject in general, of when and when not to use it?
EDIT: I know there's this tome on the subject, should any other reader be interested; it may contain my answer, but I haven't had the chance to read through it yet.

In modern C++ you should generally not use std::memory_order_consume or [[carries_dependency]] at all. They're essentially deprecated while the committee comes up with a better mechanism that compilers can practically implement.
And that hopefully doesn't require sprinkling [[carries_dependency]] and kill_dependency all over the place.
2016-06 P0371R1: Temporarily discourage memory_order_consume
It is widely accepted that the current definition of memory_order_consume in the standard is not useful. All current compilers essentially map it to memory_order_acquire. The difficulties appear to stem both from the high implementation complexity, from the fact that the current definition uses a fairly general definition of "dependency", thus requiring frequent and inconvenient use of the kill_dependency call, and from the frequent need for [[carries_dependency]] annotations. Details can be found in e.g. P0098R0.
Notably that in C++ x - x still carries a dependency but most compilers would naturally break the dependency and replace that expression with a constant 0. But also compilers do sometimes turn data dependencies into control dependencies if they can prove something about value-ranges after a branch.
On modern compilers that just promote mo_consume to mo_acquire, fully aggressive optimizations can always happen; there's never anything to gain from [[carries_dependency]] and kill_dependency even in code that uses mo_consume, let alone in other code.
This strengthening to mo_acquire has potentially-significant performance cost (an extra barrier) for real use-cases like RCU on weakly-ordered ISAs like POWER and ARM. See this video of Paul E. McKenney's CppCon 2015 talk C++ Atomics: The Sad Story of memory_order_consume. (Link includes a summary).
If you want real dependency-ordering read-only performance, you have to "roll your own", e.g. by using mo_relaxed and checking the asm to verify it compiled to asm with a dependency. (Avoid doing anything "weird" with such a value, like passing it across functions.) DEC Alpha is basically dead and all other ISAs provide dependency ordering in asm without barriers, as long as the asm itself has a data dependency.
If you don't want to roll your own and live dangerously, it might not hurt to keep using mo_consume in "simple" use-cases where it should be able to work; perhaps some future mo_consume implementation will have the same name and work in a way that's compatible with C++11.
There is ongoing work on making a new consume, e.g. 2018's http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html

because the answers I've read all make it sound like you can plaster
this code everywhere and magically you'd get equal or faster code
The only way you can get faster code is when that annotation allows the omission of a fence.
So the only case where it could possibly be useful is:
your program uses consume ordering on an atomic load operation, in an important frequently executed code;
the "consume value" isn't just used immediately and locally, but also passed to other functions;
the target CPU gives specific guarantees for consuming operations (as strong as a given fence before that operation, just for that operation);
the compiler writers take their job seriously: they manage to translate high level language consuming of a value to CPU level consuming, to get the benefit from CPU guarantees.
That's a bunch of necessary conditions to possibly get measurably faster code.
(And the latest trend in the C++ community is to give up inventing a proper compiling scheme that's safe in all cases and to come up with a completely different way for the user to instruct the compiler to produce code that "consumes" values, with much more explicit, naively translatable, C++ code.)
One comment said the code can be equal or slower, but the poster
didn't elaborate.
Of course annotations of the kind that you can randomly put on programs simply cannot make code more efficient in general! That would be too easy and also self contradictory.
Either some annotation specifies a constrain on your code, that is a promise to the compiler, and you can't put it anywhere where it doesn't correspond an guarantee in the code (like noexcept in C++, restrict in C), or it would break code in various ways (an exception in a noexcept function stops the program, aliasing of restricted pointers can cause funny miscompilation and bad behavior (formerly the behavior is not defined in that case); then the compiler can use it to optimize the code in specific ways.
Or that annotation doesn't constrain the code in any way, and the compiler can't count on anything and the annotation does not create any more optimization opportunity.
If you get more efficient code in some cases at no cost of breaking program with an annotation then you must potentially get less efficient code in other cases. That's true in general and specifically true with consume semantic, which imposes the previously described constrained on translation of C++ constructs.
I imagine appropriate places to use this is on any function return or
parameter that is a pointer or reference and that will be passed or
returned within the calling thread
No, the one and only case where it might be useful is when the intended calling function will probably use consume memory order.

Compilation and execution time of C++ vs C source code

I am not sure if this is a valid comparison or a valid statement but over the years I have heard folks claiming that the programs written in C++ generally take a longer time for compilation than the same written in C and that the applications coded in C++ are generally slower at run time than ones written in C.
Is there any truth in these statements?
Apart from reaping the benefits of OOP flexibility that C++ provides, should the above comparison be given a consideration purely from a compilation/execution time perspective?
I hope that this doesn't get closed as too generic or vague, it is just an attempt to know the actual facts about statements I have been hearing over the years from many programmers(C programmers predominantly).

I'll answer one specific part of the question that's pretty objective. C++ code that uses templates is going to be slower to compile than C code. If you don't use templates (which you probably will if you use the standard library) it should be very similar compilation time.
EDIT:
In terms of runtime it's much more subjective. Even though C may be a somewhat lower level language, C++ optimizers are getting really good, and C++ lends itself to more naturally representing real world concepts. If it's easier to represent your requirements in code (as I'd argue in C++) it's often easier to write better (and more performant) code than you would in another language. I don't think there's any objective data showing C or C++ being faster in all possible cases. I would actually suggest picking your language based on project needs and then write it in that language. If things are too slow, profile and proceed with normal performance improvement techniques.

The relatively runtime speed is a bit hard to predict. At one time, when most people thought of C++ as being all about inheritance, and used virtual functions a lot (even when they weren't particularly appropriate), code written in C++ was typically a little bit slower than equivalent C.
With (what most of us would consider) modern C++, the reverse tends to be true: templates give enough more compile-time flexibility that you can frequently produce code that's noticeably faster than any reasonable equivalent in C. In theory you could always avoid that by writing specialized code equivalent to the result of "expanding" a template -- but in reality, doing so is exceedingly rare, and quite expensive.
There is something of a tendency for C++ to be written somewhat more generally as well -- just for example, reading data into std::string or std::vector (or std::vector<std::string>) so the user can enter an arbitrary amount of data without buffer overflow or the data simply be truncated at some point. In C it's a lot more common to see somebody just code up a fixed-size buffer, and if you enter more than that, it either overflows or truncates. Obviously enough, you pay something for that -- the C++ code typically ends up using dynamic allocation (new), which is typically slower than just defining an array. OTOH, if you write C to accomplish the same thing, you end up writing a lot of extra code, and it typically runs about the same speed as the C++ version.
In other words, it's pretty easy to write C that's noticeably faster for things like benchmarks and single-use utilities, but the speed advantage evaporates in real code that has to be robust. In the latter case, about the best you can usually hope for is that the C code is equivalent to a C++ version, and in all honesty doing even that well is fairly unusual (at least IME).
Comparing compilation speed is no easier. On one hand, it's absolutely true that templates can be slow -- at least with most compilers, instantiating templates is quite expensive. On a line-for-line basis, there's no question that C will almost always be faster than anything in C++ that uses templates much. The problem with that is that a line-for-line comparison rarely makes much sense -- 10 lines of C++ may easily be equivalent to hundreds or even thousands of lines of C. As long as you look only at compile time (not development time), the balance probably favors C anyway, but certainly not by nearly as dramatic a margin as might initially seem to be the case. This also depends heavily on the compiler: just for example, clang does a lot better than gcc in this respect (and gcc has improved a lot in the last few years too).

The run time of C++ versus C would suffer only if you use certain C++-specific features. Exceptions and virtual function calls add to run time compared to returning error code and direct calls. On the other hand, if you find yourself using function pointers in C (as, say, GTK does) you are already paying at least some of the price for virtual functions. And checking error code after each function return will consume time too - you don't do it when you use exceptions.
On the other hand, inlining and templates in C++ may allow you to do a lot of work compile-time - work that C defers to run time. In some cases, C++ may end up faster than C.

If you compile the same code as C and C++, there should be no difference.
If you let the compiler do the job for you, like expanding templates, that will take some time. If you do the same in C, with cut-and-paste or some intricate macros, it will take up your time instead.
In some cases, inline expansion of templates will actually result in code that is more specialized and runs faster than the equivalent C code. Like here:
http://www2.research.att.com/~bs/new_learning.pdf
Or this report showing that many C++ features have no runtime cost:
http://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf

Although its an old question I'd like to add my 5cents here, as I'm probably not the only one who finds this question via a search engine.
I can't comment on compilation speed but on execution speed:
To the best of my knowledge, there is only one feature in c++ that costs performance, even if you don't use it. This feature are the c++ exceptions, because they prevent a few compiler optimizations (that is the reason, why noexcept was introduced in c++11). However, if you use some kind of error checking mechanism, then exceptions are probably more efficient than the combination of return value checking and a lot if else statements. This is especially true, if you have to escalate an error up the stack.
Anyway, if you turn off exceptions during compilation, c++ introduces no overhead, except in places where you deliberately use the associated features (e.g. you dont't have to pay for polymorphism if you don't use virtual functions), whereas most features introduce no runtime overhead at all (overloading, templates, namespaces a.s.o.).
On the other hand, most forms of generic code will be much faster in c++ than the equivalent in c, because c++ provides the built-in mechanisms (templates and classes) to do this. A typical example is c's qsort vs c++'s std::sort. The c++ version is usually much faster, because inside sort, the used comparator function is known at compile time, which at least saves a call through a function lookup and in the best case allows a lot of additional compiler optimizations.
That being said, the "problem" with c++ is that it's easy to hide complexity from the user such that seemingly innocent code might be much slower than expected. This is mainly due to operator overloading, polymorphism and constructors/destructors, but even a simple call to a member function hides the passed this-pointer which also isn't a NOP.
Considering operator overloading: When you see a * in c, you know this is (on most architectures) a single, cheap assembler instruction, in c++ on the other hand it can be a complex function call (think about matrix multiplication). That doesn't mean, you could implement the same functionality in c any faster but in c++ you don't directly see that this might be an expensive operation.
Destructors are a similar case: In "modern" c++, you will hardly see any explicit destructions via delete but any local variable that goes out of scope might potentially trigger an expensive call to a (virtual) destructor without a single line of code indicating this (ignoring } of course).
And finally, some people (especially coming from Java) tend to write complex class hierarchies with lots of virtual functions, where each call to such a function is a hidden indirect function call which is difficult or impossible to optimize.
So while hiding complexity from the programmer is in general a good thing it sometimes has an adverse effect on runtime, if the programmer is not aware of the costs of these "easy to use" constructs.
As a summary I would say, that c++ makes it easier for inexperienced programmers to write slow code (because they don't directly see inefficiencies in a program). But c++ also allows good programmers to write "good", correct and fast code faster than with c - which gives them more time to think about optimizations when they really are necessary.
P.S.:
Two things I haven't mentioned (probably among others that I simply forgot) are c++'s ability for complex compile time computations (thanks to templates and constexpr) and c's restrict keyword. This is because didn't use either of them in time critical programs yet and so I can't comment on their general usefulness and the real world performance benefit.

C applications are faster to compile and execute than C++ applications.
C++ applications are generally slower at runtime, and are much slower to compile than C programs.
Look at these links:
C vs C++ (a)
C vs C++ (b)
Efficient C/C++
C++ Compile Time and RunTime Performance

Should I use a function in a situation where it would be called an extreme number of times?

I have a section of my program that contains a large amount of math with some rather long equations. Its long and unsightly and I wish to replace it with a function. However, chunk of code is used an extreme number of times in my code and also requires a lot of variables to be initialized.
If I'm worried about speed, is the cost of calling the function and initializing the variables negligible here or should i stick to directly coding it in each time?
Thanks,
-Faken

Most compilers are smart about inlining reasonably small functions to avoid the overhead of a function call. For functions big enough that the compiler won't inline them, the overhead for the call is probably a very small fraction of the total execution time.
Check your compiler documentation to understand it's specific approach. Some older compilers required or could benefit from hints that a function is a candidate for inlining.
Either way, stick with functions and keep your code clean.

Are you asking if you should optimize prematurely?
Code it in a maintainable manner first; if you then find that this section is a bottleneck in the overall program, worry about tuning it at that point.

You don't know where your bottlenecks are until you profile your code. Anything you can assume about your code hot spots is likely to be wrong. I remember once I wanted to optimize some computational code. I ran a profiler and it turned out that 70 % of the running time was spent zeroing arrays. Nobody would have guessed it by looking at the code.
So, first code clean, then run a profiler, then optimize the rough spots. Not earlier. If it's still slow, change algorithm.

Modern C++ compilers generally inline small functions to avoid function call overhead. As far as the cost of variable initialization, one of the benefits of inlining is that it allows the compiler to perform additional optimizations at the call site. After performing inlining, if the compiler can prove that you don't need those extra variables, the copying will likely be eliminated. (I assume we're talking about primitives, not things with copy constructors.)

The only way to answer that is to test it. Without knowing more about the proposed function, nobody can really say whether the compiler can/will inline that code or not. This may/will also depend on the compiler and compiler flags you use. Depending on the compiler, if you find that it's really a problem, you may be able to use different flags, a pragma, etc., to force it to be generated inline even if it wouldn't be otherwise.
Without knowing how big the function would be, and/or how long it'll take to execute, it's impossible guess how much effect on speed it'll have if it isn't generated inline.
With both of those being unknown, none of us can really guess at how much effect moving the code into a function will have. There might be none, or little or huge.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js