Fortran Print Statements Performance Effects

Fortran Print Statements Performance Effects - fortran

I just inherited some old Fortran code that has print statements everywhere (when it runs, the matrix streams by). I know these print statements are useless because I cannot tell what the program is printing as it is going by so fast. But is there a significant performance impact to having a lot of print statements in a Fortran program (i.e. does an overly verbose program take longer to execute)? It seems like it would as it is another line to execute, but I don't know if it is significant.

In general, yes, I/O is "relatively costly" to execute since you have to do things like formatting numbers - especially floating point numbers, even if those procedures are highly optimized. However, one of the biggest costs (the system call to actually perform the I/O after the buffer to write has been prepared) is amortized in good compilers/runtimes since the I/O statements are usually buffered by default. This helps cut down the number of system calls significantly, thus reducing delays caused by frequent context switching between your app and the OS.
That said, if you are worried about the performance hit, why don't you try to comment every PRINT or WRITE statement and see how the runtime changes? Or even better, profile your application and see the amount of time spent on I/O and related routines.

Related

How to do benchmarking for C/C++ code accurately?

I'm asking regarding answers on this question, In my answer I first just got the time before and after the loops and printed out their difference, But as an update for #cigiens answer, it seems that I've done benchmarking inaccurately by not warming up the code.
What is warming up of the code? I think what happened here is that the string was moved to the cache first and that made the benchmarking results for the following loops close to each other. In my old answer, the first benchmarking result was slower than others, since it took more time to move the string to the cache I think, Am I correct? If not, what is warming up actually doing to code and also generally speaking if possible, What should I've done else than warming up for more accurate results? or how to do benchmarking correctly for C++ code (also C if possibly the same)?

To give you an example of warm up, i've recently benchmarked some nvidia cuda kernel calls:
The execution speed seems to increase over time, probably for several reasons like the fact that the GPU frequency is variable (to save power and cooldown).
Sometimes the slower call has an even worse impact on the next call so the benchmark can be misleading.
If you need to feel safe about these points, I advice you to:
reserve all the dynamic memory (like vectors) first
make a for loop to do the same work several times before a measurement
this implies to initialize the input datas (especially random) only once before the loop and to copy them each time inside the loop to ensure that you do the same work
if you deal with complex objects with cache, i advice you to pack them in a struct and to make an array of this struct (with the same construction or cloning technique), in order to ensure that the same work is done on the same starting data in the loop
you can avoid doing the for loop and copying the datas IF you alternate two calls very often and suppose that the impact of the behavior differences will cancel each other, for example in a simulation of continuous datas like positions
concerning the measurement tools, i've always faced problems with high_resolution_clock on different machines, like the non consistency of the durations. On the contrary, the windows QueryPerformanceCounter is very good.
I hope that helps !
EDIT
I forgot to add that effectively as said in the comments, the compiler optimization behavior can be annoying to deal with. The simplest way i've found is to increment a variable depending on some non trivial operations from both the warm up and the measured datas, in order to force the sequential computation as much as possible.

To what extent DLLs speed up calculations in a code such as loops etc

I have a code with a loop that counts to 10000000000, and within that loop, I do some calculations with conditional operators (if etc). It takes about 5 minutes to reach that number. so, my question is, can I reduce the time it takes by creating a DLL and call that dll for functions to do the calculation and return the values to the main program? will it make a difference in time it takes to do the calculations? further, will it improve the overall efficiency of the program?

By a “dll” I assume you mean going from managed .net code to that of un-managed “native” compiled code. Yes this approach can help.
It much depends. Remember, the speed of the loop code is likely only 25 seconds on a typical i3 (that is the cost and overhead to loop to 10 billion but doing much nothing else).
And I assumed you went to the project, then compile. On that screen select advanced compile. There you want to check remove integer overflow checks. Make sure you loop vars are integers for speed.
At that point the “base” loop that does nothing will drop from about 20 seconds down to about 6 seconds.
So that is the base loop speed – now it come down to what we are doing inside of that loop.
At this point, .net DOES HAVE a JIT (a just in time native compiler). This means your source code goes to “CLR” code and then in tern that code does get compiled down to native x86 assembly code. So this “does” get the source code down to REAL machine code level. However a JIT is certainly NOT as efficient nor can it spend “time” optimizing the code since the JIT has to work on the “fly” without you noticing it. So a c++ (or VB6 which runs as fast as c++ when native compiled) can certainly run faster, but the question then is by how much?
The optimized compiler might get another double in speed for the actually LOOPING code etc.
However, in BOTH cases (using .net managed code, or code compiled down to native Intel code), they BOTH LIKELY are calling the SAME routines to do the math!
In other words, if 80% of the code is spend in “library” code that does the math etc., then calling such code from c++ or calling such code from .net will make VERY LITTLE difference since the BULK of the work is spend in the same system code!
The above concept is really “supervisor” mode vs. your application mode.
In other words, the amount of time spent in your code vs. that of using system “library” code means that the bulk of the heaving lifting is occurring in supervisor code. That means jumping from .net to native c++/vb6 dll’s will NOT yield much in the way of performance.
So I would first ensure loops and array index refs in your code are integer types. The above tip of taking off bounds checking likely will give you “close” to that of using a .dll. Worse is often the time to “shuffle” the data two and from that external.dll sub will cost you MORE than the time saved on the processing side.
And if your routines are doing database or file i/o, then all bets are off, as that is VERY different problem.
So I would first test/try your application with [x] remove integer overflow checks turned off. And make sure during testing that you use ctrl-F5 in place of F5 to run your code without DEBUGGING. The above overflow check and options will NOT show increased speed when in debug mode.
So it hard to tell – it really depends on how much math (especially floating calls) you are doing (supervisor code) vs. that of just moving values around in arrays. If more code is moving things around, then I suggest the integer optimizing above, and going to a .dll likely will not help much.

Couldn´t you utilize "Parallel.ForEach" and strip this huge loop in some equal pieces?
Or try to work with some Backgroundworkers or even Threads (more than 1!!) to achieve the optimal CPU performance and try to reduce the spent time.

How negligible is the time for ofstream

I have a C++ program where I do various experiments and during these experiments I output some values to a file using ofstream. The structure is basically:
Start timer
output to a file using ofstream (the output is, at most, a few words)
do some experimental work
Stop timer
My question, which is a bit broad, is can I ignore the time that the ofstream takes or it's not something negligible ? or I guess it depends ?

First of all, from your pseudo code, you could just start the timer after the file output :-) But I'm guessing it's not like that in the real app.
Beyond that, it's obviously a matter of "it depends". If you aren't outputting all that much, and the code you're interested in runs for minutes, then the output obviously won't make much of a difference. If, on the other hand, you are trying to catch runtimes measured in microseconds, you'll probably be mostly measuring the ofstream.
You could try doing various magics, like running the actual output on a thread, or just adding your messages to a previously allocated char array and outputting that at the end. However, everything incurs some runtime penalty; nothing is ever free.
Since you're not interested in measuring the actual output time, you could compile a version without the output to do measurements, and a version with the output to debug the code. EDIT: or make that a runtime option. Nothing is ever free, but an "if (OutputEnabled)" is pretty close to "free" :-)

It mostly depends from what the ofstream does... as long as it just stores data in its internal buffer it will be fast, but if gets its buffer filled and actually call the OS API to perform the write the time spent could be much bigger.
But obviously everything depends on how long does the "experimental work" take in comparison to the IO you perform, both in the case where it just writes the data to the internal buffer and when the stream is flushed; as suggested in the comment, you should time the two things independently, to see how one time compares to the other.

Something is negligible compared to something else. You compare to nothing else in your question.
I had already asked a question here so as to check the validity of my statement below, and the conclusion was that I should not keep such classification, though it still just gives a rough rough draft evaluation (that may be false in some cases):
Stack operations are 10x faster than heap memory creations that are 10x faster than graphical device operations that are 10x faster than I/O operations ( writing to a file on hard drive for example ) that are 10x faster than net communication operation...
It's is only a rough estimation. Everything has to be re-evaluated each time you code.
If the time passed in the writing to ofstream doesnt impact the whole mechanism then it can be considered negligible.
If it does impact your whole program mechanism, then it cannot be considered negligible. Obviously.

Function size vs execution speed

I remember hearing somewhere that "large functions might have higher execution times" because of code size, and CPU cache or something like that.
How can I tell if function size is imposing a performance hit for my application? How can I optimize against this? I have a CPU intensive computation that I have split into (as many threads as there are CPU cores). The main thread waits until all of the worker threads are finished before continuing.
I happen to be using C++ on Visual Studio 2010, but I'm not sure that's really important.
Edit:
I'm running a ray tracer that shoots about 5,000 rays per pixel. I create (cores-1) threads (1 per extra core), split the screen into rows, and give each row to a CPU thread. I run the trace function on each thread about 5,000 times per pixel.
I'm actually looking for ways to speed this up. It is possible for me to reduce the size of the main tracing function by refactoring, and I want to know if I should expect to see a performance gain.
A lot of people seem to be answering the wrong question here, I'm looking for an answer to this specific question, even if you think I can probably do better by optimizing the contents of the function, I want to know if there is a function size/performance relationship.

It's not really the size of the function, it's the total size of the code that gets cached when it runs. You aren't going to speed things up by splitting code into a greater number of smaller functions, unless some of those functions aren't called at all in your critical code path, and hence don't need to occupy any cache. Besides, any attempt you make to split code into multiple functions might get reversed by the compiler, if it decides to inline them.
So it's not really possible to say whether your current code is "imposing a performance hit". A hit compared with which of the many, many ways that you could have structured your code differently? And you can't reasonably expect changes of that kind to make any particular difference to performance.
I suppose that what you're looking for is instructions that are rarely executed (your profiler will tell you which they are), but are located in the close vicinity of instructions that are executed a lot (and hence will need to be in cache a lot, and will pull in the cache line around them). If you can cluster the commonly-executed code together, you'll get more out of your instruction cache.
Practically speaking though, this is not a very fruitful line of optimization. It's unlikely you'll make much difference. If nothing else, your commonly-executed code is probably quite small and adjacent already, it'll be some small number of tight loops somewhere (your profiler will tell you where). And cache lines at the lowest levels are typically small (of the order of 32 or 64 bytes), so you'd need some very fine re-arrangement of code. C++ puts a lot between you and the object code, that obstructs careful placement of instructions in memory.
Tools like perf can give you information on cache misses - most of those won't be for executable code, but on most systems it really doesn't matter which cache misses you're avoiding: if you can avoid some then you'll speed your code up. Perhaps not by a lot, unless it's a lot of misses, but some.
Anyway, what context did you hear this? The most common one I've heard it come up in, is the idea that function inlining is sometimes counter-productive, because sometimes the overhead of the code bloat is greater than the function call overhead avoided. I'm not sure, but profile-guided optimization might help with that, if your compiler supports it. A fairly plausible profile-guided optimization is to preferentially inline at call sites that are executed a larger number of times, leaving colder code smaller, with less overhead to load and fix up in the first place, and (hopefully) less disruptive to the instruction cache when it is pulled in. Somebody with far more knowledge of compilers than me, will have thought hard about whether that's a good profile-guided optimization, and therefore decided whether or not to implement it.

Unless you're going to hand-tune to the assembly level, to include locking specific lines of code in cache, you're not going to see a significant execution difference between one large function and multiple small functions. In both cases, you still have the same amount of work to perform and that's going to be your bottleneck.
Breaking things up into multiple smaller functions will, however, be easier to maintain and easier to read -- especially 6 months later when you've forgotten what you did in the first place.

Function size is unlikely to be a bottleneck in your application. What you do in the function is much more important that it's physical size. There are some things your compiler can do with small function that it cannot do with large functions (namely inlining), but usually this isn't a huge difference anyway.
You can profile the code to see where the real bottleneck is. I suspect calling a large function is not the problem.
You should, however, break up the function into smaller function for code readability reasons.

It's not really about function size, but about what you do in it. Depending on what you do, there is possibly some way to optimize it.

Profiling code built from ifort 11.1 yields __powr8i4 routine, what is it?

I built a Fortran code with Intel 11.1. I built it with the -p option in order to produce profiling data. When I check these results, there are some routines present that aren't a part of my code. I assume they were put there by Intel. The include:
__powr8i4
__intel_new_memset
__intel_fast_memset
__intel_fast_memset.J
__intel_fast_memcpy
__intel_new_memcpy
__intel_fast_memcpy.J
There are others, too. When I build the code without optimization, the code doesn't spend much time in them. Except that results show __powr8i4 being used 3.3% of the time. However, when I build the code with optimization, this number goes way up to about 35%. I can't seem to find out what these routines are, but they are confusing my results because I want to know where to look to optimize my code.

Most programs spend a lot of their cycles in the calling of subroutines, often library subroutines, so if you look only at exclusive (self) time, you will see what you are seeing.
So point 1 is look at inclusive (self plus callees) time.
Now, if the profiler is a "CPU profiler", it will probably be blind to I/O time. That means your program might be spending most of its time reading or writing, but the profiler will give you no clue about that.
So point 2 is use a profiler that works on "wall clock" time, not "CPU" time, unless you are sure you are not doing much I/O. (Sometimes you think you're not doing I/O, but deep inside some subroutine layers deep, guess what - it's doing I/O.)
Many profilers try to produce a call-graph, and if your program does not contain recursion, and if the profiler has access to all the routines in your code, that can be helpful in identifying the subroutine calls in your code that account for a lot of time.
However, if routine A is large and calls B in several places, the profiler won't tell you which lines of code to look at.
Point 3 is use a profiler that gives you line-level inclusive time percentage, if possible.
(Percentage is the most useful number, because that tells you how much overall time you would save if you could somehow remove that line of code. Also, it is not much affected by competing processes in the system.)
One example of such a profiler is Zoom.
It may be that after you do all this, you don't see much you could do to speed up the code.
However, if you could see how certain properties of the data might affect performance, you might find there were further speedups you could get. Profilers are unable to look at data.
What I do is randomly sample the state of the program under the debugger, and see if I can really understand what it is doing at each sample.
You can find things that way that you can't find any other way.
(Some people say this is not accurate, but it is accurate - about what matters. What matters is what the problem is, not precisely how much it costs.)
And that is point 4.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js