Profiling a Fortran subroutine line by line - fortran

I have written a large Fortran program (using the new standard) and I am currently in the process to try to make it run faster. I have managed to streamline most of the routines using gprof but I have a very large subroutine that organizes the calculation that now take almost 50% of the CPU time. I am sure there are several bottlenecks inside this routine but I have not managed to set any parameters compiling or running the program so I can see where the time is spent inside this routine. I would like at least a simple count how many time each line is calculated or how much CPU time is spent executing each line. Maybe valgrind is a better tool? It was very useful to eliminate memory leaks.

A workaround that I have found is to use cpu_time module. Although this doesn't automatically do profiling, if you are willing to invest manual efforts, you can call cpu_time before and after the statement for which you want to profile. The difference of these times gives you the total time needed to execute the statement(s) between the two calls to cpu_time. If the statement(s) is inside a loop, you can add these differences and print the total time outside the loop.

This is a little oldschool, but I like the OProfile linux toolset.
If you have a fortran program prog, then running
operf -gl prog
will run prog and also use kernel profiling to produce a profile and call graph of prog.
These can then be fed to something like KCachegrind to view them as a nice nested rectangle plot. For converting from operf output to KCachegrind input I use a slightly modified version of this python script.

The gcov tool in GCC provides a nice overview of an individual subroutine in my code to discover how many times each line is executed. The file with the subroutine to be "covered" must be compiled with
gfortran -c -fprofile-arcs -ftest-coverage -g subr.F90
and to link the program I must add -lgcov as the LAST library.
After running the program I can use
gcov subr.F90
to create a file subr.F90.gcov
with information of the number of times each line in the subroutine has been executed. That should make it possible to discover bottlenecks in the subroutine. This is a nice complement to gprof which gives the time in each subroutine but as my program has more than 50000 lines of code it is nice to be able to select just a few subroutines for this "line by line" investigation.

Related

gprof - Top times spent in `std::bad_variant_accesss::~bad_variant_access()`?

I'm trying to measure performance of some code I wrote. I compiled it with g++ with -pg flag, and when after running it I execute grpof, I reliably get std::bad_variant_access::~bad_variant_access() among most time-consuming (% time) functions.
No exceptions are actually being thrown by the program and if I try to put a breakpoint bad_variant_access::~bad_variant_access() in gdb, it doesn't trigger and the program finishes in one go.
Is there any way to backtrace where all these mysterious calls come from? Could gprof misfire and confuse functions?

Execute fortran with more than one thread

I am working with a FORTRAN.77 program with a lot of iterations in a lot of loops in Ubuntu/Linux with gfortran, and to compile and execute it I simply use
gfortran program.f
./executable.out
By using htop while the execution, I see only one of the cores is working on that process.
Is there any option/flag one can use in the compilation or in the execution to force it to use more than one core/thread so it will be much faster?

How can I get the number of instructions executed by a program?

I have written and cross compiled a small c++ program, and I could run it in an ARM or a PC. Since ARM and a PC have different instruction set architectures, I wanna to compare them. Is that possible for me to get the number of executed instructions in this c++ program for both ISAs?
What you need is a profiler. perf would be one easy to use. It will give you the number of instructions that executed, which is the best metric if you want to compare ISA efficiency.
Check the tutorial here.
You need to use: perf stat ./your binary
Look for instructions metric. This approach uses a register in your CPU's performance monitoring unit - PMU - that counts the number of instructions.
Are you trying to get the number of static instructions or dynamic instructions? So, for instance, if you have the following loop (pseudocode):
for (i 0 to N):
a[i] = b[i] + c[i]
Static instruction count will be just under 10 instructions, give or take based on your ISA, but the dynamic count would depend on N, on the branch prediction implementation and so on.
So for static count I would recommend using objdump, as per recommendations in the comments. You can find the entry and exit labels of your subroutine and count the number of instructions in between.
For dynamic instruction count, I would recommend one of two things:
You can simulate running that code using an instruction set simulator (there are open source ISA simulators for both ARM and x86 out there - Gem5 for instance implements both of them, there are others out there that support one or the other.
Your second option is to run this natively on the target system and setup performance counters in the CPU to report dynamic instruction count. You would reset before executing your code, and read it afterwards (there might be some noise here associated with calling your subroutine and exiting, but you should be able to isolate that out)
Hope this helps :)
objdump -dw mybinary | wc -l
On Linux and friends, this gives a good approximation of the number of instructions in an executable, library or object file. This is a static count, which is of course completely different than runtime behavior.
Linux:
valgrind --tool=callgrind ./program 1 > /dev/null

Does GDB supports "run time sampling" or is there a user "extension" that does it

Motivation: I cant get google cpu profiler to work on machine where code runs(with my last breath I curse libunwind :)), so I was wondering if the gdb supports high frequency random pausing of the program execution, storing the name of the function where break occured and counting how many times it paused in function x.
That is what I would call "run time sampling", there is probably more precise/smarter name.
I looked at the oprofile, but it is to complicated to a) figure out if it can do it b) to figure out how to do it
EDIT: apparently correct name is:
"statistical sampling method"
EDIT2: reason why Im offering a bounty for this is that I see some ppl on SO recommending doing manual break 10-20x and examining stack with bt...
Seems very wasteful when it comes to time, so I guestimate some smart ppl automated it. :)
EDIT3: gprof wont cut it... i tried running it recently on ARM system and output was trash... :( I guess its troubles with multithreading is the reason for that...
You can manually sample in GDB by pausing it at run time.
What you seem to think you want is gprof, but
if your goal is to make the program as fast as possible, then I would suggest
High frequency of sampling is not helpful.
Counting the number of samples where the program counter is in function X is not helpful except in artificially small programs.
If you follow that link, you will see the reasons why, and directions for how to do it successfully.
GDB would not do this well, although you could certainly hack something up that gave wildly inaccurate results.
I'd suggest Valgrind's "Callgrind" plugin. As a bonus it requires absolutely no recompilation or other special setup. All you need is valgrind installed in your system, and debug information in your program (or, full symbol information, at least; I'm not sure).
You then invoke your program like this:
valgrind --tool=callgrind <your program command line>
When it's done there will be a file name callgrind.out.<pid> in the current directory. You can read and visualise this file with a very nice GUI tool called kcachegrind (usually you have to install this separately).
The only problem is that, because callgrind slows the execution of your program slightly, the time spent in system calls can appear smaller (in percentage terms) than it really would be. By default, callgrind does not include system time in its counters, so the values it give you are a real comparison of the code in your program, if not the actual time 'under' that function. This can be confusing, at first, so if that happens you try adding --collect-systime=yes.
I'm not sure what the state of callgrind on ARM might be. ARMv7 is listed as a supported platform, but only says "fairly complete", whatever that means.

Newbie: Performance Analysis through the command line

I am looking for a performance analysis tool with the following properties:
Free.
Runs on Windows.
Does not require using the GUI
(i.e. can be run from command line or by using some library in any programming language).
Runs on some x86-based architecture (preferably Intel).
Can measure the running time of my C++, mingw-compiled, program, except for the time spent in a few certain functions I specify (and all calls emanating from them).
Can measure the amount of memory used by my program, except for the memory allocated by those functions I specified in (5) and all calls emanating from them.
A tool that have properties (1) to (5) (without 6) can still be very valuable to me.
My goal is to be able to compare the running time and memory usage of different programs in a consistent way (i.e. the main requirement is that timing the same program twice would return roughly the same results).
Mingw should already have a gprof tool. To use it you just need to compile with correct flags set. I think it was -g -pg.
For heap analysis (free) you can use umdh.exe which is a full heap dumper, you can also compare consecutive memory snapshots to check for leakage over time. You'd have to filter the output yourself to remove functions that were not of interest, however.
I know that's not exactly what you were asking for in (6), but it might be useful. I think filtering like this is not going to be so common in freeware.