Execute fortran with more than one thread - fortran

I am working with a FORTRAN.77 program with a lot of iterations in a lot of loops in Ubuntu/Linux with gfortran, and to compile and execute it I simply use
gfortran program.f
./executable.out
By using htop while the execution, I see only one of the cores is working on that process.
Is there any option/flag one can use in the compilation or in the execution to force it to use more than one core/thread so it will be much faster?

Related

Profiling a Fortran subroutine line by line

I have written a large Fortran program (using the new standard) and I am currently in the process to try to make it run faster. I have managed to streamline most of the routines using gprof but I have a very large subroutine that organizes the calculation that now take almost 50% of the CPU time. I am sure there are several bottlenecks inside this routine but I have not managed to set any parameters compiling or running the program so I can see where the time is spent inside this routine. I would like at least a simple count how many time each line is calculated or how much CPU time is spent executing each line. Maybe valgrind is a better tool? It was very useful to eliminate memory leaks.
A workaround that I have found is to use cpu_time module. Although this doesn't automatically do profiling, if you are willing to invest manual efforts, you can call cpu_time before and after the statement for which you want to profile. The difference of these times gives you the total time needed to execute the statement(s) between the two calls to cpu_time. If the statement(s) is inside a loop, you can add these differences and print the total time outside the loop.
This is a little oldschool, but I like the OProfile linux toolset.
If you have a fortran program prog, then running
operf -gl prog
will run prog and also use kernel profiling to produce a profile and call graph of prog.
These can then be fed to something like KCachegrind to view them as a nice nested rectangle plot. For converting from operf output to KCachegrind input I use a slightly modified version of this python script.
The gcov tool in GCC provides a nice overview of an individual subroutine in my code to discover how many times each line is executed. The file with the subroutine to be "covered" must be compiled with
gfortran -c -fprofile-arcs -ftest-coverage -g subr.F90
and to link the program I must add -lgcov as the LAST library.
After running the program I can use
gcov subr.F90
to create a file subr.F90.gcov
with information of the number of times each line in the subroutine has been executed. That should make it possible to discover bottlenecks in the subroutine. This is a nice complement to gprof which gives the time in each subroutine but as my program has more than 50000 lines of code it is nice to be able to select just a few subroutines for this "line by line" investigation.

Detecting not using MPI when running with mpirun/mpiexec

I am writing a program (in C++11) that can optionally be run in parallel using MPI. The project uses CMake for its configuration, and CMake automatically disables MPI if it cannot be found and displays a warning message about it.
However, I am worrying about a perfectly plausible use case whereby a user configures and compiles the program on an HPC cluster, forgets to load the MPI module, and does not notice the warning. That same user might then try to run the program, notice that mpirun is not found, include the MPI module, but forget to recompile. If the user then runs the program with mpirun, this will work, but the program will just run a number of times without any parallelization, as MPI was disabled at compile time. To prevent the user from thinking the program is running in parallel, I would like to make the program display an error message in this case.
My questions is: how can I detect that my program is being run in parallel without using MPI library functions (as MPI was disabled at compile time)? mpirun just launches the program a number of times, but does not tell the processes it launches about them being run in parallel, as far as I know.
I thought about letting the program write some test file, and then check if that file already exists, but apart from the fact that this might be tricky to do due to concurrency problems, there is no guarantee that mpirun will even launch the various processes on nodes that share a file system.
I also considered using a system variable to communicate between the two processes, but as far as I know, there is no system independent way of doing this (and again, this might cause concurrency issues, as there is no way to coordinate system calls between the various processes).
So at the moment, I have run out of ideas, and I would very much appreciate any suggestions that might help me achieve this. Preferred solutions should by operating system independent, although a UNIX-only solution would already be of great help.
Basically, you want to run a a detection of whether you are being run by mpirun etc. in your non-MPI code-path. There is a very similar question: How can my program detect, whether it was launch via mpirun that already presents one non-portable solution.
Check for environment variables that are set by mpirun. See e.g.:
http://www.open-mpi.org/faq/?category=running#mpi-environmental-variables
As another option, you could get the process id of the parent process and it's process name and compare it with a list of known MPI launcher binaries such as orted,slurmstepd,hydra??1. Everything about that is unfortunately again non-portable.
Since launching itself is not clearly defined by the MPI standard, there cannot be a standard way to detect it.
1: Only from my memory, please don't take the list literally.
From a user experience point of view, I would argue that always showing a clear message how the program is being run, such as:
Running FancySimulator serially. If you see this as part of mpirun, rebuild FancySimuilator with FANCYSIM_MPI=True.
or
Running FancySimulator in parallel with 120 MPI processes.
would "solve" the problem. A user getting 120 garbled messages will hopefully notice.

Software parallelization with OpenMP versus Affinity scheduling?

Scenario: I have an program which can be easily parallelized using OpenMP, lets say the main loop of the program is a for loop and independent data within it, so paralleizing it would be trivial. However currently I don't parallelize it, and instead use affinity scheduling.
This program performs work on some input files specified by a folder in the command line arguments. To run this program in parallel, someone can create a bat file like so:
start \affinity 1 "1" bat1
start \affinity 2 "2" bat2
start \affinity 3 "3" bat3
start \affinity 4 "4" bat4
where bat1 - 4 is a bat file that calls main.exe with a different input folder for each bat file. So in this case there would be 4 instances of main.exe running on input_folder1, input_folder2, input_folder3, input_folder4 respectively.
What would be the benefits of using a library like OpenMP be instead of affinity scheduling? I figure
Less memory usage, single stack and heap for a single program instance as opposed to n instances of a program for n cores
Better scaling
But would I expect to actually see a performance boost? Why if so?
If your problem is a simply parallel, with no interaction among the data in the separate input files, then you would probably not see a speedup with OpenMP, and might even see a slow-down, since memory allocation and various other things then have to be thread-safe. Single-threaded processes can gain lots of efficiencies, and in fact do on GNU libc, where linking in POSIX threads support means you also get a slower implementation of malloc

why mpirun duplicates the program by default?

I am new to openMPI, I have problem understanding the concepts. (I found this pretty helpful)
1- Could anyone breifly explain why we use openMPI? To my understanding, OpenMPI is used to parallelize those sections of the code which can run in parallel.
2- why mpirun duplicates a single program? simply because my laptop is dual core?
3 - what changes in the code I need to apply to make it run correctly? I mean ONE program parallelized on two available cores? not 2 similar threads of the same program.
MPI is primarily of benefit when used in a multiple machine environment, in which you must run multiple processes.
It requires heavy modification of the program.

C++/g++: Concurrent program

I got a C++ program (source) that is said to work in parallel. However, if I compile it (I am using Ubuntu 10.04 and g++ 4.4.3) with g++ and run it, one of my two CPU cores gets full load while the other is doing "nothing".
So I spoke to the one who gave me the program. I was told that I had to set specific flags for g++ in order to get the program compiled for 2 CPU cores. However, if I look at the code I'm not able to find any lines that point to parallelism.
So I have two questions:
Are there any C++-intrinsics for multithreaded applications, i.e. is it possible to write parallel code without any extra libraries (because I did not find any non-standard libraries included)?
Is it true that there are indeed flags for g++ that tell the compiler to compile the program for 2 CPU cores and to compile it so it runs in parallel (and if: what are they)?
AFAIK there are no compiler flags designed to make a single-threaded application exploit parallelism (it's definitely a nontrivial operation), with the exception of parallelization of loops iterations (-ftree-parallelize-loops), that, still, must be activated carefully; still, even if there's no explicit threads creation, there may be some OpenMP directives to parallelize several instruction sequences.
Look for the occurrence of "thread" and/or "std::thread" in the source code.
The current C++ language standard has no support for multi-processing in the language or the standard library. The proposed C++0x standard does have some support for threads, locks etc. I am not aware of any flags for g++ that would magically make your program do multi-processing, and it's hard to see what such flags could do.
The only thing I can think of is openMosix or LinuxPMI (the successor of openMosix). If the code uses processes then process "migration" technique makes is possible to put processes at work on different machines (which have the specified linux distribution installed).
Check for threads (grep -i thread), processes (grep fork) in your code. If none of this exists, then check for MPI. MPI requires some extra configuration since I recall (only used it for some homeworks in faculty).
As mentioned gcc (and others) implements some ways of parallelism with OpenMP with some pragmas.