This question is related to compiling OpenMP capable Fotran77 (combined with some C libraries) fixed form code with gfortran -fopenmp.
This answer discusses that while continuing to the next line in case the required column exceeds 72, the correct directive to use in the next line for an OpenMP capable code is the c$omp& sentinel. For example,
code A
C$OMP PARALLEL SHARED(Lm,Mm, pm,pn, f,f_q, fnd_rmask,rmask, dm_u,dn_v,
& iA_q)
is an incorrect fixed form Fotran77 code portion.
Whereas, this webpage and this answer says that the correct form is
code B
C$OMP PARALLEL SHARED(Lm,Mm, pm,pn, f,f_q, fnd_rmask,rmask, dm_u,dn_v,
C$OMP& iA_q)
However, there is a need where I will have to live with code A (don't ask me now, I can explain if someone is interested) which gives me an error with the gfortran compiler (screenshot attached). This answer also says that ifort does not give any error even if we do not start the next line with the c$omp& sentinel similar to code A. (I do not have ifort and have not tried it myself.)
My question: is there a way (or any compiler flag) by which I can make gfortran compile happily with code A? If ifort can live with it, can't gfortran too? I can't believe that there is no compiler directive to override all of this. (This does not mean I am questioning the abilities and principles of gfortran developers)
Without changes to your source code, the answer to your first question is NO.
The answer to your second question is maybe. At the moment, gfortran does not support an Intel extension. gfortran is part of GCC, which is open-source software. You can download the software. Add an new option, say, -fIntel-openmp-syntax. Once you have this working, your submitted patch may be committed to the source code repository.
Related
I am trying to profile a function of OpenMx, an R package containing C++ and Fortran code, for CPU time. My operating system is OS X 10.10. I have read the section regarding this topic in the R manual. This section and this post lead me to try Instruments. Here is what I did
Opened Instruments
Chose the Time Profiler Template
Pressed Record
Started my R script using RStudio
I get the following output: . The command line tool sample returns the same output.
The problem is that it looks like omxunsafedgemm_ would be called directly from the Main Thread. However, this is a low level Fortran function. It is always called by a C++ function called omxDGEMM. In this example omxDGEMM is first called by omxCallRamExpection (so almost at the bottom of the call tree). The total time of omxDGEMM is 0. Thus, the profiling information is currently useless.
In the original version of the package omxDGEMM is defined as inline. I changed this in the hope that it would resolve the issue. This was not the case. omxunsafedgemm is called by omxDGEMM like that
F77_CALL(omxunsafedgemm)(&transa, &transb,
&(nrow), &(ncol), &(nmid),
&alpha, a->data, &(a->leading),
b->data, &(b->leading),&beta, result->data, &(result->leading));
Any ideas how to obtain a sensible profiler output?
This problem is caused by the -O2 flag of the gfortran compiler, which R uses per default. The -O2 flag turns on all optimization steps that the -O1 flag enables and more (see gcc manual page 98). One of the optimization flags that the -O1 flags enables is -fomit-frame-pointer. Instruments needs the frame pointers to know the parent of a call frame (see this talk).
Thus, changing
FFLAGS = -g -O2 $(LTO) to
FFLAGS = -g -O2 -fno-omit-frame-pointer $(LTO)
in ${R_HOME}/etc/Makeconf resolves the issue. For me R_HOME=/Library//Frameworks/R.framework/Versions/3.2/Resources
Simply omitting the -O2 also solves the issue but makes OpenMx considerably slower (200 vs 30 seconds in my case).
If the OpenMx binary came from the OpenMx website via getOpenMx.R then it would have been compiled with gcc/gfortran. If it came from CRAN it would have been compiled with the OS X compilers LLVM etc (but it would lack parallel computation because OpenMP is not compatible with LLVM). So you could try the other binary to see if the tags for profiling are better. Please let us know which version you were using and whether changing version helped.
I am trying to write a massively parallel monte carlo code part of which will be exported to a xeon phi coprocessor. To ensure that I am using the coprocessor efficiently, I would like to see which parts of my code the compiler, currently gfortran, is able to vectorize. I understand I can do this using the ifort commane -vec-report. However, I won't have access to the coprocessor for about a month, and therefore am stuck with gfortran for the time being. However, I would like to start optimizing now if possible. Unfortanately, I cannot seem to find the command line flag for gfortran that tells me which part of the code is being vectorized. Is there one. If so, what is it?
thanks
You can try, if -fopt-info suits you needs.
You can get more output by using -fopt-info-all which includes information on successfull and missed optimization.
The vectorizer can be instructed to be verbose and report what it does:
-ftree-vectorizer-verbose=n
where larger integer n means more verbose report.
For more see http://gcc.gnu.org/projects/tree-ssa/vectorization.html
(It took me 1 minute to google it).
I need some pointers to solve a problem that I can describe only in a limited way. I got a code written in f77 from a senior scientist. I can't give the code on a public forum for ownership issues. It's not big (750 lines) but given implicit declarations and gotos statements, it is very unreadable. Hence I am having trouble finding out the source of error. Here is the problem:
When I compile the code with ifort, it runs fine and gives me sensible numbers but when I compile it with gfortran, it compiles fine but does not give me the right answer. The code is a numerical root finder for a complex plasma physics problem. The ifort compiled version finds the root but the gfortran compiled version fails to find the root.
Any ideas on how to proceed looking for a solution? I will update the question to reflect the actual problem when I find one.
Some things to investigate, not necessarily in the order I would try them:
Use your compiler(s) to check everything that your compiler(s) are capable of checking including and especially array-bounds (for run-time confidence) and subroutine argument matching.
Use of uninitialised variables.
The kinds of real, complex and integer variables; the compilers (or your compilation options) may default to different kinds.
Common blocks, equivalences, entry, ... other now deprecated or obsolete features.
Finally, perhaps not a matter for immediate investigation but something you ought to do sooner (right choice) or later (wrong choice), make the effort to declare IMPLICIT NONE in all scopes and to write explicit declarations for all entities.
I have a serial Fortran code that works fine. Once I compile the same code using ifort -parallel and run it, it gives wrong results and overflow. I would expect that with "-parallel" flag, the Intel compiler is capable of selecting the loops that are safe to parallelize and I should get the exact same results as for the serial code, which did not happen. The even more strange behaviour is that I went ahead and closed all the do loops parallelization in my code using !DEC$ NOPARALLEL, compiled the code using ifort -parallel to make sure that non of the loops was parallelized and then run. Surprisingly, I got the same wrong results and overflow, although the latter action should be exactly equivalent to a serial code.
Is there any one capable of explaining this behaviour or is it just an Intel compiler deficiency.
Greetings.
Sorry to say this, but it's unlikely to be an Intel compiler problem it's a pretty good compiler (no, I don't work for Intel ! but I do use their compilers).
Yes I am capable of explaining this sort of behaviour, but without sight of your program anything I suggest will be wrong.
Answers were given to this identical question on the Intel Fortran Forum: http://software.intel.com/en-us/forums/topic/269743
EDIT: I revised the link, since as stated in the comment, the original link is now dead.
I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?
GCC's optimization passes work on an intermediary representation of your code in a format called GIMPLE.
Using the -fdump-* family of options, you can ask GCC to output intermediary states of the tree.
For example, feed this to gcc -c -fdump-tree-all -O3
unsigned fib(unsigned n) {
if (n < 2) return n;
return fib(n - 2) + fib(n - 1);
}
and watch as it gradually transforms from simple exponential algorithm into a complex polynomial algorithm. (Really!)
A useful technique is to run the code under a good sampling profiler, e.g. Zoom under Linux or Instruments (with Time Profiler instrument) under Mac OS X. These profilers not only show you the hotspots in your code but also map source code to disassembled object code. Highlighting a source line shows the (not necessarily contiguous) lines of generated code that map to the source line (and vice versa). Online opcode references and optimization tips are a nice bonus.
Instruments: developer.apple.com
Zoom: www.rotateright.com
Not gcc, but when debugging in Visual Studio you have the option to intersperse assembly and source, which gives a good idea of what has been generated for what statement. But sometimes it's not quite aligned correctly.
The output of the gcc tool chain and objdump -dS isn't at the same granularity. This article on getting gcc to output source and assembly has the same options as you are using.
Adding the -L option (eg, gcc -L -ahl) may provide slightly more intelligible listings.
The equivalent MSVC option is /FAcs (and it's a little better because it intersperses the source, machine language, and binary, and includes some helpful comments).
About one third of my job consists of doing just what you're doing: juggling C code around and then looking at the assembly output to make sure it's been optimized correctly (which is preferred to just writing inline assembly all over the place).
Game-development blogs and articles can be a good resource for the topic since games are effectively real-time applications in constant memory -- I have some notes on it, so does Mike Acton, and others. I usually like to keep Intel's instruction set reference up in a window while going through listings.
The most helpful thing is to get a good ground-level understanding of assembly programming generally first -- not because you want to write assembly code, but because having done so makes reading disassembly much easier. I've had a hard time finding a good modern textbook though.
In order to output the optimizations applied you can use:
-fopt-info-optimized
To see those that have not been applied
-fopt-info-missed
Beware that the output is sent to standard error stream so to see it you actually have to redirect that : ( hint 2>&1 )
Here is nice example of :
g++ -O3 -std=c++11 -march=native -mtune=native
-fopt-info-optimized h2d.cpp -o h2d 2>&1
h2d.cpp:225:3: note: loop vectorized
h2d.cpp:213:3: note: loop vectorized
h2d.cpp:198:3: note: loop vectorized
h2d.cpp:186:3: note: loop vectorized
You can check the interleaved output, when having applied -g with objdump -dS|c++filt , but that will not get you that far.Enjoy!
Zoom from RotateRight ( http://rotateright.com ) is mentioned in another answer, but to expand on that: it shows you the mapping of source to assembly in what they call the "code browser". It's incredibly handy even if you're not an asm expert because they have also integrated assembly documentation into the app. And the assembly listing is annotated with comments and timing for several CPU types.
You can just open your object or executable file with Zoom and take a look at what the compiler has done with your code.
Victor, in your case the optimization you are looking for is just a smaller allocation of local memory on the stack. You should see a smaller allocation at function entry and a smaller deallocation at function exit if the space used by the empty class is optimized away.
As for the general question, I've been reading (and writing) assembly language for more than (gulp!) 30 years and all I can say is that it takes practice, especially to read the output of a compiler.
Instead of trying to read through an assembler dump, run your program inside a debugger. You can pause execution, single-step through instructions, set breakpoints on the code you want to check, etc. Many debuggers can display your original C code alongside the generated assembly so you can more easily see what the compiler did to optimize your code.
Also, if you are trying to test a specific compiler optimization you can create a short dummy function that contains the type of code that fits the optimization you are interested in (and not much else, the simpler it is the easier the assembly is to read). Compile the program once with optimizations on and once with them off; comparing the generated assembly code for the dummy function between builds should show you what the compiler's optimizers did.