I am trying to profile a function of OpenMx, an R package containing C++ and Fortran code, for CPU time. My operating system is OS X 10.10. I have read the section regarding this topic in the R manual. This section and this post lead me to try Instruments. Here is what I did
Opened Instruments
Chose the Time Profiler Template
Pressed Record
Started my R script using RStudio
I get the following output: . The command line tool sample returns the same output.
The problem is that it looks like omxunsafedgemm_ would be called directly from the Main Thread. However, this is a low level Fortran function. It is always called by a C++ function called omxDGEMM. In this example omxDGEMM is first called by omxCallRamExpection (so almost at the bottom of the call tree). The total time of omxDGEMM is 0. Thus, the profiling information is currently useless.
In the original version of the package omxDGEMM is defined as inline. I changed this in the hope that it would resolve the issue. This was not the case. omxunsafedgemm is called by omxDGEMM like that
F77_CALL(omxunsafedgemm)(&transa, &transb,
&(nrow), &(ncol), &(nmid),
&alpha, a->data, &(a->leading),
b->data, &(b->leading),&beta, result->data, &(result->leading));
Any ideas how to obtain a sensible profiler output?
This problem is caused by the -O2 flag of the gfortran compiler, which R uses per default. The -O2 flag turns on all optimization steps that the -O1 flag enables and more (see gcc manual page 98). One of the optimization flags that the -O1 flags enables is -fomit-frame-pointer. Instruments needs the frame pointers to know the parent of a call frame (see this talk).
Thus, changing
FFLAGS = -g -O2 $(LTO) to
FFLAGS = -g -O2 -fno-omit-frame-pointer $(LTO)
in ${R_HOME}/etc/Makeconf resolves the issue. For me R_HOME=/Library//Frameworks/R.framework/Versions/3.2/Resources
Simply omitting the -O2 also solves the issue but makes OpenMx considerably slower (200 vs 30 seconds in my case).
If the OpenMx binary came from the OpenMx website via getOpenMx.R then it would have been compiled with gcc/gfortran. If it came from CRAN it would have been compiled with the OS X compilers LLVM etc (but it would lack parallel computation because OpenMP is not compatible with LLVM). So you could try the other binary to see if the tags for profiling are better. Please let us know which version you were using and whether changing version helped.
Related
This question is related to compiling OpenMP capable Fotran77 (combined with some C libraries) fixed form code with gfortran -fopenmp.
This answer discusses that while continuing to the next line in case the required column exceeds 72, the correct directive to use in the next line for an OpenMP capable code is the c$omp& sentinel. For example,
code A
C$OMP PARALLEL SHARED(Lm,Mm, pm,pn, f,f_q, fnd_rmask,rmask, dm_u,dn_v,
& iA_q)
is an incorrect fixed form Fotran77 code portion.
Whereas, this webpage and this answer says that the correct form is
code B
C$OMP PARALLEL SHARED(Lm,Mm, pm,pn, f,f_q, fnd_rmask,rmask, dm_u,dn_v,
C$OMP& iA_q)
However, there is a need where I will have to live with code A (don't ask me now, I can explain if someone is interested) which gives me an error with the gfortran compiler (screenshot attached). This answer also says that ifort does not give any error even if we do not start the next line with the c$omp& sentinel similar to code A. (I do not have ifort and have not tried it myself.)
My question: is there a way (or any compiler flag) by which I can make gfortran compile happily with code A? If ifort can live with it, can't gfortran too? I can't believe that there is no compiler directive to override all of this. (This does not mean I am questioning the abilities and principles of gfortran developers)
Without changes to your source code, the answer to your first question is NO.
The answer to your second question is maybe. At the moment, gfortran does not support an Intel extension. gfortran is part of GCC, which is open-source software. You can download the software. Add an new option, say, -fIntel-openmp-syntax. Once you have this working, your submitted patch may be committed to the source code repository.
I'm working on big project most files are longer than 7000 lines. If I'm using -fno-inline option compilation time going down 3 times.
Actual numbers:
w/o -fno-inline - 340 sec
w/ -fno-inline ~ 115 sec
I didn't find anything about -fno-inline impact on compilation performance. Is there any explanation to this ?
Some background:
I use MACROSes pretty extensively (for logging purposes)
There is one global Exception try / catch block inherited from old code (need to rework this piece)
There are few try/catch blocks inside, mostly to catch exceptions from stof/stoi
I tested compilation time with and w/o (-pipe, -O0 to -O3, -g / no -g, -ggdb / no ggdb). Nothing brings compilation time so down as -fno-inline.
I'm working on big project most files are longer than 7000 lines.
That is a bit quite big. You might (I am not sure) win a bit of compilation time by avoiding files bigger than 5KLOC (by splitting large C++ files, bigger than 8KLOC, in several ones), and by compiling in parallel several translation units at the same time (using make -j or ninja). This requires some refactoring work. On the other hand, with genuine C++, don't have too small files (because standard container headers like <vector> could include many thousands lines; you might also consider having a precompiled header). Pragmatically 3KLOC to 7KLOC per C++ source file is a nice trade-off in practice.
Use the -ftime-report option to g++ to get a detailed timing of each compilation phase (or passes). You may need to understand the internals of GCC to decipher the obtained table.
I didn't find anything about -fno-inline impact on compilation performance. Is there any explanation to this ?
Inline expansion happens several times inside GCC. It generally works on some GIMPLE or SSA internal representation. Of course, inlining is improving the runtime performance of your program. By disabling it, you could lose 50% of speed in your executable (and perhaps even more, since inline member functions such as getters and setters are extensively used in C++, notably in standard container templates).
FWIW, my old GCC MELT web pages (GCC MELT is now a dead project) have several slides and reference explaining GCC internals, and I am just writing right now (october 2018) the draft of a technical report on bismon (funded now by the CHARIOT H2020 project); that draft happens to have a section §1.3.2 explaining some interesting GCC optimizations.
See also CppCon 2017 talk: Matt Godbolt “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
Because of a school assignment I have to convert a C++ code to assembly(ARMv8). Then I have to compile the C++ code using GCC's -O0,-O1,-O2,-O3 and -Os optimizations, write down the time and compare with the execute time of my assembly code. As, I think I know -O3 have to be faster than -O1 and -O2. However, I get that -O2 is the fastest, then are -O1,-O3,-Os,-O0. Is that usual? (Calculated times are about 30 seconds).
Notice that GCC has many other optimization flags.
There is no guarantee that -O3 gives faster code than -O2; a compiler can apply more optimization passes, but they are all heuristics and might be unsuccessful (or even slow down slightly your particular code). Hence it does happen that -O3 gives some slightly slower code than -O2 (on some particular input source code).
You could try a more recent version of GCC (the latest -in November 2017- is GCC 7, GCC 8 will go out in few months). You could also try some better -march= or -mtune= option.
At last, with your GCC plugin, you might add your own optimization pass, or change the order (and the set) of applied optimization passes (there are several hundreds different optimization passes in GCC). But you'll need a lot of work (perhaps a year or two) to be able to extend GCC.
You could tune optimization parameters, and some project (MILEPOST) has even used machine learning techniques to improve them.
See also slides and references on my (old) GCC MELT documentation.
Yes, it is usual. Take the -Ox optimization as guide-lines. In average, they produce optimization that is advertise, but a lot depends on the style in which the code is written, memory layout, as well as the compiler itself.
Sometimes, you need to try and fail many times before getting the optimal code.
-O2 indeed gives the best optimization in most of the cases.
Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
Naturally, I use g++ -O3, but there are additional flags that may make the code run faster, e.g. -ffast-math and others, some of which are hardware-dependent.
Does anyone know some code I can put in my configure.ac file (GNU autotools), so that the flags will be added to the Makefile automatically by the ./configure command?
In addition to automatically determining the best flags, I would be interested in some useful compiler flags that are good to use as a default for most optimized executables.
Update: Most people suggest to just try different flags and select the best ones empirically. For that method, I'd have a follow-up question: Is there a utility that lists all compiler flags that are possible for the machine I'm running on (e.g. tests if SSE instructions are available etc.)?
I don't think you can do this at configure-time, but there is at least one program which attempts to optimize gcc option flags given a particular executable and machine. See http://www.coyotegulch.com/products/acovea/ for example.
You might be able to use this with some knowledge of your target machine(s) to find a good set of options for your code.
Um - yes. This is possible. Look into profile-guided optimization.
some compilers provide "-fast" option to automatically select most aggressive optimization for given compilation host. http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler
Unfortunately, g++ does not provide similar flags.
as a follow-up to your next question, for g++ you can use -mtune option together with -O3 which will give you reasonably fast defaults. Challenge then is to find processor type of your compilation host. you may want to look on autoconf macro archive, to see somebody wrote necessary tests. otherwise, assuming linux, you have to parse /proc/cpuinfo to get processor type
After some googling, I found this script: gcccpuopt.
On one of my machines (32bit), it outputs:
-march=pentium4 -mfpmath=sse
On another machine (64bit) it outputs:
$ ./gcccpuopt
Warning: The optimum *32 bit* architecture is reported
-m32 -march=core2 -mfpmath=sse
So, it's not perfect, but might be helpful.
See also -mcpu=native/-mtune=native gcc options.
Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
No.
You could compile your program with a large assortment of compiler options, then benchmark each and every version, then select the one that is "fastest," but that's hardly reliable and probably not useful for your program.
This is a solution that works for me, but it does take a little while to set up. In "Python Scripting for Computational Science" by Hans Petter Langtangen (an excellent book in my opinion), an example is given of using a short python script to do numerical experiments to determine the best compiler options for your C/Fortran/... program. This is described in Chapter 1.1.11 on "Nested Heterogeneous Data Structures".
Source code for examples from the book are freely available at http://folk.uio.no/hpl/scripting/index.html (I'm not sure of the license, so will not reproduce any code here), and in particular you can find code for a similar numerical test in the code in TCSE3-3rd-examples.tar.gz in the file src/app/wavesim2D/F77/compile.py , which you could use as a base for writing a script which is appropriate for a particular system/language (C++ in your case).
Optimizing your app is mainly your job, not the compiler's.
Here's an example of what I'm talking about.
Once you've done that, IF your app is compute-bound, with hotspots in your code (not in library code) THEN the compiler optimizations for speed will make some difference, so you can try different flag combinations.
I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?
GCC's optimization passes work on an intermediary representation of your code in a format called GIMPLE.
Using the -fdump-* family of options, you can ask GCC to output intermediary states of the tree.
For example, feed this to gcc -c -fdump-tree-all -O3
unsigned fib(unsigned n) {
if (n < 2) return n;
return fib(n - 2) + fib(n - 1);
}
and watch as it gradually transforms from simple exponential algorithm into a complex polynomial algorithm. (Really!)
A useful technique is to run the code under a good sampling profiler, e.g. Zoom under Linux or Instruments (with Time Profiler instrument) under Mac OS X. These profilers not only show you the hotspots in your code but also map source code to disassembled object code. Highlighting a source line shows the (not necessarily contiguous) lines of generated code that map to the source line (and vice versa). Online opcode references and optimization tips are a nice bonus.
Instruments: developer.apple.com
Zoom: www.rotateright.com
Not gcc, but when debugging in Visual Studio you have the option to intersperse assembly and source, which gives a good idea of what has been generated for what statement. But sometimes it's not quite aligned correctly.
The output of the gcc tool chain and objdump -dS isn't at the same granularity. This article on getting gcc to output source and assembly has the same options as you are using.
Adding the -L option (eg, gcc -L -ahl) may provide slightly more intelligible listings.
The equivalent MSVC option is /FAcs (and it's a little better because it intersperses the source, machine language, and binary, and includes some helpful comments).
About one third of my job consists of doing just what you're doing: juggling C code around and then looking at the assembly output to make sure it's been optimized correctly (which is preferred to just writing inline assembly all over the place).
Game-development blogs and articles can be a good resource for the topic since games are effectively real-time applications in constant memory -- I have some notes on it, so does Mike Acton, and others. I usually like to keep Intel's instruction set reference up in a window while going through listings.
The most helpful thing is to get a good ground-level understanding of assembly programming generally first -- not because you want to write assembly code, but because having done so makes reading disassembly much easier. I've had a hard time finding a good modern textbook though.
In order to output the optimizations applied you can use:
-fopt-info-optimized
To see those that have not been applied
-fopt-info-missed
Beware that the output is sent to standard error stream so to see it you actually have to redirect that : ( hint 2>&1 )
Here is nice example of :
g++ -O3 -std=c++11 -march=native -mtune=native
-fopt-info-optimized h2d.cpp -o h2d 2>&1
h2d.cpp:225:3: note: loop vectorized
h2d.cpp:213:3: note: loop vectorized
h2d.cpp:198:3: note: loop vectorized
h2d.cpp:186:3: note: loop vectorized
You can check the interleaved output, when having applied -g with objdump -dS|c++filt , but that will not get you that far.Enjoy!
Zoom from RotateRight ( http://rotateright.com ) is mentioned in another answer, but to expand on that: it shows you the mapping of source to assembly in what they call the "code browser". It's incredibly handy even if you're not an asm expert because they have also integrated assembly documentation into the app. And the assembly listing is annotated with comments and timing for several CPU types.
You can just open your object or executable file with Zoom and take a look at what the compiler has done with your code.
Victor, in your case the optimization you are looking for is just a smaller allocation of local memory on the stack. You should see a smaller allocation at function entry and a smaller deallocation at function exit if the space used by the empty class is optimized away.
As for the general question, I've been reading (and writing) assembly language for more than (gulp!) 30 years and all I can say is that it takes practice, especially to read the output of a compiler.
Instead of trying to read through an assembler dump, run your program inside a debugger. You can pause execution, single-step through instructions, set breakpoints on the code you want to check, etc. Many debuggers can display your original C code alongside the generated assembly so you can more easily see what the compiler did to optimize your code.
Also, if you are trying to test a specific compiler optimization you can create a short dummy function that contains the type of code that fits the optimization you are interested in (and not much else, the simpler it is the easier the assembly is to read). Compile the program once with optimizations on and once with them off; comparing the generated assembly code for the dummy function between builds should show you what the compiler's optimizers did.