How to determine what gfortran is vectorizing - fortran

I am trying to write a massively parallel monte carlo code part of which will be exported to a xeon phi coprocessor. To ensure that I am using the coprocessor efficiently, I would like to see which parts of my code the compiler, currently gfortran, is able to vectorize. I understand I can do this using the ifort commane -vec-report. However, I won't have access to the coprocessor for about a month, and therefore am stuck with gfortran for the time being. However, I would like to start optimizing now if possible. Unfortanately, I cannot seem to find the command line flag for gfortran that tells me which part of the code is being vectorized. Is there one. If so, what is it?
thanks

You can try, if -fopt-info suits you needs.
You can get more output by using -fopt-info-all which includes information on successfull and missed optimization.

The vectorizer can be instructed to be verbose and report what it does:
-ftree-vectorizer-verbose=n
where larger integer n means more verbose report.
For more see http://gcc.gnu.org/projects/tree-ssa/vectorization.html
(It took me 1 minute to google it).

Related

D compiler profiling

How to figure out which part of my d code takes long time to compile?
I tried to use valgrind, but the the method names were not very insightful. 87% of time was spent in <cycle 7>, 40% of the time in _D4ddmd5lexer5Lexer4scanMFPS4ddmd6tokens5TokenZv
I'm looking for something like this: 40% of the time was spent on xy.d, out of that 80% of the time took compiling various instantiations of template xyz and a reason is because it used memcpy 99% of the time.
I'm interested profiling both DMD and LDC.
As the D compiler front end is written in D, profiling using conventional tools will be rather hard compared to something like C++. I have had some success using tools like gdb and valgrind on Linux and tools like VisualD on Windows, Mac users are kind of SOL.
You have five other options:
Stop trying to find the specific function in the compiler and turn to common knowledge about the problem (see below)
Use a tool like https://github.com/CyberShadow/DBuildStat. It doesn't give you the exact answer you're asking about, but if you're trying to get a large project to compile faster it's better than nothing.
Use the -v flag to try and see which parts of your program take a while. Granted, this is a very brute force approach and can take you a while.
Modify the makefile the DMD front-end to use the -profile switch. Every time you run DMD you will get a profile file with a lot of information. Granted, I don't think this has ever been tried. Your milage may vary.
Try to ask the LDC team about this on their Github issues page. IIRC they made a patched version for profiling that they used for the Weka.io codebase.
When I say turn to common knowledge, I mean to say that your slow compilation is likely due to a few common problems. For example, when an SQL query is taking too long, my first reaction is not to try to profile the MySQL server code. Here are a couple of the most common issues
CTFE, while it speeds up your runtime, is slow. Especially if your doing recursive templates like allSatisfy or your using functions like ctRegex. If you're doing heavy CTFE and you want faster compiles at the price of possible slower code, consider switching them to run time calls.
DMD doesn't (yet) ignore symbols which aren't used in your program, meaning if you import a module, code-gen will happen for all of the functions in the module. This is true even for selective imports. If you don't use them the linker will prune the functions from the resulting executable, but the compiler still took time to compile them. Avoid imports like import std.algorithm; or import std.range;. Instead use package specific imports like import std.algorithm.iteration : map;.

Why is the LLVM execution engine faster than compiled code?

I have a compiler which targets LLVM, and I provide two ways to run the code:
Run it automatically. This mode compiles the code to LLVM and uses the ExecutionEngine JIT to compile it into machine code on-the-fly and run it without ever generating an output file.
Compile it and run separately. This mode outputs an LLVM .bc file, which I manually optimise (with opt), compile to native assembly (with llc) compile to machine code and link (with gcc), and run.
I was expecting approach #2 to be faster than approach #1, or at least the same speed, but running a few speed tests, I am surprised to find that #2 consistently runs about twice as slow. That is a huge speed difference.
Both cases are running the same LLVM source code. With approach #1, I haven't yet bothered to run any LLVM optimisation passes (which is why I was expecting it to be slower). With approach #2, I am running opt with -std-compile-opts and llc with -O3, to maximise optimisation, yet it isn't getting anywhere near #1. Here is an example run of the same program:
#1 without optimisation: 11.833s
#2 without optimisation: 22.262s
#2 with optimisation (-std-compile-opts and -O3): 18.823s
Is the ExecutionEngine doing something special that I don't know about? Is there any way for me to optimise the compiled code to achieve the same performance as the ExecutionEngine JIT?
It is normal for a VM with JIT to run some applications faster than than a compiled application. That's because a VM with JIT is like a simulator that simulates a virtual computer, and also runs a compiler in realtime. Because both tasks are built into the VM with JIT, the machine simulator can feed information to the compiler so that the code can be recompiled to run more efficiently. The information that it provides is not available to statically compiled code.
This effect has also been noted with Java VMs and with Python's PyPy VM, among others.
Another issue is aligning code and other optimizations. Nowadays cpu's are so complex that it's hard to predict which techniques will result in faster execution of final binary.
As an real-life example, let's consider Google's Native Client - I mean original nacl compilation approach, not involing LLVM (cause, as far as I know, currently there is direction on supporting both "nativeclient" and "LLVM bitcode"(modyfied) code).
As you can see on presentations (check out youtube.com) or in papers, like this Native Client: A Sandbox for Portable, Untrusted x86 Native Code, even their aligning technique makes code size bigger, in some cases such aligning of instructions (for example with noops) gives better cache hitting.
Aligning instructions with noops and instruction reordering it known in parallel computing, and here it shows it's impact as well.
I hope this answer gives an idea how much circumstances might influence on code speed execution, and that are many possible reasons for different pieces of code, and each of them needs investigation. Nevermore, it's interesting topic, so If you find some more details, don't hestitate to reedit your answer and let us know in "Post-Scriptorium", what have you found more :). (Maybe link to whitepaper/devblog with new findings :) ). Benchmarks are always welcome - take a look : http://llvm.org/OpenProjects.html#benchmark .

Automatically find compiler options for fastest exe on given machine?

Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
Naturally, I use g++ -O3, but there are additional flags that may make the code run faster, e.g. -ffast-math and others, some of which are hardware-dependent.
Does anyone know some code I can put in my configure.ac file (GNU autotools), so that the flags will be added to the Makefile automatically by the ./configure command?
In addition to automatically determining the best flags, I would be interested in some useful compiler flags that are good to use as a default for most optimized executables.
Update: Most people suggest to just try different flags and select the best ones empirically. For that method, I'd have a follow-up question: Is there a utility that lists all compiler flags that are possible for the machine I'm running on (e.g. tests if SSE instructions are available etc.)?
I don't think you can do this at configure-time, but there is at least one program which attempts to optimize gcc option flags given a particular executable and machine. See http://www.coyotegulch.com/products/acovea/ for example.
You might be able to use this with some knowledge of your target machine(s) to find a good set of options for your code.
Um - yes. This is possible. Look into profile-guided optimization.
some compilers provide "-fast" option to automatically select most aggressive optimization for given compilation host. http://en.wikipedia.org/wiki/Intel_C%2B%2B_Compiler
Unfortunately, g++ does not provide similar flags.
as a follow-up to your next question, for g++ you can use -mtune option together with -O3 which will give you reasonably fast defaults. Challenge then is to find processor type of your compilation host. you may want to look on autoconf macro archive, to see somebody wrote necessary tests. otherwise, assuming linux, you have to parse /proc/cpuinfo to get processor type
After some googling, I found this script: gcccpuopt.
On one of my machines (32bit), it outputs:
-march=pentium4 -mfpmath=sse
On another machine (64bit) it outputs:
$ ./gcccpuopt
Warning: The optimum *32 bit* architecture is reported
-m32 -march=core2 -mfpmath=sse
So, it's not perfect, but might be helpful.
See also -mcpu=native/-mtune=native gcc options.
Is there a method to automatically find the best compiler options (on a given machine), which result in the fastest possible executable?
No.
You could compile your program with a large assortment of compiler options, then benchmark each and every version, then select the one that is "fastest," but that's hardly reliable and probably not useful for your program.
This is a solution that works for me, but it does take a little while to set up. In "Python Scripting for Computational Science" by Hans Petter Langtangen (an excellent book in my opinion), an example is given of using a short python script to do numerical experiments to determine the best compiler options for your C/Fortran/... program. This is described in Chapter 1.1.11 on "Nested Heterogeneous Data Structures".
Source code for examples from the book are freely available at http://folk.uio.no/hpl/scripting/index.html (I'm not sure of the license, so will not reproduce any code here), and in particular you can find code for a similar numerical test in the code in TCSE3-3rd-examples.tar.gz in the file src/app/wavesim2D/F77/compile.py , which you could use as a base for writing a script which is appropriate for a particular system/language (C++ in your case).
Optimizing your app is mainly your job, not the compiler's.
Here's an example of what I'm talking about.
Once you've done that, IF your app is compute-bound, with hotspots in your code (not in library code) THEN the compiler optimizations for speed will make some difference, so you can try different flag combinations.

Intel Fortran Compiler "-parallel" Not Working

I have a serial Fortran code that works fine. Once I compile the same code using ifort -parallel and run it, it gives wrong results and overflow. I would expect that with "-parallel" flag, the Intel compiler is capable of selecting the loops that are safe to parallelize and I should get the exact same results as for the serial code, which did not happen. The even more strange behaviour is that I went ahead and closed all the do loops parallelization in my code using !DEC$ NOPARALLEL, compiled the code using ifort -parallel to make sure that non of the loops was parallelized and then run. Surprisingly, I got the same wrong results and overflow, although the latter action should be exactly equivalent to a serial code.
Is there any one capable of explaining this behaviour or is it just an Intel compiler deficiency.
Greetings.
Sorry to say this, but it's unlikely to be an Intel compiler problem it's a pretty good compiler (no, I don't work for Intel ! but I do use their compilers).
Yes I am capable of explaining this sort of behaviour, but without sight of your program anything I suggest will be wrong.
Answers were given to this identical question on the Intel Fortran Forum: http://software.intel.com/en-us/forums/topic/269743
EDIT: I revised the link, since as stated in the comment, the original link is now dead.

Verifying compiler optimizations in gcc/g++ by analyzing assembly listings

I just asked a question related to how the compiler optimizes certain C++ code, and I was looking around SO for any questions about how to verify that the compiler has performed certain optimizations. I was trying to look at the assembly listing generated with g++ (g++ -c -g -O2 -Wa,-ahl=file.s file.c) to possibly see what is going on under the hood, but the output is too cryptic to me. What techniques do people use to tackle this problem, and are there any good references on how to interpret the assembly listings of optimized code or articles specific to the GCC toolchain that talk about this problem?
GCC's optimization passes work on an intermediary representation of your code in a format called GIMPLE.
Using the -fdump-* family of options, you can ask GCC to output intermediary states of the tree.
For example, feed this to gcc -c -fdump-tree-all -O3
unsigned fib(unsigned n) {
if (n < 2) return n;
return fib(n - 2) + fib(n - 1);
}
and watch as it gradually transforms from simple exponential algorithm into a complex polynomial algorithm. (Really!)
A useful technique is to run the code under a good sampling profiler, e.g. Zoom under Linux or Instruments (with Time Profiler instrument) under Mac OS X. These profilers not only show you the hotspots in your code but also map source code to disassembled object code. Highlighting a source line shows the (not necessarily contiguous) lines of generated code that map to the source line (and vice versa). Online opcode references and optimization tips are a nice bonus.
Instruments: developer.apple.com
Zoom: www.rotateright.com
Not gcc, but when debugging in Visual Studio you have the option to intersperse assembly and source, which gives a good idea of what has been generated for what statement. But sometimes it's not quite aligned correctly.
The output of the gcc tool chain and objdump -dS isn't at the same granularity. This article on getting gcc to output source and assembly has the same options as you are using.
Adding the -L option (eg, gcc -L -ahl) may provide slightly more intelligible listings.
The equivalent MSVC option is /FAcs (and it's a little better because it intersperses the source, machine language, and binary, and includes some helpful comments).
About one third of my job consists of doing just what you're doing: juggling C code around and then looking at the assembly output to make sure it's been optimized correctly (which is preferred to just writing inline assembly all over the place).
Game-development blogs and articles can be a good resource for the topic since games are effectively real-time applications in constant memory -- I have some notes on it, so does Mike Acton, and others. I usually like to keep Intel's instruction set reference up in a window while going through listings.
The most helpful thing is to get a good ground-level understanding of assembly programming generally first -- not because you want to write assembly code, but because having done so makes reading disassembly much easier. I've had a hard time finding a good modern textbook though.
In order to output the optimizations applied you can use:
-fopt-info-optimized
To see those that have not been applied
-fopt-info-missed
Beware that the output is sent to standard error stream so to see it you actually have to redirect that : ( hint 2>&1 )
Here is nice example of :
g++ -O3 -std=c++11 -march=native -mtune=native
-fopt-info-optimized h2d.cpp -o h2d 2>&1
h2d.cpp:225:3: note: loop vectorized
h2d.cpp:213:3: note: loop vectorized
h2d.cpp:198:3: note: loop vectorized
h2d.cpp:186:3: note: loop vectorized
You can check the interleaved output, when having applied -g with objdump -dS|c++filt , but that will not get you that far.Enjoy!
Zoom from RotateRight ( http://rotateright.com ) is mentioned in another answer, but to expand on that: it shows you the mapping of source to assembly in what they call the "code browser". It's incredibly handy even if you're not an asm expert because they have also integrated assembly documentation into the app. And the assembly listing is annotated with comments and timing for several CPU types.
You can just open your object or executable file with Zoom and take a look at what the compiler has done with your code.
Victor, in your case the optimization you are looking for is just a smaller allocation of local memory on the stack. You should see a smaller allocation at function entry and a smaller deallocation at function exit if the space used by the empty class is optimized away.
As for the general question, I've been reading (and writing) assembly language for more than (gulp!) 30 years and all I can say is that it takes practice, especially to read the output of a compiler.
Instead of trying to read through an assembler dump, run your program inside a debugger. You can pause execution, single-step through instructions, set breakpoints on the code you want to check, etc. Many debuggers can display your original C code alongside the generated assembly so you can more easily see what the compiler did to optimize your code.
Also, if you are trying to test a specific compiler optimization you can create a short dummy function that contains the type of code that fits the optimization you are interested in (and not much else, the simpler it is the easier the assembly is to read). Compile the program once with optimizations on and once with them off; comparing the generated assembly code for the dummy function between builds should show you what the compiler's optimizers did.