I do not understand the documentation for gprof regarding how to compile your program for profiling with gprof. In g++, is it required to compile with the -g option (debugging information) in a addition to the -pg option or not. In each case I get different results, and I would like to see where the bottlenecks in my application are in release mode, not in debug mode, where many optimizations are left out by the compiler (e.g. inlining)
The documentation shows that you can do either, noting that you need -g for line by line profiling. So if you want to profile under release conditions, and can accept not doing line-by-line, you should be able to compile without -g.
Related
I'm taking a MOOC course on data structures and algorithm. I would like to use c++, and I need to set my compiler options to the following
g++ -pipe -O2 -std=c++14 -lm
I'm currently using MS Visual Studio 2017 on Windows. Is it even possible? Do I need to do custom build? The following is the paragraph taken from the MOOC mentions about Windows users having to use Cygwin, but I have no clue what that means. Can anybody shed some light on a feasible way to do this?
Your solution will be compiled as follows. We recommend that when
testing your solution locally, you use the same compiler flags for
compiling. This will increase the chances that your program behaves in
the same way on your machine and on the testing machine (note that a
buggy program may behave differently when compiled by different
compilers, or even by the same compiler with different flags).
C++ (g++ 5.2.1). File extensions: .cc, .cpp. Flags
g++ -pipe -O2 -std=c++14 -lm
If your C/C++ compiler does not recognize the "-std=c++14" flag, try
replacing it with "-std=c++11" or "-std=c++0x" flag or compiling
without this flag at all (all starter solutions can be compiled
without it). On Linux and MacOS, you probably have the required
compiler. On Windows, you may use your favorite compiler or install an
environment such as cygwin.
As they said you should choose cygwin (https://cygwin.com/install.html).
These would be the flags for msvc: /std:c++14 /O2 /Im
You can set compiler options for each project in its Property Pages
dialog box. In the left pane, select Configuration Properties, C/C++
and then choose the compiler option category.
For the categories lookup: https://learn.microsoft.com/de-de/cpp/build/reference/compiler-options-listed-by-category
I'm reading this tutorial about code vectorization using Intel Advisor. In particular in this page they suggest to:
Build the target sample application in release mode ... compiler options: -O2 -g
And following:
To build your own applications to produce the most accurate and
complete Vectorization Advisor analysis results, build an optimized
binary in release mode using the following settings.
-g -O2 (or higher) -qopt-report=5 -vec -simd -qopenmp
Now, I have a couple of questions:
I thought that in release mode we didn't produce any debug information (which is produced in the "debug mode"), so no -g should be included
The weirdest thing is that in the Makefile given for the samples code (vec_samples in /opt/intel/advisor_*/...) uses only -g -O2 why they don't include all the other options. Why?
The relevant entry point to fresh Intel Advisor tutorials is Getting Started, where you can pick and choose appropriate sub-tutorials. Vectorization Advisor sub-tutorial for Linux can be found here. It precisely says that:
-qopt-report=5 : necessary for version 15.0 of the Intel compiler; unnecessary for version 16.0 and higher
With regards to -vec, -simd, -openmp, the tutorial slightly confuses flags needed for proper Advisor functioning (-g, -O2, optionally -opt-report) vs. flags needed for "proper" Compiler functioning ( -vec, -simd and -openmp). The later ones are just flags controlling compiler's vector code generation, they have nothing to do with Advisor profiling capabilities so you may or may not use them.
To give you deeper understanding: there is an important feature in
Advisor, which is called Intel Advisor Survey "Compiler Integration".
This feature leverages the data relatively similar but not identical to opt-report.
In order to make this feature working you need
Using Intel Coimpiler 14.x beta, 15.x, 16.x or 17.x
-g (enable debug info) and -O2 or higher (enable some optimization)
Optional (for Intel Compiler 15.x only) -qopt-report5
All other features in Intel Advisor work equally well regardless of Compiler version (item 1 above) or opt-report and versions (item 3 above) , but they all still require -g (part of option 2 above). -O2 is not needed for some features, but it is typically useless to deal with -O0 or -O1 compiled binaries when analyzing performance aspects.
I am building a project using a library (openFrameworks) and the default options set in the compiler for the release target included the -O2 flag, which I have never used. Until recently, I thought nothing of it because everything seemed to be working. I then began testing on machines that were not being used in development and the programs crashed (it didn't even get to any of my debug statements).
Recompiling on the target machine itself makes the executable work correctly. Is the -O2 flag possibly causing this? I get no errors or warnings when recompiling on the target machine, so I'm not quite sure why this is happening. The reason I suspect the -O2 flag is because it's the only one I've never used that's enabled in the project.
I have not tested yet whether it happens in the -O1 or -O3 flags.
I am on windows 7 and all my tests have been on Windows 7 and Windows 8 systems, compiled using MinGW(TDM-GCC) 4.8.1 in Code::Blocks.
The -O2 flag: Optimize even more than -O flag. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify '-O2'. As compared to '-O', this option increases both compilation time and the performance of the generated code.
'-O2' turns on all optimization flags specified by '-O'. It also turns on the following optimization flags:
-fforce-mem -foptimize-sibling-calls -fstrength-reduce -fcse-follow-jumps -fcse-skip-blocks -frerun-cse-after-loop -frerun-loop-opt -fgcse -fgcse-lm -fgcse-sm -fdelete-null-pointer-checks -fexpensive-optimizations -fregmove -fschedule-insns -fschedule-insns2 -fsched-interblock -fsched-spec -fcaller-saves -fpeephole2 -freorder-blocks -freorder-functions -fstrict-aliasing -falign-functions -falign-jumps -falign-loops -falign-labels
The result is that any architecture or OS differences are going to cause a memory fault or bad branch. When you take a compiled executable or library that is optimized it is going to utilize all the specific hardware available on that platform. If you have an apples to apples hardware comparison and OS that are all identical, you have a shot at it working, but even then it is hard to make a determination until you try it at run time. The problem is that it is dependent on the specifics of the compiler and is not generalized.
I need to profile my C++ code, and valgrind --tool=callgrind is a phenomenal tool for that. I was wondering, however, if I should be profiling my code with -g -pg -O1 or -g -pg -O3 (GCC 4.4.7)? The latter gives a more accurate depiction of my program's performance, but I worry that -O3 will confuse the profiler and obfuscate what source functions are the actual bottlenecks. Perhaps I am just scared of old wive's tales, but I figured I should ask to be sure before running a potentially several hour test.
This thread in another stackoverflow may clear your mind: optimization flags when profiling
The problem is not profiling with optimization, but debugging with optimization (-g -pg).
As quantdev said, you should "always use the same options as the one used to create production binaries", and you are ont going to create a production binary with debug information.
If the thread is not enough, let us know.
I have written a benchmark method to test my C++ program (which searches a game tree), and I am noticing that compiling with the "LLVM compiler 2.0" option in XCode 4.0.2 gives me a significantly faster binary than if I compile with the latest version of clang++ from MacPorts.
If I understand correctly I am using a clang front-end and llvm back-end in both cases. Has Apple made improvements to their clang/llvm distribution to produce faster binaries for Mac OS? I can't find much information about the project.
Here are the benchmarks my program produces for various compilers, all using -O3 optimization (higher is better):
(Xcode) "gcc 4.2": 38.7
(Xcode) "llvm gcc 4.2": 51.2
(Xcode) "llvm compiler 2.0": 50.6
g++-mp-4.6: 43.4
clang++: 40.6
Also, how do I compile with the clang/llvm XCode is using from the terminal? I can't find the command.
EDIT: The scores I output are "thousands of games per second" which are calculated over a long enough run of the program. Scores are very consistent over multiple runs, and recent major algorithmic improvements have given me 1% - 5% speed ups, for example. A 25% speed up of 40 to 50 is huge for my program.
UPDATE: I wasn't invoking clang++ from the command line with -flto. Now when I compare clang++ -O3 -flto to /Developer/usr/bin/clang++ -O3 -flto from the command line the results are closer, but the Apple one is still 6.5% faster.
Now how to enable link time optimization for gcc? When I try g++ -flto I get the following error:
cc1plus: error: LTO support has not been enabled in this configuration
Apple LLVM Compiler should be available under /Developer/usr/bin/clang.
I can't think of any particular reason why MacPorts clang++ would generate slower code... I would check whether you're passing in comparable command-line options. One thing that would make a large difference is if you're producing 32-bit code with one compiler, and 64-bit code with the other.
If GCC has no LTO then you need to build it yourself:
http://solarianprogrammer.com/2012/07/21/compiling-gcc-4-7-1-mac-osx-lion/
For LTO you need to add 'libelf' to the instructions.
http://sourceforge.net/apps/trac/mingw-w64/wiki/LTO%20and%20GCC
Exact speed of an algorithm can depend on all kinds of things that are totally out of your's and the compiler's power. You may have a loop where the execution time depends on precisely how the instructions are aligned in memory, in a way that the compiler couldn't predict. I have seen cases where a loop could enter different "states" with different execution times per iteration (so after a context switch, it could enter a state where it took either 12 or 13 cycles, rather randomly). This can all be coincidence.
And you might be using different libraries, which is quite possible the reason. In MacOS X, they are using a new and presumably faster implementation of std::string and std::vector, for example.