Callgrind on O1 or O3 for performance profiling?

Callgrind on O1 or O3 for performance profiling? - c++

I need to profile my C++ code, and valgrind --tool=callgrind is a phenomenal tool for that. I was wondering, however, if I should be profiling my code with -g -pg -O1 or -g -pg -O3 (GCC 4.4.7)? The latter gives a more accurate depiction of my program's performance, but I worry that -O3 will confuse the profiler and obfuscate what source functions are the actual bottlenecks. Perhaps I am just scared of old wive's tales, but I figured I should ask to be sure before running a potentially several hour test.

This thread in another stackoverflow may clear your mind: optimization flags when profiling
The problem is not profiling with optimization, but debugging with optimization (-g -pg).
As quantdev said, you should "always use the same options as the one used to create production binaries", and you are ont going to create a production binary with debug information.
If the thread is not enough, let us know.

Related

Valgrind flags, debug vs release compilation

On a Jenkins instance, I need Valgrind to check if there are particular problems in a C++ compiled binary. However, I only need a yes / no answer, not a stack trace for example. If they are any problems, I will launch valgrind on the faulty code with debug flags activated on my personal machine. The build is managed with CMake on a Linux running machine (targeting gcc).
If I compile my code with -DCMAKE_BUILD_TYPE=Release on the Jenkins instance, will Valgrind detect the same problems in the binary as with -DCMAKE_BUILD_TYPE=Debug ?

Valgrind works by instrumenting and replacing parts of your code at runtime, like redirecting calls to memory allocation functions. For doing this it does not rely on debug information, but it might get confused by optimized code:
If you are planning to use Memcheck: On rare occasions, compiler
optimisations (at -O2 and above, and sometimes -O1) have been observed
to generate code which fools Memcheck into wrongly reporting
uninitialised value errors, or missing uninitialised value errors. We
have looked in detail into fixing this, and unfortunately the result
is that doing so would give a further significant slowdown in what is
already a slow tool. So the best solution is to turn off optimisation
altogether.
(from the Valgrind manual)
Since the Release build type is uses optimizations, that makes it a bad fit for your case.

bcc64 optimizations -O1 vs -O2 still slower than bcc32 by 40% and more

I have a product consisting of a VCL executable plus a Standard C++ DLL, all built with C++ Builder XE4. I publish in 32-bit and 64-bit versions.
When doing performance testing with release builds, the 64-bit version runs much more slowly... 40% more slowly.
I understand that I need to have optimizations turned on for the performance testing to be meaningful. XE4 allows me to set (mutually exclusively):
-O1 = smallest possible code
-O2 = fastest possible code
I have built using each of these, but the results are unchanged.
I see from postings here that Linux/g++ programmers use -O3 (smallest AND fastest?) (see 64-bit executable runs slower than 32-bit version). But -O3 is not an option for my environment.
Are there other compiler settings I should be looking at?
Thanks for your help.

The main downside of 64bit mode is that pointers double in size. Alignment rules might also lead classes/structs to be bigger. Maybe your code just barely fit into cache in 32bit mode, but not 64. This is esp. likely if your code uses a lot of pointers.
Another possibility is that you call some external library, and your 32bit version of it has some asm speedups, but the 64bit version doesn't.
Use a profiler to see what's actually slow in your 64bit version. For Windows, Intel's VTUNE is maybe a good choice. You can see where your code is having a lot of cache misses. Comparing total cache misses between 32bit and 64bit should shed some light.
Re: -O1 vs. -O2: Different compilers have different meanings for options. gcc and clang have:
-Os: optimize for code size
-O0: minimal / no optimization (most things get stored/reloaded from RAM after every step)
-O1: some optimization without taking a lot of extra compile time
-O2: more optimizations
-O3: even more optimizations, including auto-vectorizing
Clang doesn't seem to document its optimization options, so I assume it mirrors gcc. (There are options to report on optimizations it did, and to use profile-guided optimization.) See the latest version of the gcc manual (online) for more descriptions of optimization options: e.g.
-Ofast: -O3 -ffast-math (and maybe "unsafe" optimizations.)
-Og: optimize without breaking debugging. Recommended for the edit/compile/debug cycle.
-funroll-loops: can help in some tight loops, but isn't enabled even at -O3. Don't use for everything, because larger code size can lead to I-cache misses which hurt more. -fprofile-use does enable this, so ideally just use PGO.
-fblah-blah: there are a ton more specific options. Usually just use -O3 to pick the recommended set.

Does the Rust compiler have a profiling option?

I have a Rust program that isn't running as fast as I think it should. Is there a way to tell the compiler to instrument the binary to generate profiling information?
I mean something like GCC's -p and -pg options or GHC's -prof.

The compiler doesn't support any form of instrumentation specifically for profiling (like -p/-pg/-prof), but compiled Rust programs can be profiled under tools that do not require custom instrumentation, such as Instruments on OS X, and perf or callgrind on Linux.
I believe such tools support using DWARF debuginfo (as emitted by -g) to provide more detailed performance diagnostics (per-line etc.), but enabling optimisations play havoc with the debug info, and it's never really worked for me. When I'm analysing performance, diving into the asm is very common.
Making this easier would be really nice, and tooling is definitely a post-1.0 priority.

There's no direct switch that I'm aware of. However, I've successfully compiled my code with optimizations enabled as well as debugging symbols. I can then use OS X's Instruments to profile the code. Other people have used KCachegrind on Linux systems to the same effect.

Optimization issue

I'm developing a controller program used to run a humanoid kidsize robot. The OS is debian 6 and whole programs are written in C++11. CPU is a 1GHz VorteX86 SD and its architecture is Intel i486.
I need to compile my code with maximum possible optimization. currently I'm using gcc with 3rd level optimization flag and i486 optimization tunning:
g++ -std=c++0x -O3 -march=i486 -mtunes=i486
I'm wondering if its possible to gain more optimized code or not. I searched around about optimization flags and compiler benchmarks, but didn't find any...
My question is which compiler for C++ is generates faster code? Specially for i486 architecture.
Current candidates are: ICC XE, GCC 4.6, EkoPath

An option which typically makes the code faster is -funroll-loops

See the documentation. There are too many permutations to test them all; maybe give Acovea a try, which tests for the best one with a genetic approach.
If you have many floating points optimizations, you may try -ffast-math or -Ofast, which includes -ffast-math. However, you lose IEEE floating math compliance.

Compiling in g++ for gprof

I do not understand the documentation for gprof regarding how to compile your program for profiling with gprof. In g++, is it required to compile with the -g option (debugging information) in a addition to the -pg option or not. In each case I get different results, and I would like to see where the bottlenecks in my application are in release mode, not in debug mode, where many optimizations are left out by the compiler (e.g. inlining)

The documentation shows that you can do either, noting that you need -g for line by line profiling. So if you want to profile under release conditions, and can accept not doing line-by-line, you should be able to compile without -g.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js