callgrind slow with instrumentation turned off - c++

I am using callgrind to profile a linux multi-threaded app and mostly it's working great. I start it with instrumentation off (--instr-atstart=no) and then once setup is done i turn it on with callgrind_control -i on. However, when I change certain configurations to try to profile a different part of the app it starts running extremely slow even before I turn instrumentation on. Basically part of the code that would take a few seconds with normal operation takes over an hour with callgrind (instrumentation turned off). Any ideas as to why that might be and how to go about debugging/resolving the slowness?

Callgrind is a tool, built on valgrind. Valgrind is basically a dynamic binary translator (libVEX, part of valgrind). It will decode every instruction and JIT-compile them into stream of some instructions of the same CPU.
As I know, there is no way to enable this translation (in valgrind implementation) for already running process, so dynamic translation is enabled all time, from start of program. It can't be turned off too.
Tools are built on valgrind by adding some instrumentation code. The "Nul" tool (nulgrind) is the tool which adds no instrumentation. But every tool uses valgrind and dynamic translation is active all time. Turning on and off in callgrind is just turning on and off additional instrumentation.
Virtual CPU, implemented by Valgrind is limited, there is (incomplete) list of limitations http://valgrind.org/docs/manual/manual-core.html#manual-core.limits Most of limitations are about floating point operations, and they can be emulated wrong.
Is the change connected with floating-point operations? Or with other listed limitations?
Also you should know, that "Valgrind serialises execution so that only one thread is running at a time". (from the same page manual-core.html)

Related

Getting consistent callgrind output over multiple runs

I've been using valgrind to do some profiling on a C++ codebase, and based on my understanding it's performing sampling of individual calls to build its profile data. As a result, I can profile the same application twice using something like this:
valgrind --tool=callgrind myprogram.exe
and get different results each time. For instance, my last two runs I did gave me the following output at the bottom:
==70741==
==70741== Events : Ir
==70741== Collected : 196823780
==70741==
==70741== I refs: 196,823,780
and
==70758==
==70758== Events : Ir
==70758== Collected : 195728098
==70758==
==70758== I refs: 195,728,098
this all makes complete sense to me. However, I'm currently in the process of optimizing my codebase and I'd like to be able to make a change and determine if it does improve performance. Due to sampling, it seems that running callgrind alone will not be sufficient since I can get different numbers on each run. As a result, it would be difficult to determine if my latest run just ran faster just due to random sampling, or my change actually made a significant difference (in the statistical sense, why not?).
My question is: Is there a way I can force callgrind to be consistent in it's sampling? Or, is there some more appropriate higher-level tool that would allow me to understand how a refactor affects performance? I'm currently developing in Mac OSX Sierra.
callgrind does not use sampling technique. Instead, it exactly "counts" the instructions that are executed. So, if a program is run twice under callgrind and does exactly the same, then it will give the same result. However, as soon as your program is doing non trivial things (e.g. uses a lot of libraries), such libraries might do slightly different things, depending e.g. on the clock or system load or content of env or ...
So, the differences you see are not due to callgrind, it is because really your program and/or the libraries it is using are each time doing slightly different things.
You can use some callgrind output file visualisation or reporting tools, such as kcachegrind to analyse the difference between 2 runs. You might then see what is the origin of these differences and eliminate them. Otherwise, you should be able to determine the effect of your changes by only looking at the cost of the functions you are interested in.

Measure or profile use of AVX2 (and other advanced instruction sets) instructions used by programm

We are chasing some weird hardware failures on AMD Threadrippers. I came across some evidence that AVX2/AVX-512 instructions can lead to weird behaviour (https://news.ycombinator.com/item?id=22382946).
Is there a generic way of measuring or profiling the use of AVX2/AVX-512 instructions of a running program or machine? For now it would be enough for me to get a ball-park of how many of these instructions are being used in a given time frame. I do not necessarily need to pin it down to the actual program using them. The more detailed the profiling / attribution of AVX2/AVX-512 instruction use by program or time is the better.
I would prefer tools that run in Linux.

How to get Valgrind to not instrument a specific shared object?

I'm using a proprietary library to import data, it uses the GPU (OpenGL) to transform the data.
Running my program through Valgrind (memcheck) causes the data import to take 8-12 hours (instead of a fraction of a second). I need to do my Valgrind sessions overnight (and leave my screen unlocked all night, since the GPU stuff pauses while the screen is locked). This is causing a lot of frustration.
I'm not sure if this is related, but Valgrind shows thousands of out-of-bound read/write errors in the driver for my graphics card:
==10593== Invalid write of size 4
==10593== at 0x9789746: ??? (in /usr/lib/x86_64-linux-gnu/dri/i965_dri.so)
(I know how to suppress those warnings).
I have been unable to find any ways of selectively instrumenting code, or excluding certain shared libraries from being instrumented. I remember using a tool on Windows 20 years or so ago that would skip instrumenting selected binaries. It seems this is not possible with memcheck:
Is it possible to make valgrind ignore certain libraries? -- 2010, says this is not possible.
Can I make valgrind ignore glibc libraries? -- 2010, solutions are to disable warnings.
Restricting Valgrind to a specific function -- 2011, says it's not possible.
using valgrind at specific point while running program -- 2013, no answers.
Valgrind: disable conditional jump (or whole library) check -- 2013, solutions are to disable warnings.
...unless things have changed in the last 6 or 7 years.
My question is: Is there anything at all that can be done to speed up the memory check? Or to not check memory accesses in certain parts of the program?
Right now the only solution I see is to modify the program to read data directly from disk, but I'd rather test the actual program I'm planning to deploy. :)
No, this is not possible. When you run an application under Valgrind it is not running natively under the OS but rather in a virtual environment.
Some of the tools like Callgrind have options to control the instrumentation. However, even with the instrumentation off the application under test is still running under the Valgrind virtual environment.
There are a few things you can do to make things less slow
Test an optimized build of your application. You will lost line number information as a result, however.
Turn of leak detection
Avoid costly options like trace-origins
The sanitizers are faster and can also detect stack overflows, but at the cost of requiring instrumentation.

Is callgrind profiling influenced by other processes?

I'd like to profile my application using callgrind. Now, since it takes a very long time, in the meanwhile I go on with web-browsing, compiling and other intensive tasks on the same machine.
Am I biasing the profiling results? I'm expecting that, since valgrind uses a simulated CPU, other external processes should not interfere with valgrind execution. Am I right?
By default, Callgrind does not record anything related to time, so you can expect all collected metrics to (mostly) be independent of other processes on the machine. As the Callgrind manual states,
By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
As such, the metrics Callgrind reports should only depend on what instructions the program is executing on the (simulated) CPU - not on how much time such instructions take. Indeed, many times the output of Callgrind can be somewhat misleading, as the simulated CPU might operate different to the real one (particularly when it comes to branch prediction).
The Callgrind paper presented at ICCS 2004 is very clear about this as well:
We note that the simulation is not able to predict consumed wall clock time, as this would need a detailed simulation of the microarchitecture.
In any case, however, the simulated CPU is unaffected by what the real CPU is doing.
The reason is straightforward.
Like you said, your program is not executed on your machine at all.
Instead, at runtime, Valgrind dynamically translates your program, that is, it disassembles the binary into "UCode" for an simulated machine, adds analysis code (called instrumentation), then generates binary code that executes the simulation.
The addition of analysis code is what makes instruction counting (in Callgrind), memory checking (in Memcheck), and all other plugins possible.
Therein lies the twist, however.
Naturally there are limits to how isolated the program can run in such a dynamic simulation.
First, your program might interact with other programs.
While the time spent for doing so is irrelevant (as it is not accounted for), the return codes of inter-process communication can certainly change, depending on what else is going on in the system.
Second, most system calls need to be run untranslated and their return codes can change as well -- leading to different execution paths of your program and, thus, slightly different metrics being collected. (As an aside, Calgrind offers an option to record the wall clock time spent during syscalls, which will always be affected by what else goes on in the system).
More details about these restrictions can be found in the PhD Dissertation of Nicholas Nethercote ("Dynamic Binary Analysis and Instrumentation").

How to profile multi-threaded C++ application on Linux?

I used to do all my Linux profiling with gprof.
However, with my multi-threaded application, it's output appears to be inconsistent.
Now, I dug this up:
http://sam.zoy.org/writings/programming/gprof.html
However, it's from a long time ago and in my gprof output, it appears my gprof is listing functions used by non-main threads.
So, my questions are:
In 2010, can I easily use gprof to profile multi-threaded Linux C++ applications? (Ubuntu 9.10)
What other tools should I look into for profiling?
Edit: added another answer on poor man's profiler, which IMHO is better for multithreaded apps.
Have a look at oprofile. The profiling overhead of this tool is negligible and it supports multithreaded applications---as long as you don't want to profile mutex contention (which is a very important part of profiling multithreaded applications)
Have a look at poor man's profiler. Surprisingly there are few other tools that for multithreaded applications do both CPU profiling and mutex contention profiling, and PMP does both, while not even requiring to install anything (as long as you have gdb).
Try modern linux profiling tool, the perf (perf_events): https://perf.wiki.kernel.org/index.php/Tutorial and http://www.brendangregg.com/perf.html:
perf record ./application
# generates profile file perf.data
perf report
Have a look at Valgrind.
A Paul R said, have a look at Zoom. You can also use lsstack, which is a low-tech approach but surprisingly effective, compared to gprof.
Added: Since you clarified that you are running OpenGL at 33ms, my prior recommendation stands. In addition, what I personally have done in situations like that is both effective and non-intuitive. Just get it running with a typical or problematic workload, and just stop it, manually, in its tracks, and see what it's doing and why. Do this several times.
Now, if it only occasionally misbehaves, you would like to stop it only while it's misbehaving. That's not easy, but I've used an alarm-clock interrupt set for just the right delay. For example, if one frame out of 100 takes more than 33ms, at the start of a frame, set the timer for 35ms, and at the end of a frame, turn it off. That way, it will interrupt only when the code is taking too long, and it will show you why. Of course, one sample might miss the guilty code, but 20 samples won't miss it.
I tried valgrind and gprof. It is a crying shame that none of them work well with multi-threaded applications. Later, I found Intel VTune Amplifier. The good thing is, it handles multi-threading well, works with most of the major languages, works on Windows and Linux, and has many great profiling features. Moreover, the application itself is free. However, it only works with Intel processors.
You can randomly run pstack to find out the stack at a given point. E.g. 10 or 20 times.
The most typical stack is where the application spends most of the time (according to experience, we can assume a Pareto distribution).
You can combine that knowledge with strace or truss (Solaris) to trace system calls, and pmap for the memory print.
If the application runs on a dedicated system, you have also sar to measure cpu, memory, i/o, etc. to profile the overall system.
Since you didn't mention non-commercial, may I suggest Intel's VTune. It's not free but the level of detail is very impressive (and the overhead is negligible).
Putting a slightly different twist on matters, you can actually get a pretty good idea as to what's going on in a multithreaded application using ftrace and kernelshark. Collecting the right trace and pressing the right buttons and you can see the scheduling of individual threads.
Depending on your distro's kernel you may have to build a kernel with the right configuration (but I think that a lot of them have it built in these days).
Microprofile is another possible answer to this. It requires hand-instrumentation of the code, but it seems like it handles multi-threaded code pretty well. And it also has special hooks for profiling graphics pipelines, including what's going on inside the card itself.