What is the effect of Callgrind in code perfomance? - c++

I am using Callgrind in order to see how many times specific functions are called. However, I am also interested in the execution time.
I know the programs take much longer when running on Callgrind, since it has to take information. However, what I am surprised about is that how the time changes. In my case, I am running two different versions of the Fast Marching Method (FMM and Simplified FMM) on 2D and 3D grids. The results are as follow:
In 2D the ratio FMM/SFMM is not kept at all, but at least it is always >1 (it takes always longes for FMM than for SFMM). However, in 3D the effect of Callgrind is completely the opposite, the times are completely changed: SFMM takes shorter will callgrind but longer in regular execution.
The compilation I am using (-Ofast, -fno-finite-math-only) is the same all the time and the same binaries are being run in callgrind and regular running ./bin-name
The time measuring functions are those from std::chrono.
Therefore, the question is: as I am using the same binary in all cases, how is it possible that the same binary behaves so differently? Are the other data I am getting (function calls,% time cost, etc) reliable in this case? Callgrind-like results were expected when running the binaries with regular execution command.
EDIT: in the implemtation, the main change is that in FMM I am using the Boost Fibonacci heap and in the SFMM I am using a small modification with a Boost Priority Queue.
Thank you!

Related

Getting consistent callgrind output over multiple runs

I've been using valgrind to do some profiling on a C++ codebase, and based on my understanding it's performing sampling of individual calls to build its profile data. As a result, I can profile the same application twice using something like this:
valgrind --tool=callgrind myprogram.exe
and get different results each time. For instance, my last two runs I did gave me the following output at the bottom:
==70741==
==70741== Events : Ir
==70741== Collected : 196823780
==70741==
==70741== I refs: 196,823,780
and
==70758==
==70758== Events : Ir
==70758== Collected : 195728098
==70758==
==70758== I refs: 195,728,098
this all makes complete sense to me. However, I'm currently in the process of optimizing my codebase and I'd like to be able to make a change and determine if it does improve performance. Due to sampling, it seems that running callgrind alone will not be sufficient since I can get different numbers on each run. As a result, it would be difficult to determine if my latest run just ran faster just due to random sampling, or my change actually made a significant difference (in the statistical sense, why not?).
My question is: Is there a way I can force callgrind to be consistent in it's sampling? Or, is there some more appropriate higher-level tool that would allow me to understand how a refactor affects performance? I'm currently developing in Mac OSX Sierra.
callgrind does not use sampling technique. Instead, it exactly "counts" the instructions that are executed. So, if a program is run twice under callgrind and does exactly the same, then it will give the same result. However, as soon as your program is doing non trivial things (e.g. uses a lot of libraries), such libraries might do slightly different things, depending e.g. on the clock or system load or content of env or ...
So, the differences you see are not due to callgrind, it is because really your program and/or the libraries it is using are each time doing slightly different things.
You can use some callgrind output file visualisation or reporting tools, such as kcachegrind to analyse the difference between 2 runs. You might then see what is the origin of these differences and eliminate them. Otherwise, you should be able to determine the effect of your changes by only looking at the cost of the functions you are interested in.

Is callgrind profiling influenced by other processes?

I'd like to profile my application using callgrind. Now, since it takes a very long time, in the meanwhile I go on with web-browsing, compiling and other intensive tasks on the same machine.
Am I biasing the profiling results? I'm expecting that, since valgrind uses a simulated CPU, other external processes should not interfere with valgrind execution. Am I right?
By default, Callgrind does not record anything related to time, so you can expect all collected metrics to (mostly) be independent of other processes on the machine. As the Callgrind manual states,
By default, the collected data consists of the number of instructions executed, their relationship to source lines, the caller/callee relationship between functions, and the numbers of such calls.
As such, the metrics Callgrind reports should only depend on what instructions the program is executing on the (simulated) CPU - not on how much time such instructions take. Indeed, many times the output of Callgrind can be somewhat misleading, as the simulated CPU might operate different to the real one (particularly when it comes to branch prediction).
The Callgrind paper presented at ICCS 2004 is very clear about this as well:
We note that the simulation is not able to predict consumed wall clock time, as this would need a detailed simulation of the microarchitecture.
In any case, however, the simulated CPU is unaffected by what the real CPU is doing.
The reason is straightforward.
Like you said, your program is not executed on your machine at all.
Instead, at runtime, Valgrind dynamically translates your program, that is, it disassembles the binary into "UCode" for an simulated machine, adds analysis code (called instrumentation), then generates binary code that executes the simulation.
The addition of analysis code is what makes instruction counting (in Callgrind), memory checking (in Memcheck), and all other plugins possible.
Therein lies the twist, however.
Naturally there are limits to how isolated the program can run in such a dynamic simulation.
First, your program might interact with other programs.
While the time spent for doing so is irrelevant (as it is not accounted for), the return codes of inter-process communication can certainly change, depending on what else is going on in the system.
Second, most system calls need to be run untranslated and their return codes can change as well -- leading to different execution paths of your program and, thus, slightly different metrics being collected. (As an aside, Calgrind offers an option to record the wall clock time spent during syscalls, which will always be affected by what else goes on in the system).
More details about these restrictions can be found in the PhD Dissertation of Nicholas Nethercote ("Dynamic Binary Analysis and Instrumentation").

Why does my logging library cause performance tests to run faster?

I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.
When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.

C++ performance weirdness w/ OpenGL

I am rewriting some rendering C code in C++. The old C code basically computes everything it needs and renders it at each frame. The new C++ code instead pre-computes what it needs and stores that as a linked list.
Now, actual rendering operations are translations, colour changes and calls to GL lists.
While executing the operations in the linked list should be pretty straightforward, it would appear that the resulting method call takes longer than the old version (which computes everything each time - I have of course made sure that the new version isn't recomputing).
The weird thing? It executes less OpenGL operations than the old version. But it gets weirder. When I added counters for each type of operation, and a good old printf at the end of the method, it got faster - both gprof and manual measurements confirm this.
I also bothered to take a look at the assembly code generated by G++ in both cases (with and without trace), and there is no major change (which was my initial suspicion) - the only differences are a few more stack words allocated for counters, increasing said counters, and preparing for printf followed by a jump to it.
Also, this holds true with both -O2 and -O3. I am using gcc 4.4.5 and gprof 2.20.51 on Ubuntu Maverick.
I guess my question is: what's happening? What am I doing wrong? Is something throwing off both my measurements and gprof?
By spending time in printf, you may be avoiding stalls in your next OpenGL call.
Without more information, it is difficult to know what is happening here, but here are a few hints:
Are you sure the OpenGL calls are the same? You can use some tool to compare the calls issued. Make sure there was no state change introduced by the possibly different order things are done.
Have you tried to use a profiler at runtime? If you have many objects, the simple fact of chasing pointers while looping over the list could introduce cache misses.
Have you identified a particular bottleneck, either on the CPU side or GPU side?
Here is my own guess on what could be going wrong. The calls you send to your GPU take some time to complete: the previous code, by mixing CPU operations and GPU calls, made CPU and GPU work in parallel; on the contrary the new code first makes the CPU computes many things while the GPU is idling, then feeds the GPU with all the work to get done while the CPU has nothing left to do.

Using gprof with sockets

I have a program I want to profile with gprof. The problem (seemingly) is that it uses sockets. So I get things like this:
::select(): Interrupted system call
I hit this problem a while back, gave up, and moved on. But I would really like to be able to profile my code, using gprof if possible. What can I do? Is there a gprof option I'm missing? A socket option? Is gprof totally useless in the presence of these types of system calls? If so, is there a viable alternative?
EDIT: Platform:
Linux 2.6 (x64)
GCC 4.4.1
gprof 2.19
The socket code needs to handle interrupted system calls regardless of profiler, but under profiler it's unavoidable. This means having code like.
if ( errno == EINTR ) { ...
after each system call.
Take a look, for example, here for the background.
gprof (here's the paper) is reliable, but it only was ever intended to measure changes, and even for that, it only measures CPU-bound issues. It was never advertised to be useful for locating problems. That is an idea that other people layered on top of it.
Consider this method.
Another good option, if you don't mind spending some money, is Zoom.
Added: If I can just give you an example. Suppose you have a call-hierarchy where Main calls A some number of times, A calls B some number of times, B calls C some number of times, and C waits for some I/O with a socket or file, and that's basically all the program does. Now, further suppose that the number of times each routine calls the next one down is 25% more times than it really needs to. Since 1.25^3 is about 2, that means the entire program takes twice as long to run as it really needs to.
In the first place, since all the time is spent waiting for I/O gprof will tell you nothing about how that time is spent, because it only looks at "running" time.
Second, suppose (just for argument) it did count the I/O time. It could give you a call graph, basically saying that each routine takes 100% of the time. What does that tell you? Nothing more than you already know.
However, if you take a small number of stack samples, you will see on every one of them the lines of code where each routine calls the next.
In other words, it's not just giving you a rough percentage time estimate, it is pointing you at specific lines of code that are costly.
You can look at each line of code and ask if there is a way to do it fewer times. Assuming you do this, you will get the factor of 2 speedup.
People get big factors this way. In my experience, the number of call levels can easily be 30 or more. Every call seems necessary, until you ask if it can be avoided. Even small numbers of avoidable calls can have a huge effect over that many layers.