perf samples do not add up to 100% - profiling

When analyzing with Linux perf:
perf record -g -e cycles my-executable
perf report
... the outermost invocation is shown to only use 50.6% of execution time, although the manual asserts it should always be 100%. Where do these samples go?
I'm aware a similar question exists; I think my problem is different because my application runs only for ~5s, and I do not expect the core to idle during this time.

Related

attaching callgrind/valgrind to program in mid-way of its execution

I use Valgrind to detect issues by running a program from the beginning.
Now I have memory/performance issues in a very specific moment in the program. Unfortunately there is no feasible way to make a shortcut to this place from the start.
Is there a way to instrument the c++ program ( Valgrind/Callgrind ) in its mid-way execution, like attaching to the process?
Already answered here:
How use callgrind to profiling only a certain period of program execution?
There is no way to use valgrind on an already running program.
For callgrind, you can however somewhat speed it up by only recording data later during execution.
For this, you might e.g. look at callgrind options
--instr-atstart=no|yes Do instrumentation at callgrind start [yes]
--collect-atstart=no|yes Collect at process/thread start [yes]
--toggle-collect=<func> Toggle collection on enter/leave function
You can also control such aspects from within your program.
Refer to valgrind/callgrind user manual for more details.
There are two things that Callgrind does that slows down execution.
Counting operations (the collect part)
Simulating the cache and branch predictor (the instrumentation part). This depends on the --cache-sim and --branch-sim options, which both default to "no". If you are using these options and you disable instrumentation then I expect that there will be some impact on the accuracy of the modelling as the cache and predictor won't be "warm" when they are toggled.
There are a few other approaches that you could use.
Use the client request mechanisms. This would require you to include a Valgrind header and add a few lines that use Valgrind macros CALLGRIND_START_INSTRUMENTATION/CALLGRIND_STOP_INSTRUMENTATION and CALLGRIND_TOGGLE_COLLECTto start and stop instrumentation/collection. See the manual for details. Then just run your application under Valgrind with --instr-atstart=no --collect-atstart=no
Use the gdb monitor commands. In this case you would have two terminals. In the first you would run Valgrind with --instr-atstart=no --collect-atstart=no --vgdb=yes. In the second terminal run gdb yourapplication then from the gdb prompt (gdb) target remote | vgdb. Then you can use the monitor commands as described in the manual, which include collect and instrumentation control.
callgrind_control, part of the Valgrind distribution. To be honest I've never used this.
I recently did some profiling using the first technique without cache/branch sim. When I used Callgrind on the whole run I saw a 23x runtime increase compared to running outside of Callgrind. When I profiled only the one function that I wanted to analyze, this fell to about a 5x slowdown. Obviously this will very greatly case by case.
Thanks all for help.
Seems It's been already answered here: How use callgrind to profiling only a certain period of program execution?
but not marked as answered for some reason.
Summary:
Starting callgrind with instrumentation off
valgrind --tool=callgrind --instr-atstart=no <PROG>
Controling instrumentation ( can be executed in other shell )
callgrind_control -i on/off

profiling linux application with perf record

I've been trying to profile my C++ application in Linux by following this article on perf record. My understanding is all I need to do is run perf record program [program_options], where program is the program executable and [program options] are the arguments I want to pass to the program. However, when I try to profile my application like this:
perf record ./csvJsonTransducer -enable-AVX-deletion test.csv testout.json
perf returns almost immediately with a report. It takes nearly 30 seconds to run./csvJsonTransducer -enable-AVX-deletion test.csv testout.json without perf, though, and I want perf to monitor my program for the entirety of its execution, not return immediately. Why is perf returning so quickly? How can I make it take the entire run of my program into account?
Your commands seems ok. Try change the paranoid level at /proc/sys/kernel/perf_event_paranoid. Setting this parameter to -1 (as root) should solve permission issues:
echo "-1" > /proc/sys/kernel/perf_event_paranoid
You can also try to set the event that you want to monitor with perf record. The default event is cycles (if supported). Check man perf-list.
Try the command:
perf record -e cycles ./csvJsonTransducer -enable-AVX-deletion test.csv testout.json
to force the monitoring of cycles.

Timing in OpenMPI

Quick question about timing in OpenMPI. I see that with qstat -a I can show the wall time, and with qstat I can see the CPU time. Is it possible to have these two values written to the output file when the job is done so I can check the performance of my code?

perf.data file has no samples

I am using perf 3.0.4 on ubuntu 11.10. Its record command works well and displays on terminal 256 samples collected. But when I make use of perf report , it gives me the following error:
perf.data file has no samples
I searched a lot for the solution but no success yet.
This thread has some useful information: http://www.spinics.net/lists/linux-perf-users/msg01436.html
It seems that if you are running in a VM that does not expose the PMU to the guest, the default collection (-e cycles) won't work. Try running with -e cpu-clock. According to that thread, the OP had the same problem also in a real host running Ubuntu 10.04, so it might solve it for you too...
The number of samples reported by the perf record command is an approximation and not the correct number of events (see perf wiki here).
To get the accurate number of events, dump the raw file and use wc -l to count then number of results:
perf report -D -i perf.data | grep RECORD_SAMPLE | wc -l
This command should report 0 in your case where perf report says it can't find events.
Let us know more information about how you use perf record, which event are you sampling, which hardware, which program.
EDIT: you can try first to increase the sampling period or frequency with the -c or -F options
Whenever I run into this on a machine where perf record has worked in the past, it is because I have left something else running that uses the performance counters, e.g., I have perf top running in another terminal tab.
In this case, it seems that perf record simply doesn't record any PMU related samples.

Linux time sample based profiler

short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.
Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.
Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.
I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler
Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.
After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.