mysterious rtm abort using haswell tsx

mysterious rtm abort using haswell tsx - c++

I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks. I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm
I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.
Here is the perf command I used to quantify the failure rate and underlying causes:
perf stat \
-e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
-e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
-e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
-e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
-e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
-e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
-e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
-e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
-e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
-e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
-e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
-e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
-e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
-e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
-e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \
./myprogram -th 1 -reps 3000000
So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic block. There is only one thread in this run.
This particular perf command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.
The output from perf stat is the following:
0 tx_mem_abort_capacity_write [26.66%]
0 tx_mem_abort_conflict [26.65%]
29,937,894 tx_exec_misc1 [26.71%]
0 tx_exec_misc2 [26.74%]
0 tx_exec_misc3 [26.80%]
0 tx_exec_misc4 [26.92%]
0 tx_exec_misc5 [26.83%]
29,906,632 rtm_retired_start [26.79%]
0 rtm_retired_commit [26.70%]
29,985,423 rtm_retired_aborted [26.66%]
0 rtm_retired_aborted_misc1 [26.75%]
0 rtm_retired_aborted_misc2 [26.73%]
29,927,923 rtm_retired_aborted_misc3 [26.71%]
0 rtm_retired_aborted_misc4 [26.69%]
176 rtm_retired_aborted_misc5 [26.67%]
10.583607595 seconds time elapsed
As you can see from the output:
The rtm_retired_start count is 30 million (matches input to program)
The rtm_retired_abort count is about the same (no commits at all)
The abort_conflict and abort_capacity counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare.
The only actual leads here are the high values of tx_exec_misc1 and rtm_retired_aborted_misc3, which are somewhat similar in description.
The Intel manual (vol 3) defines rtm_retired_aborted_misc3 counters:
code: C9H 20H
mnemonic: RTM_RETIRED.ABORTED_MISC3
description: Number of times an RTM execution aborted due to HLE unfriendly instructions.
The definition for tx_exec_misc1 has some similar words:
code: 5DH 01H
mnemonic: TX_EXEC.MISC1
description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted. The location has a mov instruction from register to register. No weird instruction names seen nearby.
Update:
Here are two things I've tried since then:
1) the tx_exec_misc1 and rtm_retired_aborted_misc3 signature we we see here can be obtained, for example, by a dummy block of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
_xabort(1);
}
}
or one of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
printf("hello");
fflush(stdout);
}
}
In both cases, the perf counters look similar to what I see. However, in both cases the perf report for -e cpu/tx-abort/ points to the intuitively correct assembly lines: an xabort instruction for the first example and a syscall one for the second one. In the real codebase, the perf report points to a stack push right at the start of a function:
: 00000000004167e0 <myns::myfun()>:
100.00 : 4167e0: push %rbp
0.00 : 4167e1: mov %rsp,%rbp
0.00 : 4167e4: push %r15
I have also run the same command under the intel software development emulator. It turns out that the problem goes away in that case: I get no aborts as far as the application is concerned.

Though it's been the case for a while, I found this unanswered question while searching, so here's the answer: This is a hardware bug in Haswell and early Broadwell chips.
The particular hardware erratum assigned by Intel is HSW136, and is not fixable using microcode updates. Indeed, I think it was in stepping 4 that the feature was no longer reported as available by the cpuid instruction, even when there was (faulty) silicon on the chip to implement it.

Related

Inconsistency when benchmarking two contiguous measurements

I was benchmarking a function and I see that some iteration are slower than other.
After some tests I tried to benchmark two contiguous measurements and I still got some weird results.
The code is on wandbox.
For me the important part is :
using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
}
The loop is optimized away as we can see on godbolt.
call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()
The code was compiled with:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) on an Intel® Xeon® W-2195 Processor.
I was the only user on the machine and I try to run with and without higth priority (nice or chrt) and the result was the same.
The result I got with 100 000 000 iterations was:
The Y-axis is in nanoseconds, it's the result of the line
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)
I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.
I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:
lscpu | grep L1
L1d cache: 32K
L1i cache: 32K
And 1500*16 bits = 24 000 bits which is less than 32K so there shouldn’t have cache miss.
And the results:
I still have my 4 lines (and some noise).
So if it’s realy a cache miss I don’t have any idea why it is happening.
I don’t kown if it’s useful for you but I run:
sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load ./a.out 1000
With the value 1 000 / 10 000 / 100 000 / 1 000 000
I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.
So the questions are:
What is the cause of these slowdown?
How produce a qualitative benchmark of a function when I can’t have a constant time for a No operation ?
Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !
How reproduce:
The code:
#include <iostream>
#include <chrono>
#include <vector>
int main(int argc, char **argv)
{
int statSize = 1000;
using clock = std::chrono::steady_clock;
if (argc == 2)
{
statSize = std::atoi(argv[1]);
}
std::vector<uint16_t> temps;
temps.reserve(statSize);
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
temps.push_back(
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
}
for (auto t : temps)
std::cout << (int)t << std::endl;
return (0);
}
Build:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
Generate output (sudo needed):
In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :
for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99 >> fast_1_000_000_uint16_100 ; done
Generate graph:
cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"
The result I have on my machine:
Where I am after the answer of Zulan and all the comments
The current_clocksource is set on tsc and no switch seen in dmesg, command used:
dmesg -T | grep tsc
I use this script to remove the HyperThreading (HT)
then
grep -c proc /proc/cpuinfo
=> 18
Subtract 1 from the last result to obtain the last available core:
=> 17
Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="isolcpus=17"
Finaly:
sudo update-grub
reboot
// reexecute the script
Now I can use:
taskset -c 17 ./a.out XXXX
So I run 10 000 times a loop of 100 iterations.
for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100 > ./core17/run_$i 2>&1 ; done
Check if there is any Involuntary context switches:
grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0
There is none, good. Let's plot :
for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"
Result :
There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.
Next step, TBD
Do the part :
sudo nice -n -17 perf record...
of the Zulan answer's

I can't reproduce it with these particular clustered lines, but here is some general information.
Possible causes
As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least
The scheduling tick timer
Kernel tasks that are bound to a certain code
Your task may be migrated from one core to another for an arbitrary reason
You can use isolcpus and taskset to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.
Except for taskset, which I absolutely recommend for any performance measurement, these are quite unusual measures.
Measure instead of guessing
If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:
sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...
perf script will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe and a slightly modified benchmark:
void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
(void)count;
}
...
auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
if (count > 100) {
record_slow(count);
}
temps.push_back(count);
And compile with -g
sudo perf probe -x ./a.out record_slow count
Then add -e probe_a:record_slow to the call to perf record. Now if you are lucky, you find some close events, e.g.:
a.out 14888 [005] 51213.829062: irq:softirq_entry: vec=1 [action=TIMER]
a.out 14888 [005] 51213.829068: probe_a:record_slow: (559354aec479) count=9029
Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf is pretty low-overhead, there may be some perturbation on what you measure.
What are we benchmarking?
First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now(). It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.
That's actually a tricky point. The cost of this function, with clock_gettime underneath, depends on your current clocksource1. If that's tsc you're fine - hpet is much slower. Linux may switch quietly2 from tsc to hpet during operation.
What to do to get stable results?
Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.
If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.
Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.
1 /sys/devices/system/clocksource/clocksource0/current_clocksource
2 Actually it's in dmesg as something like
clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:

Perf Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (msr/tsc/)

I am using perf to monitor the system for certain events. However, I get the following error and I have no idea where it comes from,as the event is listed in perf list
sudo perf record -e msr/tsc/ -a
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (msr/tsc/).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?
How can I check No CONFIG_PERF_EVENTS=y kernel support configured?
**Some test results:
sudo dmesg | grep "perf\|pmu"**
[ 0.029179] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[ 0.029179] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
[ 9475.406967] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 9475.990901] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 9476.886941] perf: interrupt took too long (3942 > 3932), lowering kernel.perf_event_max_sample_rate to 50500
[76057.268195] perf: interrupt took too long (4934 > 4927), lowering kernel.perf_event_max_sample_rate to 40500
[167368.007839] perf: interrupt took too long (6171 > 6167), lowering kernel.perf_event_max_sample_rate to 32250
[168338.165608] perf: interrupt took too long (7804 > 7713), lowering kernel.perf_event_max_sample_rate to 25500
perf list |grep msr
msr/aperf/ [Kernel PMU event]
msr/mperf/ [Kernel PMU event]
msr/pperf/ [Kernel PMU event]
msr/smi/ [Kernel PMU event]
msr/tsc/
sudo uname -a
Linux bla 4.9.0-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
sudo /proc/config.gz returns command not found
Any help/ideas are appreciated.

It is possible to collect this information with perf record by using group sampling. For example, the following command
perf record -a -e '{cycles,msr/aperf/,msr/tsc/}:S'
collects values of all three events based on the cycle (first counter) overflow. The undocumented :S modifier is necessary, it makes sure that only the leader of the group triggers samples. For seeing this information, you use perf report --group, the parameter may not be necessary. I'm afraid the actual values for each sample are only visible in the very verbose perf script -D.

There was a patch introduced in perf to support MSR Performance Monitoring Units. These MSR PMUs support free-running MSR counters. These counters include time and frequency-based counters like TSC, IA32_APERF, IA32_MPERF and IA32_PPERF.
These MSR events do not support sampling modes. As visible by this line of code in the linux kernel(v4.9) source code.
Snippet:
if event->attr.sample_period) /* no sampling */
return -EINVAL;
perf_events can instrument in three ways (counting events, sampling events and bpf events). Remember that when you run perf record, you are now invoking the sampling mode. Even though you do not explicitly specify the sampling period, internally sampling is happening at a default sampling frequency.
To count msr events, you need to run perf_events in counting/aggregation mode. You run perf stat for this --
perf stat -e msr/tsc/ -a
^C
Performance counter stats for 'system wide':
34,83,07,96,035 msr/tsc/
2.419151644 seconds time elapsed
Read this to understand more about counting and sampling events/modes.

How to run record instruction-history and function-call-history in GDB?

(EDIT: per the first answer below the current "trick" seems to be using an Atom processor. But I hope some gdb guru can answer if this is a fundamental limitation, or whether there adding support for other processors is on the roadmap?)
Reverse execution seems to be working in my environment: I can reverse-continue, see a plausible record log, and move around within it:
(gdb) start
...Temporary breakpoint 5 at 0x8048460: file bang.cpp, line 13.
Starting program: /home/thomasg/temp/./bang
Temporary breakpoint 5, main () at bang.cpp:13
13 f(1000);
(gdb) record
(gdb) continue
Continuing.
Breakpoint 3, f (d=900) at bang.cpp:5
5 if(d) {
(gdb) info record
Active record target: record-full
Record mode:
Lowest recorded instruction number is 1.
Highest recorded instruction number is 1005.
Log contains 1005 instructions.
Max logged instructions is 200000.
(gdb) reverse-continue
Continuing.
Breakpoint 3, f (d=901) at bang.cpp:5
5 if(d) {
(gdb) record goto end
Go forward to insn number 1005
#0 f (d=900) at bang.cpp:5
5 if(d) {
However the instruction and function histories aren't available:
(gdb) record instruction-history
You can't do that when your target is `record-full'
(gdb) record function-call-history
You can't do that when your target is `record-full'
And the only target type available is full, the other documented type "btrace" fails with "Target does not support branch tracing."
So quite possibly it just isn't supported for this target, but as it's a mainstream modern one (gdb 7.6.1-ubuntu, on amd64 Linux Mint "Petra" running an "Intel(R) Core(TM) i5-3570") I'm hoping that I've overlooked a crucial step or config?

It seems that there is no other solution except a CPU that supports it.
More precisely, your kernel has to support Intel Processor Tracing (Intel PT). This can be checked in Linux with:
grep intel_pt /proc/cpuinfo
See also: https://unix.stackexchange.com/questions/43539/what-do-the-flags-in-proc-cpuinfo-mean
The commands only works in record btrace mode.
In the GDB source commit beab5d9, it is nat/linux-btrace.c:kernel_supports_pt that checks if we can enter btrace. The following checks are carried out:
check if /sys/bus/event_source/devices/intel_pt/type exists and read the type
do a syscall (SYS_perf_event_open, &attr, child, -1, -1, 0); with the read type, and see if it returns >=0. TODO: why not use the C wrapper?
The first check fails for me: the file does not exist.
Kernel side
cd into the kernel 4.1 source and:
git grep '"intel_pt"'
we find arch/x86/kernel/cpu/perf_event_intel_pt.c which sets up that file. In particular, it does:
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;
so intel_pt is a pre-requisite.
How I've found kernel_supports_pt
First grep for:
git grep 'Target does not support branch tracing.'
which leads us to btrace.c:btrace_enable. After a quick debug with:
gdb -q -ex start -ex 'b btrace_enable' -ex c --args /home/ciro/git/binutils-gdb/install/bin/gdb --batch -ex start -ex 'record btrace' ./hello_world.out
Virtual box does not support it either: Extract execution log from gdb record in a VirtualBox VM
Intel SDE
Intel SDE 7.21 already has this CPU feature, checked with:
./sde64 -- cpuid | grep 'Intel processor trace'
But I'm not sure if the Linux kernel can be run on it: https://superuser.com/questions/950992/how-to-run-the-linux-kernel-on-intel-software-development-emulator-sde
Other GDB methods
More generic questions, with less efficient software solutions:
call graph: List of all function calls made in an application
instruction trace: Displaying each assembly instruction executed in gdb

At least a partial answer (for the "am I doing it wrong" aspect) - from gdb-7.6.50.20140108/gdb/NEWS
* A new record target "record-btrace" has been added. The new target
uses hardware support to record the control-flow of a process. It
does not support replaying the execution, but it implements the
below new commands for investigating the recorded execution log.
This new recording method can be enabled using:
record btrace
The "record-btrace" target is only available on Intel Atom processors
and requires a Linux kernel 2.6.32 or later.
* Two new commands have been added for record/replay to give information
about the recorded execution without having to replay the execution.
The commands are only supported by "record btrace".
record instruction-history prints the execution history at
instruction granularity
record function-call-history prints the execution history at
function granularity
It's not often that I envy the owner of an Atom processor ;-)
I'll edit the question to refocus upon the question of workarounds or plans for future support.

Unix Command For Benchmarking Code Running K times

Suppose I have a code executed in Unix this way:
$ ./mycode
My question is is there a way I can time the running time of my code
executed K times. The value of K = 1000 for example.
I am aware of Unix "time" command, but that only executed 1 instance.

to improve/clarify on Charlie's answer:
time (for i in $(seq 10000); do ./mycode; done)

try
$ time ( your commands )
write a loop to go in the parens to repeat your command as needed.
Update
Okay, we can solve the command line too long issue. This is bash syntax, if you're using another shell you may have to use expr(1).
$ time (
> while ((n++ < 100)); do echo "n = $n"; done
> )
real 0m0.001s
user 0m0.000s
sys 0m0.000s

Just a word of advice: Make sure this "benchmark" comes close to your real usage of the executed program. If this is a short living process, there could be a significant overhead caused by the process creation alone. Don't assume that it's the same as implementing this as a loop within your program.

To enhance a little bit some other responses, some of them (those based on seq) may cause a command line too long if you decide to test, say one million times. The following does not have this limitation
time ( a=0 ; while test $a -lt 10000 ; do echo $a ; a=`expr $a + 1` ; done)

Another solution to the "command line too long" problem is to use a C-style for loop within bash:
$ for ((i=0;i<10;i++)); do echo $i; done
This works in zsh as well (though I bet zsh has some niftier way of using it, I'm just still new to zsh). I can't test others, as I've never used any other.

forget time, hyperfine will do exactly what you are looking for: https://github.com/sharkdp/hyperfine
% hyperfine 'sleep 0.3'
Benchmark 1: sleep 0.3
Time (mean ± σ): 310.2 ms ± 3.4 ms [User: 1.7 ms, System: 2.5 ms]
Range (min … max): 305.6 ms … 315.2 ms 10 runs

Linux perf stat has a -r repeat_count option. Its output only gives you the mean and standard deviation for each HW/software event, not min/max as well.
It doesn't discard the first run as a warm-up or anything either, but it's somewhat useful in a lot of cases.
Scroll to the right for the stddev results like ( +- 0.13% ) for cycles. Less variance in that than in task-clock, probably because CPU frequency was not fixed. (I intentionally picked a quite short run time, although with Skylake hardware P-state and EPP=performance, it should be ramping up to max turbo quite quickly even compared to a 34 ms run time. But for a CPU-bound task that's not memory-bound at all, its interpreter loop runs at a constant number of clock cycles per iteration, modulo only branch misprediction and interrupts. --all-user is counting CPU events like instructions and cycles only for user-space, not inside interrupt handlers and system calls / page-faults.)
$ perf stat --all-user -r5 awk 'BEGIN{for(i=0;i<1000000;i++){}}'
Performance counter stats for 'awk BEGIN{for(i=0;i<1000000;i++){}}' (5 runs):
34.10 msec task-clock # 0.984 CPUs utilized ( +- 0.40% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
178 page-faults # 5.180 K/sec ( +- 0.42% )
139,277,791 cycles # 4.053 GHz ( +- 0.13% )
360,590,762 instructions # 2.58 insn per cycle ( +- 0.00% )
97,439,689 branches # 2.835 G/sec ( +- 0.00% )
16,416 branch-misses # 0.02% of all branches ( +- 8.14% )
0.034664 +- 0.000143 seconds time elapsed ( +- 0.41% )
awk here is just a busy-loop to give us something to measure. If you're using this to microbenchmark a loop or function, construct it to have minimal startup overhead as a fraction of total run time, so perf stat event counts for the whole run mostly reflect the code you wanted to time. Often this means building a repeat-loop into your own program, to loop over the initialized data multiple times.
See also Idiomatic way of performance evaluation? - timing very short things is hard due to measurement overhead. Carefully constructing a repeat loop that tells you something interesting about the throughput or latency of your code under test is important.
Run-to-run variation is often a thing, but often back-to-back runs like this will have less variation within the group than between runs separated by half a second to up-arrow/return. Perhaps something to do with transparent hugepage availability, or choice of alignment? Usually for small microbenchmarks, so not sensitive to the file getting evicted from the pagecache.
(The +- range printed by perf is just I think one standard deviation based on the small sample size, not the full range it saw.)

If you're worried about the overhead of constantly load and unloading the executable into process space, I suggest you set up a ram disk and time your app from there.
Back in the 70's we used to be able to set a "sticky" bit on the executable and have it remain in memory.. I don't know of a single unix which now supports this behaviour as it made updating applications a nightmare.... :o)

How do I profile C++ code running on Linux?

How do I find areas of my code that run slowly in a C++ application running on Linux?

If your goal is to use a profiler, use one of the suggested ones.
However, if you're in a hurry and you can manually interrupt your program under the debugger while it's being subjectively slow, there's a simple way to find performance problems.
Just halt it several times, and each time look at the call stack. If there is some code that is wasting some percentage of the time, 20% or 50% or whatever, that is the probability that you will catch it in the act on each sample. So, that is roughly the percentage of samples on which you will see it. There is no educated guesswork required. If you do have a guess as to what the problem is, this will prove or disprove it.
You may have multiple performance problems of different sizes. If you clean out any one of them, the remaining ones will take a larger percentage, and be easier to spot, on subsequent passes. This magnification effect, when compounded over multiple problems, can lead to truly massive speedup factors.
Caveat: Programmers tend to be skeptical of this technique unless they've used it themselves. They will say that profilers give you this information, but that is only true if they sample the entire call stack, and then let you examine a random set of samples. (The summaries are where the insight is lost.) Call graphs don't give you the same information, because
They don't summarize at the instruction level, and
They give confusing summaries in the presence of recursion.
They will also say it only works on toy programs, when actually it works on any program, and it seems to work better on bigger programs, because they tend to have more problems to find. They will say it sometimes finds things that aren't problems, but that is only true if you see something once. If you see a problem on more than one sample, it is real.
P.S. This can also be done on multi-thread programs if there is a way to collect call-stack samples of the thread pool at a point in time, as there is in Java.
P.P.S As a rough generality, the more layers of abstraction you have in your software, the more likely you are to find that that is the cause of performance problems (and the opportunity to get speedup).
Added: It might not be obvious, but the stack sampling technique works equally well in the presence of recursion. The reason is that the time that would be saved by removal of an instruction is approximated by the fraction of samples containing it, regardless of the number of times it may occur within a sample.
Another objection I often hear is: "It will stop someplace random, and it will miss the real problem".
This comes from having a prior concept of what the real problem is.
A key property of performance problems is that they defy expectations.
Sampling tells you something is a problem, and your first reaction is disbelief.
That is natural, but you can be sure if it finds a problem it is real, and vice-versa.
Added: Let me make a Bayesian explanation of how it works. Suppose there is some instruction I (call or otherwise) which is on the call stack some fraction f of the time (and thus costs that much). For simplicity, suppose we don't know what f is, but assume it is either 0.1, 0.2, 0.3, ... 0.9, 1.0, and the prior probability of each of these possibilities is 0.1, so all of these costs are equally likely a-priori.
Then suppose we take just 2 stack samples, and we see instruction I on both samples, designated observation o=2/2. This gives us new estimates of the frequency f of I, according to this:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&&f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.1 1 1 0.1 0.1 0.25974026
0.1 0.9 0.81 0.081 0.181 0.47012987
0.1 0.8 0.64 0.064 0.245 0.636363636
0.1 0.7 0.49 0.049 0.294 0.763636364
0.1 0.6 0.36 0.036 0.33 0.857142857
0.1 0.5 0.25 0.025 0.355 0.922077922
0.1 0.4 0.16 0.016 0.371 0.963636364
0.1 0.3 0.09 0.009 0.38 0.987012987
0.1 0.2 0.04 0.004 0.384 0.997402597
0.1 0.1 0.01 0.001 0.385 1
P(o=2/2) 0.385
The last column says that, for example, the probability that f >= 0.5 is 92%, up from the prior assumption of 60%.
Suppose the prior assumptions are different. Suppose we assume P(f=0.1) is .991 (nearly certain), and all the other possibilities are almost impossible (0.001). In other words, our prior certainty is that I is cheap. Then we get:
Prior
P(f=x) x P(o=2/2|f=x) P(o=2/2&& f=x) P(o=2/2&&f >= x) P(f >= x | o=2/2)
0.001 1 1 0.001 0.001 0.072727273
0.001 0.9 0.81 0.00081 0.00181 0.131636364
0.001 0.8 0.64 0.00064 0.00245 0.178181818
0.001 0.7 0.49 0.00049 0.00294 0.213818182
0.001 0.6 0.36 0.00036 0.0033 0.24
0.001 0.5 0.25 0.00025 0.00355 0.258181818
0.001 0.4 0.16 0.00016 0.00371 0.269818182
0.001 0.3 0.09 0.00009 0.0038 0.276363636
0.001 0.2 0.04 0.00004 0.00384 0.279272727
0.991 0.1 0.01 0.00991 0.01375 1
P(o=2/2) 0.01375
Now it says P(f >= 0.5) is 26%, up from the prior assumption of 0.6%. So Bayes allows us to update our estimate of the probable cost of I. If the amount of data is small, it doesn't tell us accurately what the cost is, only that it is big enough to be worth fixing.
Yet another way to look at it is called the Rule Of Succession.
If you flip a coin 2 times, and it comes up heads both times, what does that tell you about the probable weighting of the coin?
The respected way to answer is to say that it's a Beta distribution, with average value (number of hits + 1) / (number of tries + 2) = (2+1)/(2+2) = 75%.
(The key is that we see I more than once. If we only see it once, that doesn't tell us much except that f > 0.)
So, even a very small number of samples can tell us a lot about the cost of instructions that it sees. (And it will see them with a frequency, on average, proportional to their cost. If n samples are taken, and f is the cost, then I will appear on nf+/-sqrt(nf(1-f)) samples. Example, n=10, f=0.3, that is 3+/-1.4 samples.)
Added: To give an intuitive feel for the difference between measuring and random stack sampling:
There are profilers now that sample the stack, even on wall-clock time, but what comes out is measurements (or hot path, or hot spot, from which a "bottleneck" can easily hide). What they don't show you (and they easily could) is the actual samples themselves. And if your goal is to find the bottleneck, the number of them you need to see is, on average, 2 divided by the fraction of time it takes.
So if it takes 30% of time, 2/.3 = 6.7 samples, on average, will show it, and the chance that 20 samples will show it is 99.2%.
Here is an off-the-cuff illustration of the difference between examining measurements and examining stack samples.
The bottleneck could be one big blob like this, or numerous small ones, it makes no difference.
Measurement is horizontal; it tells you what fraction of time specific routines take.
Sampling is vertical.
If there is any way to avoid what the whole program is doing at that moment, and if you see it on a second sample, you've found the bottleneck.
That's what makes the difference - seeing the whole reason for the time being spent, not just how much.

Use Valgrind with the following options:
valgrind --tool=callgrind ./(Your binary)
This generates a file called callgrind.out.x. Use the kcachegrind tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.

I assume you're using GCC. The standard solution would be to profile with gprof.
Be sure to add -pg to compilation before profiling:
cc -o myprog myprog.c utils.c -g -pg
I haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
Related question here.
A few other buzzwords if gprof does not do the job for you: Valgrind, Intel VTune, Sun DTrace.

Newer kernels (e.g. the latest Ubuntu kernels) come with the new 'perf' tools (apt-get install linux-tools) AKA perf_events.
These come with classic sampling profilers (man-page) as well as the awesome timechart!
The important thing is that these tools can be system profiling and not just process profiling - they can show the interaction between threads, processes and the kernel and let you understand the scheduling and I/O dependencies between processes.

The answer to run valgrind --tool=callgrind is not quite complete without some options. We usually do not want to profile 10 minutes of slow startup time under Valgrind and want to profile our program when it is doing some task.
So this is what I recommend. Run program first:
valgrind --tool=callgrind --dump-instr=yes -v --instr-atstart=no ./binary > tmp
Now when it works and we want to start profiling we should run in another window:
callgrind_control -i on
This turns profiling on. To turn it off and stop whole task we might use:
callgrind_control -k
Now we have some files named callgrind.out.* in current directory. To see profiling results use:
kcachegrind callgrind.out.*
I recommend in next window to click on "Self" column header, otherwise it shows that "main()" is most time consuming task. "Self" shows how much each function itself took time, not together with dependents.

I would use Valgrind and Callgrind as a base for my profiling tool suite. What is important to know is that Valgrind is basically a Virtual Machine:
(wikipedia) Valgrind is in essence a virtual
machine using just-in-time (JIT)
compilation techniques, including
dynamic recompilation. Nothing from
the original program ever gets run
directly on the host processor.
Instead, Valgrind first translates the
program into a temporary, simpler form
called Intermediate Representation
(IR), which is a processor-neutral,
SSA-based form. After the conversion,
a tool (see below) is free to do
whatever transformations it would like
on the IR, before Valgrind translates
the IR back into machine code and lets
the host processor run it.
Callgrind is a profiler build upon that. Main benefit is that you don't have to run your aplication for hours to get reliable result. Even one second run is sufficient to get rock-solid, reliable results, because Callgrind is a non-probing profiler.
Another tool build upon Valgrind is Massif. I use it to profile heap memory usage. It works great. What it does is that it gives you snapshots of memory usage -- detailed information WHAT holds WHAT percentage of memory, and WHO had put it there. Such information is available at different points of time of application run.

This is a response to Nazgob's Gprof answer.
I've been using Gprof the last couple of days and have already found three significant limitations, one of which I've not seen documented anywhere else (yet):
It doesn't work properly on multi-threaded code, unless you use a workaround
The call graph gets confused by function pointers. Example: I have a function called multithread() which enables me to multi-thread a specified function over a specified array (both passed as arguments). Gprof however, views all calls to multithread() as equivalent for the purposes of computing time spent in children. Since some functions I pass to multithread() take much longer than others my call graphs are mostly useless. (To those wondering if threading is the issue here: no, multithread() can optionally, and did in this case, run everything sequentially on the calling thread only).
It says here that "... the number-of-calls figures are derived by counting, not sampling. They are completely accurate...". Yet I find my call graph giving me 5345859132+784984078 as call stats to my most-called function, where the first number is supposed to be direct calls, and the second recursive calls (which are all from itself). Since this implied I had a bug, I put in long (64-bit) counters into the code and did the same run again. My counts: 5345859132 direct, and 78094395406 self-recursive calls. There are a lot of digits there, so I'll point out the recursive calls I measure are 78bn, versus 784m from Gprof: a factor of 100 different. Both runs were single threaded and unoptimised code, one compiled -g and the other -pg.
This was GNU Gprof (GNU Binutils for Debian) 2.18.0.20080103 running under 64-bit Debian Lenny, if that helps anyone.

Survey of C++ profiling techniques: gprof vs valgrind vs perf vs gperftools
In this answer, I will use several different tools to a analyze a few very simple test programs, in order to concretely compare how those tools work.
The following test program is very simple and does the following:
main calls fast and maybe_slow 3 times, one of the maybe_slow calls being slow
The slow call of maybe_slow is 10x longer, and dominates runtime if we consider calls to the child function common. Ideally, the profiling tool will be able to point us to the specific slow call.
both fast and maybe_slow call common, which accounts for the bulk of the program execution
The program interface is:
./main.out [n [seed]]
and the program does O(n^2) loops in total. seed is just to get different output without affecting runtime.
main.c
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
uint64_t __attribute__ ((noinline)) common(uint64_t n, uint64_t seed) {
for (uint64_t i = 0; i < n; ++i) {
seed = (seed * seed) - (3 * seed) + 1;
}
return seed;
}
uint64_t __attribute__ ((noinline)) fast(uint64_t n, uint64_t seed) {
uint64_t max = (n / 10) + 1;
for (uint64_t i = 0; i < max; ++i) {
seed = common(n, (seed * seed) - (3 * seed) + 1);
}
return seed;
}
uint64_t __attribute__ ((noinline)) maybe_slow(uint64_t n, uint64_t seed, int is_slow) {
uint64_t max = n;
if (is_slow) {
max *= 10;
}
for (uint64_t i = 0; i < max; ++i) {
seed = common(n, (seed * seed) - (3 * seed) + 1);
}
return seed;
}
int main(int argc, char **argv) {
uint64_t n, seed;
if (argc > 1) {
n = strtoll(argv[1], NULL, 0);
} else {
n = 1;
}
if (argc > 2) {
seed = strtoll(argv[2], NULL, 0);
} else {
seed = 0;
}
seed += maybe_slow(n, seed, 0);
seed += fast(n, seed);
seed += maybe_slow(n, seed, 1);
seed += fast(n, seed);
seed += maybe_slow(n, seed, 0);
seed += fast(n, seed);
printf("%" PRIX64 "\n", seed);
return EXIT_SUCCESS;
}
gprof
gprof requires recompiling the software with instrumentation, and it also uses a sampling approach together with that instrumentation. It therefore strikes a balance between accuracy (sampling is not always fully accurate and can skip functions) and execution slowdown (instrumentation and sampling are relatively fast techniques that don't slow down execution very much).
gprof is built-into GCC/binutils, so all we have to do is to compile with the -pg option to enable gprof. We then run the program normally with a size CLI parameter that produces a run of reasonable duration of a few seconds (10000):
gcc -pg -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time ./main.out 10000
For educational reasons, we will also do a run without optimizations enabled. Note that this is useless in practice, as you normally only care about optimizing the performance of the optimized program:
gcc -pg -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out 10000
First, time tells us that the execution time with and without -pg were the same, which is great: no slowdown! I have however seen accounts of 2x - 3x slowdowns on complex software, e.g. as shown in this ticket.
Because we compiled with -pg, running the program produces a file gmon.out file containing the profiling data.
We can observe that file graphically with gprof2dot as asked at: Is it possible to get a graphical representation of gprof results?
sudo apt install graphviz
python3 -m pip install --user gprof2dot
gprof main.out > main.gprof
gprof2dot < main.gprof | dot -Tsvg -o output.svg
Here, the gprof tool reads the gmon.out trace information, and generates a human readable report in main.gprof, which gprof2dot then reads to generate a graph.
The source for gprof2dot is at: https://github.com/jrfonseca/gprof2dot
We observe the following for the -O0 run:
and for the -O3 run:
The -O0 output is pretty much self-explanatory. For example, it shows that the 3 maybe_slow calls and their child calls take up 97.56% of the total runtime, although execution of maybe_slow itself without children accounts for 0.00% of the total execution time, i.e. almost all the time spent in that function was spent on child calls.
TODO: why is main missing from the -O3 output, even though I can see it on a bt in GDB? Missing function from GProf output I think it is because gprof is also sampling based in addition to its compiled instrumentation, and the -O3 main is just too fast and got no samples.
I choose SVG output instead of PNG because the SVG is searchable with Ctrl + F and the file size can be about 10x smaller. Also, the width and height of the generated image can be humoungous with tens of thousands of pixels for complex software, and GNOME eog 3.28.1 bugs out in that case for PNGs, while SVGs get opened by my browser automatically. gimp 2.8 worked well though, see also:
https://askubuntu.com/questions/1112641/how-to-view-extremely-large-images
https://unix.stackexchange.com/questions/77968/viewing-large-image-on-linux
https://superuser.com/questions/356038/viewer-for-huge-images-under-linux-100-mp-color-images
but even then, you will be dragging the image around a lot to find what you want, see e.g. this image from a "real" software example taken from this ticket:
Can you find the most critical call stack easily with all those tiny unsorted spaghetti lines going over one another? There might be better dot options I'm sure, but I don't want to go there now. What we really need is a proper dedicated viewer for it, but I haven't found one yet:
View gprof output in kcachegrind
Which is the best replacement for KProf?
You can however use the color map to mitigate those problems a bit. For example, on the previous huge image, I finally managed to find the critical path on the left when I made the brilliant deduction that green comes after red, followed finally by darker and darker blue.
Alternatively, we can also observe the text output of the gprof built-in binutils tool which we previously saved at:
cat main.gprof
By default, this produces an extremely verbose output that explains what the output data means. Since I can't explain better than that, I'll let you read it yourself.
Once you have understood the data output format, you can reduce verbosity to show just the data without the tutorial with the -b option:
gprof -b main.out
In our example, outputs were for -O0:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
100.35 3.67 3.67 123003 0.00 0.00 common
0.00 3.67 0.00 3 0.00 0.03 fast
0.00 3.67 0.00 3 0.00 1.19 maybe_slow
Call graph
granularity: each sample hit covers 2 byte(s) for 0.27% of 3.67 seconds
index % time self children called name
0.09 0.00 3003/123003 fast [4]
3.58 0.00 120000/123003 maybe_slow [3]
[1] 100.0 3.67 0.00 123003 common [1]
-----------------------------------------------
<spontaneous>
[2] 100.0 0.00 3.67 main [2]
0.00 3.58 3/3 maybe_slow [3]
0.00 0.09 3/3 fast [4]
-----------------------------------------------
0.00 3.58 3/3 main [2]
[3] 97.6 0.00 3.58 3 maybe_slow [3]
3.58 0.00 120000/123003 common [1]
-----------------------------------------------
0.00 0.09 3/3 main [2]
[4] 2.4 0.00 0.09 3 fast [4]
0.09 0.00 3003/123003 common [1]
-----------------------------------------------
Index by function name
[1] common [4] fast [3] maybe_slow
and for -O3:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls us/call us/call name
100.52 1.84 1.84 123003 14.96 14.96 common
Call graph
granularity: each sample hit covers 2 byte(s) for 0.54% of 1.84 seconds
index % time self children called name
0.04 0.00 3003/123003 fast [3]
1.79 0.00 120000/123003 maybe_slow [2]
[1] 100.0 1.84 0.00 123003 common [1]
-----------------------------------------------
<spontaneous>
[2] 97.6 0.00 1.79 maybe_slow [2]
1.79 0.00 120000/123003 common [1]
-----------------------------------------------
<spontaneous>
[3] 2.4 0.00 0.04 fast [3]
0.04 0.00 3003/123003 common [1]
-----------------------------------------------
Index by function name
[1] common
As a very quick summary for each section e.g.:
0.00 3.58 3/3 main [2]
[3] 97.6 0.00 3.58 3 maybe_slow [3]
3.58 0.00 120000/123003 common [1]
centers around the function that is left indented (maybe_flow). [3] is the ID of that function. Above the function, are its callers, and below it the callees.
For -O3, see here like in the graphical output that maybe_slow and fast don't have a known parent, which is what the documentation says that <spontaneous> means.
I'm not sure if there is a nice way to do line-by-line profiling with gprof: `gprof` time spent in particular lines of code
valgrind callgrind
valgrind runs the program through the valgrind virtual machine. This makes the profiling very accurate, but it also produces a very large slowdown of the program. I have also mentioned kcachegrind previously at: Tools to get a pictorial function call graph of code
callgrind is the valgrind's tool to profile code and kcachegrind is a KDE program that can visualize cachegrind output.
First we have to remove the -pg flag to go back to normal compilation, otherwise the run actually fails with Profiling timer expired, and yes, this is so common that I did and there was a Stack Overflow question for it.
So we compile and run as:
sudo apt install kcachegrind valgrind
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time valgrind --tool=callgrind valgrind --dump-instr=yes \
--collect-jumps=yes ./main.out 10000
I enable --dump-instr=yes --collect-jumps=yes because this also dumps information that enables us to view a per assembly line breakdown of performance, at a relatively small added overhead cost.
Off the bat, time tells us that the program took 29.5 seconds to execute, so we had a slowdown of about 15x on this example. Clearly, this slowdown is going to be a serious limitation for larger workloads. On the "real world software example" mentioned here, I observed a slowdown of 80x.
The run generates a profile data file named callgrind.out.<pid> e.g. callgrind.out.8554 in my case. We view that file with:
kcachegrind callgrind.out.8554
which shows a GUI that contains data similar to the textual gprof output:
Also, if we go on the bottom right "Call Graph" tab, we see a call graph which we can export by right clicking it to obtain the following image with unreasonable amounts of white border :-)
I think fast is not showing on that graph because kcachegrind must have simplified the visualization because that call takes up too little time, this will likely be the behavior you want on a real program. The right click menu has some settings to control when to cull such nodes, but I couldn't get it to show such a short call after a quick attempt. If I click on fast on the left window, it does show a call graph with fast, so that stack was actually captured. No one had yet found a way to show the complete graph call graph: Make callgrind show all function calls in the kcachegrind callgraph
TODO on complex C++ software, I see some entries of type <cycle N>, e.g. <cycle 11> where I'd expect function names, what does that mean? I noticed there is a "Cycle Detection" button to toggle that on and off, but what does it mean?
perf from linux-tools
perf seems to use exclusively Linux kernel sampling mechanisms. This makes it very simple to setup, but also not fully accurate.
sudo apt install linux-tools
time perf record -g ./main.out 10000
This added 0.2s to execution, so we are fine time-wise, but I still don't see much of interest, after expanding the common node with the keyboard right arrow:
Samples: 7K of event 'cycles:uppp', Event count (approx.): 6228527608
Children Self Command Shared Object Symbol
- 99.98% 99.88% main.out main.out [.] common
common
0.11% 0.11% main.out [kernel] [k] 0xffffffff8a6009e7
0.01% 0.01% main.out [kernel] [k] 0xffffffff8a600158
0.01% 0.00% main.out [unknown] [k] 0x0000000000000040
0.01% 0.00% main.out ld-2.27.so [.] _dl_sysdep_start
0.01% 0.00% main.out ld-2.27.so [.] dl_main
0.01% 0.00% main.out ld-2.27.so [.] mprotect
0.01% 0.00% main.out ld-2.27.so [.] _dl_map_object
0.01% 0.00% main.out ld-2.27.so [.] _xstat
0.00% 0.00% main.out ld-2.27.so [.] __GI___tunables_init
0.00% 0.00% main.out [unknown] [.] 0x2f3d4f4944555453
0.00% 0.00% main.out [unknown] [.] 0x00007fff3cfc57ac
0.00% 0.00% main.out ld-2.27.so [.] _start
So then I try to benchmark the -O0 program to see if that shows anything, and only now, at last, do I see a call graph:
Samples: 15K of event 'cycles:uppp', Event count (approx.): 12438962281
Children Self Command Shared Object Symbol
+ 99.99% 0.00% main.out [unknown] [.] 0x04be258d4c544155
+ 99.99% 0.00% main.out libc-2.27.so [.] __libc_start_main
- 99.99% 0.00% main.out main.out [.] main
- main
- 97.54% maybe_slow
common
- 2.45% fast
common
+ 99.96% 99.85% main.out main.out [.] common
+ 97.54% 0.03% main.out main.out [.] maybe_slow
+ 2.45% 0.00% main.out main.out [.] fast
0.11% 0.11% main.out [kernel] [k] 0xffffffff8a6009e7
0.00% 0.00% main.out [unknown] [k] 0x0000000000000040
0.00% 0.00% main.out ld-2.27.so [.] _dl_sysdep_start
0.00% 0.00% main.out ld-2.27.so [.] dl_main
0.00% 0.00% main.out ld-2.27.so [.] _dl_lookup_symbol_x
0.00% 0.00% main.out [kernel] [k] 0xffffffff8a600158
0.00% 0.00% main.out ld-2.27.so [.] mmap64
0.00% 0.00% main.out ld-2.27.so [.] _dl_map_object
0.00% 0.00% main.out ld-2.27.so [.] __GI___tunables_init
0.00% 0.00% main.out [unknown] [.] 0x552e53555f6e653d
0.00% 0.00% main.out [unknown] [.] 0x00007ffe1cf20fdb
0.00% 0.00% main.out ld-2.27.so [.] _start
TODO: what happened on the -O3 execution? Is it simply that maybe_slow and fast were too fast and did not get any samples? Does it work well with -O3 on larger programs that take longer to execute? Did I miss some CLI option? I found out about -F to control the sample frequency in Hertz, but I turned it up to the max allowed by default of -F 39500 (could be increased with sudo) and I still don't see clear calls.
One cool thing about perf is the FlameGraph tool from Brendan Gregg which displays the call stack timings in a very neat way that allows you to quickly see the big calls. The tool is available at: https://github.com/brendangregg/FlameGraph and is also mentioned on his perf tutorial at: http://www.brendangregg.com/perf.html#FlameGraphs When I ran perf without sudo I got ERROR: No stack counts found so for now I'll be doing it with sudo:
git clone https://github.com/brendangregg/FlameGraph
sudo perf record -F 99 -g -o perf_with_stack.data ./main.out 10000
sudo perf script -i perf_with_stack.data | FlameGraph/stackcollapse-perf.pl | FlameGraph/flamegraph.pl > flamegraph.svg
but in such a simple program the output is not very easy to understand, since we can't easily see neither maybe_slow nor fast on that graph:
On the a more complex example it becomes clear what the graph means:
TODO there are a log of [unknown] functions in that example, why is that?
Another perf GUI interfaces which might be worth it include:
Eclipse Trace Compass plugin: https://www.eclipse.org/tracecompass/
But this has the downside that you have to first convert the data to the Common Trace Format, which can be done with perf data --to-ctf, but it needs to be enabled at build time/have perf new enough, either of which is not the case for the perf in Ubuntu 18.04
https://github.com/KDAB/hotspot
The downside of this is that there seems to be no Ubuntu package, and building it requires Qt 5.10 while Ubuntu 18.04 is at Qt 5.9.
But David Faure mentions in the comments that there is no an AppImage package which might be a convenient way to use it.
gperftools
Previously called "Google Performance Tools", source: https://github.com/gperftools/gperftools Sample based.
First install gperftools with:
sudo apt install google-perftools
Then, we can enable the gperftools CPU profiler in two ways: at runtime, or at build time.
At runtime, we have to pass set the LD_PRELOAD to point to libprofiler.so, which you can find with locate libprofiler.so, e.g. on my system:
gcc -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so \
CPUPROFILE=prof.out ./main.out 10000
Alternatively, we can build the library in at link time, dispensing passing LD_PRELOAD at runtime:
gcc -Wl,--no-as-needed,-lprofiler,--as-needed -ggdb3 -O3 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
CPUPROFILE=prof.out ./main.out 10000
See also: gperftools - profile file not dumped
The nicest way to view this data I've found so far is to make pprof output the same format that kcachegrind takes as input (yes, the Valgrind-project-viewer-tool) and use kcachegrind to view that:
google-pprof --callgrind main.out prof.out > callgrind.out
kcachegrind callgrind.out
After running with either of those methods, we get a prof.out profile data file as output. We can view that file graphically as an SVG with:
google-pprof --web main.out prof.out
which gives as a familiar call graph like other tools, but with the clunky unit of number of samples rather than seconds.
Alternatively, we can also get some textual data with:
google-pprof --text main.out prof.out
which gives:
Using local file main.out.
Using local file prof.out.
Total: 187 samples
187 100.0% 100.0% 187 100.0% common
0 0.0% 100.0% 187 100.0% __libc_start_main
0 0.0% 100.0% 187 100.0% _start
0 0.0% 100.0% 4 2.1% fast
0 0.0% 100.0% 187 100.0% main
0 0.0% 100.0% 183 97.9% maybe_slow
See also: How to use google perf tools
Instrument your code with raw perf_event_open syscalls
I think this is the same underlying subsystem that perf uses, but you could of course attain even greater control by explicitly instrumenting your program at compile time with events of interest.
This is likely too hardcore for most people, but it's kind of fun. Minimal runnable example at: Quick way to count number of instructions executed in a C program
Intel VTune
https://en.wikipedia.org/wiki/VTune
This seems to be closed source and x86-only, but it is likely to be amazing from what I've heard. I'm not sure how free it is to use, but it seems to be free to download. TODO evaluate.
Tested in Ubuntu 18.04, gprof2dot 2019.11.30, valgrind 3.13.0, perf 4.15.18, Linux kernel 4.15.0, FLameGraph 1a0dc6985aad06e76857cf2a354bd5ba0c9ce96b, gperftools 2.5-2.

Use Valgrind, callgrind and kcachegrind:
valgrind --tool=callgrind ./(Your binary)
generates callgrind.out.x. Read it using kcachegrind.
Use gprof (add -pg):
cc -o myprog myprog.c utils.c -g -pg
(not so good for multi-threads, function pointers)
Use google-perftools:
Uses time sampling, I/O and CPU bottlenecks are revealed.
Intel VTune is the best (free for educational purposes).
Others: AMD Codeanalyst (since replaced with AMD CodeXL), OProfile, 'perf' tools (apt-get install linux-tools)

For single-threaded programs you can use igprof, The Ignominous Profiler: https://igprof.org/ .
It is a sampling profiler, along the lines of the... long... answer by Mike Dunlavey, which will gift wrap the results in a browsable call stack tree, annotated with the time or memory spent in each function, either cumulative or per-function.

Also worth mentioning are
HPCToolkit (http://hpctoolkit.org/) - Open-source, works for parallel programs and has a GUI with which to look at the results multiple ways
Intel VTune (https://software.intel.com/en-us/vtune) - If you have intel compilers this is very good
TAU (http://www.cs.uoregon.edu/research/tau/home.php)
I have used HPCToolkit and VTune and they are very effective at finding the long pole in the tent and do not need your code to be recompiled (except that you have to use -g -O or RelWithDebInfo type build in CMake to get meaningful output). I have heard TAU is similar in capabilities.

Actually a bit surprised not many mentioned about google/benchmark , while it is a bit cumbersome to pin the specific area of code, specially if the code base is a little big one, however I found this really helpful when used in combination with callgrind
IMHO identifying the piece that is causing bottleneck is the key here. I'd however try and answer the following questions first and choose tool based on that
is my algorithm correct ?
are there locks that are proving to be bottle necks ?
is there a specific section of code that's proving to be a culprit ?
how about IO, handled and optimized ?
valgrind with the combination of callgrind and kcachegrind should provide a decent estimation on the points above, and once it's established that there are issues with some section of code, I'd suggest to do a micro bench mark - google benchmark is a good place to start.

These are the two methods I use for speeding up my code:
For CPU bound applications:
Use a profiler in DEBUG mode to identify questionable parts of your code
Then switch to RELEASE mode and comment out the questionable sections of your code (stub it with nothing) until you see changes in performance.
For I/O bound applications:
Use a profiler in RELEASE mode to identify questionable parts of your code.
N.B.
If you don't have a profiler, use the poor man's profiler. Hit pause while debugging your application. Most developer suites will break into assembly with commented line numbers. You're statistically likely to land in a region that is eating most of your CPU cycles.
For CPU, the reason for profiling in DEBUG mode is because if your tried profiling in RELEASE mode, the compiler is going to reduce math, vectorize loops, and inline functions which tends to glob your code into an un-mappable mess when it's assembled. An un-mappable mess means your profiler will not be able to clearly identify what is taking so long because the assembly may not correspond to the source code under optimization. If you need the performance (e.g. timing sensitive) of RELEASE mode, disable debugger features as needed to keep a usable performance.
For I/O-bound, the profiler can still identify I/O operations in RELEASE mode because I/O operations are either externally linked to a shared library (most of the time) or in the worst case, will result in a sys-call interrupt vector (which is also easily identifiable by the profiler).

You can use the iprof library:
https://gitlab.com/Neurochrom/iprof
https://github.com/Neurochrom/iprof
It's cross-platform and allows you not to measure performance of your application also in real-time. You can even couple it with a live graph.
Full disclaimer: I am the author.

You can use a logging framework like loguru since it includes timestamps and total uptime which can be used nicely for profiling:

At work we have a really nice tool that helps us monitoring what we want in terms of scheduling. This has been useful numerous times.
It's in C++ and must be customized to your needs. Unfortunately I can't share code, just concepts.
You use a "large" volatile buffer containing timestamps and event ID that you can dump post mortem or after stopping the logging system (and dump this into a file for example).
You retrieve the so-called large buffer with all the data and a small interface parses it and shows events with name (up/down + value) like an oscilloscope does with colors (configured in .hpp file).
You customize the amount of events generated to focus solely on what you desire. It helped us a lot for scheduling issues while consuming the amount of CPU we wanted based on the amount of logged events per second.
You need 3 files :
toolname.hpp // interface
toolname.cpp // code
tool_events_id.hpp // Events ID
The concept is to define events in tool_events_id.hpp like that :
// EVENT_NAME ID BEGIN_END BG_COLOR NAME
#define SOCK_PDU_RECV_D 0x0301 //#D00301 BGEEAAAA # TX_PDU_Recv
#define SOCK_PDU_RECV_F 0x0302 //#F00301 BGEEAAAA # TX_PDU_Recv
You also define a few functions in toolname.hpp :
#define LOG_LEVEL_ERROR 0
#define LOG_LEVEL_WARN 1
// ...
void init(void);
void probe(id,payload);
// etc
Wherever in you code you can use :
toolname<LOG_LEVEL>::log(EVENT_NAME,VALUE);
The probe function uses a few assembly lines to retrieve the clock timestamp ASAP and then sets an entry in the buffer. We also have an atomic increment to safely find an index where to store the log event.
Of course buffer is circular.
Hope the idea is not obfuscated by the lack of sample code.

use a debugging software
how to identify where the code is running slowly ?
just think you have a obstacle while you are in motion then it will decrease your speed
like that unwanted reallocation's looping,buffer overflows,searching,memory leakages etc operations consumes more execution power it will effect adversely over performance of the code,
Be sure to add -pg to compilation before profiling:
g++ your_prg.cpp -pg or cc my_program.cpp -g -pg as per your compiler
haven't tried it yet but I've heard good things about google-perftools. It is definitely worth a try.
valgrind --tool=callgrind ./(Your binary)
It will generate a file called gmon.out or callgrind.out.x. You can then use kcachegrind or debugger tool to read this file. It will give you a graphical analysis of things with results like which lines cost how much.
i think so

As no one mentioned Arm MAP, I'd add it as personally I have successfully used Map to profile a C++ scientific program.
Arm MAP is the profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in-depth analysis and bottleneck pinpointing to the source line. Unlike most profilers, it's designed to be able to profile pthreads, OpenMP or MPI for parallel and threaded code.
MAP is commercial software.

Use -pg flag when compiling and linking the code and run the executable file. While this program is executed, profiling data is collected in the file a.out.
There is two different type of profiling
1- Flat profiling:
by running the command gprog --flat-profile a.out you got the following data
- what percentage of the overall time was spent for the function,
- how many seconds were spent in a function—including and excluding calls to sub-functions,
- the number of calls,
- the average time per call.
2- graph profiling
us the command gprof --graph a.out to get the following data for each function which includes
- In each section, one function is marked with an index number.
- Above function , there is a list of functions that call the function .
- Below function , there is a list of functions that are called by the function .
To get more info you can look in https://sourceware.org/binutils/docs-2.32/gprof/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js