Perf Error: The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (msr/tsc/) - profiling

I am using perf to monitor the system for certain events. However, I get the following error and I have no idea where it comes from,as the event is listed in perf list
sudo perf record -e msr/tsc/ -a
Error:
The sys_perf_event_open() syscall returned with 22 (Invalid argument) for event (msr/tsc/).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?
How can I check No CONFIG_PERF_EVENTS=y kernel support configured?
**Some test results:
sudo dmesg | grep "perf\|pmu"**
[ 0.029179] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[ 0.029179] ENERGY_PERF_BIAS: View and update with x86_energy_perf_policy(8)
[ 9475.406967] perf: interrupt took too long (2509 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 9475.990901] perf: interrupt took too long (3146 > 3136), lowering kernel.perf_event_max_sample_rate to 63500
[ 9476.886941] perf: interrupt took too long (3942 > 3932), lowering kernel.perf_event_max_sample_rate to 50500
[76057.268195] perf: interrupt took too long (4934 > 4927), lowering kernel.perf_event_max_sample_rate to 40500
[167368.007839] perf: interrupt took too long (6171 > 6167), lowering kernel.perf_event_max_sample_rate to 32250
[168338.165608] perf: interrupt took too long (7804 > 7713), lowering kernel.perf_event_max_sample_rate to 25500
perf list |grep msr
msr/aperf/ [Kernel PMU event]
msr/mperf/ [Kernel PMU event]
msr/pperf/ [Kernel PMU event]
msr/smi/ [Kernel PMU event]
msr/tsc/
sudo uname -a
Linux bla 4.9.0-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
sudo /proc/config.gz returns command not found
Any help/ideas are appreciated.

It is possible to collect this information with perf record by using group sampling. For example, the following command
perf record -a -e '{cycles,msr/aperf/,msr/tsc/}:S'
collects values of all three events based on the cycle (first counter) overflow. The undocumented :S modifier is necessary, it makes sure that only the leader of the group triggers samples. For seeing this information, you use perf report --group, the parameter may not be necessary. I'm afraid the actual values for each sample are only visible in the very verbose perf script -D.

There was a patch introduced in perf to support MSR Performance Monitoring Units. These MSR PMUs support free-running MSR counters. These counters include time and frequency-based counters like TSC, IA32_APERF, IA32_MPERF and IA32_PPERF.
These MSR events do not support sampling modes. As visible by this line of code in the linux kernel(v4.9) source code.
Snippet:
if event->attr.sample_period) /* no sampling */
return -EINVAL;
perf_events can instrument in three ways (counting events, sampling events and bpf events). Remember that when you run perf record, you are now invoking the sampling mode. Even though you do not explicitly specify the sampling period, internally sampling is happening at a default sampling frequency.
To count msr events, you need to run perf_events in counting/aggregation mode. You run perf stat for this --
perf stat -e msr/tsc/ -a
^C
Performance counter stats for 'system wide':
34,83,07,96,035 msr/tsc/
2.419151644 seconds time elapsed
Read this to understand more about counting and sampling events/modes.

Related

why does use "perf stat -B -e branches,branch-misses ./a.out" to get one process branch prediction miss is always zero in centos7 [duplicate]

I want to evaluate the performance of one process by "branch-misses" hardware event. But when I used perf stat to get "branch-misses" data, it always return 0 just because my os is in KVM.
Because it's trouble for me to get one real machine to do the test. So I want to know that is there any method to get "branch-misses" by "perf stat" when I am in KVM.
I'm really need your help. Thanks a lot.

Inconsistency when benchmarking two contiguous measurements

I was benchmarking a function and I see that some iteration are slower than other.
After some tests I tried to benchmark two contiguous measurements and I still got some weird results.
The code is on wandbox.
For me the important part is :
using clock = std::chrono::steady_clock;
// ...
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
}
The loop is optimized away as we can see on godbolt.
call std::chrono::_V2::steady_clock::now()
mov r12, rax
call std::chrono::_V2::steady_clock::now()
The code was compiled with:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
and gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) on an Intel® Xeon® W-2195 Processor.
I was the only user on the machine and I try to run with and without higth priority (nice or chrt) and the result was the same.
The result I got with 100 000 000 iterations was:
The Y-axis is in nanoseconds, it's the result of the line
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
These 4 lines make me think of: No cache/L1/L2/L3 cache misses (even if the “L3 cache misses” line seems to be too close of the L2 line)
I am not sure why there will be cache misses, may be the storage of the result, but it’s not in the measured code.
I have try to run 10 000 times the program with a loop of 1500, because the L1 cache of this processor is:
lscpu | grep L1
L1d cache: 32K
L1i cache: 32K
And 1500*16 bits = 24 000 bits which is less than 32K so there shouldn’t have cache miss.
And the results:
I still have my 4 lines (and some noise).
So if it’s realy a cache miss I don’t have any idea why it is happening.
I don’t kown if it’s useful for you but I run:
sudo perf stat -e cache-misses,L1-dcache-load-misses,L1-dcache-load ./a.out 1000
With the value 1 000 / 10 000 / 100 000 / 1 000 000
I got between 4.70 and 4.30% of all L1-dcache hits, which seem pretty decent to me.
So the questions are:
What is the cause of these slowdown?
How produce a qualitative benchmark of a function when I can’t have a constant time for a No operation ?
Ps : I don’t kwon if I am missing useful informations / flags, feel free to ask !
How reproduce:
The code:
#include <iostream>
#include <chrono>
#include <vector>
int main(int argc, char **argv)
{
int statSize = 1000;
using clock = std::chrono::steady_clock;
if (argc == 2)
{
statSize = std::atoi(argv[1]);
}
std::vector<uint16_t> temps;
temps.reserve(statSize);
for (int i = 0; i < statSize; i++)
{
auto t1 = clock::now();
auto t2 = clock::now();
temps.push_back(
std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count());
}
for (auto t : temps)
std::cout << (int)t << std::endl;
return (0);
}
Build:
g++ bench.cpp -Wall -Wextra -std=c++11 -O3
Generate output (sudo needed):
In this case I run 10 000 time the program. Each time take 100 measures, and I remove the first which is always about 5 time slower :
for i in {1..10000} ; do sudo nice -n -17 ./a.out 100 | tail -n 99 >> fast_1_000_000_uint16_100 ; done
Generate graph:
cat fast_1_000_000_uint16_100 | gnuplot -p -e "plot '<cat'"
The result I have on my machine:
Where I am after the answer of Zulan and all the comments
The current_clocksource is set on tsc and no switch seen in dmesg, command used:
dmesg -T | grep tsc
I use this script to remove the HyperThreading (HT)
then
grep -c proc /proc/cpuinfo
=> 18
Subtract 1 from the last result to obtain the last available core:
=> 17
Edit /etc/grub/default and add isolcpus=(last result) in in GRUB_CMDLINE_LINUX:
GRUB_CMDLINE_LINUX="isolcpus=17"
Finaly:
sudo update-grub
reboot
// reexecute the script
Now I can use:
taskset -c 17 ./a.out XXXX
So I run 10 000 times a loop of 100 iterations.
for i in {1..10000} ; do sudo /usr/bin/time -v taskset -c 17 ./a.out 100 > ./core17/run_$i 2>&1 ; done
Check if there is any Involuntary context switches:
grep -L "Involuntary context switches: 0" result/* | wc -l
=> 0
There is none, good. Let's plot :
for i in {1..10000} ; do cat ./core17/run_$i | head -n 99 >> ./no_switch_taskset ; done
cat no_switch_taskset | gnuplot -p -e "plot '<cat'"
Result :
There are still 22 measures greater than 1000 (when most values are arround 20) that I don't understand.
Next step, TBD
Do the part :
sudo nice -n -17 perf record...
of the Zulan answer's
I can't reproduce it with these particular clustered lines, but here is some general information.
Possible causes
As discussed in the comments, nice on a normal idle system is just a best effort. You still have at least
The scheduling tick timer
Kernel tasks that are bound to a certain code
Your task may be migrated from one core to another for an arbitrary reason
You can use isolcpus and taskset to get exclusive cores for certain processes to avoid some of that, but I don't think you can really get rid of all the kernel tasks. In addition, use nohz=full to disable the scheduling tick. You should also disable hyperthreading to get exclusive access to a core from a hardware thread.
Except for taskset, which I absolutely recommend for any performance measurement, these are quite unusual measures.
Measure instead of guessing
If there is a suspicion what could be happening, you can usually setup a measurement to confirm or disprove that hypothesis. perf and tracepoints are great for that. For example, we can start with looking at scheduling activity and some interrupts:
sudo nice -n -17 perf record -o perf.data -e sched:sched_switch -e irq:irq_handler_entry -e irq:softirq_entry ./a.out ...
perf script will now tell list you every occurrence. To correlate that with slow iterations you can use perf probe and a slightly modified benchmark:
void __attribute__((optimize("O0"))) record_slow(int64_t count)
{
(void)count;
}
...
auto count = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
if (count > 100) {
record_slow(count);
}
temps.push_back(count);
And compile with -g
sudo perf probe -x ./a.out record_slow count
Then add -e probe_a:record_slow to the call to perf record. Now if you are lucky, you find some close events, e.g.:
a.out 14888 [005] 51213.829062: irq:softirq_entry: vec=1 [action=TIMER]
a.out 14888 [005] 51213.829068: probe_a:record_slow: (559354aec479) count=9029
Be aware: while this information will likely explain some of your observation, you enter a world of even more puzzling questions and oddities. Also, while perf is pretty low-overhead, there may be some perturbation on what you measure.
What are we benchmarking?
First of all, you need to be clear what you actually measure: The time to execute std::chrono::steady_clock::now(). It's actually good to do that to figure out at least this measurement overhead as well as the precision of the clock.
That's actually a tricky point. The cost of this function, with clock_gettime underneath, depends on your current clocksource1. If that's tsc you're fine - hpet is much slower. Linux may switch quietly2 from tsc to hpet during operation.
What to do to get stable results?
Sometimes you might need to do benchmarks with extreme isolation, but usually that's not necessary even for very low-level micro-architecture benchmarks. Instead, you can use statistical effects: repeat the measurement. Use the appropriate methods (mean, quantiles), sometimes you may want to exclude outliers.
If the measurement kernel is not significantly longer than timer precision, you will have to repeat the kernel and measure outside to get a throughput rather than a latency, which may or may not be different.
Yes - benchmarking right is very complicated, you need to consider a lot of aspects, especially when you get closer to the hardware and the your kernel times get very short. Fortunately there's some help, for example Google's benchmark library provides a lot of help in terms of doing the right amount of repetitions and also in terms having experiment factors.
1 /sys/devices/system/clocksource/clocksource0/current_clocksource
2 Actually it's in dmesg as something like
clocksource: timekeeping watchdog on CPU: Marking clocksource 'tsc' as unstable because the skew is too large:

mysterious rtm abort using haswell tsx

I'm experimenting with the tsx extensions in haswell, by adapting an existing medium-sized (1000's of lines) codebase to using GCC transactional memory extensions (which indirectly are using haswell tsx in this machine) instead of coarse grained locks. I am using GCC's transactional_memory extensions, not writing my own _xbegin / _xend directly. I am using the ITM_DEFAULT_METHOD=htm
I'm having issues getting it to work fast enough because I get high rates of hardware transaction abort for mysterious reasons. As shown below, these aborts are not due to conflicts nor due to capacity limitations.
Here is the perf command I used to quantify the failure rate and underlying causes:
perf stat \
-e cpu/event=0x54,umask=0x2,name=tx_mem_abort_capacity_write/ \
-e cpu/event=0x54,umask=0x1,name=tx_mem_abort_conflict/ \
-e cpu/event=0x5d,umask=0x1,name=tx_exec_misc1/ \
-e cpu/event=0x5d,umask=0x2,name=tx_exec_misc2/ \
-e cpu/event=0x5d,umask=0x4,name=tx_exec_misc3/ \
-e cpu/event=0x5d,umask=0x8,name=tx_exec_misc4/ \
-e cpu/event=0x5d,umask=0x10,name=tx_exec_misc5/ \
-e cpu/event=0xc9,umask=0x1,name=rtm_retired_start/ \
-e cpu/event=0xc9,umask=0x2,name=rtm_retired_commit/ \
-e cpu/event=0xc9,umask=0x4,name=rtm_retired_aborted/pp \
-e cpu/event=0xc9,umask=0x8,name=rtm_retired_aborted_misc1/ \
-e cpu/event=0xc9,umask=0x10,name=rtm_retired_aborted_misc2/ \
-e cpu/event=0xc9,umask=0x20,name=rtm_retired_aborted_misc3/ \
-e cpu/event=0xc9,umask=0x40,name=rtm_retired_aborted_misc4/ \
-e cpu/event=0xc9,umask=0x80,name=rtm_retired_aborted_misc5/ \
./myprogram -th 1 -reps 3000000
So, the program runs some code with transactions in it 30 million times. Each request involves one transaction gcc __transaction_atomic block. There is only one thread in this run.
This particular perf command captures most of the relevant tsx performance events described in the Intel software developers manual vol 3.
The output from perf stat is the following:
0 tx_mem_abort_capacity_write [26.66%]
0 tx_mem_abort_conflict [26.65%]
29,937,894 tx_exec_misc1 [26.71%]
0 tx_exec_misc2 [26.74%]
0 tx_exec_misc3 [26.80%]
0 tx_exec_misc4 [26.92%]
0 tx_exec_misc5 [26.83%]
29,906,632 rtm_retired_start [26.79%]
0 rtm_retired_commit [26.70%]
29,985,423 rtm_retired_aborted [26.66%]
0 rtm_retired_aborted_misc1 [26.75%]
0 rtm_retired_aborted_misc2 [26.73%]
29,927,923 rtm_retired_aborted_misc3 [26.71%]
0 rtm_retired_aborted_misc4 [26.69%]
176 rtm_retired_aborted_misc5 [26.67%]
10.583607595 seconds time elapsed
As you can see from the output:
The rtm_retired_start count is 30 million (matches input to program)
The rtm_retired_abort count is about the same (no commits at all)
The abort_conflict and abort_capacity counts are 0, so these are not the reasons. Also, recall it is only one thread running, conflicts should be rare.
The only actual leads here are the high values of tx_exec_misc1 and rtm_retired_aborted_misc3, which are somewhat similar in description.
The Intel manual (vol 3) defines rtm_retired_aborted_misc3 counters:
code: C9H 20H
mnemonic: RTM_RETIRED.ABORTED_MISC3
description: Number of times an RTM execution aborted due to HLE unfriendly instructions.
The definition for tx_exec_misc1 has some similar words:
code: 5DH 01H
mnemonic: TX_EXEC.MISC1
description: Counts the number of times a class of instructions that may cause a transactional abort was executed. Since this is the count of execution, it may not always cause a transactional abort.
I checked the assembly location for the aborts using perf record/ perf report using high precision (PEBS) support for rtm_retired_aborted. The location has a mov instruction from register to register. No weird instruction names seen nearby.
Update:
Here are two things I've tried since then:
1) the tx_exec_misc1 and rtm_retired_aborted_misc3 signature we we see here can be obtained, for example, by a dummy block of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
_xabort(1);
}
}
or one of the form
for (int i = 0; i < 10000000; i++){
__transaction_atomic{
printf("hello");
fflush(stdout);
}
}
In both cases, the perf counters look similar to what I see. However, in both cases the perf report for -e cpu/tx-abort/ points to the intuitively correct assembly lines: an xabort instruction for the first example and a syscall one for the second one. In the real codebase, the perf report points to a stack push right at the start of a function:
: 00000000004167e0 <myns::myfun()>:
100.00 : 4167e0: push %rbp
0.00 : 4167e1: mov %rsp,%rbp
0.00 : 4167e4: push %r15
I have also run the same command under the intel software development emulator. It turns out that the problem goes away in that case: I get no aborts as far as the application is concerned.
Though it's been the case for a while, I found this unanswered question while searching, so here's the answer: This is a hardware bug in Haswell and early Broadwell chips.
The particular hardware erratum assigned by Intel is HSW136, and is not fixable using microcode updates. Indeed, I think it was in stepping 4 that the feature was no longer reported as available by the cpuid instruction, even when there was (faulty) silicon on the chip to implement it.

How to run record instruction-history and function-call-history in GDB?

(EDIT: per the first answer below the current "trick" seems to be using an Atom processor. But I hope some gdb guru can answer if this is a fundamental limitation, or whether there adding support for other processors is on the roadmap?)
Reverse execution seems to be working in my environment: I can reverse-continue, see a plausible record log, and move around within it:
(gdb) start
...Temporary breakpoint 5 at 0x8048460: file bang.cpp, line 13.
Starting program: /home/thomasg/temp/./bang
Temporary breakpoint 5, main () at bang.cpp:13
13 f(1000);
(gdb) record
(gdb) continue
Continuing.
Breakpoint 3, f (d=900) at bang.cpp:5
5 if(d) {
(gdb) info record
Active record target: record-full
Record mode:
Lowest recorded instruction number is 1.
Highest recorded instruction number is 1005.
Log contains 1005 instructions.
Max logged instructions is 200000.
(gdb) reverse-continue
Continuing.
Breakpoint 3, f (d=901) at bang.cpp:5
5 if(d) {
(gdb) record goto end
Go forward to insn number 1005
#0 f (d=900) at bang.cpp:5
5 if(d) {
However the instruction and function histories aren't available:
(gdb) record instruction-history
You can't do that when your target is `record-full'
(gdb) record function-call-history
You can't do that when your target is `record-full'
And the only target type available is full, the other documented type "btrace" fails with "Target does not support branch tracing."
So quite possibly it just isn't supported for this target, but as it's a mainstream modern one (gdb 7.6.1-ubuntu, on amd64 Linux Mint "Petra" running an "Intel(R) Core(TM) i5-3570") I'm hoping that I've overlooked a crucial step or config?
It seems that there is no other solution except a CPU that supports it.
More precisely, your kernel has to support Intel Processor Tracing (Intel PT). This can be checked in Linux with:
grep intel_pt /proc/cpuinfo
See also: https://unix.stackexchange.com/questions/43539/what-do-the-flags-in-proc-cpuinfo-mean
The commands only works in record btrace mode.
In the GDB source commit beab5d9, it is nat/linux-btrace.c:kernel_supports_pt that checks if we can enter btrace. The following checks are carried out:
check if /sys/bus/event_source/devices/intel_pt/type exists and read the type
do a syscall (SYS_perf_event_open, &attr, child, -1, -1, 0); with the read type, and see if it returns >=0. TODO: why not use the C wrapper?
The first check fails for me: the file does not exist.
Kernel side
cd into the kernel 4.1 source and:
git grep '"intel_pt"'
we find arch/x86/kernel/cpu/perf_event_intel_pt.c which sets up that file. In particular, it does:
if (!test_cpu_cap(&boot_cpu_data, X86_FEATURE_INTEL_PT))
goto fail;
so intel_pt is a pre-requisite.
How I've found kernel_supports_pt
First grep for:
git grep 'Target does not support branch tracing.'
which leads us to btrace.c:btrace_enable. After a quick debug with:
gdb -q -ex start -ex 'b btrace_enable' -ex c --args /home/ciro/git/binutils-gdb/install/bin/gdb --batch -ex start -ex 'record btrace' ./hello_world.out
Virtual box does not support it either: Extract execution log from gdb record in a VirtualBox VM
Intel SDE
Intel SDE 7.21 already has this CPU feature, checked with:
./sde64 -- cpuid | grep 'Intel processor trace'
But I'm not sure if the Linux kernel can be run on it: https://superuser.com/questions/950992/how-to-run-the-linux-kernel-on-intel-software-development-emulator-sde
Other GDB methods
More generic questions, with less efficient software solutions:
call graph: List of all function calls made in an application
instruction trace: Displaying each assembly instruction executed in gdb
At least a partial answer (for the "am I doing it wrong" aspect) - from gdb-7.6.50.20140108/gdb/NEWS
* A new record target "record-btrace" has been added. The new target
uses hardware support to record the control-flow of a process. It
does not support replaying the execution, but it implements the
below new commands for investigating the recorded execution log.
This new recording method can be enabled using:
record btrace
The "record-btrace" target is only available on Intel Atom processors
and requires a Linux kernel 2.6.32 or later.
* Two new commands have been added for record/replay to give information
about the recorded execution without having to replay the execution.
The commands are only supported by "record btrace".
record instruction-history prints the execution history at
instruction granularity
record function-call-history prints the execution history at
function granularity
It's not often that I envy the owner of an Atom processor ;-)
I'll edit the question to refocus upon the question of workarounds or plans for future support.

Capture CPU and Memory usage dynamically

I am running a shell script to execute a c++ application, which measures the performance of an api. i can capture the latency (time taken to return a value for a given set of parameters) of the api, but i also wish to capture the cpu and memory usage alongside at intervals of say 5-10 seconds.
is there a way to do this without effecting the performance of the system too much and that too within the same script? i have found many examples where one can do outside (independently) of the script we are running; but not one where we can do within the same script.
If you are looking for capturing CPU and Mem utilization dynamically for entire linux box, then following command can help you too:
CPU
vmstat -n 15 10| awk '{now=strftime("%Y-%m-%d %T "); print now $0}'> CPUDataDump.csv &
vmstat is used for collection of CPU counters
-n for delay value, in this case it's 15, that means after every 15 sec, stats will be collected.
then 10 is the number of intervals, there would be 10 iterations in this example
awk '{now=strftime("%Y-%m-%d %T "); print now $0}' this will dump the timestamp of each iteration
in the end, the dump file with & for continuation
Memory
free -m -s 10 10 | awk '{now=strftime("%Y-%m-%d %T "); print now $0}'> DataDumpMemoryfile.csv &
free is for mem stats collection
-m this is for units of mem (you can use -b for bytes, -k for kilobytes, -g for gigabytes)
then 10 is the number of intervals (there would be 10 iterations in this example)
awk'{now=strftime("%Y-%m-%d %T "); print now $0}' this will dump the timestamp of each iteration
in the end, the dump & for continuation
I'd suggest to use 'time' command and also 'vmstat' command. The first will give CPU usage of executable execution and second - periodic (i.e. once per second) dump of CPU/memory/IO of the system.
Example:
time dd if=/dev/zero bs=1K of=/dev/null count=1024000
1024000+0 records in
1024000+0 records out
1048576000 bytes (1.0 GB) copied, 0.738194 seconds, 1.4 GB/s
0.218u 0.519s 0:00.73 98.6% 0+0k 0+0io 0pf+0w <== that's time result