Sky high iTLB-load-misses - profiling

I am trying to establish the bottleneck in my code using perf and ocperf.
If I do a 'detailed stat' run on my binary, two statistics are reported in red text, which I suppose mean that it is too high.
L1-dcache-load-misses is in red at 28.60%
iTLB-load-misses is in red at 425.89%
# ~bram/src/pmu-tools/ocperf.py stat -d -d -d -d -d ./bench ray
perf stat -d -d -d -d -d ./bench ray
Loaded 455 primitives.
Testing ray against 455 primitives.
Performance counter stats for './bench ray':
9031.444612 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
292 page-faults # 0.032 K/sec
28,786,063,163 cycles # 3.187 GHz (61.47%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
55,742,952,563 instructions # 1.94 insns per cycle (69.18%)
3,717,242,560 branches # 411.589 M/sec (69.18%)
18,097,580 branch-misses # 0.49% of all branches (69.18%)
10,230,376,136 L1-dcache-loads # 1132.751 M/sec (69.17%)
2,926,349,754 L1-dcache-load-misses # 28.60% of all L1-dcache hits (69.21%)
145,843,523 LLC-loads # 16.148 M/sec (69.32%)
49,512 LLC-load-misses # 0.07% of all LL-cache hits (69.33%)
<not supported> L1-icache-loads
260,144 L1-icache-load-misses # 0.029 M/sec (69.34%)
10,230,376,830 dTLB-loads # 1132.751 M/sec (69.34%)
1,197 dTLB-load-misses # 0.00% of all dTLB cache hits (61.59%)
2,294 iTLB-loads # 0.254 K/sec (61.55%)
9,770 iTLB-load-misses # 425.89% of all iTLB cache hits (61.51%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9.032234014 seconds time elapsed
My questions:
What would be a reasonable figure for L1 data cache misses?
What would be a reasonable figure for iTLB-load-misses?
Why can iTLB-load-misses exceed 100%? In other words: why is iTLB-load-misses exceeding iTLB-loads? I've even seen it spike as high as 568%
Also, my machine has a Haswell CPU. I would have expected the stalled-cycles stat to be included?

Related

which is better for CPU usage waiting for a function returned with std::future wait() or check a flag sleep for a while in a loop?

Q1: which occupies less CPU usage, future wait() or check flag in a while loop?
std::atomic_bool isRunning{false};
void foo(){
isRunning.store(true);
doSomethingTimeConsuming();
isRunning.store(false);
}
std::future f = std::async(std::launch::async, foo);
use std::future wait():
if(f.vaild())
f.wait()
check flag in a while loop:
if(f.valid){
while(isRunning.load())
std::this_thread::sleep_for(1ms);
}
Q2: is the conclusion also applied to std::thread.join() or std::condition_variable.wait() ?
thanks in advance.
std::this_thread::sleep_for keeps waking up the thread unnecessarily at wrong times. The average latency of the result being ready and the waiter thread noticing it is half the sleep_for timeout.
std::future::wait is more efficient because it blocks in the kernel till the result is ready, without doing multiple syscalls unnecessarily, unlike std::this_thread::sleep_for.
If your run the two versions with
void doSomethingTimeConsuming() {
std::this_thread::sleep_for(1s);
}
under perf stat, the results for std::future::wait are:
1.803578 task-clock (msec) # 0.002 CPUs utilized
2 context-switches # 0.001 M/sec
0 cpu-migrations # 0.000 K/sec
116 page-faults # 0.064 M/sec
6,356,215 cycles # 3.524 GHz
4,511,076 instructions # 0.71 insn per cycle
835,604 branches # 463.304 M/sec
22,313 branch-misses # 2.67% of all branches
Whereas for std::this_thread::sleep_for(1ms):
11.715249 task-clock (msec) # 0.012 CPUs utilized
901 context-switches # 0.077 M/sec
6 cpu-migrations # 0.512 K/sec
118 page-faults # 0.010 M/sec
40,177,222 cycles # 3.429 GHz
25,401,055 instructions # 0.63 insn per cycle
2,286,806 branches # 195.199 M/sec
156,400 branch-misses # 6.84% of all branches
I.e. in this particular test, sleep_for burns roughly 6 times as many CPU cycles.
Note that there is a race condition between isRunning.load() and isRunning.store(true). A fix is to initialize isRunning{true};.

For same task, why more threads lead to less instructions

Code
I ran my program 30 times, and n passed to run_and_join_threads() changed from 1 to 30 accordingly.
Note that jobs passed to run_and_join_threads() were populated by exactly the same way in each execution.
void do_job(JobQueue *jobs) {
Job job;
while (job = jobs->pop())
job();
// control flow goes here if jobs.pop() returns nullptr,
// which means all the jobs have been done
}
void run_and_join_threads(int n, JobQueue &jobs) {
vector<thread> threads;
threads.reserve(n);
for (int i = 0; i < n; ++i)
threads.push_back(thread(do_job, &jobs));
// synchronization
for (int i = 0; i < n; ++i)
threads[i].join();
}
JobQueue.h
#ifndef JOB_QUEUE_H
#define JOB_QUEUE_H
#include <functional>
#include <queue>
#include <mutex>
typedef std::function<void (void)> Job;
// Its methods are all atomic.
class JobQueue {
std::queue<Job> jobs;
std::mutex mtx;
public:
void push(Job job);
// pop removes the "oldest" job in the queue and returns it.
// pop returns nullptr if there's no more jobs left in the queue.
Job pop();
};
#endif
JobQueue.cc
#include "JobQueue.h"
using namespace std;
void JobQueue::push(Job job) {
mtx.lock();
jobs.push(job);
mtx.unlock();
}
Job JobQueue::pop() {
Job job = nullptr;
mtx.lock();
if (!jobs.empty()) {
job = jobs.front();
jobs.pop();
}
mtx.unlock();
return job;
}
Chart
I use perf stat -e instructions:u ./my_program to record number of instructions during my program execution.
I then found that there is a negative correlation between number of threads and number of user instructions.
My Thoughts
Since the "real task" remains the same, more threads should lead to more thread construction and destruction, which results in more instructions, but that's not the case from the chart. I tried to google with the keywords in the title, but no luck.
compilation options: -std=c++14 -pthread -Wextra -Werror -MMD
gcc version: 8.2.1 20180831
Output of --per-thread when n = 10
hw4-9525 8,524.37 msec task-clock:u # 0.153 CPUs utilized
hw4-9524 8,082.77 msec task-clock:u # 0.145 CPUs utilized
hw4-9522 7,824.93 msec task-clock:u # 0.140 CPUs utilized
hw4-9519 7,782.65 msec task-clock:u # 0.139 CPUs utilized
hw4-9518 7,734.42 msec task-clock:u # 0.138 CPUs utilized
hw4-9517 7,722.12 msec task-clock:u # 0.138 CPUs utilized
hw4-9520 7,636.99 msec task-clock:u # 0.137 CPUs utilized
hw4-9425 11,899.78 msec task-clock:u # 0.213 CPUs utilized
hw4-9521 7,585.14 msec task-clock:u # 0.136 CPUs utilized
hw4-9526 7,580.60 msec task-clock:u # 0.136 CPUs utilized
hw4-9523 7,306.57 msec task-clock:u # 0.131 CPUs utilized
hw4-9425 0 context-switches:u # 0.000 K/sec
hw4-9517 0 context-switches:u # 0.000 K/sec
hw4-9518 0 context-switches:u # 0.000 K/sec
hw4-9519 0 context-switches:u # 0.000 K/sec
hw4-9520 0 context-switches:u # 0.000 K/sec
hw4-9521 0 context-switches:u # 0.000 K/sec
hw4-9522 0 context-switches:u # 0.000 K/sec
hw4-9523 0 context-switches:u # 0.000 K/sec
hw4-9524 0 context-switches:u # 0.000 K/sec
hw4-9525 0 context-switches:u # 0.000 K/sec
hw4-9526 0 context-switches:u # 0.000 K/sec
hw4-9425 0 cpu-migrations:u # 0.000 K/sec
hw4-9517 0 cpu-migrations:u # 0.000 K/sec
hw4-9518 0 cpu-migrations:u # 0.000 K/sec
hw4-9519 0 cpu-migrations:u # 0.000 K/sec
hw4-9520 0 cpu-migrations:u # 0.000 K/sec
hw4-9521 0 cpu-migrations:u # 0.000 K/sec
hw4-9522 0 cpu-migrations:u # 0.000 K/sec
hw4-9523 0 cpu-migrations:u # 0.000 K/sec
hw4-9524 0 cpu-migrations:u # 0.000 K/sec
hw4-9525 0 cpu-migrations:u # 0.000 K/sec
hw4-9526 0 cpu-migrations:u # 0.000 K/sec
hw4-9425 9,332 page-faults:u # 1144.724 M/sec
hw4-9520 7,487 page-faults:u # 918.404 M/sec
hw4-9526 7,408 page-faults:u # 908.714 M/sec
hw4-9522 7,401 page-faults:u # 907.855 M/sec
hw4-9518 7,386 page-faults:u # 906.015 M/sec
hw4-9524 7,362 page-faults:u # 903.071 M/sec
hw4-9521 7,348 page-faults:u # 901.354 M/sec
hw4-9525 7,258 page-faults:u # 890.314 M/sec
hw4-9517 7,253 page-faults:u # 889.700 M/sec
hw4-9519 7,153 page-faults:u # 877.434 M/sec
hw4-9523 6,194 page-faults:u # 759.797 M/sec
hw4-9425 24,365,706,871 cycles:u # 2988857.145 GHz
hw4-9524 19,199,338,912 cycles:u # 2355116.623 GHz
hw4-9518 18,658,195,691 cycles:u # 2288736.452 GHz
hw4-9522 18,565,304,421 cycles:u # 2277341.801 GHz
hw4-9520 18,524,344,417 cycles:u # 2272317.378 GHz
hw4-9519 18,452,590,959 cycles:u # 2263515.629 GHz
hw4-9521 18,384,181,678 cycles:u # 2255124.099 GHz
hw4-9517 18,169,025,051 cycles:u # 2228731.578 GHz
hw4-9526 17,957,925,085 cycles:u # 2202836.674 GHz
hw4-9523 17,689,877,988 cycles:u # 2169956.262 GHz
hw4-9525 20,380,269,586 cycles:u # 2499977.312 GHz
hw4-9524 35,930,781,858 instructions:u # 1.88 insn per cycle
hw4-9425 31,238,610,254 instructions:u # 1.63 insn per cycle
hw4-9522 34,856,962,399 instructions:u # 1.82 insn per cycle
hw4-9518 34,794,129,974 instructions:u # 1.82 insn per cycle
hw4-9520 34,565,759,122 instructions:u # 1.81 insn per cycle
hw4-9519 34,521,122,564 instructions:u # 1.81 insn per cycle
hw4-9521 34,389,796,009 instructions:u # 1.80 insn per cycle
hw4-9517 33,823,905,990 instructions:u # 1.77 insn per cycle
hw4-9525 38,084,271,354 instructions:u # 1.99 insn per cycle
hw4-9526 33,682,632,175 instructions:u # 1.76 insn per cycle
hw4-9523 33,147,549,812 instructions:u # 1.73 insn per cycle
hw4-9525 6,113,561,884 branches:u # 749929530.566 M/sec
hw4-9425 5,978,592,665 branches:u # 733373322.423 M/sec
hw4-9524 5,765,141,950 branches:u # 707190060.107 M/sec
hw4-9522 5,593,987,998 branches:u # 686195195.687 M/sec
hw4-9518 5,583,032,551 branches:u # 684851328.824 M/sec
hw4-9520 5,546,955,396 branches:u # 680425868.769 M/sec
hw4-9519 5,541,456,246 branches:u # 679751307.023 M/sec
hw4-9521 5,518,407,713 branches:u # 676924023.050 M/sec
hw4-9517 5,427,113,316 branches:u # 665725254.544 M/sec
hw4-9526 5,407,241,325 branches:u # 663287626.012 M/sec
hw4-9523 5,318,730,317 branches:u # 652430286.226 M/sec
hw4-9525 66,142,537 branch-misses:u # 1.18% of all branches
hw4-9524 61,835,669 branch-misses:u # 1.10% of all branches
hw4-9518 61,243,167 branch-misses:u # 1.09% of all branches
hw4-9520 60,266,206 branch-misses:u # 1.07% of all branches
hw4-9521 59,396,966 branch-misses:u # 1.06% of all branches
hw4-9522 59,227,658 branch-misses:u # 1.05% of all branches
hw4-9519 59,210,503 branch-misses:u # 1.05% of all branches
hw4-9526 57,983,090 branch-misses:u # 1.03% of all branches
hw4-9517 57,910,215 branch-misses:u # 1.03% of all branches
hw4-9523 56,251,632 branch-misses:u # 1.00% of all branches
hw4-9425 32,626,137 branch-misses:u # 0.58% of all branches

Significant performance difference of std clock between different machines

Testing something else I stumbled across something that I haven't managed to figure out yet.
Let's look at this snippet:
#include <iostream>
#include <chrono>
int main () {
int i = 0;
using namespace std::chrono_literals;
auto const end = std::chrono::system_clock::now() + 5s;
while (std::chrono::system_clock::now() < end) {
++i;
}
std::cout << i;
}
I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.
On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6
Now this is a difference that I would like to understand so I've checked with perf stat -d
on the i7:
4999.419546 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%)
33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%)
7,397,994,820 branches:u # 1479.771 M/sec (62.53%)
34,788 branch-misses:u # 0.00% of all branches (62.58%)
10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%)
13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%)
3,944 LLC-loads:u # 0.789 K/sec (24.95%)
1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)
5.003180401 seconds time elapsed
4.969048000 seconds user
0.016557000 seconds sys
Xeon:
5001.000000 task-clock (msec) # 0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%)
794,184,899 instructions # 0.05 insn per cycle (62.51%)
188,083,219 branches # 37.609 M/sec (62.49%)
85,924 branch-misses # 0.05% of all branches (62.51%)
269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%)
246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%)
13,327 LLC-loads # 0.003 M/sec (49.99%)
7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)
5.006139971 seconds time elapsed
What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.
And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.
On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7
So in the end I'm now left with two questions:
Why do I see such a significant difference between the two machines.
Why does a statically linked exectuable perform worse while I would expect the oposite?
Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.
perf interestingly shows different numbers though:
5002.000000 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%)
392,228,592 instructions:u # 0.96 insn per cycle (62.51%)
115,475,503 branches:u # 23.086 M/sec (62.51%)
26,355 branch-misses:u # 0.02% of all branches (62.53%)
115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%)
42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%)
88 LLC-loads:u # 0.018 K/sec (49.96%)
2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)
5.005940327 seconds time elapsed
0.533000000 seconds user
4.469000000 seconds sys
It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!
I've been able to reproduce the respective measurements on the two machines, thanks to #Imran's comment above. (Posting this answer to close the question, if Imran should post one I'm happy to accept his instead)
It indeed is related to the available clocksource. The XEON, unfortunately, had the notsc flag in its kernel parameters which is why the tsc clocksource wasn't available and selected.
Thus for anyone running into this problem:
1. check your clocksource in /sys/devices/system/clocksource/clocksource0/current_clocksource
2. check available clocksources in /sys/devices/system/clocksource/clocksource0/available_clocksource
3. If you can't find tsc, check dmesg | grep tsc to check you kernel parameters for notsc

Extract single line from command output in terminal

I would like to extract the line containing 'seconds time elapsed' output from perf stat output for some logging script that I am working on.
I do not want to write the output to a file and then search the file. I would like to do it using 'grep' or something similar.
Here is what I have tried:
perf stat -r 10 echo "Sample_String" | grep -eE "seconds time elapsed"
For which I get
grep: seconds time elapsed: No such file or directory
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
Performance counter stats for 'echo Sample_String' (10 runs):
0.254533 task-clock (msec) # 0.556 CPUs utilized ( +- 0.98% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.220 M/sec ( +- 0.53% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000457686 seconds time elapsed ( +- 1.08% )
And I tried this
perf stat -r 10 echo "Sample_String" > grep -eE "seconds time elapsed"
For which I got
Performance counter stats for 'echo Sample_String -eE seconds time elapsed' (10 runs):
0.262585 task-clock (msec) # 0.576 CPUs utilized ( +- 1.11% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.214 M/sec ( +- 0.36% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000456035 seconds time elapsed ( +- 1.05% )
I am new to these tools like grep, awk and sed. I hope someone can help me out. I also do not want to write the output to a file and then search the file.
The problem here is that the output you want is sent to stderr instead of the standard output.
You can see this by redirecting stderr to /dev/null, and you'll see that the only result left is the one from the "echo" command.
~/ perf stat -r 10 echo "Sample_String" 2>/dev/null
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
In order to do what you want, you will have to redirect perf's stderr to the standard output, and hide the standard output. This way, perf's output will be sent to the grep command.
~/ perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null | grep 'seconds time elapsed'
0,013137361 seconds time elapsed ( +- 96,08% )
Looks like your desired output is going to stderr. Try:
perf stat -r 10 echo "Sample_String" 2>&1 | grep "seconds time elapsed"
This could work the way you intend:
grep -e "seconds time elapsed" <<< "$(perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null)"
Output:
0.000544399 seconds time elapsed ( +- 2.05% )

Why does linking to librt swap performance between g++ and clang?

I just found this answer from #tony-d with a bench code to test virtual function call overhead. I checked is benchmark using g++:
$ g++ -O2 -o vdt vdt.cpp -lrt
$ ./vdt
virtual dispatch: 150000000 0.128562
switched: 150000000 0.0803207
overheads: 150000000 0.0543323
...
I got better performance that his (ratio is about 2), but then I checked with clang:
$ clang++-3.7 -O2 -o vdt vdt.cpp -lrt
$ ./vdt
virtual dispatch: 150000000 0.462368
switched: 150000000 0.0569544
overheads: 150000000 0.0509332
...
Now the ratio goes up to about 70!
I then noticed the -lrt command line argument, and after a bit of googling about librt I tried without it for g++ and clang:
$ g++ -O2 -o vdt vdt.cpp
$ ./vdt
virtual dispatch: 150000000 0.4661
switched: 150000000 0.0815865
overheads: 150000000 0.0543611
...
$ clang++-3.7 -O2 -o vdt vdt.cpp
$ ./vdt
virtual dispatch: 150000000 0.155901
switched: 150000000 0.0568319
overheads: 150000000 0.0492521
...
As you can see, the performance are swaped.
From what I found about librt, it is needed for clock_gettime and other related time computation (maybe I am wrong, correct me in this case!) but the code compiles fine without -lrt, and the time seems correct from what I see.
Why does linking / not linking librt affects that code so much?
Informations about my system and compilers:
$ g++ --version
g++-5 (Ubuntu 5.3.0-3ubuntu1~14.04) 5.3.0 20151204
Copyright (C) 2015 Free Software Foundation, Inc.
$ clang++-3.7 --version
Debian clang version 3.7.1-svn254351-1~exp1 (branches/release_37) (based on LLVM 3.7.1)
Target: x86_64-pc-linux-gnu
Thread model: posix
$ uname -a
Linux ****** 3.13.0-86-generic #130-Ubuntu SMP Mon Apr 18 18:27:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
I would guess that is connected with oprimizer (if -lrt is specified, because of trying to link with the library the optimizer has more data and can optimize differently).
As for the differences, with my g++ (4.8.4) I have the same results with and without -lrt, but clang (3.4.-lubuntu3) there is a difference. I tried to run this through perftools statistics with the folloving results:
$ g++ -O2 -o vdt vdt.cpp -lrt && perf stat -d ./vdt
virtual dispatch: 150000000 1.2304
switched: 150000000 0.131782
overheads: 150000000 0.0842732
virtual dispatch: 150000000 1.13689
switched: 150000000 0.137304
overheads: 150000000 0.0854806
virtual dispatch: 150000000 1.19261
switched: 150000000 0.133561
overheads: 150000000 0.0969093
Performance counter stats for './vdt':
4068.861539 task-clock (msec) # 0.961 CPUs utilized
1,068 context-switches # 0.262 K/sec
0 cpu-migrations # 0.000 K/sec
431 page-faults # 0.106 K/sec
11,977,128,883 cycles # 2.944 GHz [40.18%]
6,088,274,331 stalled-cycles-frontend # 50.83% frontend cycles idle [39.92%]
3,984,855,636 stalled-cycles-backend # 33.27% backend cycles idle [39.98%]
6,581,309,599 instructions # 0.55 insns per cycle
# 0.93 stalled cycles per insn [50.06%]
1,506,617,848 branches # 370.280 M/sec [50.12%]
303,871,937 branch-misses # 20.17% of all branches [49.88%]
2,708,080,460 L1-dcache-loads # 665.562 M/sec [49.94%]
559,844,530 L1-dcache-load-misses # 20.67% of all L1-dcache hits [50.28%]
0 LLC-loads # 0.000 K/sec [40.05%]
0 LLC-load-misses # 0.00% of all LL-cache hits [39.98%]
4.232477683 seconds time elapsed
$ g++ -O2 -o vdt vdt.cpp && perf stat -d ./vdt
virtual dispatch: 150000000 1.11517
switched: 150000000 0.14231
overheads: 150000000 0.0840234
virtual dispatch: 150000000 1.11355
switched: 150000000 0.130082
overheads: 150000000 0.116934
virtual dispatch: 150000000 1.16225
switched: 150000000 0.13281
overheads: 150000000 0.0798615
Performance counter stats for './vdt':
4050.314222 task-clock (msec) # 0.993 CPUs utilized
707 context-switches # 0.175 K/sec
0 cpu-migrations # 0.000 K/sec
402 page-faults # 0.099 K/sec
12,213,599,260 cycles # 3.015 GHz [39.72%]
6,987,416,990 stalled-cycles-frontend # 57.21% frontend cycles idle [40.25%]
4,675,829,189 stalled-cycles-backend # 38.28% backend cycles idle [40.17%]
6,611,623,206 instructions # 0.54 insns per cycle
# 1.06 stalled cycles per insn [50.54%]
1,505,162,879 branches # 371.616 M/sec [50.48%]
298,748,152 branch-misses # 19.85% of all branches [50.30%]
2,710,580,651 L1-dcache-loads # 669.227 M/sec [50.04%]
551,212,908 L1-dcache-load-misses # 20.34% of all L1-dcache hits [49.86%]
3 LLC-loads # 0.001 K/sec [39.62%]
0 LLC-load-misses # 0.00% of all LL-cache hits [40.01%]
4.080288324 seconds time elapsed
$ clang++ -O2 -o vdt vdt.cpp -lrt && perf stat -d ./vdt
virtual dispatch: 150000000 0.276252
switched: 150000000 0.11926
overheads: 150000000 0.0733678
virtual dispatch: 150000000 0.249832
switched: 150000000 0.0892711
overheads: 150000000 0.117108
virtual dispatch: 150000000 0.247705
switched: 150000000 0.109486
overheads: 150000000 0.0762541
Performance counter stats for './vdt':
1347.887606 task-clock (msec) # 0.989 CPUs utilized
222 context-switches # 0.165 K/sec
0 cpu-migrations # 0.000 K/sec
430 page-faults # 0.319 K/sec
3,558,892,668 cycles # 2.640 GHz [42.47%]
1,316,787,839 stalled-cycles-frontend # 37.00% frontend cycles idle [42.61%]
438,592,926 stalled-cycles-backend # 12.32% backend cycles idle [40.57%]
6,388,507,180 instructions # 1.80 insns per cycle
# 0.21 stalled cycles per insn [50.49%]
1,514,291,853 branches # 1123.456 M/sec [50.19%]
1,095,265 branch-misses # 0.07% of all branches [48.66%]
2,485,922,557 L1-dcache-loads # 1844.310 M/sec [47.99%]
577,213,257 L1-dcache-load-misses # 23.22% of all L1-dcache hits [48.20%]
2 LLC-loads # 0.001 K/sec [40.51%]
0 LLC-load-misses # 0.00% of all LL-cache hits [40.17%]
1.362403811 seconds time elapsed
$ clang++ -O2 -o vdt vdt.cpp && perf stat -d ./vdt
virtual dispatch: 150000000 1.0894
switched: 150000000 0.0849747
overheads: 150000000 0.0726611
virtual dispatch: 150000000 1.03949
switched: 150000000 0.0849843
overheads: 150000000 0.0768674
virtual dispatch: 150000000 1.07786
switched: 150000000 0.0893431
overheads: 150000000 0.0725624
Performance counter stats for './vdt':
3667.235804 task-clock (msec) # 0.993 CPUs utilized
356 context-switches # 0.097 K/sec
0 cpu-migrations # 0.000 K/sec
402 page-faults # 0.110 K/sec
11,052,067,182 cycles # 3.014 GHz [39.98%]
5,346,555,173 stalled-cycles-frontend # 48.38% frontend cycles idle [40.10%]
3,480,506,097 stalled-cycles-backend # 31.49% backend cycles idle [40.09%]
6,351,819,740 instructions # 0.57 insns per cycle
# 0.84 stalled cycles per insn [50.07%]
1,524,106,229 branches # 415.601 M/sec [50.17%]
299,296,742 branch-misses # 19.64% of all branches [50.05%]
2,393,484,447 L1-dcache-loads # 652.667 M/sec [49.93%]
554,010,640 L1-dcache-load-misses # 23.15% of all L1-dcache hits [49.88%]
0 LLC-loads # 0.000 K/sec [40.33%]
0 LLC-load-misses # 0.00% of all LL-cache hits [39.83%]
3.692786417 seconds time elapsed
What I can see that there is some difference in branch prediction (branch-misses) in clang (that again to the optimizer).