Q1: which occupies less CPU usage, future wait() or check flag in a while loop?
std::atomic_bool isRunning{false};
void foo(){
isRunning.store(true);
doSomethingTimeConsuming();
isRunning.store(false);
}
std::future f = std::async(std::launch::async, foo);
use std::future wait():
if(f.vaild())
f.wait()
check flag in a while loop:
if(f.valid){
while(isRunning.load())
std::this_thread::sleep_for(1ms);
}
Q2: is the conclusion also applied to std::thread.join() or std::condition_variable.wait() ?
thanks in advance.
std::this_thread::sleep_for keeps waking up the thread unnecessarily at wrong times. The average latency of the result being ready and the waiter thread noticing it is half the sleep_for timeout.
std::future::wait is more efficient because it blocks in the kernel till the result is ready, without doing multiple syscalls unnecessarily, unlike std::this_thread::sleep_for.
If your run the two versions with
void doSomethingTimeConsuming() {
std::this_thread::sleep_for(1s);
}
under perf stat, the results for std::future::wait are:
1.803578 task-clock (msec) # 0.002 CPUs utilized
2 context-switches # 0.001 M/sec
0 cpu-migrations # 0.000 K/sec
116 page-faults # 0.064 M/sec
6,356,215 cycles # 3.524 GHz
4,511,076 instructions # 0.71 insn per cycle
835,604 branches # 463.304 M/sec
22,313 branch-misses # 2.67% of all branches
Whereas for std::this_thread::sleep_for(1ms):
11.715249 task-clock (msec) # 0.012 CPUs utilized
901 context-switches # 0.077 M/sec
6 cpu-migrations # 0.512 K/sec
118 page-faults # 0.010 M/sec
40,177,222 cycles # 3.429 GHz
25,401,055 instructions # 0.63 insn per cycle
2,286,806 branches # 195.199 M/sec
156,400 branch-misses # 6.84% of all branches
I.e. in this particular test, sleep_for burns roughly 6 times as many CPU cycles.
Note that there is a race condition between isRunning.load() and isRunning.store(true). A fix is to initialize isRunning{true};.
Code
I ran my program 30 times, and n passed to run_and_join_threads() changed from 1 to 30 accordingly.
Note that jobs passed to run_and_join_threads() were populated by exactly the same way in each execution.
void do_job(JobQueue *jobs) {
Job job;
while (job = jobs->pop())
job();
// control flow goes here if jobs.pop() returns nullptr,
// which means all the jobs have been done
}
void run_and_join_threads(int n, JobQueue &jobs) {
vector<thread> threads;
threads.reserve(n);
for (int i = 0; i < n; ++i)
threads.push_back(thread(do_job, &jobs));
// synchronization
for (int i = 0; i < n; ++i)
threads[i].join();
}
JobQueue.h
#ifndef JOB_QUEUE_H
#define JOB_QUEUE_H
#include <functional>
#include <queue>
#include <mutex>
typedef std::function<void (void)> Job;
// Its methods are all atomic.
class JobQueue {
std::queue<Job> jobs;
std::mutex mtx;
public:
void push(Job job);
// pop removes the "oldest" job in the queue and returns it.
// pop returns nullptr if there's no more jobs left in the queue.
Job pop();
};
#endif
JobQueue.cc
#include "JobQueue.h"
using namespace std;
void JobQueue::push(Job job) {
mtx.lock();
jobs.push(job);
mtx.unlock();
}
Job JobQueue::pop() {
Job job = nullptr;
mtx.lock();
if (!jobs.empty()) {
job = jobs.front();
jobs.pop();
}
mtx.unlock();
return job;
}
Chart
I use perf stat -e instructions:u ./my_program to record number of instructions during my program execution.
I then found that there is a negative correlation between number of threads and number of user instructions.
My Thoughts
Since the "real task" remains the same, more threads should lead to more thread construction and destruction, which results in more instructions, but that's not the case from the chart. I tried to google with the keywords in the title, but no luck.
compilation options: -std=c++14 -pthread -Wextra -Werror -MMD
gcc version: 8.2.1 20180831
Output of --per-thread when n = 10
hw4-9525 8,524.37 msec task-clock:u # 0.153 CPUs utilized
hw4-9524 8,082.77 msec task-clock:u # 0.145 CPUs utilized
hw4-9522 7,824.93 msec task-clock:u # 0.140 CPUs utilized
hw4-9519 7,782.65 msec task-clock:u # 0.139 CPUs utilized
hw4-9518 7,734.42 msec task-clock:u # 0.138 CPUs utilized
hw4-9517 7,722.12 msec task-clock:u # 0.138 CPUs utilized
hw4-9520 7,636.99 msec task-clock:u # 0.137 CPUs utilized
hw4-9425 11,899.78 msec task-clock:u # 0.213 CPUs utilized
hw4-9521 7,585.14 msec task-clock:u # 0.136 CPUs utilized
hw4-9526 7,580.60 msec task-clock:u # 0.136 CPUs utilized
hw4-9523 7,306.57 msec task-clock:u # 0.131 CPUs utilized
hw4-9425 0 context-switches:u # 0.000 K/sec
hw4-9517 0 context-switches:u # 0.000 K/sec
hw4-9518 0 context-switches:u # 0.000 K/sec
hw4-9519 0 context-switches:u # 0.000 K/sec
hw4-9520 0 context-switches:u # 0.000 K/sec
hw4-9521 0 context-switches:u # 0.000 K/sec
hw4-9522 0 context-switches:u # 0.000 K/sec
hw4-9523 0 context-switches:u # 0.000 K/sec
hw4-9524 0 context-switches:u # 0.000 K/sec
hw4-9525 0 context-switches:u # 0.000 K/sec
hw4-9526 0 context-switches:u # 0.000 K/sec
hw4-9425 0 cpu-migrations:u # 0.000 K/sec
hw4-9517 0 cpu-migrations:u # 0.000 K/sec
hw4-9518 0 cpu-migrations:u # 0.000 K/sec
hw4-9519 0 cpu-migrations:u # 0.000 K/sec
hw4-9520 0 cpu-migrations:u # 0.000 K/sec
hw4-9521 0 cpu-migrations:u # 0.000 K/sec
hw4-9522 0 cpu-migrations:u # 0.000 K/sec
hw4-9523 0 cpu-migrations:u # 0.000 K/sec
hw4-9524 0 cpu-migrations:u # 0.000 K/sec
hw4-9525 0 cpu-migrations:u # 0.000 K/sec
hw4-9526 0 cpu-migrations:u # 0.000 K/sec
hw4-9425 9,332 page-faults:u # 1144.724 M/sec
hw4-9520 7,487 page-faults:u # 918.404 M/sec
hw4-9526 7,408 page-faults:u # 908.714 M/sec
hw4-9522 7,401 page-faults:u # 907.855 M/sec
hw4-9518 7,386 page-faults:u # 906.015 M/sec
hw4-9524 7,362 page-faults:u # 903.071 M/sec
hw4-9521 7,348 page-faults:u # 901.354 M/sec
hw4-9525 7,258 page-faults:u # 890.314 M/sec
hw4-9517 7,253 page-faults:u # 889.700 M/sec
hw4-9519 7,153 page-faults:u # 877.434 M/sec
hw4-9523 6,194 page-faults:u # 759.797 M/sec
hw4-9425 24,365,706,871 cycles:u # 2988857.145 GHz
hw4-9524 19,199,338,912 cycles:u # 2355116.623 GHz
hw4-9518 18,658,195,691 cycles:u # 2288736.452 GHz
hw4-9522 18,565,304,421 cycles:u # 2277341.801 GHz
hw4-9520 18,524,344,417 cycles:u # 2272317.378 GHz
hw4-9519 18,452,590,959 cycles:u # 2263515.629 GHz
hw4-9521 18,384,181,678 cycles:u # 2255124.099 GHz
hw4-9517 18,169,025,051 cycles:u # 2228731.578 GHz
hw4-9526 17,957,925,085 cycles:u # 2202836.674 GHz
hw4-9523 17,689,877,988 cycles:u # 2169956.262 GHz
hw4-9525 20,380,269,586 cycles:u # 2499977.312 GHz
hw4-9524 35,930,781,858 instructions:u # 1.88 insn per cycle
hw4-9425 31,238,610,254 instructions:u # 1.63 insn per cycle
hw4-9522 34,856,962,399 instructions:u # 1.82 insn per cycle
hw4-9518 34,794,129,974 instructions:u # 1.82 insn per cycle
hw4-9520 34,565,759,122 instructions:u # 1.81 insn per cycle
hw4-9519 34,521,122,564 instructions:u # 1.81 insn per cycle
hw4-9521 34,389,796,009 instructions:u # 1.80 insn per cycle
hw4-9517 33,823,905,990 instructions:u # 1.77 insn per cycle
hw4-9525 38,084,271,354 instructions:u # 1.99 insn per cycle
hw4-9526 33,682,632,175 instructions:u # 1.76 insn per cycle
hw4-9523 33,147,549,812 instructions:u # 1.73 insn per cycle
hw4-9525 6,113,561,884 branches:u # 749929530.566 M/sec
hw4-9425 5,978,592,665 branches:u # 733373322.423 M/sec
hw4-9524 5,765,141,950 branches:u # 707190060.107 M/sec
hw4-9522 5,593,987,998 branches:u # 686195195.687 M/sec
hw4-9518 5,583,032,551 branches:u # 684851328.824 M/sec
hw4-9520 5,546,955,396 branches:u # 680425868.769 M/sec
hw4-9519 5,541,456,246 branches:u # 679751307.023 M/sec
hw4-9521 5,518,407,713 branches:u # 676924023.050 M/sec
hw4-9517 5,427,113,316 branches:u # 665725254.544 M/sec
hw4-9526 5,407,241,325 branches:u # 663287626.012 M/sec
hw4-9523 5,318,730,317 branches:u # 652430286.226 M/sec
hw4-9525 66,142,537 branch-misses:u # 1.18% of all branches
hw4-9524 61,835,669 branch-misses:u # 1.10% of all branches
hw4-9518 61,243,167 branch-misses:u # 1.09% of all branches
hw4-9520 60,266,206 branch-misses:u # 1.07% of all branches
hw4-9521 59,396,966 branch-misses:u # 1.06% of all branches
hw4-9522 59,227,658 branch-misses:u # 1.05% of all branches
hw4-9519 59,210,503 branch-misses:u # 1.05% of all branches
hw4-9526 57,983,090 branch-misses:u # 1.03% of all branches
hw4-9517 57,910,215 branch-misses:u # 1.03% of all branches
hw4-9523 56,251,632 branch-misses:u # 1.00% of all branches
hw4-9425 32,626,137 branch-misses:u # 0.58% of all branches
Testing something else I stumbled across something that I haven't managed to figure out yet.
Let's look at this snippet:
#include <iostream>
#include <chrono>
int main () {
int i = 0;
using namespace std::chrono_literals;
auto const end = std::chrono::system_clock::now() + 5s;
while (std::chrono::system_clock::now() < end) {
++i;
}
std::cout << i;
}
I've noticed that the counts heavily depend on the machine I execute it on.
I've compiled with gcc 7.3,8.2, and clang 6.0 with std=c++17 -O3.
On i7-4790 (4.17.14-arch1-1-ARCH kernel): ~3e8
but on a Xeon E5-2630 v4 (3.10.0-514.el7.x86_64): ~8e6
Now this is a difference that I would like to understand so I've checked with perf stat -d
on the i7:
4999.419546 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
120 page-faults:u # 0.024 K/sec
19,605,598,394 cycles:u # 3.922 GHz (49.94%)
33,601,884,120 instructions:u # 1.71 insn per cycle (62.48%)
7,397,994,820 branches:u # 1479.771 M/sec (62.53%)
34,788 branch-misses:u # 0.00% of all branches (62.58%)
10,809,601,166 L1-dcache-loads:u # 2162.171 M/sec (62.41%)
13,632 L1-dcache-load-misses:u # 0.00% of all L1-dcache hits (24.95%)
3,944 LLC-loads:u # 0.789 K/sec (24.95%)
1,034 LLC-load-misses:u # 26.22% of all LL-cache hits (37.42%)
5.003180401 seconds time elapsed
4.969048000 seconds user
0.016557000 seconds sys
Xeon:
5001.000000 task-clock (msec) # 0.999 CPUs utilized
42 context-switches # 0.008 K/sec
2 cpu-migrations # 0.000 K/sec
412 page-faults # 0.082 K/sec
15,100,238,798 cycles # 3.019 GHz (50.01%)
794,184,899 instructions # 0.05 insn per cycle (62.51%)
188,083,219 branches # 37.609 M/sec (62.49%)
85,924 branch-misses # 0.05% of all branches (62.51%)
269,848,346 L1-dcache-loads # 53.959 M/sec (62.49%)
246,532 L1-dcache-load-misses # 0.09% of all L1-dcache hits (62.51%)
13,327 LLC-loads # 0.003 M/sec (49.99%)
7,417 LLC-load-misses # 55.65% of all LL-cache hits (50.02%)
5.006139971 seconds time elapsed
What pops out is the low amount of instructions per cycle on the Xeon as well as the nonzero context-switches that I don't understand. However, I wasn't able to use these diagnostics to come up with an explanation.
And to add a bit more weirdness to the problem, when trying to debug I've also compiled statically on one machine and executed on the other.
On the Xeon the statically compiled executable gives a ~10% lower output with no difference between compiling on xeon or i7.
Doing the same thing on the i7 both the counter actually drops from 3e8 to ~2e7
So in the end I'm now left with two questions:
Why do I see such a significant difference between the two machines.
Why does a statically linked exectuable perform worse while I would expect the oposite?
Edit: after updating the kernel on the centos 7 machine to 4.18 we actually see an additional drop from ~ 8e6 to 5e6.
perf interestingly shows different numbers though:
5002.000000 task-clock:u (msec) # 0.999 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
119 page-faults:u # 0.024 K/sec
409,723,790 cycles:u # 0.082 GHz (50.00%)
392,228,592 instructions:u # 0.96 insn per cycle (62.51%)
115,475,503 branches:u # 23.086 M/sec (62.51%)
26,355 branch-misses:u # 0.02% of all branches (62.53%)
115,799,571 L1-dcache-loads:u # 23.151 M/sec (62.51%)
42,327 L1-dcache-load-misses:u # 0.04% of all L1-dcache hits (62.50%)
88 LLC-loads:u # 0.018 K/sec (49.96%)
2 LLC-load-misses:u # 2.27% of all LL-cache hits (49.98%)
5.005940327 seconds time elapsed
0.533000000 seconds user
4.469000000 seconds sys
It's interesting that there are no more context switches and istructions per cycle went up significantly but the cycles and therefore colck are super low!
I've been able to reproduce the respective measurements on the two machines, thanks to #Imran's comment above. (Posting this answer to close the question, if Imran should post one I'm happy to accept his instead)
It indeed is related to the available clocksource. The XEON, unfortunately, had the notsc flag in its kernel parameters which is why the tsc clocksource wasn't available and selected.
Thus for anyone running into this problem:
1. check your clocksource in /sys/devices/system/clocksource/clocksource0/current_clocksource
2. check available clocksources in /sys/devices/system/clocksource/clocksource0/available_clocksource
3. If you can't find tsc, check dmesg | grep tsc to check you kernel parameters for notsc
I am trying to establish the bottleneck in my code using perf and ocperf.
If I do a 'detailed stat' run on my binary, two statistics are reported in red text, which I suppose mean that it is too high.
L1-dcache-load-misses is in red at 28.60%
iTLB-load-misses is in red at 425.89%
# ~bram/src/pmu-tools/ocperf.py stat -d -d -d -d -d ./bench ray
perf stat -d -d -d -d -d ./bench ray
Loaded 455 primitives.
Testing ray against 455 primitives.
Performance counter stats for './bench ray':
9031.444612 task-clock (msec) # 1.000 CPUs utilized
15 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
292 page-faults # 0.032 K/sec
28,786,063,163 cycles # 3.187 GHz (61.47%)
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
55,742,952,563 instructions # 1.94 insns per cycle (69.18%)
3,717,242,560 branches # 411.589 M/sec (69.18%)
18,097,580 branch-misses # 0.49% of all branches (69.18%)
10,230,376,136 L1-dcache-loads # 1132.751 M/sec (69.17%)
2,926,349,754 L1-dcache-load-misses # 28.60% of all L1-dcache hits (69.21%)
145,843,523 LLC-loads # 16.148 M/sec (69.32%)
49,512 LLC-load-misses # 0.07% of all LL-cache hits (69.33%)
<not supported> L1-icache-loads
260,144 L1-icache-load-misses # 0.029 M/sec (69.34%)
10,230,376,830 dTLB-loads # 1132.751 M/sec (69.34%)
1,197 dTLB-load-misses # 0.00% of all dTLB cache hits (61.59%)
2,294 iTLB-loads # 0.254 K/sec (61.55%)
9,770 iTLB-load-misses # 425.89% of all iTLB cache hits (61.51%)
<not supported> L1-dcache-prefetches
<not supported> L1-dcache-prefetch-misses
9.032234014 seconds time elapsed
My questions:
What would be a reasonable figure for L1 data cache misses?
What would be a reasonable figure for iTLB-load-misses?
Why can iTLB-load-misses exceed 100%? In other words: why is iTLB-load-misses exceeding iTLB-loads? I've even seen it spike as high as 568%
Also, my machine has a Haswell CPU. I would have expected the stalled-cycles stat to be included?
I would like to extract the line containing 'seconds time elapsed' output from perf stat output for some logging script that I am working on.
I do not want to write the output to a file and then search the file. I would like to do it using 'grep' or something similar.
Here is what I have tried:
perf stat -r 10 echo "Sample_String" | grep -eE "seconds time elapsed"
For which I get
grep: seconds time elapsed: No such file or directory
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
echo: Broken pipe
Performance counter stats for 'echo Sample_String' (10 runs):
0.254533 task-clock (msec) # 0.556 CPUs utilized ( +- 0.98% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.220 M/sec ( +- 0.53% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000457686 seconds time elapsed ( +- 1.08% )
And I tried this
perf stat -r 10 echo "Sample_String" > grep -eE "seconds time elapsed"
For which I got
Performance counter stats for 'echo Sample_String -eE seconds time elapsed' (10 runs):
0.262585 task-clock (msec) # 0.576 CPUs utilized ( +- 1.11% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
56 page-faults # 0.214 M/sec ( +- 0.36% )
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.000456035 seconds time elapsed ( +- 1.05% )
I am new to these tools like grep, awk and sed. I hope someone can help me out. I also do not want to write the output to a file and then search the file.
The problem here is that the output you want is sent to stderr instead of the standard output.
You can see this by redirecting stderr to /dev/null, and you'll see that the only result left is the one from the "echo" command.
~/ perf stat -r 10 echo "Sample_String" 2>/dev/null
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
Sample_String
In order to do what you want, you will have to redirect perf's stderr to the standard output, and hide the standard output. This way, perf's output will be sent to the grep command.
~/ perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null | grep 'seconds time elapsed'
0,013137361 seconds time elapsed ( +- 96,08% )
Looks like your desired output is going to stderr. Try:
perf stat -r 10 echo "Sample_String" 2>&1 | grep "seconds time elapsed"
This could work the way you intend:
grep -e "seconds time elapsed" <<< "$(perf stat -r 10 echo "Sample_String" 2>&1 >/dev/null)"
Output:
0.000544399 seconds time elapsed ( +- 2.05% )