The Windows function QueryThreadCycleTime() gives the number of "CPU clock cycles" used by a given thread. The Windows manual boldly states
Do not attempt to convert the CPU clock cycles returned by QueryThreadCycleTime to elapsed time.
I would like to do exactly this, for most Intel and AMD x86_64 CPUs.
It doesn't need to be very accurate, because you can't expect perfection from cycle counters like RDTSC anyway.
I just need some kludgey way to get the time factor seconds / QueryThreadCycleTime for the CPUs.
First, I imagine that QueryThreadCycleTime uses RDTSC internally.
I imagine that on some CPUs, constant rate TSC is used, so changing the actual clock rate (e.g. with variable-frequency CPU power management) doesn't affect the time/TSC factor.
On other CPUs, that rate might change, so I'd have to query this factor periodically.
Why do I need this?
Before anyone cites the XY Problem, I should note that I'm not really interested in alternative solutions.
This is because I have two hard requirements for profiling that no other method meets.
It should only measure thread time, so sleep(1) should not return 1 second, but a busy loop lasting 1 second should. In other words, the profiler should not say that a task ran for 10ms when its thread was only active for 1ms. This is the reason I cannot use QueryPerformanceCounter().
It needs a precision better than 1/64 seconds, which is the precision given by GetThreadTimes(). The tasks I'm profiling might run for only a few microseconds.
Minimal reproducable example
As requested by #Ted Lyngmo, the goal is implement computeFactor().
#include <stdio.h>
#include <windows.h>
double computeFactor();
int main() {
uint64_t start, end;
QueryThreadCycleTime(GetCurrentThread(), &start);
// insert task here, such as an actual workload or sleep(1)
QueryThreadCycleTime(GetCurrentThread(), &end);
printf("%lf\n", (end - start) * computeFactor());
return 0;
}
Do not attempt to convert the CPU clock cycles returned by QueryThreadCycleTime to elapsed time.
I would like to do exactly this.
Your wish is obviously Denied!
A workaround, that will do something close to what you want, could be to create one thread with a steady_clock that samples QueryThreadCycleTime and/or GetThreadTimes at some specified frequency. Here's an example of how it could be done with a sampling thread taking a sample of both once every second.
#include <algorithm>
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <iomanip>
#include <thread>
#include <vector>
#include <Windows.h>
using namespace std::literals::chrono_literals;
struct FTs_t {
FILETIME CreationTime, ExitTime, KernelTime, UserTime;
ULONG64 CycleTime;
};
using Sample = std::vector<FTs_t>;
std::ostream& operator<<(std::ostream& os, const FILETIME& ft) {
std::uint64_t bft = (std::uint64_t(ft.dwHighDateTime) << 16) + ft.dwLowDateTime;
return os << bft;
}
std::ostream& operator<<(std::ostream& os, const Sample& smp) {
size_t tno = 0;
for (const auto& fts : smp) {
os << " tno:" << std::setw(3) << tno << std::setw(10) << fts.KernelTime
<< std::setw(10) << fts.UserTime << std::setw(16) << fts.CycleTime << "\n";
++tno;
}
return os;
}
// the sampling thread
void ft_sampler(std::atomic<bool>& quit, std::vector<std::thread>& threads, std::vector<Sample>& samples) {
auto tp = std::chrono::steady_clock::now(); // for steady sampling
FTs_t fts;
while (quit == false) {
Sample s;
s.reserve(threads.size());
for (auto& th : threads) {
if (QueryThreadCycleTime(th.native_handle(), &fts.CycleTime) &&
GetThreadTimes(th.native_handle(), &fts.CreationTime,
&fts.ExitTime, &fts.KernelTime, &fts.UserTime)) {
s.push_back(fts);
}
}
samples.emplace_back(std::move(s));
tp += 1s; // add a second since we last sampled and sleep until that time_point
std::this_thread::sleep_until(tp);
}
}
// a worker thread
void worker(std::atomic <bool>& quit, size_t payload) {
volatile std::uintmax_t x = 0;
while (quit == false) {
for (size_t i = 0; i < payload; ++i) ++x;
std::this_thread::sleep_for(1us);
}
}
int main() {
std::atomic<bool> quit_sampling = false, quit_working = false;
std::vector<std::thread> threads;
std::vector<Sample> samples;
size_t max_threads = std::thread::hardware_concurrency() > 1 ? std::thread::hardware_concurrency() - 1 : 1;
// start some worker threads
for (size_t tno = 0; tno < max_threads; ++tno) {
threads.emplace_back(std::thread(&worker, std::ref(quit_working), (tno + 100) * 100000));
}
// start the sampling thread
auto smplr = std::thread(&ft_sampler, std::ref(quit_sampling), std::ref(threads), std::ref(samples));
// let the threads work for some time
std::this_thread::sleep_for(10s);
quit_sampling = true;
smplr.join();
quit_working = true;
for (auto& th : threads) th.join();
std::cout << "Took " << samples.size() << " samples\n";
size_t s = 0;
for (const auto& smp : samples) {
std::cout << "Sample " << s << ":\n" << smp << "\n";
++s;
}
}
Related
I have the code sample bellow to measure the execution time of some piece of code:
int main()
{
auto before = chrono::steady_clock::now();
Sleep(30000);
auto after = chrono::steady_clock::now();
int duration = (std::chrono::duration_cast<std::chrono::seconds> ((after - before)).count());
cout << duration << endl;
return 0;
}
Normally it works fine and prints out 30 in the cout statement.
However, during testing I observed that if the computer were to go to sleep in between the auto before = ... statement and the auto after = ... statement (due to inactivity or whatever other reason), then the printed out time also counts the entire time the machine was asleep. This makes perfect sense since we are comparing a timepoint from before the machine going to sleep and one with after.
So my question is how can I make it so that the duration the machine was asleep is not counted in my final duration? Probably will need a ticker that doesn't increment while machine is asleep rather than timepoint measurements but I'm not aware of such a ticker.
This is a Windows specific question. As I understand, MacOS has mach_absolute_time which is exactly what I'm looking for in windows. I'm using MSVC 19.29.30147.0 as my compiler.
After looking around and testing it out, the solution is to use QueryUnbiasedInterruptTime
Running the following code snippet, I manually put my machine to sleep while the program was stuck on the sleep statement and I observed that the second print out consistently outputs 15 seconds regardless of how long I leave my machine in a sleeping state. However, the first print-out that uses GetTickCount64 will include the amount of time the machine was asleep.
int main()
{
ULONGLONG before_query, after_query= 0;
QueryUnbiasedInterruptTime(&before_query);
auto beforeticks = GetTickCount64();
Sleep(15000);
QueryUnbiasedInterruptTime(&after_query);
auto afterticks = GetTickCount64();
cout << "Ticks from gettickcount64 is " << (double (afterticks-beforeticks))/1000 << endl;
cout << "Unbiased time measure is " << double((after_query - before_query)/10000000) << endl;
return 0;
}
You are correct that the easiest way is to use a counter that is incremented each second. This is easily implemented with threads:
#include <thread>
#include <atomic>
#include <chrono>
using namespace std::literals::chrono_literals;
class ellapsed_counter {
std::atomic<bool> finished = false;
std::atomic<unsigned int> value = 0;
std::thread worker { [this] {
while(!finished) {
value++;
std::this_thread::sleep_for(1s);
}
} };
public:
void finish() noexcept {
finished = true;
if(worker.joinable()) worker.join();
}
unsigned int ellapsed() const noexcept { return value; }
};
This will keep incrementing on 1s intervals (probably with some error) as long as the process is running and should cease so when it is sleeping.
You can use it like this:
#include <iostream>
int main(int argc, const char *argv[]) {
ellapsed_counter counter;
unsigned int last = 0, count = 0;
while(count < 10) {
count = counter.ellapsed();
if(count != last) {
last = count;
std::cout << count << std::endl;
}
}
counter.finish();
return 0;
}
This will count from 1 to 10 seconds and exit.
Once there was a deleted question, that I wrote a huge answer to, but this question was deleted and author refused to undelete it.
So posting here a short summary of this question. And immediately answering this question myself, just to share my results.
Question was that if we're given std::bitset<65536> that is processed (by some formula) inside inner loop bit-by-bit, then how can we boost this computation?
Outer loop just called inner loop many times (lets say 50 000 times), and outer loop can't be processed in parallel, because each next iteration depends on results of previous iteration.
Example code of this process:
std::bitset<65536> bits{};
uint64_t hash = 0;
for (size_t i = 0; i < 50000; ++i) {
// Process Bits
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
hash = Hash(bits, hash);
}
Code above is just one sample way of processing, it is not a real case. The real case is such that many times we process std::bitset<65536> somehow in such a way that all bits can be processed independently.
The question is how we can process bits in parallel as fast as possible inside inner loop.
One important Note that formula that modifies bits is generic, meaning that we don't know it in advance and can't make SIMD instructions out of it.
But what we know is that all bits can be processed independently. And that we need to parallelize this processing. Also we can't parallelize outer loop as each its iteration depends on results of previous iteration.
Another Note is that std::bitset<65536> is quite small, just 1K of 64-bit words. So it means that directly using pool of std::thread of std::async threads will not work as each thread's work will be just around 50-200 nano-seconds, very tiny time to start and stop threads and send work to them. Even std::mutex takes 75 nano-seconds on my Windows machine (although 20 nano-seconds on Linux), so using std::mutex is also a big overhead.
One may assume that ModifyBit() function above takes around same time for each bit, otherwise there is no understanding on how to schedule balanced parallelization of a loop, only by slicing it into very many tiny tasks hoping that longer tasks will be balanced out by several shorter one.
Implemented quite large and complex solution for your task, but which works very fast. On my 4-core (8 hardware threads) laptop I have 6x times multi-core speedup compared to single threaded version (your version of code).
Main idea of solution below is to implement very fast multi core Thread-Pool for running arbitrary tasks that has small overhead. My implementation can handle up to 1-10 Million tasks per second (depending on CPU speed and cores count).
Regular way of asynchronously starting multiple tasks is through usage of std::async or just by creating std::thread. Both these ways are considerably slower than my own implementation. They can't give throughput of 5 Million tasks per second like my implementation gives. And your code needs millions of tasks per second to be run for good speed. That's why I implemented everything from scratch.
After fast thread pool is implemented now we can slice your 64K bitset into smaller sub-sets and process these sub-sets in parallel. I sliced 64K bitset into 16 equal parts (see BitSize / 16 in code), you can set other amount of parts equal to power of two, but not too many, otherwise thread pool overhead will be too large. Usually it is good to slice into amount of parts that is equal to twice the amount of hardware threads (or 4 times amount of cores).
I implemented several classes in C++ code. AtomicMutex class uses std::atomic_flag in order to implement very fast replacement for mutex that is based on spin-locking approach. This AtomicMutex is used to protect queue of tasks submitted for running on thread pool.
RingBuffer class is based on std::vector and implements simple and fast queue to store any objects. It is implemented using two pointers (head and tail), pointing into vector. When new element is added to queue then tail pointer is advanced to the right, if this pointer reaches end of vector then it wraps around to 0-th position. Same way when element is taken out from queue then head pointer also advances to the right with wrap around. RingBuffer is used to store thread pool tasks.
Queue class is a wrapper around RingBuffer, but with AtomicMutex protection. This spin-lock mutex is used to protect simultaneous adding/taking elements to/from queue from multiple workers' threads.
Pool implements multi-core pool of tasks itself. It creates as many worker threads as there are CPU hardware threads (double amount of cores) minus one. Each worker thread just polls new tasks from queue and executes them immediately. Main thread adds new tasks to queue. Pool also has Wait() capability to wait till all current tasks are finished, this waiting is used as barrier to wait till whole 64K bitset is processed (all sub-parts are processed). Pool accepts any lambdas (function closures) to be run. You can see that 64K bitset sliced into smaller parts is processed by doing pool.Emplace(lambda) and later pool.Wait() is used to wait till all sub-parts are finished. Exceptions from pool workers are collected and reported to user if there is any error. When doing Wait() pool runs tasks also inside main thread not to waste one core for just waiting of tasks to finish.
Timings reported in console are done by std::chrono module.
There is an ability to run both versions - single-threaded (your original version) and multi-threaded using all cores. Switch between single/multi is done by passing MultiThreaded = true template parameter to function ProcessBitset().
Try it online!
#include <cstdint>
#include <atomic>
#include <vector>
#include <array>
#include <queue>
#include <functional>
#include <thread>
#include <future>
#include <exception>
#include <optional>
#include <memory>
#include <iostream>
#include <iomanip>
#include <bitset>
#include <string>
#include <chrono>
#include <algorithm>
#include <any>
#include <type_traits>
class AtomicMutex {
class LockerC;
public:
void lock() {
while (f_.test_and_set(std::memory_order_acquire))
//f_.wait(true, std::memory_order_acquire)
;
}
void unlock() {
f_.clear(std::memory_order_release);
//f_.notify_all();
}
LockerC Locker() { return LockerC(*this); }
private:
class LockerC {
public:
LockerC() = delete;
LockerC(AtomicMutex & mux) : pmux_(&mux) { mux.lock(); }
LockerC(LockerC const & other) = delete;
LockerC(LockerC && other) : pmux_(other.pmux_) { other.pmux_ = nullptr; }
~LockerC() { if (pmux_) pmux_->unlock(); }
LockerC & operator = (LockerC const & other) = delete;
LockerC & operator = (LockerC && other) = delete;
private:
AtomicMutex * pmux_ = nullptr;
};
std::atomic_flag f_ = ATOMIC_FLAG_INIT;
};
template <typename T>
class RingBuffer {
public:
RingBuffer() : buf_(1 << 8), last_(buf_.size() - 1) {}
T & front() { return buf_[first_]; }
T const & front() const { return buf_[first_]; }
T & back() { return buf_[last_]; }
T const & back() const { return buf_[last_]; }
size_t size() const { return size_; }
bool empty() const { return size_ == 0; }
template <typename ... Args>
void emplace(Args && ... args) {
while (size_ >= buf_.size()) {
std::rotate(&buf_[0], &buf_[first_], &buf_[buf_.size()]);
first_ = 0;
last_ = buf_.size() - 1;
buf_.resize(buf_.size() * 2);
}
++size_;
++last_;
if (last_ >= buf_.size())
last_ = 0;
buf_[last_] = T(std::forward<Args>(args)...);
}
void pop() {
if (size_ == 0)
return;
--size_;
++first_;
if (first_ >= buf_.size())
first_ = 0;
}
private:
std::vector<T> buf_;
size_t first_ = 0, last_ = 0, size_ = 0;
};
template <typename T>
class Queue {
public:
size_t Size() const { return q_.size(); }
bool Empty() const { return q_.size() == 0; }
template <typename ... Args>
void Emplace(Args && ... args) {
auto lock = m_.Locker();
q_.emplace(std::forward<Args>(args)...);
}
T Pop(std::function<void()> const & on_empty = []{},
std::function<void()> const & on_full = []{}) {
while (true) {
if (q_.empty()) {
on_empty();
continue;
}
auto lock = m_.Locker();
if (q_.empty()) {
on_empty();
continue;
}
on_full();
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
}
std::optional<T> TryPop() {
auto lock = m_.Locker();
if (q_.empty())
return std::nullopt;
T val = std::move(q_.front());
q_.pop();
return std::move(val);
}
private:
AtomicMutex m_;
RingBuffer<T> q_;
};
class RunInDestr {
public:
RunInDestr(std::function<void()> const & f) : f_(f) {}
~RunInDestr() { f_(); }
private:
std::function<void()> const & f_;
};
class Pool {
public:
struct FinishExc {};
struct Worker {
std::unique_ptr<std::atomic<bool>> pdone = std::make_unique<std::atomic<bool>>(true);
std::unique_ptr<std::exception_ptr> pexc = std::make_unique<std::exception_ptr>();
std::unique_ptr<std::thread> thr;
};
Pool(size_t nthreads = size_t(-1)) {
if (nthreads == size_t(-1))
nthreads = std::thread::hardware_concurrency() - 1;
std::cout << "Pool has " << nthreads << " worker threads." << std::endl;
for (size_t i = 0; i < nthreads; ++i) {
workers_.emplace_back(Worker{});
workers_.back().thr = std::make_unique<std::thread>(
[&, pdone = workers_.back().pdone.get(), pexc = workers_.back().pexc.get()]{
try {
std::function<void()> f_done = [pdone]{
pdone->store(true, std::memory_order_relaxed);
}, f_empty = [this]{
CheckFinish();
}, f_full = [pdone]{
pdone->store(false, std::memory_order_relaxed);
};
while (true) {
RunInDestr set_done(f_done);
tasks_.Pop(f_empty, f_full)();
}
} catch (...) {
exc_.store(true, std::memory_order_relaxed);
*pexc = std::current_exception();
}
});
}
}
~Pool() {
Wait();
Finish();
}
void CheckExc() {
if (!exc_.load(std::memory_order_relaxed))
return;
Finish();
throw std::runtime_error("Pool: Exception occured!");
}
void Finish() {
finish_ = true;
for (auto & w: workers_)
try {
w.thr->join();
if (*w.pexc)
std::rethrow_exception(*w.pexc);
} catch (FinishExc const &) {}
workers_.clear();
}
template <typename ... Args>
void Emplace(Args && ... args) {
CheckExc();
tasks_.Emplace(std::forward<Args>(args)...);
}
void Wait() {
while (true) {
auto task = tasks_.TryPop();
if (!task)
break;
(*task)();
}
while (true) {
bool done = true;
for (auto & w: workers_)
if (!w.pdone->load(std::memory_order_relaxed)) {
done = false;
break;
}
if (done)
break;
}
CheckExc();
}
private:
void CheckFinish() {
if (finish_)
throw FinishExc{};
}
Queue<std::function<void()>> tasks_;
std::vector<Worker> workers_;
bool finish_ = false;
std::atomic<bool> exc_ = false;
};
template <bool MultiThreaded = true, size_t BitSize>
void ProcessBitset(Pool & pool, std::bitset<BitSize> & bset,
std::string const & businessLogicCriteria) {
static size_t constexpr block = BitSize / 16;
for (int j = 0; j < BitSize; j += block) {
auto task = [&bset, j]{
int const hi = std::min(j + block, BitSize);
for (int i = j; i < hi; ++i) {
if (i % 2 == 0)
bset[i] = 0;
else
bset[i] = 1;
}
};
if constexpr(MultiThreaded)
pool.Emplace(std::move(task));
else
task();
}
if constexpr(MultiThreaded)
pool.Wait();
}
static auto const gtb = std::chrono::high_resolution_clock::now();
double Time() {
return std::chrono::duration_cast<std::chrono::duration<double>>(
std::chrono::high_resolution_clock::now() - gtb).count();
}
void Compute() {
Pool pool;
std::bitset<65536> bset;
std::string businessLogicCriteria;
int const hi = 50000;
for (int j = 0; j < hi; ++j) {
if ((j & 0x1FFF) == 0 || j + 1 >= hi)
std::cout << j / 1000 << "K (" << std::fixed << std::setprecision(3) << Time() << " sec), " << std::flush;
ProcessBitset(pool, bset, businessLogicCriteria);
businessLogicCriteria = "...";
}
}
void TimeMeasure() {
size_t constexpr A = 1 << 16, B = 1 << 5;
{
Pool pool;
auto const tb = Time();
int64_t volatile x = 0;
for (size_t i = 0; i < A; ++i) {
for (size_t j = 0; j < B; ++j)
pool.Emplace([&]{ x = x + 1; });
pool.Wait();
}
std::cout << "AtomicPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << A * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (A * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (A * B) << std::endl;
}
{
auto const tb = Time();
//size_t const nthr = std::thread::hardware_concurrency();
size_t constexpr C = A / 8;
std::vector<std::future<void>> asyncs;
int64_t volatile x = 0;
for (size_t i = 0; i < C; ++i) {
asyncs.clear();
for (size_t j = 0; j < B; ++j)
asyncs.emplace_back(std::async(std::launch::async, [&]{ x = x + 1; }));
asyncs.clear();
}
std::cout << "AsyncPool time " << std::fixed << std::setprecision(3) << (Time() - tb)
<< " sec, speed " << C * B / (Time() - tb) / 1000.0 << " empty K-tasks/sec, "
<< 1'000'000 / (C * B / (Time() - tb)) << " sec/M-task, no-collisions "
<< std::setprecision(7) << double(x) / (C * B) << std::endl;
}
}
int main() {
try {
TimeMeasure();
Compute();
return 0;
} catch (std::exception const & ex) {
std::cout << "Exception: " << ex.what() << std::endl;
return -1;
} catch (...) {
std::cout << "Unknown Exception!" << std::endl;
return -1;
}
}
Output for 4 cores (8 hardware threads):
Pool has 7 worker threads.
AtomicPool time 0.903 sec, speed 2321.831 empty K-tasks/sec, 0.431 sec/M-task, no-collisions 0.9999967
AsyncPool time 0.982 sec, speed 266.789 empty K-tasks/sec, 3.750 sec/M-task, no-collisions 0.9999123
Pool has 7 worker threads.
0K (0.074 sec), 8K (0.670 sec), 16K (1.257 sec), 24K (1.852 sec), 32K (2.435 sec), 40K (2.984 sec), 49K (3.650 sec), 49K (3.711 sec),
For comparison below is single-threaded version timings, that is 6x times slower:
0K (0.125 sec), 8K (3.786 sec), 16K (7.754 sec), 24K (11.202 sec), 32K (14.662 sec), 40K (18.056 sec), 49K (21.470 sec), 49K (21.841 sec),
You have this inner loop you want to parallelize:
for (size_t j = 0; j < bits.size(); ++j)
bits[j] = ModifyBit(i, j, hash, bits[j]);
So a good idea is to split it into chunks, and have multiple threads do each chunk in parallel. You can submit chunks to workers easily with a std::atomic<int> counter that increments to identify which chunk to work on. You can also make sure the threads all stop working after one loop before starting the next with a std::barrier:
std::bitset<65536> bits{};
std::thread pool[8]; // Change size accordingly
std::atomic<int> task_number{0};
constexpr std::size_t tasks_per_loop = 32; // Arbitrarily chosen
constexpr std::size_t block_size = (bits.size()+tasks_per_loop-1) / tasks_per_loop;
// (only written to by one thread by the barrier, so not atomic)
uint64_t hash = 0;
int i = 0;
std::barrier barrier(std::size(pool), [&]() {
task_number = 0;
++i;
hash = Hash(bits, hash);
});
for (std::thread& t : pool) {
t = std::thread([&]{
while (i < 50000) {
for (int t; (t = task_number++) < tasks_per_loop;) {
int block_start = t * block_size;
int block_end = std::min(block_start + block_size, bits.size());
for (int j = block_start; j < block_end; ++j) {
bits[j] = ModifyBit(i, j, hash, bits[j]);
}
}
// Wait for other threads to finish and hash
// to be calculated before starting next loop
barrier.arrive_and_wait();
}
});
}
for (std::thread& t : pool) t.join();
(The seemingly easy way of parallelizing the for loop with OpenMP #pragma omp parallel for seemed slower with some testing, perhaps because the tasks were so small)
Here it is against your implementation running similar code: https://godbolt.org/z/en76Kv4nn
And on my machine, running this a few times with 1 million iterations took 28 to 32 seconds with my approach and 44 to 50 seconds with your general thread pool approach (granted this is much less general because it can't execute arbitrary std::function<void()> tasks).
I want to find out how much time a certain function takes in my C++ program to execute on Linux. Afterwards, I want to make a speed comparison . I saw several time function but ended up with this from boost. Chrono:
process_user_cpu_clock, captures user-CPU time spent by the current process
Now, I am not clear if I use the above function, will I get the only time which CPU spent on that function?
Secondly, I could not find any example of using the above function. Can any one please help me how to use the above function?
P.S: Right now , I am using std::chrono::system_clock::now() to get time in seconds but this gives me different results due to different CPU load every time.
It is a very easy-to-use method in C++11. You have to use std::chrono::high_resolution_clock from <chrono> header.
Use it like so:
#include <chrono>
/* Only needed for the sake of this example. */
#include <iostream>
#include <thread>
void long_operation()
{
/* Simulating a long, heavy operation. */
using namespace std::chrono_literals;
std::this_thread::sleep_for(150ms);
}
int main()
{
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto t1 = high_resolution_clock::now();
long_operation();
auto t2 = high_resolution_clock::now();
/* Getting number of milliseconds as an integer. */
auto ms_int = duration_cast<milliseconds>(t2 - t1);
/* Getting number of milliseconds as a double. */
duration<double, std::milli> ms_double = t2 - t1;
std::cout << ms_int.count() << "ms\n";
std::cout << ms_double.count() << "ms\n";
return 0;
}
This will measure the duration of the function long_operation.
Possible output:
150ms
150.068ms
Working example: https://godbolt.org/z/oe5cMd
Here's a function that will measure the execution time of any function passed as argument:
#include <chrono>
#include <utility>
typedef std::chrono::high_resolution_clock::time_point TimeVar;
#define duration(a) std::chrono::duration_cast<std::chrono::nanoseconds>(a).count()
#define timeNow() std::chrono::high_resolution_clock::now()
template<typename F, typename... Args>
double funcTime(F func, Args&&... args){
TimeVar t1=timeNow();
func(std::forward<Args>(args)...);
return duration(timeNow()-t1);
}
Example usage:
#include <iostream>
#include <algorithm>
typedef std::string String;
//first test function doing something
int countCharInString(String s, char delim){
int count=0;
String::size_type pos = s.find_first_of(delim);
while ((pos = s.find_first_of(delim, pos)) != String::npos){
count++;pos++;
}
return count;
}
//second test function doing the same thing in different way
int countWithAlgorithm(String s, char delim){
return std::count(s.begin(),s.end(),delim);
}
int main(){
std::cout<<"norm: "<<funcTime(countCharInString,"precision=10",'=')<<"\n";
std::cout<<"algo: "<<funcTime(countWithAlgorithm,"precision=10",'=');
return 0;
}
Output:
norm: 15555
algo: 2976
In Scott Meyers book I found an example of universal generic lambda expression that can be used to measure function execution time. (C++14)
auto timeFuncInvocation =
[](auto&& func, auto&&... params) {
// get time before function invocation
const auto& start = std::chrono::high_resolution_clock::now();
// function invocation using perfect forwarding
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
// get time after function invocation
const auto& stop = std::chrono::high_resolution_clock::now();
return stop - start;
};
The problem is that you are measure only one execution so the results can be very differ. To get a reliable result you should measure a large number of execution.
According to Andrei Alexandrescu lecture at code::dive 2015 conference - Writing Fast Code I:
Measured time: tm = t + tq + tn + to
where:
tm - measured (observed) time
t - the actual time of interest
tq - time added by quantization noise
tn - time added by various sources of noise
to - overhead time (measuring, looping, calling functions)
According to what he said later in the lecture, you should take a minimum of this large number of execution as your result.
I encourage you to look at the lecture in which he explains why.
Also there is a very good library from google - https://github.com/google/benchmark.
This library is very simple to use and powerful. You can checkout some lectures of Chandler Carruth on youtube where he is using this library in practice. For example CppCon 2017: Chandler Carruth “Going Nowhere Faster”;
Example usage:
#include <iostream>
#include <chrono>
#include <vector>
auto timeFuncInvocation =
[](auto&& func, auto&&... params) {
// get time before function invocation
const auto& start = high_resolution_clock::now();
// function invocation using perfect forwarding
for(auto i = 0; i < 100000/*largeNumber*/; ++i) {
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
}
// get time after function invocation
const auto& stop = high_resolution_clock::now();
return (stop - start)/100000/*largeNumber*/;
};
void f(std::vector<int>& vec) {
vec.push_back(1);
}
void f2(std::vector<int>& vec) {
vec.emplace_back(1);
}
int main()
{
std::vector<int> vec;
std::vector<int> vec2;
std::cout << timeFuncInvocation(f, vec).count() << std::endl;
std::cout << timeFuncInvocation(f2, vec2).count() << std::endl;
std::vector<int> vec3;
vec3.reserve(100000);
std::vector<int> vec4;
vec4.reserve(100000);
std::cout << timeFuncInvocation(f, vec3).count() << std::endl;
std::cout << timeFuncInvocation(f2, vec4).count() << std::endl;
return 0;
}
EDIT:
Ofcourse you always need to remember that your compiler can optimize something out or not. Tools like perf can be useful in such cases.
simple program to find a function execution time taken.
#include <iostream>
#include <ctime> // time_t
#include <cstdio>
void function()
{
for(long int i=0;i<1000000000;i++)
{
// do nothing
}
}
int main()
{
time_t begin,end; // time_t is a datatype to store time values.
time (&begin); // note time before execution
function();
time (&end); // note time after execution
double difference = difftime (end,begin);
printf ("time taken for function() %.2lf seconds.\n", difference );
return 0;
}
Easy way for older C++, or C:
#include <time.h> // includes clock_t and CLOCKS_PER_SEC
int main() {
clock_t start, end;
start = clock();
// ...code to measure...
end = clock();
double duration_sec = double(end-start)/CLOCKS_PER_SEC;
return 0;
}
Timing precision in seconds is 1.0/CLOCKS_PER_SEC
#include <iostream>
#include <chrono>
void function()
{
// code here;
}
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
function();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << duration<<"/n";
return 0;
}
This Worked for me.
Note:
The high_resolution_clock is not implemented consistently across different standard library implementations, and its use should be avoided. It is often just an alias for std::chrono::steady_clock or std::chrono::system_clock, but which one it is depends on the library or configuration. When it is a system_clock, it is not monotonic (e.g., the time can go backwards).
For example, for gcc's libstdc++ it is system_clock, for MSVC it is steady_clock, and for clang's libc++ it depends on configuration.
Generally one should just use std::chrono::steady_clock or std::chrono::system_clock directly instead of std::chrono::high_resolution_clock: use steady_clock for duration measurements, and system_clock for wall-clock time.
Here is an excellent header only class template to measure the elapsed time of a function or any code block:
#ifndef EXECUTION_TIMER_H
#define EXECUTION_TIMER_H
template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock>;
private:
const Clock::time_point mStart = Clock::now();
public:
ExecutionTimer() = default;
~ExecutionTimer() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Destructor Elapsed: "
<< std::chrono::duration_cast<Resolution>( end - mStart ).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
inline void stop() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Stop Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
}; // ExecutionTimer
#endif // EXECUTION_TIMER_H
Here are some uses of it:
int main() {
{ // empty scope to display ExecutionTimer's destructor's message
// displayed in milliseconds
ExecutionTimer<std::chrono::milliseconds> timer;
// function or code block here
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::microseconds> timer;
// code block here...
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::nanoseconds> timer;
// code block here...
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::seconds> timer;
// code block here...
timer.stop();
}
return 0;
}
Since the class is a template we can specify real easily in how we want our time to be measured & displayed. This is a very handy utility class template for doing bench marking and is very easy to use.
If you want to safe time and lines of code you can make measuring the function execution time a one line macro:
a) Implement a time measuring class as already suggested above ( here is my implementation for android):
class MeasureExecutionTime{
private:
const std::chrono::steady_clock::time_point begin;
const std::string caller;
public:
MeasureExecutionTime(const std::string& caller):caller(caller),begin(std::chrono::steady_clock::now()){}
~MeasureExecutionTime(){
const auto duration=std::chrono::steady_clock::now()-begin;
LOGD("ExecutionTime")<<"For "<<caller<<" is "<<std::chrono::duration_cast<std::chrono::milliseconds>(duration).count()<<"ms";
}
};
b) Add a convenient macro that uses the current function name as TAG (using a macro here is important, else __FUNCTION__ will evaluate to MeasureExecutionTime instead of the function you wanto to measure
#ifndef MEASURE_FUNCTION_EXECUTION_TIME
#define MEASURE_FUNCTION_EXECUTION_TIME const MeasureExecutionTime measureExecutionTime(__FUNCTION__);
#endif
c) Write your macro at the begin of the function you want to measure. Example:
void DecodeMJPEGtoANativeWindowBuffer(uvc_frame_t* frame_mjpeg,const ANativeWindow_Buffer& nativeWindowBuffer){
MEASURE_FUNCTION_EXECUTION_TIME
// Do some time-critical stuff
}
Which will result int the following output:
ExecutionTime: For DecodeMJPEGtoANativeWindowBuffer is 54ms
Note that this (as all other suggested solutions) will measure the time between when your function was called and when it returned, not neccesarily the time your CPU was executing the function. However, if you don't give the scheduler any change to suspend your running code by calling sleep() or similar there is no difference between.
It is a very easy to use method in C++11.
We can use std::chrono::high_resolution_clock from header
We can write a method to print the method execution time in a much readable form.
For example, to find the all the prime numbers between 1 and 100 million, it takes approximately 1 minute and 40 seconds.
So the execution time get printed as:
Execution Time: 1 Minutes, 40 Seconds, 715 MicroSeconds, 715000 NanoSeconds
The code is here:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
typedef high_resolution_clock Clock;
typedef Clock::time_point ClockTime;
void findPrime(long n, string file);
void printExecutionTime(ClockTime start_time, ClockTime end_time);
int main()
{
long n = long(1E+8); // N = 100 million
ClockTime start_time = Clock::now();
// Write all the prime numbers from 1 to N to the file "prime.txt"
findPrime(n, "C:\\prime.txt");
ClockTime end_time = Clock::now();
printExecutionTime(start_time, end_time);
}
void printExecutionTime(ClockTime start_time, ClockTime end_time)
{
auto execution_time_ns = duration_cast<nanoseconds>(end_time - start_time).count();
auto execution_time_ms = duration_cast<microseconds>(end_time - start_time).count();
auto execution_time_sec = duration_cast<seconds>(end_time - start_time).count();
auto execution_time_min = duration_cast<minutes>(end_time - start_time).count();
auto execution_time_hour = duration_cast<hours>(end_time - start_time).count();
cout << "\nExecution Time: ";
if(execution_time_hour > 0)
cout << "" << execution_time_hour << " Hours, ";
if(execution_time_min > 0)
cout << "" << execution_time_min % 60 << " Minutes, ";
if(execution_time_sec > 0)
cout << "" << execution_time_sec % 60 << " Seconds, ";
if(execution_time_ms > 0)
cout << "" << execution_time_ms % long(1E+3) << " MicroSeconds, ";
if(execution_time_ns > 0)
cout << "" << execution_time_ns % long(1E+6) << " NanoSeconds, ";
}
I recommend using steady_clock which is guarunteed to be monotonic, unlike high_resolution_clock.
#include <iostream>
#include <chrono>
using namespace std;
unsigned int stopwatch()
{
static auto start_time = chrono::steady_clock::now();
auto end_time = chrono::steady_clock::now();
auto delta = chrono::duration_cast<chrono::microseconds>(end_time - start_time);
start_time = end_time;
return delta.count();
}
int main() {
stopwatch(); //Start stopwatch
std::cout << "Hello World!\n";
cout << stopwatch() << endl; //Time to execute last line
for (int i=0; i<1000000; i++)
string s = "ASDFAD";
cout << stopwatch() << endl; //Time to execute for loop
}
Output:
Hello World!
62
163514
Since none of the provided answers are very accurate or give reproducable results I decided to add a link to my code that has sub-nanosecond precision and scientific statistics.
Note that this will only work to measure code that takes a (very) short time to run (aka, a few clock cycles to a few thousand): if they run so long that they are likely to be interrupted by some -heh- interrupt, then it is clearly not possible to give a reproducable and accurate result; the consequence of which is that the measurement never finishes: namely, it continues to measure until it is statistically 99.9% sure it has the right answer which never happens on a machine that has other processes running when the code takes too long.
https://github.com/CarloWood/cwds/blob/master/benchmark.h#L40
You can have a simple class which can be used for this kind of measurements.
class duration_printer {
public:
duration_printer() : __start(std::chrono::high_resolution_clock::now()) {}
~duration_printer() {
using namespace std::chrono;
high_resolution_clock::time_point end = high_resolution_clock::now();
duration<double> dur = duration_cast<duration<double>>(end - __start);
std::cout << dur.count() << " seconds" << std::endl;
}
private:
std::chrono::high_resolution_clock::time_point __start;
};
The only thing is needed to do is to create an object in your function at the beginning of that function
void veryLongExecutingFunction() {
duration_calculator dc;
for(int i = 0; i < 100000; ++i) std::cout << "Hello world" << std::endl;
}
int main() {
veryLongExecutingFunction();
return 0;
}
and that's it. The class can be modified to fit your requirements.
C++11 cleaned up version of Jahid's response:
#include <chrono>
#include <thread>
void long_operation(int ms)
{
/* Simulating a long, heavy operation. */
std::this_thread::sleep_for(std::chrono::milliseconds(ms));
}
template<typename F, typename... Args>
double funcTime(F func, Args&&... args){
std::chrono::high_resolution_clock::time_point t1 =
std::chrono::high_resolution_clock::now();
func(std::forward<Args>(args)...);
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::high_resolution_clock::now()-t1).count();
}
int main()
{
std::cout<<"expect 150: "<<funcTime(long_operation,150)<<"\n";
return 0;
}
This is a very basic timer class which you can expand on depending on your needs. I wanted something straightforward which can be used cleanly in code. You can mess with it at coding ground with this link: http://tpcg.io/nd47hFqr.
class local_timer {
private:
std::chrono::_V2::system_clock::time_point start_time;
std::chrono::_V2::system_clock::time_point stop_time;
std::chrono::_V2::system_clock::time_point stop_time_temp;
std::chrono::microseconds most_recent_duration_usec_chrono;
double most_recent_duration_sec;
public:
local_timer() {
};
~local_timer() {
};
void start() {
this->start_time = std::chrono::high_resolution_clock::now();
};
void stop() {
this->stop_time = std::chrono::high_resolution_clock::now();
};
double get_time_now() {
this->stop_time_temp = std::chrono::high_resolution_clock::now();
this->most_recent_duration_usec_chrono = std::chrono::duration_cast<std::chrono::microseconds>(stop_time_temp-start_time);
this->most_recent_duration_sec = (long double)most_recent_duration_usec_chrono.count()/1000000;
return this->most_recent_duration_sec;
};
double get_duration() {
this->most_recent_duration_usec_chrono = std::chrono::duration_cast<std::chrono::microseconds>(stop_time-start_time);
this->most_recent_duration_sec = (long double)most_recent_duration_usec_chrono.count()/1000000;
return this->most_recent_duration_sec;
};
};
The use for this being
#include <iostream>
#include "timer.hpp" //if kept in an hpp file in the same folder, can also before your main function
int main() {
//create two timers
local_timer timer1 = local_timer();
local_timer timer2 = local_timer();
//set start time for timer1
timer1.start();
//wait 1 second
while(timer1.get_time_now() < 1.0) {
}
//save time
timer1.stop();
//print time
std::cout << timer1.get_duration() << " seconds, timer 1\n" << std::endl;
timer2.start();
for(long int i = 0; i < 100000000; i++) {
//do something
if(i%1000000 == 0) {
//return time since loop started
std::cout << timer2.get_time_now() << " seconds, timer 2\n"<< std::endl;
}
}
return 0;
}
I try to make sure the execution time of each loop to 10ms with usleep , but sometimes it exceeds 10ms.
I have no idea how to solve this problem, is it proper to use usleep and gettimeofday in this case?
Please help my find out what i missed.
Result: 0.0127289
0.0136499
0.0151598
0.0114031
0.014801
double tvsecf(){
struct timeval tv;
double asec;
gettimeofday(&tv,NULL);
asec = tv.tv_usec;
asec /= 1e6;
asec += tv.tv_sec;
return asec;
}
int main(){
double t1 ,t2;
t1 = tvsecf();
for(;;){
t2= tvsecf();
if(t2-t1 >= 0.01){
if(t2-t1 >= 0.011)
cout << t2-t1 <<endl;
t1 = tvsecf();
}
usleep(100);
}
}
To keep the loop overhead (which is generally unknown) from constantly accumulating error, you can sleep until a time point, instead of for a time duration. Using C++'s <chrono> and <thread> libraries, this is incredibly easy:
#include <chrono>
#include <iostream>
#include <thread>
int
main()
{
using namespace std;
using namespace std::chrono;
auto t0 = steady_clock::now() + 10ms;
for (;;)
{
this_thread::sleep_until(t0);
t0 += 10ms;
}
}
One can dress this up with more calls to steady_clock::now() in order to ascertain the time between iterations, and perhaps more importantly, the average iteration time:
#include <chrono>
#include <iostream>
#include <thread>
int
main()
{
using namespace std;
using namespace std::chrono;
using dsec = duration<double>;
auto t0 = steady_clock::now() + 10ms;
auto t1 = steady_clock::now();
auto t2 = t1;
constexpr auto N = 1000;
dsec avg{0};
for (auto i = 0; i < N; ++i)
{
this_thread::sleep_until(t0);
t0 += 10ms;
t2 = steady_clock::now();
dsec delta = t2-t1;
std::cout << delta.count() << "s\n";
avg += delta;
t1 = t2;
}
avg /= N;
cout << "avg = " << avg.count() << "s\n";
}
Above I've added to the loop overhead by doing more things within the loop. However the loop is still going to wake up about every 10ms. Sometimes the OS will wake the thread late, but next time the loop automatically adjusts itself to sleep for a shorter time. Thus the average iteration rate self-corrects to 10ms.
On my machine this just output:
...
0.0102046s
0.0128338s
0.00700504s
0.0116826s
0.00785826s
0.0107023s
0.00912614s
0.0104725s
0.010489s
0.0112545s
0.00906409s
avg = 0.0100014s
There is no way to guarantee 10ms loop time.
All sleeping functions sleeps for at least wanted time.
For a portable solution use std::this_thread::sleep_for
#include <iostream>
#include <chrono>
#include <thread>
int main()
{
for (;;) {
auto start = std::chrono::high_resolution_clock::now();
std::this_thread::sleep_for(std::chrono::milliseconds{10});
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> elapsed = end-start;
std::cout << "Waited " << elapsed.count() << " ms\n";
}
}
Depending on what you are trying to do take a look at Howard Hinnants date library.
From the usleep man page:
The sleep may be lengthened slightly by any system activity or by the time spent processing the call or by the granularity of system timers.
If you need high resolution: with C on Unix (or Linux) check out this answer that explains how to use high resolution timers using clock_gettime.
Edit: As mentioned by Tobias nanosleep may be a better option:
Compared to sleep(3) and usleep(3), nanosleep() has the following
advantages: it provides a higher resolution for specifying the sleep
interval; POSIX.1 explicitly specifies that it does not interact with
signals; and it makes the task of resuming a sleep that has been
interrupted by a signal handler easier.
I have a large codebase and I want to manually add some timers to profile some sections of the code.
Some of those sections are within a loop, so I would like to aggregate all the wall time spent there for each iteration.
What I'd like to do in a Pythonic pseudo-code:
time_step_1 = 0
time_step_2 = 0
for pair in pairs:
start_step_1 = time.now()
run_step_1(pair)
time_step_1 += start_step_1 - time.now()
start_step_2 = time.now()
run_step_2(pair)
time_step_2 += start_step_2 - time.now()
print("Time spent in step 1", time_step_1)
print("Time spent in step 2", time_step_2)
Is there a library in C++ to do this?
Otherwise would you recommend using boost::timer, create a map of timers and then resume and stop at each iteration?
Not very advanced, but for basic time measurement, you can use std::chrono library, specifically the std::chrono::high_resolution_clock - the clock
with smallest tick period (= highest accuracy) provided by the implementation.
For some more trivial time measurement, I have used RAII classes similar to this:
#include <chrono>
#include <cstdint>
#include <iomanip>
#include <iostream>
#include <string>
class TimeMeasureGuard {
public:
using clock_type = std::chrono::high_resolution_clock;
private:
const std::string m_testName;
std::ostream& m_os;
clock_type::time_point started_at;
clock_type::time_point ended_at;
public:
TimeMeasureGuard(const std::string& testName, std::ostream& os = std::cerr)
: m_testName(testName), m_os(os)
{
started_at = clock_type::now();
}
~TimeMeasureGuard()
{
ended_at = clock_type::now();
// Get duration
const auto duration = ended_at - started_at;
// Get duration in nanoseconds
const auto durationNs = std::chrono::nanoseconds(duration).count();
// ...or in microseconds:
const auto durationUs
= std::chrono::duration_cast<std::chrono::microseconds>(duration).count();
// Report total run time into 'm_os' stream
m_os << "[Test " << std::quoted(m_testName) << "]: Total run time: "
<< durationNs << " ns, " << "or: " << durationUs << " us" << std::endl;
}
};
Of course this is a very simple class, which would deserve several improvements before being used for a real measurement.
You can use this class like:
std::uint64_t computeSquares()
{
std::uint64_t interestingNumbers = 0;
{
auto time_measurement = TimeMeasureGuard("Test1");
for (std::uint64_t x = 0; x < 1'000; ++x) {
for (std::uint64_t y = 0; y < 1'000; ++y) {
if ((x * y) % 42 == 0)
++interestingNumbers;
}
}
}
return interestingNumbers;
}
int main()
{
std::cout << "Computing all x * y, where 'x' and 'y' are from 1 to 1'000..."
<< std::endl;
const auto res = computeSquares();
std::cerr << "Interesting numbers found: " << res << std::endl;
return 0;
}
And the output is:
Computing all x * y, where 'x' and 'y' are from 1 to 1'000...
[Test "Test1"]: Total run time: 6311371 ns, or: 6311 us
Interesting numbers found: 111170
For simple time measurement cases, this might be easier than using
a whole timer library, and it's just a few lines of code, you don't
need to include lots of headers.