We are running some code on a project that uses OpenMP and I've run into something strange. I've included parts of some play code that demonstrates what I see.
The tests compare calling a function with a const char* argument with a std::string argument in a multi-threaded loop. The functions essentially do nothing and so have no overhead.
What I do see is a major difference in the time it takes to complete the loops. For the const char* version doing 100,000,000 iterations the code takes 0.075 seconds to complete compared with 5.08 seconds for the std::string version. These tests were done on Ubuntu-10.04-x64 with gcc-4.4.
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Code below and many thanks for your responses.
Compiled with: g++ -Wall -Wextra -O3 -fopenmp string_args.cpp -o string_args
#include <iostream>
#include <map>
#include <string>
#include <stdint.h>
// For wall time
#ifdef _WIN32
#include <time.h>
#else
#include <sys/time.h>
#endif
namespace
{
const int64_t g_max_iter = 100000000;
std::map<const char*, int> g_charIndex = std::map<const char*,int>();
std::map<std::string, int> g_strIndex = std::map<std::string,int>();
class Timer
{
public:
Timer()
{
#ifdef _WIN32
m_start = clock();
#else /* linux & mac */
gettimeofday(&m_start,0);
#endif
}
float elapsed()
{
#ifdef _WIN32
clock_t now = clock();
const float retval = float(now - m_start)/CLOCKS_PER_SEC;
m_start = now;
#else /* linux & mac */
timeval now;
gettimeofday(&now,0);
const float retval = float(now.tv_sec - m_start.tv_sec) + float((now.tv_usec - m_start.tv_usec)/1E6);
m_start = now;
#endif
return retval;
}
private:
// The type of this variable is different depending on the platform
#ifdef _WIN32
clock_t
#else
timeval
#endif
m_start; ///< The starting time (implementation dependent format)
};
}
bool contains_char(const char * id)
{
if( g_charIndex.empty() ) return false;
return (g_charIndex.find(id) != g_charIndex.end());
}
bool contains_str(const std::string & name)
{
if( g_strIndex.empty() ) return false;
return (g_strIndex.find(name) != g_strIndex.end());
}
void do_serial_char()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_char()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_char("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_serial_str()
{
int found(0);
Timer clock;
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
void do_parallel_str()
{
int found(0);
Timer clock;
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter ; ++i )
{
if( contains_str("pos") )
{
++found;
}
}
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
int main()
{
std::cout << "Starting single-threaded loop using std::string\n";
do_serial_str();
std::cout << "\nStarting multi-threaded loop using std::string\n";
do_parallel_str();
std::cout << "\nStarting single-threaded loop using char *\n";
do_serial_char();
std::cout << "\nStarting multi-threaded loop using const char*\n";
do_parallel_char();
}
My question is basically whether this is solely due the dynamic allocation of std::string and why in this case that can't be optimized away since it is const and can't change?
Yes, it is due to the allocation and copying for std::string on every iteration.
A sufficiently smart compiler could potentially optimize this, but it is unlikely to happen with current optimizers. Instead, you can hoist the string yourself:
void do_parallel_str()
{
int found(0);
Timer clock;
std::string const str = "pos"; // you can even make it static, if desired
#pragma omp parallel for
for( int64_t i = 0; i < g_max_iter; ++i )
{
if( contains_str(str) )
{
++found;
}
}
//clock.stop(); // Or use something to that affect, so you don't include
// any of the below expression (such as outputing "Loop time: ") in the timing.
std::cout << "Loop time: " << clock.elapsed() << "\n";
++found;
}
Does changing:
if( contains_str("pos") )
to:
static const std::string str = "pos";
if( str )
Change things much? My current best guess is that the implicit constructor call for std::string every loop would introduce a fair bit of overhead and optimising it away whilst possible is still a sufficiently hard problem I suspect.
std::string (in your case temporary) requires dynamic allocation, which is a very slow operation, compared to everything else in your loop. There are also old implementations of standard library that did COW, which also slow in multi-threaded environment. Having said that, there is no reason why compiler cannot optimize temporary string creation and optimize away the whole contains_str function call, unless you have some side effects there. Since you didn't provide implementation for that function, it's impossible to say if it could be completely optimized away.
Related
I am testing std::counting_semaphore on C++20 with Windows 10 and MinGW x64.
As I learned from https://en.cppreference.com/w/cpp/thread/counting_semaphore, std::counting_semaphore is an atomic counter. We can use release() to increase the counter, and use acquire() to decrease the counter. If the counter equals to 0, than the thread wait.
I build the following simplified example to show my problem.
If I always release() before acquire() in the thread, the internal counter value(v) of std::counting_semaphore should always stay between v and v+1, and this code should never suffer any block.
When I run this example code, it suffers deadlock very often, but sometimes it can finish correctly.
I try to use std::cout message to understand the deadlock situation, but the deadlock disappeared when I using std::cout. In another hand, the deadlock disappeared when I use std::unique_lock.
The example is as follows:
#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <mutex>
#include <semaphore>
using namespace std::literals;
std::mutex mtx;
const int numOfThr {2};
const int numOfForLoop {1000};
const int max_smph {numOfThr* numOfForLoop *2};
std::counting_semaphore<max_smph> smph {numOfThr+1};
void thrf_TestSmph ( const int iThr )
{
for ( int i = 0; i < numOfForLoop; ++i )
{
// std::unique_lock ul(mtx);
//unique_lock can stop deadlock.
smph.release(); //smph counter ++
smph.acquire(); //smph counter --
// if ( i % 1000 == 1 ) std::cout << iThr << " : " << i << "\n";
//print out message can stop deadlock.
}
}
int main()
{
std::cout << "Start testing semaphore ..." << "\n\n";
std::vector<std::thread> thrf_TestSmphVec ( numOfThr );
for ( int iThr = 0; iThr < numOfThr; ++iThr )
{
thrf_TestSmphVec[iThr] = std::thread ( thrf_TestSmph, iThr );
}
for ( auto& thr : thrf_TestSmphVec )
{
if ( thr.joinable() )
thr.join();
}
std::cout << "Test is done." << "\n";
return 0;
}
Update: Found this bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104928
This is not really an answer.
I can reproduce the infinite blocking on my M1 macbook air, when it is compiled with gcc or clang and libstdc++. Printing message did't prevent the blocking. When it is compiled with clang and libc++, the program finished normally.
I noticed this piece of code and comment in my included header include/c++/11/bits/semaphore_base.h of libstdc++:
_GLIBCXX_ALWAYS_INLINE void
_M_release(ptrdiff_t __update) noexcept
{
if (0 < __atomic_impl::fetch_add(&_M_counter, __update, memory_order_release))
return;
if (__update > 1)
__atomic_notify_address_bare(&_M_counter, true);
else
__atomic_notify_address_bare(&_M_counter, true);
// FIXME - Figure out why this does not wake a waiting thread
// __atomic_notify_address_bare(&_M_counter, false);
}
Then I changed the first return to __atomic_notify_address_bare(&_M_counter, true);, and the problem seems disappear.
That comment is commited in this commit.
_GLIBCXX_ALWAYS_INLINE void
_M_release(ptrdiff_t __update) noexcept
{
if (0 < __atomic_impl::fetch_add(&_M_counter, __update, memory_order_release))
return;
if (__update > 1)
__atomic_notify_address_bare(&_M_counter, true);
else
- __atomic_notify_address_bare(&_M_counter, false);
+ __atomic_notify_address_bare(&_M_counter, true);
+ // FIXME - Figure out why this does not wake a waiting thread
+ // __atomic_notify_address_bare(&_M_counter, false);
It seems that the developer team has known the problem, but their short-term solution didn't fix the problem.
After doing a lot of experimentations about std::counting_semaphore::acquire(), I noticed that it will suffer a blocking when two threads trigger std::counting_semaphore::acquire() in a very close time interval. It seems to make the internal counter inside of the std::counting_semaphore be frozen, so
std::counting_semaphore::release() can not increase the internal counter of the std::counting_semaphore correctly. In this situation, the next std::counting_semaphore::acquire() will be blocked, because the internal counter is frozen. This situation happens in a lot of intense threads experimentations with std::counting_semaphore::acquire() on my system. The example code in my question is the most simplified one to reproduce this problem.
I guess it is a kind of collision issue inside of my system. Base on this assumption, I try to use back-off to bypass this problem.
I use while(!std::counting_semaphore::try_acquire_for(1ns)){} to substitude std::counting_semaphore::acquire(), because std::counting_semaphore::try_acquire_for() can return false when it can not decrease the internal counter.
It works well at this moment, even I increase the const int numOfThr {2} to {100`000}.
Here comes the example code as follows:
#include <iostream>
#include <thread>
#include <atomic>
#include <vector>
#include <mutex>
#include <semaphore>
using namespace std::literals;
std::mutex mtx;
const int numOfThr {2};
const int numOfForLoop {1000};
const int max_smph {numOfThr* numOfForLoop * 2};
std::counting_semaphore<max_smph> smph {numOfThr + 1};
void thrf_TestSmph ( const int iThr )
{
for ( int i = 0; i < numOfForLoop; ++i )
{
smph.release(); //smph counter ++
while ( !smph.try_acquire_for ( 1ns ) ) {} //smph counter --
//don't use smph.acquire() directly, it easily makes blocking.
}
}
int main()
{
std::cout << "Start testing semaphore ..." << "\n\n";
std::vector<std::thread> thrf_TestSmphVec ( numOfThr );
for ( int iThr = 0; iThr < numOfThr; ++iThr )
{
thrf_TestSmphVec[iThr] = std::thread ( thrf_TestSmph, iThr );
}
for ( auto& thr : thrf_TestSmphVec )
{
if ( thr.joinable() )
thr.join();
}
std::cout << "Test is done." << "\n";
return 0;
}
I want to find out how much time a certain function takes in my C++ program to execute on Linux. Afterwards, I want to make a speed comparison . I saw several time function but ended up with this from boost. Chrono:
process_user_cpu_clock, captures user-CPU time spent by the current process
Now, I am not clear if I use the above function, will I get the only time which CPU spent on that function?
Secondly, I could not find any example of using the above function. Can any one please help me how to use the above function?
P.S: Right now , I am using std::chrono::system_clock::now() to get time in seconds but this gives me different results due to different CPU load every time.
It is a very easy-to-use method in C++11. You have to use std::chrono::high_resolution_clock from <chrono> header.
Use it like so:
#include <chrono>
/* Only needed for the sake of this example. */
#include <iostream>
#include <thread>
void long_operation()
{
/* Simulating a long, heavy operation. */
using namespace std::chrono_literals;
std::this_thread::sleep_for(150ms);
}
int main()
{
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
auto t1 = high_resolution_clock::now();
long_operation();
auto t2 = high_resolution_clock::now();
/* Getting number of milliseconds as an integer. */
auto ms_int = duration_cast<milliseconds>(t2 - t1);
/* Getting number of milliseconds as a double. */
duration<double, std::milli> ms_double = t2 - t1;
std::cout << ms_int.count() << "ms\n";
std::cout << ms_double.count() << "ms\n";
return 0;
}
This will measure the duration of the function long_operation.
Possible output:
150ms
150.068ms
Working example: https://godbolt.org/z/oe5cMd
Here's a function that will measure the execution time of any function passed as argument:
#include <chrono>
#include <utility>
typedef std::chrono::high_resolution_clock::time_point TimeVar;
#define duration(a) std::chrono::duration_cast<std::chrono::nanoseconds>(a).count()
#define timeNow() std::chrono::high_resolution_clock::now()
template<typename F, typename... Args>
double funcTime(F func, Args&&... args){
TimeVar t1=timeNow();
func(std::forward<Args>(args)...);
return duration(timeNow()-t1);
}
Example usage:
#include <iostream>
#include <algorithm>
typedef std::string String;
//first test function doing something
int countCharInString(String s, char delim){
int count=0;
String::size_type pos = s.find_first_of(delim);
while ((pos = s.find_first_of(delim, pos)) != String::npos){
count++;pos++;
}
return count;
}
//second test function doing the same thing in different way
int countWithAlgorithm(String s, char delim){
return std::count(s.begin(),s.end(),delim);
}
int main(){
std::cout<<"norm: "<<funcTime(countCharInString,"precision=10",'=')<<"\n";
std::cout<<"algo: "<<funcTime(countWithAlgorithm,"precision=10",'=');
return 0;
}
Output:
norm: 15555
algo: 2976
In Scott Meyers book I found an example of universal generic lambda expression that can be used to measure function execution time. (C++14)
auto timeFuncInvocation =
[](auto&& func, auto&&... params) {
// get time before function invocation
const auto& start = std::chrono::high_resolution_clock::now();
// function invocation using perfect forwarding
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
// get time after function invocation
const auto& stop = std::chrono::high_resolution_clock::now();
return stop - start;
};
The problem is that you are measure only one execution so the results can be very differ. To get a reliable result you should measure a large number of execution.
According to Andrei Alexandrescu lecture at code::dive 2015 conference - Writing Fast Code I:
Measured time: tm = t + tq + tn + to
where:
tm - measured (observed) time
t - the actual time of interest
tq - time added by quantization noise
tn - time added by various sources of noise
to - overhead time (measuring, looping, calling functions)
According to what he said later in the lecture, you should take a minimum of this large number of execution as your result.
I encourage you to look at the lecture in which he explains why.
Also there is a very good library from google - https://github.com/google/benchmark.
This library is very simple to use and powerful. You can checkout some lectures of Chandler Carruth on youtube where he is using this library in practice. For example CppCon 2017: Chandler Carruth “Going Nowhere Faster”;
Example usage:
#include <iostream>
#include <chrono>
#include <vector>
auto timeFuncInvocation =
[](auto&& func, auto&&... params) {
// get time before function invocation
const auto& start = high_resolution_clock::now();
// function invocation using perfect forwarding
for(auto i = 0; i < 100000/*largeNumber*/; ++i) {
std::forward<decltype(func)>(func)(std::forward<decltype(params)>(params)...);
}
// get time after function invocation
const auto& stop = high_resolution_clock::now();
return (stop - start)/100000/*largeNumber*/;
};
void f(std::vector<int>& vec) {
vec.push_back(1);
}
void f2(std::vector<int>& vec) {
vec.emplace_back(1);
}
int main()
{
std::vector<int> vec;
std::vector<int> vec2;
std::cout << timeFuncInvocation(f, vec).count() << std::endl;
std::cout << timeFuncInvocation(f2, vec2).count() << std::endl;
std::vector<int> vec3;
vec3.reserve(100000);
std::vector<int> vec4;
vec4.reserve(100000);
std::cout << timeFuncInvocation(f, vec3).count() << std::endl;
std::cout << timeFuncInvocation(f2, vec4).count() << std::endl;
return 0;
}
EDIT:
Ofcourse you always need to remember that your compiler can optimize something out or not. Tools like perf can be useful in such cases.
simple program to find a function execution time taken.
#include <iostream>
#include <ctime> // time_t
#include <cstdio>
void function()
{
for(long int i=0;i<1000000000;i++)
{
// do nothing
}
}
int main()
{
time_t begin,end; // time_t is a datatype to store time values.
time (&begin); // note time before execution
function();
time (&end); // note time after execution
double difference = difftime (end,begin);
printf ("time taken for function() %.2lf seconds.\n", difference );
return 0;
}
Easy way for older C++, or C:
#include <time.h> // includes clock_t and CLOCKS_PER_SEC
int main() {
clock_t start, end;
start = clock();
// ...code to measure...
end = clock();
double duration_sec = double(end-start)/CLOCKS_PER_SEC;
return 0;
}
Timing precision in seconds is 1.0/CLOCKS_PER_SEC
#include <iostream>
#include <chrono>
void function()
{
// code here;
}
int main()
{
auto t1 = std::chrono::high_resolution_clock::now();
function();
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << duration<<"/n";
return 0;
}
This Worked for me.
Note:
The high_resolution_clock is not implemented consistently across different standard library implementations, and its use should be avoided. It is often just an alias for std::chrono::steady_clock or std::chrono::system_clock, but which one it is depends on the library or configuration. When it is a system_clock, it is not monotonic (e.g., the time can go backwards).
For example, for gcc's libstdc++ it is system_clock, for MSVC it is steady_clock, and for clang's libc++ it depends on configuration.
Generally one should just use std::chrono::steady_clock or std::chrono::system_clock directly instead of std::chrono::high_resolution_clock: use steady_clock for duration measurements, and system_clock for wall-clock time.
Here is an excellent header only class template to measure the elapsed time of a function or any code block:
#ifndef EXECUTION_TIMER_H
#define EXECUTION_TIMER_H
template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock>;
private:
const Clock::time_point mStart = Clock::now();
public:
ExecutionTimer() = default;
~ExecutionTimer() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Destructor Elapsed: "
<< std::chrono::duration_cast<Resolution>( end - mStart ).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
inline void stop() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Stop Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
}; // ExecutionTimer
#endif // EXECUTION_TIMER_H
Here are some uses of it:
int main() {
{ // empty scope to display ExecutionTimer's destructor's message
// displayed in milliseconds
ExecutionTimer<std::chrono::milliseconds> timer;
// function or code block here
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::microseconds> timer;
// code block here...
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::nanoseconds> timer;
// code block here...
timer.stop();
}
{ // same as above
ExecutionTimer<std::chrono::seconds> timer;
// code block here...
timer.stop();
}
return 0;
}
Since the class is a template we can specify real easily in how we want our time to be measured & displayed. This is a very handy utility class template for doing bench marking and is very easy to use.
If you want to safe time and lines of code you can make measuring the function execution time a one line macro:
a) Implement a time measuring class as already suggested above ( here is my implementation for android):
class MeasureExecutionTime{
private:
const std::chrono::steady_clock::time_point begin;
const std::string caller;
public:
MeasureExecutionTime(const std::string& caller):caller(caller),begin(std::chrono::steady_clock::now()){}
~MeasureExecutionTime(){
const auto duration=std::chrono::steady_clock::now()-begin;
LOGD("ExecutionTime")<<"For "<<caller<<" is "<<std::chrono::duration_cast<std::chrono::milliseconds>(duration).count()<<"ms";
}
};
b) Add a convenient macro that uses the current function name as TAG (using a macro here is important, else __FUNCTION__ will evaluate to MeasureExecutionTime instead of the function you wanto to measure
#ifndef MEASURE_FUNCTION_EXECUTION_TIME
#define MEASURE_FUNCTION_EXECUTION_TIME const MeasureExecutionTime measureExecutionTime(__FUNCTION__);
#endif
c) Write your macro at the begin of the function you want to measure. Example:
void DecodeMJPEGtoANativeWindowBuffer(uvc_frame_t* frame_mjpeg,const ANativeWindow_Buffer& nativeWindowBuffer){
MEASURE_FUNCTION_EXECUTION_TIME
// Do some time-critical stuff
}
Which will result int the following output:
ExecutionTime: For DecodeMJPEGtoANativeWindowBuffer is 54ms
Note that this (as all other suggested solutions) will measure the time between when your function was called and when it returned, not neccesarily the time your CPU was executing the function. However, if you don't give the scheduler any change to suspend your running code by calling sleep() or similar there is no difference between.
It is a very easy to use method in C++11.
We can use std::chrono::high_resolution_clock from header
We can write a method to print the method execution time in a much readable form.
For example, to find the all the prime numbers between 1 and 100 million, it takes approximately 1 minute and 40 seconds.
So the execution time get printed as:
Execution Time: 1 Minutes, 40 Seconds, 715 MicroSeconds, 715000 NanoSeconds
The code is here:
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
typedef high_resolution_clock Clock;
typedef Clock::time_point ClockTime;
void findPrime(long n, string file);
void printExecutionTime(ClockTime start_time, ClockTime end_time);
int main()
{
long n = long(1E+8); // N = 100 million
ClockTime start_time = Clock::now();
// Write all the prime numbers from 1 to N to the file "prime.txt"
findPrime(n, "C:\\prime.txt");
ClockTime end_time = Clock::now();
printExecutionTime(start_time, end_time);
}
void printExecutionTime(ClockTime start_time, ClockTime end_time)
{
auto execution_time_ns = duration_cast<nanoseconds>(end_time - start_time).count();
auto execution_time_ms = duration_cast<microseconds>(end_time - start_time).count();
auto execution_time_sec = duration_cast<seconds>(end_time - start_time).count();
auto execution_time_min = duration_cast<minutes>(end_time - start_time).count();
auto execution_time_hour = duration_cast<hours>(end_time - start_time).count();
cout << "\nExecution Time: ";
if(execution_time_hour > 0)
cout << "" << execution_time_hour << " Hours, ";
if(execution_time_min > 0)
cout << "" << execution_time_min % 60 << " Minutes, ";
if(execution_time_sec > 0)
cout << "" << execution_time_sec % 60 << " Seconds, ";
if(execution_time_ms > 0)
cout << "" << execution_time_ms % long(1E+3) << " MicroSeconds, ";
if(execution_time_ns > 0)
cout << "" << execution_time_ns % long(1E+6) << " NanoSeconds, ";
}
I recommend using steady_clock which is guarunteed to be monotonic, unlike high_resolution_clock.
#include <iostream>
#include <chrono>
using namespace std;
unsigned int stopwatch()
{
static auto start_time = chrono::steady_clock::now();
auto end_time = chrono::steady_clock::now();
auto delta = chrono::duration_cast<chrono::microseconds>(end_time - start_time);
start_time = end_time;
return delta.count();
}
int main() {
stopwatch(); //Start stopwatch
std::cout << "Hello World!\n";
cout << stopwatch() << endl; //Time to execute last line
for (int i=0; i<1000000; i++)
string s = "ASDFAD";
cout << stopwatch() << endl; //Time to execute for loop
}
Output:
Hello World!
62
163514
Since none of the provided answers are very accurate or give reproducable results I decided to add a link to my code that has sub-nanosecond precision and scientific statistics.
Note that this will only work to measure code that takes a (very) short time to run (aka, a few clock cycles to a few thousand): if they run so long that they are likely to be interrupted by some -heh- interrupt, then it is clearly not possible to give a reproducable and accurate result; the consequence of which is that the measurement never finishes: namely, it continues to measure until it is statistically 99.9% sure it has the right answer which never happens on a machine that has other processes running when the code takes too long.
https://github.com/CarloWood/cwds/blob/master/benchmark.h#L40
You can have a simple class which can be used for this kind of measurements.
class duration_printer {
public:
duration_printer() : __start(std::chrono::high_resolution_clock::now()) {}
~duration_printer() {
using namespace std::chrono;
high_resolution_clock::time_point end = high_resolution_clock::now();
duration<double> dur = duration_cast<duration<double>>(end - __start);
std::cout << dur.count() << " seconds" << std::endl;
}
private:
std::chrono::high_resolution_clock::time_point __start;
};
The only thing is needed to do is to create an object in your function at the beginning of that function
void veryLongExecutingFunction() {
duration_calculator dc;
for(int i = 0; i < 100000; ++i) std::cout << "Hello world" << std::endl;
}
int main() {
veryLongExecutingFunction();
return 0;
}
and that's it. The class can be modified to fit your requirements.
C++11 cleaned up version of Jahid's response:
#include <chrono>
#include <thread>
void long_operation(int ms)
{
/* Simulating a long, heavy operation. */
std::this_thread::sleep_for(std::chrono::milliseconds(ms));
}
template<typename F, typename... Args>
double funcTime(F func, Args&&... args){
std::chrono::high_resolution_clock::time_point t1 =
std::chrono::high_resolution_clock::now();
func(std::forward<Args>(args)...);
return std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::high_resolution_clock::now()-t1).count();
}
int main()
{
std::cout<<"expect 150: "<<funcTime(long_operation,150)<<"\n";
return 0;
}
This is a very basic timer class which you can expand on depending on your needs. I wanted something straightforward which can be used cleanly in code. You can mess with it at coding ground with this link: http://tpcg.io/nd47hFqr.
class local_timer {
private:
std::chrono::_V2::system_clock::time_point start_time;
std::chrono::_V2::system_clock::time_point stop_time;
std::chrono::_V2::system_clock::time_point stop_time_temp;
std::chrono::microseconds most_recent_duration_usec_chrono;
double most_recent_duration_sec;
public:
local_timer() {
};
~local_timer() {
};
void start() {
this->start_time = std::chrono::high_resolution_clock::now();
};
void stop() {
this->stop_time = std::chrono::high_resolution_clock::now();
};
double get_time_now() {
this->stop_time_temp = std::chrono::high_resolution_clock::now();
this->most_recent_duration_usec_chrono = std::chrono::duration_cast<std::chrono::microseconds>(stop_time_temp-start_time);
this->most_recent_duration_sec = (long double)most_recent_duration_usec_chrono.count()/1000000;
return this->most_recent_duration_sec;
};
double get_duration() {
this->most_recent_duration_usec_chrono = std::chrono::duration_cast<std::chrono::microseconds>(stop_time-start_time);
this->most_recent_duration_sec = (long double)most_recent_duration_usec_chrono.count()/1000000;
return this->most_recent_duration_sec;
};
};
The use for this being
#include <iostream>
#include "timer.hpp" //if kept in an hpp file in the same folder, can also before your main function
int main() {
//create two timers
local_timer timer1 = local_timer();
local_timer timer2 = local_timer();
//set start time for timer1
timer1.start();
//wait 1 second
while(timer1.get_time_now() < 1.0) {
}
//save time
timer1.stop();
//print time
std::cout << timer1.get_duration() << " seconds, timer 1\n" << std::endl;
timer2.start();
for(long int i = 0; i < 100000000; i++) {
//do something
if(i%1000000 == 0) {
//return time since loop started
std::cout << timer2.get_time_now() << " seconds, timer 2\n"<< std::endl;
}
}
return 0;
}
The Windows function QueryThreadCycleTime() gives the number of "CPU clock cycles" used by a given thread. The Windows manual boldly states
Do not attempt to convert the CPU clock cycles returned by QueryThreadCycleTime to elapsed time.
I would like to do exactly this, for most Intel and AMD x86_64 CPUs.
It doesn't need to be very accurate, because you can't expect perfection from cycle counters like RDTSC anyway.
I just need some kludgey way to get the time factor seconds / QueryThreadCycleTime for the CPUs.
First, I imagine that QueryThreadCycleTime uses RDTSC internally.
I imagine that on some CPUs, constant rate TSC is used, so changing the actual clock rate (e.g. with variable-frequency CPU power management) doesn't affect the time/TSC factor.
On other CPUs, that rate might change, so I'd have to query this factor periodically.
Why do I need this?
Before anyone cites the XY Problem, I should note that I'm not really interested in alternative solutions.
This is because I have two hard requirements for profiling that no other method meets.
It should only measure thread time, so sleep(1) should not return 1 second, but a busy loop lasting 1 second should. In other words, the profiler should not say that a task ran for 10ms when its thread was only active for 1ms. This is the reason I cannot use QueryPerformanceCounter().
It needs a precision better than 1/64 seconds, which is the precision given by GetThreadTimes(). The tasks I'm profiling might run for only a few microseconds.
Minimal reproducable example
As requested by #Ted Lyngmo, the goal is implement computeFactor().
#include <stdio.h>
#include <windows.h>
double computeFactor();
int main() {
uint64_t start, end;
QueryThreadCycleTime(GetCurrentThread(), &start);
// insert task here, such as an actual workload or sleep(1)
QueryThreadCycleTime(GetCurrentThread(), &end);
printf("%lf\n", (end - start) * computeFactor());
return 0;
}
Do not attempt to convert the CPU clock cycles returned by QueryThreadCycleTime to elapsed time.
I would like to do exactly this.
Your wish is obviously Denied!
A workaround, that will do something close to what you want, could be to create one thread with a steady_clock that samples QueryThreadCycleTime and/or GetThreadTimes at some specified frequency. Here's an example of how it could be done with a sampling thread taking a sample of both once every second.
#include <algorithm>
#include <atomic>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <iomanip>
#include <thread>
#include <vector>
#include <Windows.h>
using namespace std::literals::chrono_literals;
struct FTs_t {
FILETIME CreationTime, ExitTime, KernelTime, UserTime;
ULONG64 CycleTime;
};
using Sample = std::vector<FTs_t>;
std::ostream& operator<<(std::ostream& os, const FILETIME& ft) {
std::uint64_t bft = (std::uint64_t(ft.dwHighDateTime) << 16) + ft.dwLowDateTime;
return os << bft;
}
std::ostream& operator<<(std::ostream& os, const Sample& smp) {
size_t tno = 0;
for (const auto& fts : smp) {
os << " tno:" << std::setw(3) << tno << std::setw(10) << fts.KernelTime
<< std::setw(10) << fts.UserTime << std::setw(16) << fts.CycleTime << "\n";
++tno;
}
return os;
}
// the sampling thread
void ft_sampler(std::atomic<bool>& quit, std::vector<std::thread>& threads, std::vector<Sample>& samples) {
auto tp = std::chrono::steady_clock::now(); // for steady sampling
FTs_t fts;
while (quit == false) {
Sample s;
s.reserve(threads.size());
for (auto& th : threads) {
if (QueryThreadCycleTime(th.native_handle(), &fts.CycleTime) &&
GetThreadTimes(th.native_handle(), &fts.CreationTime,
&fts.ExitTime, &fts.KernelTime, &fts.UserTime)) {
s.push_back(fts);
}
}
samples.emplace_back(std::move(s));
tp += 1s; // add a second since we last sampled and sleep until that time_point
std::this_thread::sleep_until(tp);
}
}
// a worker thread
void worker(std::atomic <bool>& quit, size_t payload) {
volatile std::uintmax_t x = 0;
while (quit == false) {
for (size_t i = 0; i < payload; ++i) ++x;
std::this_thread::sleep_for(1us);
}
}
int main() {
std::atomic<bool> quit_sampling = false, quit_working = false;
std::vector<std::thread> threads;
std::vector<Sample> samples;
size_t max_threads = std::thread::hardware_concurrency() > 1 ? std::thread::hardware_concurrency() - 1 : 1;
// start some worker threads
for (size_t tno = 0; tno < max_threads; ++tno) {
threads.emplace_back(std::thread(&worker, std::ref(quit_working), (tno + 100) * 100000));
}
// start the sampling thread
auto smplr = std::thread(&ft_sampler, std::ref(quit_sampling), std::ref(threads), std::ref(samples));
// let the threads work for some time
std::this_thread::sleep_for(10s);
quit_sampling = true;
smplr.join();
quit_working = true;
for (auto& th : threads) th.join();
std::cout << "Took " << samples.size() << " samples\n";
size_t s = 0;
for (const auto& smp : samples) {
std::cout << "Sample " << s << ":\n" << smp << "\n";
++s;
}
}
I am wondering if one can use
char Flag;
instead of
std::atomic_flag Flag;
I know that C++ fundamental types, generally speaking, are not atomic/thread safe (that's why std::atomic exists), but also I know that size of char is always 1 byte. And I cannot imagine situation in which read/write of single byte is not thread safe.
Also I cannot find anything about thread safefy of char variable.
Consider following example (Win32, Visual Studio 2015, Release, optimisation disabled):
// Can be any integral type
using mytype_t = unsigned char;
#define VAL1 static_cast<mytype_t>(0x5555555555555555ULL)
#define VAL2 static_cast<mytype_t>(0xAAAAAAAAAAAAAAAAULL)
#define CYCLES (50 * 1000 * 1000)
void runtest_mytype()
{
// Just to stop checking thread
std::atomic_bool Stop = false;
const auto Started = ::GetTickCount64();
auto Val = VAL1;
std::thread threadCheck([&]()
{
// Checking values
while (!Stop)
{
const auto Val_ = Val;
if (VAL1 != Val_ && VAL2 != Val_)
std::cout << "Error! " << std::to_string(Val_) << std::endl;
}
});
std::thread thread1([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Val = VAL1;
});
std::thread thread2([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Val = VAL2;
});
thread1.join();
thread2.join();
std::cout << "mytype: finished in " << std::to_string(::GetTickCount64() - Started) << " ms" << std::endl;
Stop = true;
threadCheck.join();
}
void runtest_atomic_flag()
{
std::atomic_flag Flag;
const auto Started = ::GetTickCount64();
std::thread thread1([&]()
{
for (auto I = 0; I < CYCLES; ++I)
auto Val_ = Flag.test_and_set(std::memory_order_acquire);
});
std::thread thread2([&]()
{
for (auto I = 0; I < CYCLES; ++I)
Flag.clear(std::memory_order_release);
});
thread1.join();
thread2.join();
std::cout << "atomic_flag: finished in " << std::to_string(::GetTickCount64() - Started) << " ms" << std::endl;
}
int _tmain(int argc, _TCHAR* argv[])
{
runtest_mytype();
runtest_atomic_flag();
std::getchar();
return 0;
}
It outputs something like this (during several tests, the values did not change much):
mytype: finished in 312 ms
atomic_flag: finished in 1669 ms
So, char instead of atomic_flag, works significantly faster, which can play role in some cases.
But I am far from the idea that std::atomiс_flag was invented in vain.
Please, help me figure it out. At least, can I use char, when I use only Windows, only Visual Studio, and I don't have to think about compatibility?
changes of atomic variables are also visible in other thread.
When using char, the modification might be not visible to other threads (that's why some people wrongly use volatile for synchronization).
Btw, modifying char concurrently without synchronization is UB.
I am interested in timing the execution time of a free function or a member function (template or not). Call TheFunc the function in question, its call being
TheFunc(/*parameters*/);
or
ReturnType ret = TheFunc(/*parameters*/);
Of course I could wrap these function calls as follows :
double duration = 0.0 ;
std::clock_t start = std::clock();
TheFunc(/*parameters*/);
duration = static_cast<double>(std::clock() - start) / static_cast<double>(CLOCKS_PER_SEC);
or
double duration = 0.0 ;
std::clock_t start = std::clock();
ReturnType ret = TheFunc(/*parameters*/);
duration = static_cast<double>(std::clock() - start) / static_cast<double>(CLOCKS_PER_SEC);
but I would like to do something more elegant than this, namely (and from now on I will stick to the void return type) as follows :
Timer thetimer ;
double duration = 0.0;
thetimer(*TheFunc)(/*parameters*/, duration);
where Timer is some timing class that I would like to design and that would allow me to write the previous code, in such way that after the exectution of the last line of previous code the double duration will contain the execution time of
TheFunc(/*parameters*/);
but I don't see how to do this, nor if the syntax/solution I aim for is optimal...
With variadic template, you may do:
template <typename F, typename ... Ts>
double Time_function(F&& f, Ts&&...args)
{
std::clock_t start = std::clock();
std::forward<F>(f)(std::forward<Ts>(args)...);
return static_cast<double>(std::clock() - start) / static_cast<double>(CLOCKS_PER_SEC);
}
I really like boost::cpu_timer::auto_cpu_timer, and when I cannot use boost I simply hack my own:
#include <cmath>
#include <string>
#include <chrono>
#include <iostream>
class AutoProfiler {
public:
AutoProfiler(std::string name)
: m_name(std::move(name)),
m_beg(std::chrono::high_resolution_clock::now()) { }
~AutoProfiler() {
auto end = std::chrono::high_resolution_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::microseconds>(end - m_beg);
std::cout << m_name << " : " << dur.count() << " musec\n";
}
private:
std::string m_name;
std::chrono::time_point<std::chrono::high_resolution_clock> m_beg;
};
void foo(std::size_t N) {
long double x {1.234e5};
for(std::size_t k = 0; k < N; k++) {
x += std::sqrt(x);
}
}
int main() {
{
AutoProfiler p("N = 10");
foo(10);
}
{
AutoProfiler p("N = 1,000,000");
foo(1000000);
}
}
This timer works thanks to RAII. When you build the object within an scope you store the timepoint at that point in time. When you leave the scope (that is, at the corresponding }) the timer first stores the timepoint, then calculates the number of ticks (which you can convert to a human-readable duration), and finally prints it to screen.
Of course, boost::timer::auto_cpu_timer is much more elaborate than my simple implementation, but I often find my implementation more than sufficient for my purposes.
Sample run in my computer:
$ g++ -o example example.com -std=c++14 -Wall -Wextra
$ ./example
N = 10 : 0 musec
N = 1,000,000 : 10103 musec
EDIT
I really liked the implementation suggested by #Jarod42. I modified it a little bit to offer some flexibility on the desired "units" of the output.
It defaults to returning the number of elapsed microseconds (an integer, normally std::size_t), but you can request the output to be in any duration of your choice.
I think it is a more flexible approach than the one I suggested earlier because now I can do other stuff like taking the measurements and storing them in a container (as I do in the example).
Thanks to #Jarod42 for the inspiration.
#include <cmath>
#include <string>
#include <chrono>
#include <algorithm>
#include <iostream>
template<typename Duration = std::chrono::microseconds,
typename F,
typename ... Args>
typename Duration::rep profile(F&& fun, Args&&... args) {
const auto beg = std::chrono::high_resolution_clock::now();
std::forward<F>(fun)(std::forward<Args>(args)...);
const auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<Duration>(end - beg).count();
}
void foo(std::size_t N) {
long double x {1.234e5};
for(std::size_t k = 0; k < N; k++) {
x += std::sqrt(x);
}
}
int main() {
std::size_t N { 1000000 };
// profile in default mode (microseconds)
std::cout << "foo(1E6) takes " << profile(foo, N) << " microseconds" << std::endl;
// profile in custom mode (e.g, milliseconds)
std::cout << "foo(1E6) takes " << profile<std::chrono::milliseconds>(foo, N) << " milliseconds" << std::endl;
// To create an average of `M` runs we can create a vector to hold
// `M` values of the type used by the clock representation, fill
// them with the samples, and take the average
std::size_t M {100};
std::vector<typename std::chrono::milliseconds::rep> samples(M);
for(auto & sample : samples) {
sample = profile(foo, N);
}
auto avg = std::accumulate(samples.begin(), samples.end(), 0) / static_cast<long double>(M);
std::cout << "average of " << M << " runs: " << avg << " microseconds" << std::endl;
}
Output (compiled with g++ example.cpp -std=c++14 -Wall -Wextra -O3):
foo(1E6) takes 10073 microseconds
foo(1E6) takes 10 milliseconds
average of 100 runs: 10068.6 microseconds
You can do it the MatLab way. It's very old-school but simple is often good:
tic();
a = f(c);
toc(); //print to stdout, or
auto elapsed = toc(); //store in variable
tic() and toc() can work to a global variable. If that's not sufficient, you can create local variables with some macro-magic:
tic(A);
a = f(c);
toc(A);
I'm a fan of using RAII wrappers for this type of stuff.
The following example is a little verbose but it's more flexible in that it works with arbitrary scopes instead of being limited to a single function call:
class timing_context {
public:
std::map<std::string, double> timings;
};
class timer {
public:
timer(timing_context& ctx, std::string name)
: ctx(ctx),
name(name),
start(std::clock()) {}
~timer() {
ctx.timings[name] = static_cast<double>(std::clock() - start) / static_cast<double>(CLOCKS_PER_SEC);
}
timing_context& ctx;
std::string name;
std::clock_t start;
};
timing_context ctx;
int main() {
timer_total(ctx, "total");
{
timer t(ctx, "foo");
// Do foo
}
{
timer t(ctx, "bar");
// Do bar
}
// Access ctx.timings
}
The downside is that you might end up with a lot of scopes that only serve to destroy the timing object.
This might or might not satisfy your requirements as your request was a little vague but it illustrates how using RAII semantics can make for some really nice reusable and clean code. It can probably be modified to look a lot better too!