Profiling C++ threads with clock()

Profiling C++ threads with clock() - c++

I am trying to measure how gcc threads perform on my system. I've written some very simple measurement code which is something like this...
start = clock();
for(int i=0; i < thread_iters; i++) {
pthread_mutex_lock(dataMutex);
data++;
pthread_mutex_unlock(dataMutex);
}
end = clock();
I do the usual subtract and div by CLOCKS_PER_SEC to get an elapsed time of about 2 seconds for 100000000 iterations. I then change the profiling code slightly so I am measuring the individual time for each mutex_lock/unlock call.
for(int i=0; i < thread_iters; i++) {
start1 = clock();
pthread_mutex_lock(dataMutex);
end1 = clock();
lock_time+=(end1-start1);
data++;
start2 = clock();
pthread_mutex_unlock(dataMutex);
end2 = clock();
unlock_time+=(end2-start2)
}
The times I get for the same number of iterations are
lock: ~27 seconds
unlock: ~27 seconds
I get why the total time for the program increases, more timer calls in the loop. But the time for the system calls should still add up to less than 2 seconds. Can someone help me figure out where I went wrong? Thanks!

The clock calls also measure the time it takes to call clock and return from it. This introduces a bias into the measurement. I.e. somewhere deep inside the clock function it takes a sample. But then before running your code, it has to return from deep inside clock. And then when you take the end measurement, before that time sample can be taken, clock has to be called and control has to pass somewhere deep inside that function where it actually obtains the time. So you're including all that overhead as part of the measurement.
You must find out how much time elapses between consecutive clock calls (by taking some samples over many pairs of clock calls to get an accurate average). That gives you a baseline bias: how much time does it take to execute nothing at all between two clock samples. You then carefully subtract your bias from the measurements.
But calls to clock can disturb the performance so that you're not getting an accurate answer. Calls to the kernel to get the clock are disturbing your L1 cache and instruction cache. For fine grained measurements like this, it is better to drop down to inline assembly and read a cycle counting register from the CPU.
clock is best used as you have it in your first example: take samples around something that executes for many iterations, and then divide by the number of iterations to estimate the single-iteration time.

Related

Measuing CPU clock speed

I am trying to measure the speed of the CPU.I am not sure how much my method is accurate. Basicly, I tried an empty for loop with values like UINT_MAX but the program terminated quickly so I tried UINT_MAX * 3 and so on...
Then I realized that the compiler is optimizing away the loop, so I added a volatile variable to prevent optimization. The following program takes 1.5 seconds approximately to finish. I want to know how accurate is this algorithm for measuring the clock speed. Also,how do I know how many core's are being involved in the process?
#include <iostream>
#include <limits.h>
#include <time.h>
using namespace std;
int main(void)
{
volatile int v_obj = 0;
unsigned long A, B = 0, C = UINT32_MAX;
clock_t t1, t2;
t1 = clock();
for (A = 0; A < C; A++) {
(void)v_obj;
}
t2 = clock();
std::cout << (double)(t2 - t1) / CLOCKS_PER_SEC << std::endl;
double t = (double)(t2 - t1) / CLOCKS_PER_SEC;
unsigned long clock_speed = (unsigned long)(C / t);
std::cout << "Clock speed : " << clock_speed << std::endl;
return 0;
}

This doesn't measure clock speed at all, it measures how many loop iterations can be done per second. There's no rule that says one iteration will run per clock cycle. It may be the case, and you may have actually found it to be the case - certainly with optimized code and a reasonable CPU, a useless loop shouldn't run much slower than that. It could run at half speed though, some processors are not able to retire more than 1 taken branch every 2 cycles. And on esoteric targets, all bets are off.
So no, this doesn't measure clock cycles, except accidentally. In general it's extremely hard to get an empirical clock speed (you can ask your OS what it thinks the maximum clock speed and current clock speed are, see below), because
If you measure how much wall clock time a loop takes, you must know (at least approximately) the number of cycles per iteration. That's a bad enough problem in assembly, requiring fairly detailed knowledge of the expected microarchitectures (maybe a long chain of dependent instructions that each could only reasonably take 1 cycle, like add eax, 1? a long enough chain that differences in the test/branch throughput become small enough to ignore), so obviously anything you do there is not portable and will have assumptions built into it may become false (actually there is an other answer on SO that does this and assumes that addps has a latency of 3, which it doesn't anymore on Skylake, and didn't have on old AMDs). In C? Give up now. The compiler might be rolling some random code generator, and relying on it to be reasonable is like doing the same with a bear. Guessing the number of cycles per iteration of code you neither control nor even know is just folly. If it's just on your own machine you can check the code, but then you could just check the clock speed manually too so..
If you measure the number of clock cycles elapsed in a given amount of wall clock time.. but this is tricky. Because rdtsc doesn't measure clock cycles (not anymore), and nothing else gets any closer. You can measure something, but with frequency scaling and turbo, it generally won't be actual clock cycles. You can get actual clock cycles from a performance counter, but you can't do that from user mode. Obviously any way you try to do this is not portable, because you can't portably ask for the number of elapsed clock cycles.
So if you're doing this for actual information and not just to mess around, you should probably just ask the OS. For Windows, query WMI for CurrentClockSpeed or MaxClockSpeed, whichever one you want. On Linux there's stuff in /proc/cpuinfo. Still not portable, but then, no solution is.
As for
how do I know how many core's are being involved in the process?
1. Of course your thread may migrate between cores, but since you only have one thread, it's on only one core at any time.

A good optimizer may remove the loop, since
for (A = 0; A < C; A++) {
(void)v_obj;
}
has the same effect on the program state as;
A = C;
So the optimizer is entirely free to unwind your loop.
So you cannot measure CPU speed this way as it depends on the compiler as much as it does on the computer (not to mention the variable clock speed and multicore architecture already mentioned)

std::clock() and CLOCKS_PER_SEC

I want to use a timer function in my program. Following the example at How to use clock() in C++, my code is:
int main()
{
std::clock_t start = std::clock();
while (true)
{
double time = (std::clock() - start) / (double)CLOCKS_PER_SEC;
std::cout << time << std::endl;
}
return 0;
}
On running this, it begins to print out numbers. However, it takes about 15 seconds for that number to reach 1. Why does it not take 1 second for the printed number to reach 1?

Actually it is a combination of what has been posted. Basically as your program is running in a tight loop, CPU time should increase just as fast as wall clock time.
But since your program is writing to stdout and the terminal has a limited buffer space, your program will block whenever that buffer is full until the terminal had enough time to print more of the generated output.
This of course is much more expensive CPU wise than generating the strings from the clock values, therefore most of the CPU time will be spent in the terminal and graphics driver. It seems like your system takes about 14 times the CPU power to output that timestamps than generating the strings to write.

std::clock returns cpu time, not wall time. That means the number of cpu-seconds used, not the time elapsed. If your program uses only 20% of the CPU, then the cpu-seconds will only increase at 20% of the speed of wall seconds.

std::clock
Returns the approximate processor time used by the process since the beginning of an implementation-defined era related to the program's execution. To convert result value to seconds divide it by CLOCKS_PER_SEC.
So it will not return a second until the program uses an actual second of the cpu time.
If you want to deal with actual time I suggest you use the clocks provided by <chrono> like std::steady_clock or std::high_resolution_clock

Function execution time

I want to find out the execution time of my function written in C++ on Linux. I found lots of posts regarding that. I tried all the methods mentioned in this link Timer Methods for calculating time. Following are the results of execution time of my function:
time() : 0 seconds
clock() : 0.01 seconds
gettimeofday() : 0.002869 seconds
rdtsc() : 0.00262336 seconds
clock_gettime() : 0.00672151 seconds
chrono : 0.002841 seconds
Please help me which method is reliable in its readings as all the results differ in their readings. I read that your OS is switching between different tasks so the readings cannot be expected to be very accurate. Is there a way that I can just calculate the time CPU spends on my function. I heard about the use of profiling tool but have not found any example for just a function yet. Please guide me.

Read time(7).
For various reasons (and depending upon your actual hardware, i.e. your motherboard) the time is not as accurate as you want it to be.
So, add some loop repeating your function many times, or change its input so that it runs longer. Ensure that the execution time of the total program (given by time(1)...) is at least a second approximately (if possible, ensure that you have at least half a second of CPU time).
For profiling, compile and link with  g++ -Wall -pg -O1 then use gprof(1) (there are more sophisticated ways to profile, e.g. oprofile ...).
See also this answer to a very similar question (by the same Zara).

If you are doing simple test, trying to find out which implementation is better, then any method is okay. For example:
const int MAX = 10000; // times to execute the function
void benchmark0() {
auto begin = std::chrono::steady_clock::now();
for (int i = 0; i < MAX; ++i)
method0();
auto now = std::chrono::steady_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(now - begin);
std::cout << "Cost of method0() is " << elapsed .count() << " milliseconds" << std::endl;
}
void benchmark1() { /* almost the same as benchmark0, but calls method1 */ }
int main() {
benchmark0();
benchmark0();
benchmark1();
benchmark1();
}
You might have noticed benchmark0 and benchmark1 has been called consecutively twice: because there will be cache of CPU, I/O ..., you might want to get rid of the performance gain/lost due to caching, but measure the pure execution time.
Of course, you can also use g++ to profile the program.

Why is the CPU time different with every execution of this program?

I have a hard time understanding processor time. The result of this program:
#include <iostream>
#include <chrono>
// the function f() does some time-consuming work
void f()
{
volatile long double d;
int size = 10000;
for(int n=0; n<size; ++n)
for(int m=0; m<size; ++m)
d = n*m;
}
int main()
{
std::clock_t start = std::clock();
f();
std::clock_t end = std::clock();
std::cout << "CPU time used: "
<< (end - start)
<< "\n";
}
Seems to randomly fluctuate between 210 000, 220 000 and 230 000. At first I was amazed, why these discrete values. Then I found out that std::clock() returns only approximate processor time. So probably the value returned by std::clock() is rounded to a multiple of 10 000. This would also explain why the maximum difference between the CPU times is 20 000 (10 000 == rounding error by the first call to std::clock() and 10 000 by the second).
But if I change to int size = 40000; in the body of f(), I get fluctuations in the ranges of 3 400 000 to 3 500 000 which cannot be explained by rounding.
From what I read about the clock rate, on Wikipedia:
The CPU requires a fixed number of clock ticks (or clock cycles) to
execute each instruction. The faster the clock, the more instructions
the CPU can execute per second.
That is, if the program is deterministic (which I hope mine is), the CPU time needed to finish should be:
Always the same
Slightly higher than the number of instructions carried out
My experiments show neither, since my program needs to carry out at least 3 * size * size instructions. Could you please explain what I am doing wrong?

First, the statement you quote from Wikipedia is simply false.
It might have been true 20 years ago (but not always, even
then), but it is totally false today. There are many things
which can affect your timings:
The first: if you're running on Windows, clock is broken,
and totally unreliable. It returns the difference in elapsed
time, not CPU time. And elapsed time depends on all sorts of
other things the processor might be doing.
Beyond that: things like cache misses have a very significant
impact on time. And whether a particular piece of data is in
the cache or not can depend on whether your program was
interrupted between the last access and this one.
In general, anything less than 10% can easily be due to the
caching issues. And I've seen differences of a factor of 10
under Windows, depending on whether there was a build running or
not.

You don't state what hardware you're running the binary on.
Does it have an interrupt driven CPU ?
Is it a multitasking operating system ?
You're mistaking the cycle time of the CPU (the CPU clock as Wikipedia refers to) with the time it takes to execute a particular piece of code from start to end and all the other stuff the poor CPU has to do at the same time.
Also ... is all your executing code in level 1 cache, or is some in level 2 or in main memory, or on disk ... what about the next time you run it ?

Your program is not deterministic, because it uses library and system functions which are not deterministic.
As a particular example, when you allocate memory this is virtual memory, which must be mapped to physical memory. Although this is a system call, running kernel code, it takes place on your thread and will count against your clock time. How long it takes to do this will depend on what the overall memory allocation situation is.

The CPU time is indeed "fixed" for a given set of circumstances. However, in a modern computer, there are other things happening in the system, which interferes with the execution of your code. It may be that caches are being wiped out when your email software wakes up to check if there is any new emails for you, or when the HP printer software checks for updates, or when the antivirus software decides to run for a little bit checking if your memory contains any viruses, etc, etc, etc, etc.
Part of this is also caused by the problem that CPU time accounting in any system is not 100% accurate - it works on "clock-ticks" and similar things, so the time used by for example an interrupt to service a network packet coming in, or the hard disk servicing interrupt, or the timer interrupt to say "another millisecond ticked by" these all account into "the currently running process". Assuming this is Windows, there is a further "feature", and that is that for historical and other reasons, std::clock() simply returns the time now, not actually the time used by your process. So for exampple:
t = clock();
cin >> x;
t = clock() - t;
would leave t with a time of 10 seconds if it took ten seconds to input the value of x, even though 9.999 of those ten seconds were spent in the idle process, not your program.

How to calculate and print clock_t time roughly

I am timing how long it takes to do three different types of searches, sequential, recursive binary, and iterative binary. I have those in place, and it does iterate through and finish the search. My problem is that when I time them all, I get 0 for all of them every time, even if I make an array of 100,000, and I have it search for something not in the array. If I set a break point in the search it obviously makes the time longer, and it gives me a reasonable time that I can work with. But otherwise it is always 0. Here is my code, it is similar for all three search timers.
clock_t recStart = clock();
mySearch.recursiveSearch(SEARCH_INT);
clock_t recEnd = clock();
clock_t recDiff = recEnd - recStart;
double recClockTime = (double)recDiff/(double)CLOCKS_PER_SEC;
cout << recClockTime << endl;
cout << CLOCKS_PER_SEC << endl;
cout << recClockTime << endl;
For the last two I get 1000 and 0.
Am I doing something wrong here? Or is it in my search Object?

clock() is not an accurate timer, and it just don't work well for timing short intervals.
C says clock returns the implementation’s best approximation to the processor time used by the program since the beginning of an implementation-defined era related only to the program invocation.
If between two successive clock calls you program takes less time than one unity of the clock function, you could get 0. POSIX clock defines the unity with CLOCKS_PER_SEC as 1000000 (unity is then 1 microsecond).
(http://pubs.opengroup.org/onlinepubs/009604499/functions/clock.html)
To measure clock cycles in x86/x64 you can use assembly to retreive the clock count of the CPU Time Stamp Counter register rdtsc. (which can be achieved by inline assembling?) Note that it returns the time stamp, not the number of seconds elapsed. So you need to retrieve the cpu frequency as well.
However, the best way to get accurate time in seconds depends on your platform.
To sum up, it's virtually impossible to achieve calculating and printing clock_t time in seconds accurately. You might want to see this on Stackoverflow to find a better approach (if accuracy is top priority).

clock() just doesn't have enough resolution - here is one good discussion/blog on that topic
http://www.guyrutenberg.com/2007/09/10/resolution-problems-in-clock/
I think two options either use clock_gettime or even better have you considered using OProfile or CodeAnalyst?
I personally prefer to use tools - OProfile is good. I have not used CodeAnalyst before - and then there is Valgrind and gprof.
If you insist on using clock_gettime - please check this out
http://www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js