How to handle caching while timing an operating in C++ in linux - c++

I have to time the clock_gettime() function for estimating and profiling other operations, and it's for homework so I cant use a profiler and have to write my own code.
The way I'm doing it is like below:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID,&begin);
for(int i=0;i<=n;i++)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
cout<<(end.tv_nsec-begin.tv_nsec)/n; //time per clock_gettime()
The problem is that when n=100, output is: 370.63 ns, when n=100000, output: 330 ns, when n=1000000, output: 260 ns, n=10000000, output: 55 ns, ....keeps reducing.
I understand that this is happening because of instruction caching, but I don't know how to handle this in profiling. Because for example when I estimate the time for a function call using gettime, how would I know how much time that gettime used for itself?
Would taking a weighted mean of all these values be a good idea? (I can run the operation I want the same number of times, take weighted mean of that, subtract weighted mean of gettime and get a good estimate of the operation irrespective of caching?)
Any suggestions are welcome.
Thank you in advance.

When you compute the time difference: (end.tv_nsec-begin.tv_nsec)/n
You are only taking into account the nanoseconds part of the elapsed time. You must also take the seconds into account since the tv_nsec field only reflects the fractional part of a second:
int64_t end_ns = ((int64_t)end.tv_sec * 1000000000) + end.tv_nsec;
int64_t begin_ns = ((int64_t)begin.tv_sec * 1000000000) + begin.tv_nsec;
int64_t elapsed_ns = end_ns - begin_ns;
Actually, with your current code you should sometimes get negative results when the nanoseconds part of end has wrapped around and is less than begin's nanoseconds part.
Fix that, and you'll be able to observe much more consistent results.
Edit: for the sake of completeness, here's the code I used for my tests, which gets me very consistent results (between 280 and 300ns per call, whatever number of iterations I use):
int main() {
const int loops = 100000000;
struct timespec begin;
struct timespec end;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &begin);
for(int i = 0; i < loops; i++)
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
int64_t end_ns = ((int64_t)end.tv_sec * 1000000000) + end.tv_nsec;
int64_t begin_ns = ((int64_t)begin.tv_sec * 1000000000) + begin.tv_nsec;
int64_t elapsed_ns = end_ns - begin_ns;
int64_t ns_per_call = elapsed_ns / loops;
std::cout << ns_per_call << std::endl;
}

Related

Code performance strick measurement

I’m creating performance framework tool for measuring individual message processing time in CentOS 7. I reserved one CPU for this task with isolcpus kernel option and I run it using taskset.
Ok, now the problem. I trying to measure the max processing time among several messages. The processing time is <= 1000ns, but when I run many iterations I get very high results (> 10000ns).
Here I created some simple code which does nothing interesting but shows the problem. Depending on the number of iterations i can get results like:
max: 84 min: 23 -> for 1000 iterations
max: 68540 min: 11 -> for 100000000 iterations
I'm trying to understand from where this difference came from? I tried to run this with real-time scheduling with highest priority. Is there some way to prevent that?
#include <iostream>
#include <limits>
#include <time.h>
const unsigned long long SEC = 1000L*1000L*1000L;
inline int64_t time_difference( const timespec &start,
const timespec &stop ) {
return ( (stop.tv_sec * SEC - start.tv_sec * SEC) +
(stop.tv_nsec - start.tv_nsec));
}
int main()
{
timespec start, stop;
int64_t max = 0, min = std::numeric_limits<int64_t>::max();
for(int i = 0; i < 100000000; ++i){
clock_gettime(CLOCK_REALTIME, &start);
clock_gettime(CLOCK_REALTIME, &stop);
int64_t time = time_difference(start, stop);
max = std::max(max, time);
min = std::min(min, time);
}
std::cout << "max: " << max << " min: " << min << std::endl;
}
You can't really reduce jitter to zero even with isolcpus, since you still have at least the following:
1) Interrupts delivered to your CPU (you may be able to reduce this my messing with irq affinity - but probably not to zero).
2) Clock timer interrupts are still scheduled for your process and may do a variable amount of work on the kernel side.
3) The CPU itself may pause briefly for P-state or C-state transitions, or other reasons (e.g., to let voltage levels settle after turning on AVX circuitry, etc).
Let us check the documentation...
Isolation will be effected for userspace processes - kernel threads may still get scheduled on the isolcpus isolated CPUs.
So it seems that there is no guarantee of perfect isolation, at least not from the kernel.

Performance difference between float and double in x86 and x86_64

A while ago I heard that some compilers use SSE2 extensions for floating point operations for x86_64 architecture, so I used this simple code to determine the performance difference between them.
I disabled Intel SpeedStep technology via BIOS and system load was approximately equal for my tests. I am using GCC 4.8 on OpenSuSE 64 bit.
I am writing a program with a lot of FPU operations and I would like to know if this test is valid or not?
And any information about the performance difference between float and double under each architecture is appreciated.
Code :
#include <iostream>
#include <sys/time.h>
#include <vector>
#include <cstdlib>
using namespace std;
int main()
{
timeval t1, t2;
double elapsedTime;
double TotalTime = 0;
for(int j=0 ; j < 100 ; j++)
{
// start timer
gettimeofday(&t1, NULL);
vector<float> RealVec;
float temp;
for (int i = 0; i < 1000000; i++)
{
temp = static_cast <float> (rand()) / (static_cast <float> (RAND_MAX));
RealVec.push_back(temp);
}
for (int i = 0; i < 1000000; i++)
{
RealVec[i] = (RealVec[i]*2-435.345345)/15.75;
}
// stop timer
gettimeofday(&t2, NULL);
elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0; // sec to ms
elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0; // us to ms
TotalTime = TotalTime + elapsedTime;
}
cout << TotalTime/100 << " ms.\n";
return 0;
}
and result :
32 Bit Double
157.781 ms.
151.994 ms.
152.244 ms.
32 Bit Float
149.896 ms.
148.489 ms.
161.086 ms.
64 Bit Double
110.125 ms.
111.612 ms.
113.818 ms.
64 Bit Float
110.393 ms.
106.778 ms.
107.833 ms.
You're really not measuring much; perhaps just the degree of compiler
optimization. In order for the measurements to be valid, you really
have to do something with the results, or the compiler can optimize out
all, or the major part of your tests. What I woule do is 1) initialize
the vector, 2) get the start time (probably using clock, since that
only takes CPU time into account), 3) execute the second loop a 100 (or
more... enough to last a couple of seconds, at least) times, 4) get the
end time, and finally, 5) output the sum of the elements in the vector.
With regards to the differences you may find: independently of the
floating point processors, the 64 bit machine has more general registers
for the compiler to play with. This could have an enormous impact.
Unless you look at the generated assembler, you just can't know.
Not really valid. You're basically testing the performance of the random number generator.
Also, you're not trying to enforce SSE2 SIMD operation, so you can't really claim this compares anything SSE-related.
Valid in what sense?
Measure actual usage, with your actual code.
Some artificial test suite probably won't help you assess the performance characteristics.
You can use a typedef, then change the actual underlying type with a flick of a switch.

Interesting processing time results

I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}
You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.
On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.

precise time measurement

I'm using time.h in C++ to measure the timing of a function.
clock_t t = clock();
someFunction();
printf("\nTime taken: %.4fs\n", (float)(clock() - t)/CLOCKS_PER_SEC);
however, I'm always getting the time taken as 0.0000. clock() and t when printed separately, have the same value. I would like to know if there is way to measure the time precisely (maybe in the order of nanoseconds) in C++ . I'm using VS2010.
C++11 introduced the chrono API, you can use to get nanoseconds :
auto begin = std::chrono::high_resolution_clock::now();
// code to benchmark
auto end = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count() << "ns" << std::endl;
For a more relevant value it is good to run the function several times and compute the average :
auto begin = std::chrono::high_resolution_clock::now();
uint32_t iterations = 10000;
for(uint32_t i = 0; i < iterations; ++i)
{
// code to benchmark
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count();
std::cout << duration << "ns total, average : " << duration / iterations << "ns." << std::endl;
But remember the for loop and assigning begin and end var use some CPU time too.
I usually use the QueryPerformanceCounter function.
example:
LARGE_INTEGER frequency; // ticks per second
LARGE_INTEGER t1, t2; // ticks
double elapsedTime;
// get ticks per second
QueryPerformanceFrequency(&frequency);
// start timer
QueryPerformanceCounter(&t1);
// do something
...
// stop timer
QueryPerformanceCounter(&t2);
// compute and print the elapsed time in millisec
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
The following text, that i completely agree with, is quoted from Optimizing software in C++ (good reading for any C++ programmer) -
The time measurements may require a very high resolution if time
intervals are short. In Windows, you can use the
GetTickCount or
QueryPerformanceCounter functions for millisecond resolution. A much
higher resolution can be obtained with the time stamp counter in the
CPU, which counts at the CPU clock frequency.
There is a problem that "the clock frequency may vary dynamically and that
measurements are unstable due to interrupts and task switches."
In C or C++ I usually do like below. If it still fails you may consider using rtdsc functions
struct timeval time;
gettimeofday(&time, NULL); // Start Time
long totalTime = (time.tv_sec * 1000) + (time.tv_usec / 1000);
//........ call your functions here
gettimeofday(&time, NULL); //END-TIME
totalTime = (((time.tv_sec * 1000) + (time.tv_usec / 1000)) - totalTime);

Calculating time of execution with time() function

I was given the following HomeWork assignment,
Write a program to test on your computer how long it takes to do
nlogn, n2, n5, 2n, and n! additions for n=5, 10, 15, 20.
I have written a piece of code but all the time I am getting the time of execution 0. Can anyone help me out with it? Thanks
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main()
{
float n=20;
time_t start, end, diff;
start = time (NULL);
cout<<(n*log(n))*(n*n)*(pow(n,5))*(pow(2,n))<<endl;
end= time(NULL);
diff = difftime (end,start);
cout <<diff<<endl;
return 0;
}
better than time() with second-precision is to use a milliseconds precision.
a portable way is e.g.
int main(){
clock_t start, end;
double msecs;
start = clock();
/* any stuff here ... */
end = clock();
msecs = ((double) (end - start)) * 1000 / CLOCKS_PER_SEC;
return 0;
}
Execute each calculation thousands of times, in a loop, so that you can overcome the low resolution of time and obtain meaningful results. Remember to divide by the number of iterations when reporting results.
This is not particularly accurate but that probably does not matter for this assignment.
At least on Unix-like systems, time() only gives you 1-second granularity, so it's not useful for timing things that take a very short amount of time (unless you execute them many times in a loop). Take a look at the gettimeofday() function, which gives you the current time with microsecond resolution. Or consider using clock(), which measure CPU time rather than wall-clock time.
Your code is executed too fast to be detected by time function returning the number of seconds elapsed since 00:00 hours, Jan 1, 1970 UTC.
Try to use this piece of code:
inline long getCurrentTime() {
timeb timebstr;
ftime( &timebstr );
return (long)(timebstr.time)*1000 + timebstr.millitm;
}
To use it you have to include sys/timeb.h.
Actually the better practice is to repeat your calculations in the loop to get more precise results.
You will probably have to find a more precise platform-specific timer such as the Windows High Performance Timer. You may also (very likely) find that your compiler optimizes or removes almost all of your code.