Possible cause and solution for randomness in time measurements [duplicate] - c++

This question already has answers here:
Is cpu clock time returned by have to be exactly same among runs?
(3 answers)
Closed 5 years ago.
I got result of time measurements below for repeated computations for simple summation from my Windows machine with 3.2Ghz quad-core CPU and 24GB RAM.
The code is following.
From the result, the summation takes less than 3 ms most of time but sometimes it can take 20 times more. I can understand the large maximum because distribution of the time measurements is exponential having very long right tail.
But what I am not sure of are:
What is cause of the randomness (variation)? Note that I ran the application while CPU usage was 2-4% and memory was 10%.
Possible solution for the randomness. Is there any way to avoid the rare maximum duration?
Results
Time Statistics (ms)
N : 10000
Minimum: 2.31406
Maximum: 64.7171
Mean : 2.43556
Std : 0.676273
M+6Std : 3.11184
Code:
#include "stdafx.h"
#include <Windows.h>
#include <iostream>
int main()
{
LARGE_INTEGER t_start, t_end, Frequency;
double tdiff,minx=1e+307,maxx=-1e+307,meanx=0,stdx=0;
int niter = 10000;
for (int j = 0;j < niter;j++)
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&t_start);
double s = 0;
for (int i = 0;i < 1000000;i++) s += i;
QueryPerformanceCounter(&t_end);
tdiff = (double)(t_end.QuadPart - t_start.QuadPart) / (double)Frequency.QuadPart * 1000;
minx = min(minx, tdiff);
maxx = max(maxx, tdiff);
meanx += tdiff;
stdx += tdiff*tdiff;
//std::cout << "Iteration: " << j << " Time (ms): " << tdiff << std::endl;
}
meanx /= (double)niter;
stdx = sqrt((stdx - (double)niter*meanx*meanx) / (double)(niter - 1));
std::cout << "Time Statistics (ms) " << std::endl << std::endl;
std::cout << "N : " << niter << std::endl;
std::cout << "Minimum: " << minx << std::endl;
std::cout << "Maximum: " << maxx << std::endl;
std::cout << "Mean : " << meanx << std::endl;
std::cout << "Std : " << stdx << std::endl;
std::cout << "M+6Std : " << meanx+stdx << std::endl;
return 0;
}

A general-purpose computing system has many tasks going on. At any moment, the system may have to respond to I/O interrupts (disk drive completion notices, timer interrupts, network activity,…) and run various housekeeping tasks (background backups, check for scheduled events, indexing user files,…).
The times at which they occur are effectively random. Measuring execution time repeatedly and discarding outliers is a common technique.

Related

Iomanip setprecision() Method Isn't Working as It Should Only on the First Line, Why?

So I'm writing a program to count the execution time of a function using clock and I used iomanip to change the output to decimal with 9 zeros.
This is the code that I am using:
#include <time.h>
#include <iomanip>
using namespace std;
void linearFunction(int input)
{
for(int i = 0; i < input; i++)
{
}
}
void execution_time(int input)
{
clock_t start_time, end_time;
start_time = clock();
linearFunction(input);
end_time = clock();
double time_taken = double(end_time - start_time) / double(CLOCKS_PER_SEC);
cout << "Time taken by function for input = " << input << " is : " << fixed
<< time_taken << setprecision(9);
cout << " sec " << endl;
}
int main()
{
execution_time(10000);
execution_time(100000);
execution_time(1000000);
execution_time(10000000);
execution_time(100000000);
execution_time(1000000000);
return 0;
}
And the output shows:
Time taken by function for input = 10000 is : 0.000000 sec
Time taken by function for input = 100000 is : 0.001000000 sec
Time taken by function for input = 1000000 is : 0.002000000 sec
Time taken by function for input = 10000000 is : 0.038000000 sec
Time taken by function for input = 100000000 is : 0.316000000 sec
Time taken by function for input = 1000000000 is : 3.288000000 sec
As you can see, the first time I call the function, it doesn't follow the setprecision(9) that I wrote. Why is this and how can I solve this? Thanks you in advance.
Look at the following line properly:
cout << "Time taken by function for input = " << input << " is : " << fixed << time_taken << setprecision(9);
See? You are setting the precision after printing out time_taken. So for the first time, you don't see the result of setprecision(). But for the second time and onwards, as setprecision() has already been executed, you get the desired decimal places.
So to fix this issue, move setprecision() before time_taken as such:
cout << "Time taken by function for input = " << input << " is : " << fixed << setprecision(9) << time_taken;
..or you can also do something like this:
cout.precision(9);
cout << "Time taken by function for input = " << input << " is : " << fixed << time_taken;
Also, consider not using the following line in your code:
using namespace std;
..as it's considered as a bad practice. Instead use std:: every time like this:
std::cout.precision(9);
std::cout << "Time taken by function for input = " << input << " is : " << std::fixed << time_taken;
For more information on this, look up to why is "using namespace std" considered as a bad practice.

Estimate total runtime of C++ function by measuring one iteration

I have implemented a c++ method that calculates the maximum ulp error between an approximation and a reference function on a given interval. The approximation as well as the reference are calculated as single-precision floating point values. The method starts with the low bound of the interval and iterates over each existing single-precision value within the range.
Since there are a lot of existing values depending on the range that is chosen, I would like to estimate the total runtime of this method, and print it to the user.
I tried to execute the comparison several times to calculate the runtime of one iteration. My approach was to multiply the duration of one iteration with the total number of floats existing in the range. But obviously the execution time for one iteration is not constant but depends on the number of iterations, therefore my estimated duration is not accurate at all... Maybe one could adapt the total runtime calculation in the main loop?
My question is: Is there any other way to estimate the total runtime for this particular case?
Here is my code:
void FloatEvaluateMaxUlp(float(*testFunction)(float), float(*referenceFunction)(float), float lowBound, float highBound)
{
/*initialization*/
float x = lowBound, output, output_ref;
int ulp = 0;
long long duration = 0, numberOfFloats=0;
/*calculate number of floats between lowBound and highBound*/
numberOfFloats = *(int*)&highBound - *(int*)&lowBound;
/*measure execution time of 10 iterations*/
int iterationsToEstimateTime = 1000;
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < iterationsToEstimateTime; i++)
{
printProgressInteger(i+1, iterationsToEstimateTime);
output = testFunction(x);
output_ref = referenceFunction(x);
int ulp_local = FloatCompareULP(output, output_ref);
if (abs(ulp_local) > abs(ulp))
ulp = ulp_local;
x= std::nextafter(x, highBound + 0.001f);
}
auto t2 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
duration /= iterationsToEstimateTime;
x = lowBound;
/*output of estimated time*/
std::cout <<std::endl<<std::endl<< " Number of floats: " << numberOfFloats << " Time per iteration: " << duration << " Estimated total time: " << numberOfFloats * duration << std::endl;
std::cout << " Starting test in range [" << lowBound << "," << highBound << "]." << std::endl;
long long count = 0;
/*record start time*/
t1 = std::chrono::high_resolution_clock::now();
for (count; x < highBound; count++)
{
printProgressInteger(count, numberOfFloats);
output = testFunction(x);
output_ref = referenceFunction(x);
int ulp_local = FloatCompareULP(output, output_ref);
if (abs(ulp_local) > abs(ulp))
ulp = ulp_local;
x = std::nextafter(x, highBound + 0.001f);
}
/*record stop time and compute duration*/
t2 = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count();
/*result output*/
std::cout <<std::endl<< std::endl << std::endl << std::endl << "*********************************************************" << std::endl;
std::cout << " RESULT " << std::endl;
std::cout << "*********************************************************" << std::endl;
std::cout << " Iterations: " << count << " Total execution time: " << duration << std::endl;
std::cout << " Max ulp: " << ulp <<std::endl;
std::cout << "*********************************************************" << std::endl;
}

Threads slowing eachother down

I have some expensive computation I want to divide and distribute over a set of threads.
I dumbed down my code to a minimal example where this is still happening.
In short:
I have N tasks that I want to divide into "Threads" threads.
Each task is the following simple function of running a bunch of simple mathematical operations.
(In practice I verify asymmetric signatures here, but I excluded that for the sake of simplification)
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
Running the above code with 1 thread results in 0.36 seconds per operation (outermost for loop), and thus in around 36 seconds overall execution time.
Thus, parallelization seemed like an obvious way to speed it up. However, with two threads the operation time rises to 0.72 seconds completely destroying any speed up.
Adding more threads results usually in an increasingly worse performance.
I got a Intel(R) Core(TM) i7-8750H CPU # 2.20GHz with 6 physical cores.
So I'd expect a performance boost at least using when going from 1 to 2 threads. But in fact each operation slows down when increasing the number of threads.
Am I doing something wrong?
Full code:
using namespace std;
const size_t N = 100;
const size_t Threads = 1;
atomic_int counter(0);
struct ThreadData
{
int index;
int count;
ThreadData(const int index, const int count): index(index), count(count){};
};
void *executeSlave(void *threadarg)
{
struct ThreadData *my_data;
my_data = static_cast<ThreadData *>(threadarg);
for( int x = my_data->index; x < my_data->index + my_data->count; x++ )
{
cout << "Thread: " << my_data->index << ": " << x << endl;
clock_t start, end;
start = clock();
int i = 0;
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
counter.fetch_add(1);
end = clock();
cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)<< endl;
}
pthread_exit(NULL);
}
int main()
{
clock_t start, end;
start = clock();
pthread_t threads[Threads];
vector<ThreadData> td;
td.reserve(Threads);
int each = N / Threads;
cout << each << endl;
for (int x = 0; x < Threads; x++) {
cout << "main() : creating thread, " << x << endl;
td[x] = ThreadData(x * each, each);
int rc = pthread_create(&threads[x], NULL, executeSlave, (void *) &td[x]);
if (rc) {
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
while (counter < N) {
std::this_thread::sleep_for(10ms);
}
end = clock();
cout << "Final:" << endl;
cout << end - start << ':' << CLOCKS_PER_SEC << ':' << (((float) end - start) / CLOCKS_PER_SEC)
<< endl;
}
clock() returns approximate CPU time for the entire process.
The outermost loop does a fixed amount of work per iteration
int i = 0;
while (i++ < 100000)
{
for (int y = 0; y < 1000; y++)
{
sqrt(y);
}
}
Therefore, process CPU time reported around this loop will be proportional to the number of running threads (it still takes the same amount of time per thread, times N threads).
Use std::chrono::steady_clock to measure wall clock time instead. Note also that I/O such as std::cout takes a lot of wall clock time and is unstable. So the measured total elapsed time will be skewed due to the I/O inside.
Some additional remarks:
The return value of sqrt() is never used; the compiler may eliminate the call entirely. It would be prudent to use the value in some way to be sure.
void* executeSlave() isn't returning a void* pointer value (UB). It should probably be declared simply void if it returns nothing.
td.reserve(Threads) reserves memory but does not allocate objects. td[x] then accesses nonexistent objects (UB). Use td.emplace_back(x * each, each) instead of td[x] = ....
Not technically an issue, but it is recommended to use the standard C++ std::thread instead of pthread, for better portability.
With the following I'm seeing correct speedup proportional to the # of threads:
#include <string>
#include <iostream>
#include <vector>
#include <atomic>
#include <cmath>
#include <thread>
using namespace std;
using namespace std::chrono_literals;
const size_t N = 12;
const size_t Threads = 2;
std::atomic<int> counter(0);
std::atomic<int> xx{ 0 };
void executeSlave(int index, int count, int n)
{
double sum = 0;
for (int x = index; x < index + count; x++)
{
cout << "Thread: " << index << ": " << x << endl;
auto start = std::chrono::steady_clock::now();
for (int i=0; i < 100000; i++)
{
for (int y = 0; y < n; y++)
{
sum += sqrt(y);
}
}
counter++;
auto end = std::chrono::steady_clock::now();
cout << 1e-6 * (end - start) / 1us << " s" << endl;
}
xx += (int)sum; // prevent optimization
}
int main()
{
std::thread threads[Threads];
int each = N / Threads;
cout << each << endl;
auto start = std::chrono::steady_clock::now();
for (int x = 0; x < Threads; x++) {
cout << "main() : creating thread, " << x << endl;
threads[x] = std::thread(executeSlave, x * each, each, 100);
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::steady_clock::now();
cout << "Final:" << endl;
cout << 1e-6 * (end - start) / 1us << " s" << endl;
}

(opencv rc1) What causes Mat multiplication to be 20x slower than per-pixel multiplication?

// 700 ms
cv::Mat in(height,width,CV_8UC1);
in /= 4;
Replaced with
//40 ms
cv::Mat in(height,width,CV_8UC1);
for (int y=0; y < in.rows; ++y)
{
unsigned char* ptr = in.data + y*in.step1();
for (int x=0; x < in.cols; ++x)
{
ptr[x] /= 4;
}
}
What can cause such behavior? Is it due to opencv "promoting" Mat with Scalar multiplication to a Mat with Mat multiplication, or is it a specific failed optimization for arm? (NEON is enabled).
This is a very old issue (I reported it couple of years ago) that many basic operations are taking extra time. Not just division but also addition, abs, etc... I don't know the real reason for that behavior. What is even more weird, is that the operations that supposed to take more time, like addWeighted, are actually very efficient. Try this one:
addWeighted(in, 1.0/4, in, 0, 0, in);
It performs multiple operations per pixel yet it run few times faster than either add function and loop implementation.
Here is my report on bug tracker.
Tried the same by measuring cpu time.
int main()
{
clock_t startTime;
clock_t endTime;
int height =1024;
int width =1024;
// 700 ms
cv::Mat in(height,width,CV_8UC1, cv::Scalar(255));
std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;
cv::Mat out(height,width,CV_8UC1);
startTime = clock();
out = in/4;
endTime = clock();
std::cout << "1: " << (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
std::cout << "value: " << (int)out.at<unsigned char>(0,0) << std::endl;
startTime = clock();
in /= 4;
endTime = clock();
std::cout << "2: " << (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
std::cout << "value: " << (int)in.at<unsigned char>(0,0) << std::endl;
//40 ms
cv::Mat in2(height,width,CV_8UC1, cv::Scalar(255));
startTime = clock();
for (int y=0; y < in2.rows; ++y)
{
//unsigned char* ptr = in2.data + y*in2.step1();
unsigned char* ptr = in2.ptr(y);
for (int x=0; x < in2.cols; ++x)
{
ptr[x] /= 4;
}
}
std::cout << "value: " << (int)in2.at<unsigned char>(0,0) << std::endl;
endTime = clock();
std::cout << "3: " << (float)(endTime-startTime)/(float)CLOCKS_PER_SEC << std::endl;
cv::namedWindow("...");
cv::waitKey(0);
}
with results:
value: 255
1: 0.016
value: 64
2: 0.016
value: 64
3: 0.003
value: 63
you see that the results differ, probably because mat.divide() does perform floating point division and rounding to next. While you use integer division in your faster version, which is faster but gives a different result.
In addition, there is a saturate_cast in openCV computation, but I guess the bigger computation load difference will be the double precision division.

c++ Inline function for array multiplications of 10000

I am tasked with two programs and this is the second one. The first program involved no calculation() function and to time the program when it started and finished. My computer will display anything from .523 seconds to .601 seconds.
The second task was to create an inline function for the calculation and I believe that I have done it wrong because it is not faster. I am not sure if I made the calculation function right because it includes display information, or if the inline function should focus only on the multiplication. Either way pulling the arrays out of main and into a function is not faster.
Is the compiler just ignoring it?
#include <ctime>
#include <iostream>
using namespace std;
inline int calculation(){
int i;
double result[10000];
double user[10000];
for(i=0; i<10000; i++){
user[i]=i+100;
}
double second[10000];
for(i=0; i<10000; i++){
second[i]=10099-i;
}
for (i = 0; i < 10000; i++){
result[i] = user[i] * second[i];
}
for (i = 0; i < 10000; i++){
cout << user[i] << " * " << second[i] << " = " << result[i] << '\n';
}
}
int main() {
time_t t1 = time(0); // get time now
struct tm * now = localtime( & t1 );
cout << "The time now is: ";
cout << now->tm_hour << ":" << now->tm_min << ":" << now->tm_sec << endl;
clock_t t; // get ticks
t = clock();
cout << " Also calculating ticks...\n"<<endl;
calculation(); // inline function
time_t t2 = time(0); // get time now
struct tm * now2 = localtime( & t2 );
cout << "The time now is: ";
cout << now2->tm_hour << ":" << now2->tm_min << ":" << now2->tm_sec << endl;
time_t t3= t2-t1;
cout << "This took me "<< t3 << " second(s)" << endl; // ticks
t = clock() - t;
float p;
p = (float)t/CLOCKS_PER_SEC;
cout << "Or more accuratley, this took " << t << " clicks"
<< " or " << p << " seconds"<<endl;
}
Is the compiler just ignoring it?
Most probably, yes. It could be doing that for two reasons:
You're compiling in debug mode. In debug mode all inline keywords are ignored to facilitate debugging.
It's ignoring it because the function is far too long for an inline function, and uses far too much stack space to safely inline, and is only invoked once. The inline keyword is a compiler HINT, not a mandatory requirement. It's the programmer's way of recommending the compiler to inline the function, just like a compiler in release mode will frequently inline functions on its own to increase performance. If it only sees negative value it won't comply.
Also, given the single invocation, it's highly unlikely that you'll even see differences no matter if it works or not. A single native function call is much easier on the CPU than a single task switch at the OS level.
You should disable optimization to verify if what you do has any effect, because there are good chances that the compiler is already inlining the function by itself.
Also, if you want to know exactly what your code does, you should compile with the -s flag in g++, and look at the assembly generated by the compiler for your program. This will remove all uncertainty about what the compiler is doing to your program.
I would not make the function inlined and define arrays as static. For example
int calculation(){
int i;
static double result[10000];
static double user[10000];
for(i=0; i<10000; i++){
user[i]=i+100;
}
static double second[10000];
for(i=0; i<10000; i++){
second[i]=10099-i;
}
for (i = 0; i < 10000; i++){
result[i] = user[i] * second[i];
}
for (i = 0; i < 10000; i++){
cout << user[i] << " * " << second[i] << " = " << result[i] << '\n';
}
}