this is my first time using multi-threading to speed up a heavy calculation.
Background: The idea is to calculate a Kernel Covariance matrix, by reading a list of 3D points x_test and calculating the corresponding matrix, which has dimensions x_test.size() x x_test.size().
I already sped up the calculations by only calculating the lower triangluar matrix. Since all the calculations are independent from each other I tried to speed up the process (x_test.size() = 27000 in my case) by splitting the calculations of the matrix entries row-wise, assigning a range of rows to each thread.
On a single core the calculations took about 280 seconds each time, on 4 cores it took 270-290 seconds.
main.cpp
int main(int argc, char *argv[]) {
double sigma0sq = 1;
double lengthScale [] = {0.7633, 0.6937, 3.3307e+07};
const std::vector<std::vector<double>> x_test = parse2DCsvFile(inputPath);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i=1; i<x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
/* Spreding calculations to multiple threads */
std::vector<std::thread> threads;
for(std::size_t i = 1; i < indices.size(); ++i){
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices.at(i-1), indices.at(i)));
}
for(auto & th: threads){
th.join();
}
return 0;
}
As you can see, each thread performs the following calculations on the data assigned to it:
void calculateKMatrixCpp(const std::vector<std::vector<double>> xtest, double lengthScale[], double sigma0sq, int threadCounter, int start, int stop){
char buffer[8192];
std::ofstream out("lower_half_matrix_" + std::to_string(threadCounter) +".csv");
out.rdbuf()->pubsetbuf(buffer, 8196);
for(int i = start; i < stop; ++i){
for(int j = 0; j < i+1; ++j){
double kij = seKernel(xtest.at(i), xtest.at(j), lengthScale, sigma0sq);
if (j!=0)
out << ',';
out << kij;
}
if(i!=xtest.size()-1 )
out << '\n';
}
out.close();
}
and
double seKernel(const std::vector<double> x1,const std::vector<double> x2, double lengthScale[], double sigma0sq) {
double sum(0);
for(std::size_t i=0; i<x1.size();i++){
sum += pow((x1.at(i)-x2.at(i))/lengthScale[i],2);
}
return sigma0sq*exp(-0.5*sum);
}
Aspects I considered
locking by simultaneous access to data vector -> I don't pass a reference to the threads, but a copy of the data. I know this is not optimal in terms of RAM usage, but as far as I know this should prevent simultaneous data access since every thread has its own copy
Output -> every thread writes its part of the lower triangular matrix to its own file. My task manager doesn't indicate a full SSD utilization in the slightest
Compiler and machine
Windows 11
GNU GCC Compiler
Code::Blocks (although I don't think that should be of importance)
There are many details that can be improved in your code, but I think the two biggest issues are:
using vectors or vectors, which leads to fragmented data;
writing each piece of data to file as soon as its value is computed.
The first point is easy to fix: use something like std::vector<std::array<double, 3>>. In the code below I use an alias to make it more readable:
using Point3D = std::array<double, 3>;
std::vector<Point3D> x_test;
The second point is slightly harder to address. I assume you wanted to write to the disk inside each thread because you couldn't manage to write to a shared buffer that you could then write to a file.
Here is a way to do exactly that:
void calculateKMatrixCpp(
std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq,
int threadCounter, int start, int stop, std::vector<double>& kMatrix
) {
// ...
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
// ...
}
// ...
threads.push_back(std::thread(
calculateKMatrixCpp, x_test, lengthScale, sigma0sq,
i, indices[i-1], indices[i], std::ref(kMatrix)
));
Here, kMatrix is the shared buffer and represents the whole matrix you are trying to compute. You need to pass it to the thread via std::ref. Each thread will write to a different location in that buffer, so there is no need for any mutex or other synchronization.
Once you make these changes and try to write kMatrix to the disk, you will realize that this is the part that takes the most time, by far.
Below is the full code I tried on my machine, and the computation time was about 2 seconds whereas the writing-to-file part took 300 seconds! No amount of multithreading can speed that up.
If you truly want to write all that data to the disk, you may have some luck with file mapping. Computing the exact size needed should be easy enough if all values have the same number of digits, and it looks like you could write the values with multithreading. I have never done anything like that, so I can't really say much more about it, but it looks to me like the fastest way to write multiple gigabytes of memory to the disk.
#include <vector>
#include <thread>
#include <iostream>
#include <string>
#include <cmath>
#include <array>
#include <random>
#include <fstream>
#include <chrono>
using Point3D = std::array<double, 3>;
auto generateSampleData() -> std::vector<Point3D> {
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i) {
data.push_back({ d(g), d(g), d(g) });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance*distance;
}
return sigma0sq * std::exp(-0.5*sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D const& lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::vector<double>& kMatrix) {
std::cout << "start of thread " << threadCounter << "\n" << std::flush;
for(int i = start; i < stop; ++i) {
for(int j = 0; j < i+1; ++j) {
double& kij = kMatrix[i * xtest.size() + j];
kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
}
}
std::cout << "end of thread " << threadCounter << "\n" << std::flush;
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = {0.7633, 0.6937, 3.3307e+07};
const std::vector<Point3D> x_test = generateSampleData();
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size()*x_test.size()/2;
const int numThreads = 4;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for(std::size_t i = 1; i < x_test.size()+1; ++i){
int prod = i*(i+1)/2 - j*(j+1)/2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if(indices.size() == numThreads-1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<double> kMatrix(x_test.size() * x_test.size(), 0.0);
std::vector<std::thread> threads;
for (std::size_t i = 1; i < indices.size(); ++i) {
threads.push_back(std::thread(calculateKMatrixCpp, x_test, lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::ref(kMatrix)));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "computation time: " << elapsed_seconds << "s" << std::endl;
start = std::chrono::system_clock::now();
constexpr int buffer_size = 131072;
char buffer[buffer_size];
std::ofstream out("matrix.csv");
out.rdbuf()->pubsetbuf(buffer, buffer_size);
for (int i = 0; i < x_test.size(); ++i) {
for (int j = 0; j < i + 1; ++j) {
if (j != 0) {
out << ',';
}
out << kMatrix[i * x_test.size() + j];
}
if (i != x_test.size() - 1) {
out << '\n';
}
}
end = std::chrono::system_clock::now();
elapsed_seconds = std::chrono::duration<double>(end - start).count();
std::cout << "writing time: " << elapsed_seconds << "s" << std::endl;
}
Okey I've wrote implementation with optimized formatting.
By using #Nelfeal code it was taking on my system around 250 seconds for the run to complete with write time taking the most by far. Or rather std::ofstream formatting taking most of the time.
I've written a C++20 version via std::format_to/format. It is a multi-threaded version that takes around 25-40 seconds to complete all the computations, formatting, and writing. If run in a single thread, it takes on my system around 70 seconds. Same performance should be achievable via fmt library on C++11/14/17.
Here is the code:
import <vector>;
import <thread>;
import <iostream>;
import <string>;
import <cmath>;
import <array>;
import <random>;
import <fstream>;
import <chrono>;
import <format>;
import <filesystem>;
using Point3D = std::array<double, 3>;
auto generateSampleData(Point3D scale) -> std::vector<Point3D>
{
static std::minstd_rand g(std::random_device{}());
std::uniform_real_distribution<> d(-1.0, 1.0);
std::vector<Point3D> data;
data.reserve(27000);
for (auto i = 0; i < 27000; ++i)
{
data.push_back({ d(g)* scale[0], d(g)* scale[1], d(g)* scale[2] });
}
return data;
}
double seKernel(Point3D const& x1, Point3D const& x2, Point3D const& lengthScale, double sigma0sq) {
double sum = 0.0;
for (auto i = 0u; i < 3u; ++i) {
double distance = (x1[i] - x2[i]) / lengthScale[i];
sum += distance * distance;
}
return sigma0sq * std::exp(-0.5 * sum);
}
void calculateKMatrixCpp(std::vector<Point3D> const& xtest, Point3D lengthScale, double sigma0sq, int threadCounter, int start, int stop, std::filesystem::path localPath)
{
using namespace std::string_view_literals;
std::vector<char> buffer;
buffer.reserve(15'000);
std::ofstream out(localPath);
std::cout << std::format("starting thread {}: from {} to {}\n"sv, threadCounter, start, stop);
for (int i = start; i < stop; ++i)
{
for (int j = 0; j < i; ++j)
{
double kij = seKernel(xtest[i], xtest[j], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}, "sv, kij);
}
double kii = seKernel(xtest[i], xtest[i], lengthScale, sigma0sq);
std::format_to(std::back_inserter(buffer), "{:.6g}\n"sv, kii);
out.write(buffer.data(), buffer.size());
buffer.clear();
}
}
int main() {
double sigma0sq = 1;
Point3D lengthScale = { 0.7633, 0.6937, 3.3307e+07 };
const std::vector<Point3D> x_test = generateSampleData(lengthScale);
/* Finding data slices of similar size */
//This piece of code works, each thread is assigned roughly the same number of matrix entries
int numElements = x_test.size() * (x_test.size()+1) / 2;
const int numThreads = 3;
int elemsPerThread = numElements / numThreads;
std::vector<int> indices;
int j = 0;
for (std::size_t i = 1; i < x_test.size() + 1; ++i) {
int prod = i * (i + 1) / 2 - j * (j + 1) / 2;
if (prod > elemsPerThread) {
i--;
j = i;
indices.push_back(i);
if (indices.size() == numThreads - 1)
break;
}
}
indices.insert(indices.begin(), 0);
indices.push_back(x_test.size());
auto start = std::chrono::system_clock::now();
std::vector<std::thread> threads;
using namespace std::string_view_literals;
for (std::size_t i = 1; i < indices.size(); ++i)
{
threads.push_back(std::thread(calculateKMatrixCpp, std::ref(x_test), lengthScale, sigma0sq, i, indices[i - 1], indices[i], std::format("./matrix_{}.csv"sv, i-1)));
}
for (auto& t : threads)
{
t.join();
}
auto end = std::chrono::system_clock::now();
auto elapsed_seconds = std::chrono::duration<double>(end - start);
std::cout << std::format("total elapsed time: {}"sv, elapsed_seconds);
return 0;
}
Note: I used 6 digits of precision here as it is the default for std::ofstream. More digits means more writing time to disk and lower performance.
I am trying to find a fastest way to make square root of any float number in C++. I am using this type of function in a huge particles movement calculation like calculation distance between two particle, we need a square root etc. So If any suggestion it will be very helpful.
I have tried and below is my code
#include <math.h>
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
#define CHECK_RANGE 100
inline float msqrt(float a)
{
int i;
for (i = 0;i * i <= a;i++);
float lb = i - 1; //lower bound
if (lb * lb == a)
return lb;
float ub = lb + 1; // upper bound
float pub = ub; // previous upper bound
for (int j = 0;j <= 20;j++)
{
float ub2 = ub * ub;
if (ub2 > a)
{
pub = ub;
ub = (lb + ub) / 2; // mid value of lower and upper bound
}
else
{
lb = ub;
ub = pub;
}
}
return ub;
}
void check_msqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
msqrt(i);
}
}
void check_sqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
sqrt(i);
}
}
int main()
{
auto start1 = high_resolution_clock::now();
check_msqrt();
auto stop1 = high_resolution_clock::now();
auto duration1 = duration_cast<microseconds>(stop1 - start1);
cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";
auto start2 = high_resolution_clock::now();
check_sqrt();
auto stop2 = high_resolution_clock::now();
auto duration2 = duration_cast<microseconds>(stop2 - start2);
cout << "Time for check_sqrt = " << duration2.count() << " micro secs";
//cout << msqrt(3);
return 0;
}
output of above code showing the implemented method 4 times more slow than sqrt of math.h file.
I need faster than math.h version.
In short, I do not think it is possible to implement something generally faster than the standard library version of sqrt.
Performance is a very important parameter when implementing standard library functions and it is fair to assume that such a commonly used function as sqrt is optimized as much as possible.
Beating the standard library function would require a special case, such as:
Availability of a suitable assembler instruction - or other specialized hardware support - on the particular system for which the standard library has not been specialized.
Knowledge of the needed range or precision. The standard library function must handle all cases specified by the standard. If the application only needs a subset of that or maybe only requires an approximate result then perhaps an optimization is possible.
Making a mathematical reduction of the calculations or combine some calculation steps in a smart way so an efficient implementation can be made for that combination.
Here's another alternative to binary search. It may not be as fast as std::sqrt, haven't tested it. But it will definitely be faster than your binary search.
auto
Sqrt(float x)
{
using F = decltype(x);
if (x == 0 || x == INFINITY || isnan(x))
return x;
if (x < 0)
return F{NAN};
int e;
x = std::frexp(x, &e);
if (e % 2 != 0)
{
++e;
x /= 2;
}
auto y = (F{-160}/567*x + F{2'848}/2'835)*x + F{155}/567;
y = (y + x/y)/2;
y = (y + x/y)/2;
return std::ldexp(y, e/2);
}
After getting +/-0, nan, inf, and negatives out of the way, it works by decomposing the float into a mantissa in the range of [1/4, 1) times 2e where e is an even integer. The answer is then sqrt(mantissa)* 2e/2.
Finding the sqrt of the mantissa can be guessed at with a least squares quadratic curve fit in the range [1/4, 1]. Then that good guess is refined by two iterations of Newton–Raphson. This will get you within 1 ulp of the correctly rounded result. A good std::sqrt will typically get that last bit correct.
I have also tried with the algorithm mention in https://en.wikipedia.org/wiki/Fast_inverse_square_root, but not found desired result, please check
#include <math.h>
#include <iostream>
#include <chrono>
#include <bit>
#include <limits>
#include <cstdint>
using namespace std;
using namespace std::chrono;
#define CHECK_RANGE 10000
inline float msqrt(float a)
{
int i;
for (i = 0;i * i <= a;i++);
float lb = i - 1; //lower bound
if (lb * lb == a)
return lb;
float ub = lb + 1; // upper bound
float pub = ub; // previous upper bound
for (int j = 0;j <= 20;j++)
{
float ub2 = ub * ub;
if (ub2 > a)
{
pub = ub;
ub = (lb + ub) / 2; // mid value of lower and upper bound
}
else
{
lb = ub;
ub = pub;
}
}
return ub;
}
/* mentioned here -> https://en.wikipedia.org/wiki/Fast_inverse_square_root */
inline float Q_sqrt(float number)
{
union Conv {
float f;
uint32_t i;
};
Conv conv;
conv.f= number;
conv.i = 0x5f3759df - (conv.i >> 1);
conv.f *= 1.5F - (number * 0.5F * conv.f * conv.f);
return 1/conv.f;
}
void check_Qsqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
Q_sqrt(i);
}
}
void check_msqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
msqrt(i);
}
}
void check_sqrt()
{
for (size_t i = 0; i < CHECK_RANGE; i++)
{
sqrt(i);
}
}
int main()
{
auto start1 = high_resolution_clock::now();
check_msqrt();
auto stop1 = high_resolution_clock::now();
auto duration1 = duration_cast<microseconds>(stop1 - start1);
cout << "Time for check_msqrt = " << duration1.count() << " micro secs\n";
auto start2 = high_resolution_clock::now();
check_sqrt();
auto stop2 = high_resolution_clock::now();
auto duration2 = duration_cast<microseconds>(stop2 - start2);
cout << "Time for check_sqrt = " << duration2.count() << " micro secs\n";
auto start3 = high_resolution_clock::now();
check_Qsqrt();
auto stop3 = high_resolution_clock::now();
auto duration3 = duration_cast<microseconds>(stop3 - start3);
cout << "Time for check_Qsqrt = " << duration3.count() << " micro secs\n";
//cout << Q_sqrt(3);
//cout << sqrt(3);
//cout << msqrt(3);
return 0;
}
I have below Eigen C++ code and doing squredNorm calculations 10milliyon times.
Is there anyway to make it more robust/faster .
#include <Eigen/Core>
#include <tbb/parallel_for.h>
#include "tbb/tbb.h"
#include <mutex>
#include <opencv2/opencv.hpp>
int main(){
int numberOFdata = 10000008;
Eigen::MatrixXf feat = Eigen::MatrixXf::Random(numberOFdata,512);
Eigen::MatrixXf b_cmp= Eigen::MatrixXf::Random(1,512);
int count_feature = feat.rows();
std::vector<int> found_number ;
std::mutex mutex1;
for (int loop = 0 ; loop<16 ; loop++){
double start_1 = static_cast<double>(cv::getTickCount());
tbb::affinity_partitioner ap;
tbb::parallel_for( tbb::blocked_range<int>(0,count_feature),
[&](tbb::blocked_range<int> r )
{
for (int i=r.begin(); i<r.end(); ++i)
{
auto distance = ( feat.row(i)- b_cmp ).squaredNorm();
if (distance < 0.5) {
mutex1.lock();
found_number.push_back(i);
mutex1.unlock();
}
}
},ap);
double timefin = ((double)cv::getTickCount() - start_1) / cv::getTickFrequency();
std::cout << count_feature << " TOTAL : " << timefin << std::endl;
}
}
Compile flags :
-Xpreprocessor -std=c++11 -fopenmp -pthread -O3 -mavx2 -march=native -funroll-loops -fpermissive
eigen version 3.3.7
tbb opencv and eigen linked.
You can remove opencv and use a different elapsed time calculation.
Thanks
You should be faster by a factor of about 4 if you store feat in the same order in which you access it (i.e., Eigen::RowMajor in your case).
Minimal example removing all non-Eigen related things:
int numberOFdata = 10000008;
Eigen::Matrix<float,Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> feat = Eigen::MatrixXf::Random(numberOFdata, 512);
Eigen::RowVectorXf b_cmp = Eigen::MatrixXf::Random(1, 512);
int count_feature = feat.rows();
std::vector<int> found_number;
for (int loop = 0; loop < 16; loop++) {
auto start = std::chrono::steady_clock::now();
{
for (int i = 0; i < feat.rows(); ++i) {
float distance = (feat.row(i) - b_cmp).squaredNorm();
if (distance < 0.5f) {
found_number.push_back(i);
}
}
};
auto end = std::chrono::steady_clock::now();
std::chrono::duration<double> diff = end-start;
std::cout << count_feature << " TOTAL : " <<
diff.count() << std::endl;
}
Godbolt-Demo (reduced dimension of feat due to memory-limitations): https://godbolt.org/z/b6r5K4Yxv
In Eigen,with
ArrayXXf a;
a = ArrayXXf::Random(1000, 10000);
doing
a = a.pow(4);
takes ~500ms on my pc, whereas doing
a = a.square().square();
takes only about 5ms. I'm compiling with a recent GCC in release.
Is this the expected behaviour or am I doing something wrong? I would expect, that at least for small integer (say < 20, if not using a cost function), an overload should exist that catches such cases.
After a whole day of debugging I realized this was the bottleneck in my code. On the documentation it says that there is no SIMD for a.pow(). For whatever reason, actually on my machine it appears a * a * a * a is faster for a 300 x 50 Eigen::ArrayXXd.
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
int main() {
Eigen::ArrayXXd m = Eigen::ArrayXd::LinSpaced(300 * 50, 0, 300 * 50 - 1).reshaped(300, 50);
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 100000; i++)
{
result = m.square().square();
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 100000.0 << std::endl;
}
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 100000; i++)
{
result = m * m * m * m;
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 100000.0 << std::endl;
}
{
decltype(m) result; // prevent loop from being eliminated
auto start = std::chrono::steady_clock::now();
for (size_t i = 0; i < 10000; i++)
{
result = m.pow(4);
}
auto end = std::chrono::steady_clock::now();
std::cout << "Per run(microsecond)=" << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() / 10000.0 << std::endl;
}
return 0;
m.square().square();
Per run(microsecond)=17.9101
m * m * m * m;
Per run(microsecond)=10.3267
m.pow(4);
Per run(microsecond)=431.636
With C++17 if constexpr this could be possible, but otherwise it's not. So currently, a.pow(x) is equivalent to calling std::pow(a[i],x) for each i.
I am trying to view what the run-time on my code is. The code is my attempt at Project Euler Problem 5. When I try to output the run time it gives 0ns.
#define MAX_DIVISOR 20
bool isDivisible(long, int);
int main() {
auto begin = std::chrono::high_resolution_clock::now();
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1)) {
if ((i % d) == 0) {
inc = i;
i = inc;
d++;
}
else {
i += inc;
}
}
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count())); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;
}
The timing resolulution of std::chrono::high_resolution_clock::now() is system dependent.
You can find out an order of magnitude with the small piece of code here (edit: here you have a more accurate version):
chrono::nanoseconds mn(1000000000); // asuming the resolution is higher
for (int i = 0; i < 5; i++) {
using namespace std::chrono;
nanoseconds dt;
long d = 1000 * pow(10, i);
for (long e = 0; e < 10; e++) {
long j = d + e*pow(10, i)*100;
cout << j << " ";
auto begin = high_resolution_clock::now();
while (j>0)
k = ((j-- << 2) + 1) % (rand() + 100);
auto end = high_resolution_clock::now();
dt = duration_cast<nanoseconds>(end - begin);
cout << dt.count() << "ns = "
<< duration_cast<milliseconds>(dt).count() << " ms" << endl;
if (dt > nanoseconds(0) && dt < mn)
mn = dt;
}
}
cout << "Minimum resolution observed: " << mn.count() << "ns\n";
where k is a global volatile long k; in order to avoid optimizer to interfere too much.
Under windows, I obtain here 15ms. Then you have platform specific alternatives. For windows, there is a high performance cloeck that enables you to measure timebelow 10µs range (see here http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx) but still not in the nanosecond range.
If you want to time your code very accurately, you could reexecute it a big loop, and dividint the total time by the number of iterations.
Estimation you are going to do is not precise, better approach is to measure CPU time consumption of you program (because other processes are also running concurrently with you process, so time that you are trying to measure can be greatly affected if CPU intensitive tasks are running in parallel with you process).
So my advise use already implemented profilers if you want to estimate your code performance.
Considering your task, OS if doesn`t provide needed precision for time, you need to increase total time your are trying to estimate, the esiest way - run program n times & calculate the avarage, this method provides such advantage that by avareging - you can eleminate errors that arose from CPU intensitive tasks running concurrently with you process.
Here is code snippet of how I see the possible implementation:
#include <iostream>
using namespace std;
#define MAX_DIVISOR 20
bool isDivisible(long, int);
void doRoutine()
{
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1))
{
if (isDivisible(i, d))
{
inc = i;
i = inc;
d++;
}
else
{
i += inc;
}
}
}
int main() {
auto begin = std::chrono::high_resolution_clock::now();
const int nOfTrials = 1000000;
for (int i = 0; i < nOfTrials; ++i)
doRoutine();
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()/ nOfTrials)); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;