Measuring time with chrono changes after printing - c++

I want to measure the execution time of a program in ns in C++. For that purpose I am using the chrono library.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
return 0;
}
I measured the time and it executed in 90 ns . However when I add a printing afterwards the time changes.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
std::cout<<input[i][j];
}
return 0;
}
The time changes to 89700 ns. What could be the problem. I only want to measure the execution time of the for.

Related

SSE slower than standard logic [duplicate]

This question already has answers here:
SSE intrinsics without compiler optimization
(2 answers)
Idiomatic way of performance evaluation?
(1 answer)
Closed 3 days ago.
Add 2 arrays element by element. Standart logic and by SSE
int count = std::pow(2, 20);
alignas(16) float* fm1 = new float[count];
alignas(16) float* fm2 = new float[count];
alignas(16) float* res = new float[count];
for (int i = 0; i < count; ++i) {
fm1[i] = static_cast<float>(i);
fm2[i] = static_cast<float>(i);
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < 1000; ++j) {
for (int i = 0; i < count; ++i) {
res[i] = fm1[i] + fm2[i];
}
}
auto diff = std::chrono::high_resolution_clock::now() - start;
std::cout << "execute time duration = " << std::chrono::duration<double, std::milli>(diff).count() << " milliseconds" << std::endl;
}
{
assert(count % 4 == 0);
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < 1000; ++j) {
for (int i = 0; i < count; i += 4) {
__m128 a = _mm_load_ps(&fm1[i]);
__m128 b = _mm_load_ps(&fm2[i]);
__m128 r = _mm_add_ps(a, b);
_mm_store_ps(&res[i], r);
}
}
auto diff = std::chrono::high_resolution_clock::now() - start;
std::cout << "execute time duration = " << std::chrono::duration<double, std::milli>(diff).count() << " milliseconds" << std::endl;
}
result
execute time duration = 1692.19 milliseconds
execute time duration = 2339.49 milliseconds
laptop configuration
11th Gen Intel(R) Core(TM) i7-11370H # 3.30GHz 3.30 GHz
16,0 ГБ
I expect that SSE logic wil be faster at least 3 times, but it slover

How can this for loop be optimized to run faster without parallelizing or SSE?

I am trying to optimize a piece of code without resorting to parallelizing / SSE.
Current critical code runs in about 20ms on my PC with O2. That seems quite a bit even for ~17mil iterations.
The particular piece that is too slow is as follows:
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
bins[d][(int) (floodVals[d][i] * binSteps)]++;
}
}
Update: Changing to iterators reduced the run-time to 17ms.
for (int d = 0; d < numDims; d++)
{
std::vector<float>::iterator floodIt;
for (floodIt = floodVals[d].begin(); floodIt < floodVals[d].end(); floodIt++)
{
bins[d][(int) (*floodIt * binSteps)]++;
}
}
The full dummy code is here:
#include <vector>
#include <random>
#include <iostream>
#include <chrono>
int main()
{
// Initialize random normalized input [0, 1)
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_real_distribution<float> dist(0, 0.99999);
// Initialize dimensions
const int numDims = 130;
const int numNodes = 130000;
const int binSteps = 30;
// Make dummy data
std::vector<std::vector<float>> floodVals(numDims, std::vector<float>(numNodes));
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
floodVals[d][i] = dist(gen);
}
}
// Initialize binning
std::vector<std::vector<int>> bins(numDims, std::vector<int>(binSteps, 0));
// Time critical section of code
auto start = std::chrono::high_resolution_clock::now();
for (int d = 0; d < numDims; d++)
{
for (int i = 0; i < numNodes; i++)
{
bins[d][(int) (floodVals[d][i] * binSteps)]++;
}
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed: " << elapsed.count() * 1000 << " ms" << std::endl;
return 0;
}
Try eliminating indexing on d in the inner loop, since it is constant in the inner loop anyway. This was roughly 2x faster for me.
for (int d = 0; d < numDims; d++)
{
int* const bins_d = &bins[d][0];
float* const floodVals_d = &floodVals[d][0];
for (int i = 0; i < numNodes; i++)
{
bins_d[(int) (floodVals_d[i] * binSteps)]++;
}
}

how to solve a weighted completion time minimization problem to cplex/c++?

a scheduling problem that wants to minimize a weighted completion time
I would like to pass the image problem to the c++ language and solve it with cplex solver.
Problem:
Min ∑WiCi
xi ∈ 0,1
t= ∆ + ∑_{i} xi*pi1
Ci1 = ∑_{j=1}^{i} xj*pj1
Ci2 = t + ∑_{j=1}^{i} (1−xj)*pj2
Ci ≥ Ci1
Ci ≥ Ci2−Ωxi
Ω = ∆ + ∑_i max(pi1, pi2)
The goal is to minimize the weighted completion time of a problem with one machine with at most one machine reconfiguration. I have a boolean variable Xi that tells me if the job is done before or after the reconfiguration. The variable t is the instant after the reconfiguration(delta is the time to reconfigure the machine). pi1 is the processing time in configuration 1 and pi2 is the processing time in configuration 2. Besides, I have restrictions due to Ci (completion time of a job i).
The example of code I made is not working. I know I have a problem writing the C variable (but I don't know how to solve it).
#include <iostream>
#include <ilcplex/ilocplex.h>
#include <ilcp/cp.h>
#include <vector>
#include <algorithm>
using namespace std;
int main()
{
int nbJob = 4;
int delta = 2;
int nbConf = 2;
vector<int> w_job;
w_job.push_back(1); w_job.push_back(1); w_job.push_back(1); w_job.push_back(1);
vector<vector<int>>p_job;
p_job.resize(nbJob);
p_job[0].push_back(6); p_job[0].push_back(3);
p_job[1].push_back(1); p_job[1].push_back(2);
p_job[2].push_back(10); p_job[2].push_back(2);
p_job[3].push_back(1); p_job[3].push_back(9);
int max_p = 0;
int aux;
for (size_t i = 0; i < nbJob - 1; i++) {
aux = max(p_job[i][0], p_job[i][1]);
max_p = max(max_p, aux);
}
cout << max_p << endl;
int max_w = 0;
aux = 0;
for (size_t i = 0; i < nbJob - 1; i++) {
aux = max(w_job[i], w_job[i + 1]);
max_w = max(aux, max_w);
}
cout << max_w << endl;
int omega = 0;
for (size_t i = 0; i < nbJob; i++) {
omega += max(p_job[i][0], p_job[i][1]);
}
omega += delta;
cout << omega << endl;
try {
IloEnv env;
IloModel model(env);
IloBoolVarArray x(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloBoolVar xi(env, 0, 1, "xi");
x[i] = xi;
}
IloArray<IloExprArray> C(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloExprArray Ci(env);
for (size_t j = 0; j < nbConf; j++) {
IloExpr Cij(env);
Ci.add(Cij);
}
C[i] = Ci;
}
IloNumVarArray C_final(env, nbJob);
for (size_t i = 0; i < nbJob; i++) {
IloNumVar C_finali(env, 0, IloInfinity, ILOINT); //
C_final[i] = C_finali;
}
IloExpr t(env);
for (int i = 0; i < nbJob; i++) {
t += x[i] * p_job[i][0];
}
t += delta;
for (size_t i = 0; i < nbJob; i++) {
for (size_t j = 0; j < nbJob; j++) {
if (j <= i) {
C[i][0] += x[j] * p_job[j][0];
}
}
}
for (size_t i = 0; i < nbJob; i++) {
for (size_t j = 0; j < nbJob; j++) {
if (j <= i) {
C[i][1] += ((1 - x[j]) * p_job[j][1]);
}
}
C[i][1] += t;
}
for (size_t i = 0; i < nbJob; i++) {
model.add(C_final[i] >= C[i][0]);
}
for (size_t i = 0; i < nbJob; i++) {
model.add(C_final[i] >= C[i][1] - (omega * x[i]));
}
IloExpr wiCi(env);
for (size_t i = 0; i < nbJob; i++) {
wiCi += w_job[i] * C_final[i];
}
model.add(IloMinimize(env, wiCi));
IloCplex solver(model);
solver.solve();
for (int i = 0; i < nbJob; i++) {
cout << "C " << i + 1 << " = " << solver.getValue(C_final[i]) << " x" << i+1 << " = " << solver.getValue(x[i]) << endl;
}
cout << endl;
cout << "t = " << solver.getValue(t) << endl;
cout << "wiCi = " << solver.getObjValue() << endl << endl;
}
catch (IloException e) {
cout << e.getMessage();
}
}

Why is c++ foreach slower than naive single thread loop?

My code is like this:
auto t1 = std::chrono::steady_clock::now();
for (int t{0}; t < 100; ++t) {
vector<int> table(256, 0);
Mat im2 = cv::imread(impth, cv::ImreadModes::IMREAD_COLOR);
im2.forEach<cv::Vec3b>([&table](cv::Vec3b &pix, const int* pos) {
for (int i{0}; i < 3; ++i) ++table[pix[i]];
});
}
auto t2 = std::chrono::steady_clock::now();
cout << "time is: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << endl;
auto t3 = std::chrono::steady_clock::now();
for (int t{0}; t < 100; ++t) {
vector<int> table(256, 0);
Mat im2 = cv::imread(impth, cv::ImreadModes::IMREAD_COLOR);
for (int r{0}; r < im2.rows; ++r) {
auto ptr = im2.ptr<uint8_t>(r);
for (int c{0}; c < im2.cols; ++c) {
for (int i{0}; i < 3; ++i) ++table[ptr[i]];
ptr += 3;
}
}
}
auto t4 = std::chrono::steady_clock::now();
cout << "time is: " << std::chrono::duration_cast<std::chrono::milliseconds>(t4 - t3).count() << endl;
Intuitively, I feel that foreach should work faster since it used multi-thread mechanism to do the work, but the result turns out that the foreach methods took 14759ms while the naive loop method took only 6791ms. What is the cause of this slower foreach method, and how could make it faster ?

Dynamic 2D array C++98 vs C++11

Following this question "What is “cache-friendly” code?" I've created dynamic 2d array to check how much time would it take to access elements column-wise and row-wise.
When I create an array in the following way:
const int len = 10000;
int **mass = new int*[len];
for (int i = 0; i < len; ++i)
{
mass[i] = new int[len];
}
it takes 0.239 sec to traverse this array row-wise and 1.851 sec column-wise (in Release).
But when I create an array in this way:
auto mass = new int[len][len];
I get an opposite result: 0.204 sec to traverse this array row-wise and 0.088 sec column-wise.
My code:
const int len = 10000;
int **mass = new int*[len];
for (int i = 0; i < len; ++i)
{
mass[i] = new int[len];
}
// auto mass = new int[len][len]; // C++11 style
begin = std::clock();
for (int i = 0; i < len; ++i)
{
for (int j = 0; j < len; ++j)
{
mass[i][j] = i + j;
}
}
end = std::clock();
std::cout << "[i][j] " << static_cast<float>(end - begin) / 1000 << std::endl;
begin = std::clock();
for (int i = 0; i < len; ++i)
{
for (int j = 0; j < len; ++j)
{
mass[j][i] = i + j;
}
}
end = std::clock();
std::cout << "[j][i] " << static_cast<float>(end - begin) / 1000 << std::endl;
Please, can you explain what is the difference between these ways to allocate memory for two-dimentional dynamic array? Why does it faster to traverse array row-wise in first way and column-wise in second way?