Preventing compiler from optimising a loop

Preventing compiler from optimising a loop - c++

I want to measure the time it takes to call a function.
Here is the code:
for (int i = 0; i < 5; ++i) {
std::cout << "Pass : " << i << "\n";
const auto t0 = std::chrono::high_resolution_clock::now();
system1.euler_intregration(0.0166667);
const auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count() << "\n";
but the compiler keeps optimising the loop so time is not measured and returns zero.
I have tried using asm("") and __asm__("") as advised here but nothing works for me.
I must admit that I don't really know how these asm() functions works so I might be using them in the wrong way.

Related

Chrono C++ timings not correct

I'm just comparing the speed of a couple Fibonacci functions, one gives an output almost immediately and reads it got done in 500 nanoseconds, while the other, depending on the depth, may sit there loading for many seconds, yet when it is done, it will read that it took it only 100 nanoseconds... After I just sat there and waited like 20 seconds for it.
It's not a big deal as I can prove the other is slower just with raw human perception, but why would chrono not be working? Something to do with recursion?
PS I know that fibonacci2() doesn't give the correct output on odd numbered depths, I'm just testing some things and the output is actually just there so the compiler doesn't optimize it away or something. Go ahead and just copy this code and you'll see fibonacci2() immediately output but you'll have to wait like 5 seconds for fibonacci(). Thank you.
#include <iostream>
#include <chrono>
int fibonacci2(int depth) {
static int a = 0;
static int b = 1;
if (b > a) {
a += b; //std::cout << a << '\n';
}
else {
b += a; //std::cout << b << '\n';
}
if (depth > 1) {
fibonacci2(depth - 1);
}
return a;
}
int fibonacci(int n) {
if (n <= 1) {
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main() {
int f = 0;
auto start2 = std::chrono::steady_clock::now();
f = fibonacci2(44);
auto stop2 = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2);
std::cout << "faster function time: " << duration2.count() << '\n';
auto start = std::chrono::steady_clock::now();
f = fibonacci(44);
auto stop = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "way slower function with incorrect time: " << duration.count() << '\n';
}

I don't know what compiler you are using and with which compiler options, but I tested x64 msvc v19.28 using /O2 in godbolt. Here the compiled instructions are reordered such that it queries the perf_counter twice before invoking the fibonacci(int) function, which in code would look like
auto start = ...;
auto stop = ...;
f = fibonacci(44);
A solution to disallow this reordering might be to use a atomic_thread_fence just before and after the fibonacci function call.

As Mestkon answered the compiler can reorder your code.
Examples of how to prevent the compiler from reordering Memory Ordering - Compile Time Memory Barrier
It would be beneficial in the future if you provided information on what compiler you were using.
gcc 7.5 with -O2 for example does not reorder the timer instructions in this given scenario.

Class to calculate performance of a function

I'm writing a class as follows:
struct TimeIt {
using TimePoint = std::chrono::time_point<std::chrono::high_resolution_clock>;
TimeIt(const std::string& functName) :
t1{std::chrono::high_resolution_clock::now()},
functName{functName} {}
~TimeIt() {
TimePoint t2 = std::chrono::high_resolution_clock::now();
std::cout << "Exiting from " << functName << "...\n Elapsed: ";
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << " ms" << "\n";
}
TimePoint t1;
std::string functName;
};
The whole point of it is measure the time that takes for one function to complete, calling this at the start of it. However, the only value I'm getting is 0ms. This is obviously wrong, because it takes up to a minute for some of the functions, but I can't see why it's wrong.
I did the same, but at the start and end of the function, creating the TimePoint (with auto) and doing a duration_cast. Any clue what I'm missing here?
Edit:
I'm going to try to make it reproducible. A little bit of context: I'm working with big matrixes (12000 dimensions) and doing a lot of input output operations.
template <typename InputType>
InputMat<InputType>
readInp(const std::string& filepath = "data.inp", const size_t& reserveSize = 15000) {
TimeIt("readInp");
std::ifstream F(filepath);
assert(F.is_open());
InputMat<InputType> res;
res.reserve(reserveSize);
std::string line;
while (F >> line) {
InputType lineBitset{line};
res.push_back(lineBitset);
}
return res;
}
This function reads a matrix, and calling TimeIt here gives really different results compared when I call it in the wrapper function:
void test1() {
//Testing for 0-1 values
auto t1 = std::chrono::high_resolution_clock::now();
auto inpMat = readInp<std::bitset<32>>();
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << "\n";
//More code...
}
This outputs:
Exiting from readInp...
Elapsed: 0 milliseconds
4
and data.inp

NOW you have made the problem clear! By writing this:
TimeIt("Reading");
you are creating a temporary object, which is immediately deleted. You need to give this object a name so it lives until the end of the block:
TimeIt timer("Reading");

Why does switching the order of for loops significantly change the execution time?

In the following code, I have 2 nested for loops. The second one swaps the order of the for loops, and runs significantly faster.
Is this purely a cache locality issue (the first code loops over a vector many times, whereas the second code loops over the vector once), or is there something else that I'm not understanding?
int main()
{
using namespace std::chrono;
auto n = 1 << 12;
vector<int> v(n);
high_resolution_clock::time_point t1 = high_resolution_clock::now();
for(int i = 0; i < (1 << 16); ++i)
{
for(const auto val : v) i & val;
}
high_resolution_clock::time_point t2 = high_resolution_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
t1 = high_resolution_clock::now();
for(const auto val : v)
{
for(int i = 0; i < (1 << 16); ++i) i & val;
}
t2 = high_resolution_clock::now();
time_span = duration_cast<duration<double>>(t2 - t1);
std::cout << "It took me " << time_span.count() << " seconds.";
std::cout << std::endl;
}

As written, the second loop needs to read each val from vector v only once. The first version needs to read each val from vector v once in the inner loop for every i, so in total 65536 times.
So without any optimisation, this will make the second loop several times faster. With optimisation turned on high enough, the compiler will figure out that all these calculations achieve nothing, and are unnecessary, and throw them all away. Your execution times will then go down to zero.
If you change the code to do something with the results (like adding up all values i & val, then printing the total), a really good compiler may figure out that both pieces of code produce the same result and use the faster method for both cases.

Dynamic allocation and random access: Raw, Smart, Deque, Vector. Why raw so fast, and deque so slow?

I created a little performance test comparing the setup and access times of three popular techniques for dynamic allocation: raw pointer, std::unique_ptr, and a std::deque.
EDIT: per #NathanOliver's, added std::vector:
EDIT 2: per latedeveloper's, allocated with std::vector(n) and std::deque(n) constructors
EDIT 3: per #BaummitAugen, moved allocation inside timing loop, and compiled an optimized version.
EDIT 4: per #PaulMcKenzie's comments, set runs to 2000.
Results: These changes have tightened things up a lot. Deque and Vector are still slower on allocation and assignment, while deque is much slower on access:
pickledEgg$ g++ -std=c++11 -o sp2 -O2 sp2.cpp
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000085643 0.0000000724
Smart: 0.0000085281 0.0000000732
Deque: 0.0000205775 0.0000076908
Vector: 0.0000163492 0.0000000760
Just for fun, here are -Ofast results:
pickledEgg$ g++ -std=c++11 -o sp2 -Ofast sp2.cpp
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000045316 0.0000000893
Smart: 0.0000038308 0.0000000730
Deque: 0.0000165620 0.0000076475
Vector: 0.0000063442 0.0000000699
ORIGINAL: For posterity; note lack of optimizer -O2 flag:
pickledEgg$ g++ -std=c++11 -o sp2 sp2.cpp
Average of 100 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000466522 0.0000468586
Smart: 0.0004391623 0.0004406758
Deque: 0.0003144142 0.0021758729
Vector: 0.0004715145 0.0003829193
Updated Code:
#include <iostream>
#include <iomanip>
#include <vector>
#include <deque>
#include <chrono>
#include <memory>
const int NUM_RUNS(2000);
int main() {
std::chrono::high_resolution_clock::time_point b, e;
std::chrono::duration<double> t, raw_assign(0), raw_access(0), smart_assign(0), smart_access(0), deque_assign(0), deque_access(0), vector_assign(0), vector_access(0);
int k, tmp, n(32768);
std::cout << "Average of " << NUM_RUNS << " runs:" << std::endl;
std::cout << "Method " << '\t' << "Assign" << "\t\t" << "Access" << std::endl;
std::cout << "====== " << '\t' << "======" << "\t\t" << "======" << std::endl;
// Raw
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
int* raw_p = new int[n]; // run-time allocation
for (int i=0; i<n; ++i) { //assign
raw_p[i] = i;
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
raw_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = raw_p[i];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
raw_access+=t;
delete [] raw_p; // :^)
}
raw_assign /= NUM_RUNS;
raw_access /= NUM_RUNS;
std::cout << "Raw: " << '\t' << std::setprecision(10) << std::fixed << raw_assign.count() << '\t' << raw_access.count() << std::endl;
// Smart
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::unique_ptr<int []> smart_p(new int[n]); // run-time allocation
for (int i=0; i<n; ++i) { //assign
smart_p[i] = i;
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
smart_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = smart_p[i];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
smart_access+=t;
}
smart_assign /= NUM_RUNS;
smart_access /= NUM_RUNS;
std::cout << "Smart: " << '\t' << std::setprecision(10) << std::fixed << smart_assign.count() << '\t' << smart_access.count() << std::endl;
// Deque
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::deque<int> myDeque(n);
for (int i=0; i<n; ++i) { //assign
myDeque[n] = i;
// myDeque.push_back(i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
deque_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = myDeque[n];
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
deque_access+=t;
}
deque_assign /= NUM_RUNS;
deque_access /= NUM_RUNS;
std::cout << "Deque: " << '\t' << std::setprecision(10) << std::fixed << deque_assign.count() << '\t' << deque_access.count() << std::endl;
// vector
for (k=0; k<NUM_RUNS; ++k) {
b = std::chrono::high_resolution_clock::now();
std::vector<int> myVector(n);
for (int i=0; i<n; ++i) { //assign
myVector[i] = i;
// .push_back(i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
vector_assign+=t;
b = std::chrono::high_resolution_clock::now();
for (int i=0; i<n; ++i) { //access
tmp = myVector[i];
// tmp = *(myVector.begin() + i);
}
e = std::chrono::high_resolution_clock::now();
t = std::chrono::duration_cast<std::chrono::duration<double> >(e - b);
vector_access+=t;
}
vector_assign /= NUM_RUNS;
vector_access /= NUM_RUNS;
std::cout << "Vector:" << '\t' << std::setprecision(10) << std::fixed << vector_assign.count() << '\t' << vector_access.count() << std::endl;
std::cout << std::endl;
return 0;
}

As you can see from the results, raw pointers are the clear winner in both categories. Why is this?
Because ...
g++ -std=c++11 -o sp2 sp2.cpp
... you didn't enable optimization. Calling an operator overloaded for a non-fundamental type such as std::vector or std::unique_ptr involves a function call. Using operators of fundamental types like a raw pointer do not involve function calls.
A function call is typically slower than no function call. Over several iterations, the small overhead of the function call multiplies. However, an optimizer can expand function calls inline thereby making the disadvantage of non-fundamental types void. But only if the optimization is performed.
std::deque has an additional reason for being slower: The algorithm to access an arbitrary element of a double ended queue is more complicated than accessing an array. While std::deque has decent random access performance, it is not as good array has. A more appropriate use case for std::deque is linear iteration (using an iterator).
Furthermore, you used std::deque::at, which does bounds checking. The subscript operator does not do bounds checking. Bounds checking adds runtime overhead.
The slight edge that the raw array appears to have with the allocation speed over the std::vector, may be because std::vector zero-initializes the data.

A std::deque is a doubly linked list. myDeque.at(i) has to walk through the first i elements on every call. That is why the access to the deque is so slow.
The initialiation of std::vector is slow, because you don't preallocate enough memory. std::vector then starts with a small number of elements and usually doubles that as soon as you try to insert more. This reallocation involves calling the move constructor for all elements. Try to construct the vector like this:
std::vector<int> myVector{n};
in the vector access I wonder why you didn't use tmp = myVector[i]. Instead of calling the index operator, you instantiate an iterator, call its + operator and on the result you call the dereference operator. Since you are not optimizing, function calls will probably not be inlined, so that is, why std::vector access is slower than the raw pointer.
For the std::uniqe_ptr I suppose, that it has similar reasons as with std::vector. You always call the index operator on the unique pointer, which is a function call as well. Just as an experiment, can you please try and immediately after allocating the memory for smart_p, call smart_p.get() and use the raw pointer for the rest of the operations. I assume, that it will be just as fast as the raw pointer. That could prove my assumption, that it is the function calls. Then the simple advice is, enable optimizations and try again.
kmiklas edit:
Average of 2000 runs:
Method Assign Access
====== ====== ======
Raw: 0.0000086415 0.0000000681
Smart: 0.0000081824 0.0000000670
Deque: 0.0000204542 0.0000076554
Vector: 0.0000164252 0.0000000678

Inaccuracy in std::chrono::high_resolution_clock? [duplicate]

This question already has answers here:
What are the uses of std::chrono::high_resolution_clock?
(2 answers)
Closed 6 years ago.
So I was trying to use std::chrono::high_resolution_clock to time how long something takes to executes. I figured that you can just find the difference between the start time and end time...
To check my approach works, I made the following program:
#include <iostream>
#include <chrono>
#include <vector>
void long_function();
int main()
{
std::chrono::high_resolution_clock timer;
auto start_time = timer.now();
long_function();
auto end_time = timer.now();
auto diff_millis = std::chrono::duration_cast<std::chrono::duration<int, std::milli>>(end_time - start_time);
std::cout << "It took " << diff_millis.count() << "ms" << std::endl;
return 0;
}
void long_function()
{
//Should take a while to execute.
//This is calculating the first 100 million
//fib numbers and storing them in a vector.
//Well, it doesn't actually, because it
//overflows very quickly, but the point is it
//should take a few seconds to execute.
std::vector<unsigned long> numbers;
numbers.push_back(1);
numbers.push_back(1);
for(int i = 2; i < 100000000; i++)
{
numbers.push_back(numbers[i-2] + numbers[i-1]);
}
}
The problem is, it just outputs 3000ms exactly, when it clearly wasn't actually that.
On shorter problems, it just outputs 0ms... What am I doing wrong?
EDIT: If it's of any use, I'm using the GNU GCC compiler with -std=c++0x flag on

The resolution of the high_resolution_clock depends on the platform.
Printing the following will give you an idea of the resolution of the implementation you use
std::cout << "It took " << std::chrono::nanoseconds(end_time - start_time).count() << std::endl;

I have got a similar problem with g++ (rev5, Built by MinGW-W64 project) 4.8.1 under window7.
int main()
{
auto start_time = std::chrono::high_resolution_clock::now();
int temp(1);
const int n(1e7);
for (int i = 0; i < n; i++)
temp += temp;
auto end_time = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(end_time - start_time).count() << " ns.";
return 0;
}
if n=1e7 it displays 19999800 ns
but if
n=1e6 it displays 0 ns.
the precision seems weak.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Preventing compiler from optimising a loop - c++

Related

Chrono C++ timings not correct

Class to calculate performance of a function

Why does switching the order of for loops significantly change the execution time?

Dynamic allocation and random access: Raw, Smart, Deque, Vector. Why raw so fast, and deque so slow?

Inaccuracy in std::chrono::high_resolution_clock? [duplicate]

Categories

Resources