How to tell the compiler to optimize array access?

How to tell the compiler to optimize array access? - c++

I have the following fragment of code. It contains 3 sections where I measure memory access runtime. First is plain iteration over the array. The second is almost the same with the exception that the array address received from the function call. The third is the same as the second but manually optimized.
#include <map>
#include <cstdlib>
#include <chrono>
#include <iostream>
std::map<void*, void*> cache;
constexpr int elems = 1000000;
double x[elems] = {};
template <typename T>
T& find_in_cache(T& var) {
void* key = &var;
void* value = nullptr;
if (cache.count(key)) {
value = cache[key];
} else {
value = malloc(sizeof(T));
cache[key] = value;
}
return *(T*)value;
}
int main() {
std::chrono::duration<double> elapsed_seconds1, elapsed_seconds2, elapsed_seconds3;
for (int k = 0; k < 100; k++) { // account for cache effects
// first section
auto start = std::chrono::steady_clock::now();
for (int i = 1; i < elems; i++) {
x[i] = (x[i-1] + 1.0) * 1.001;
}
auto end = std::chrono::steady_clock::now();
elapsed_seconds1 = end-start;
// second section
start = std::chrono::steady_clock::now();
for (int i = 1; i < elems; i++) {
find_in_cache(x)[i] = (find_in_cache(x)[i-1] + 1.0) * 1.001;
}
end = std::chrono::steady_clock::now();
elapsed_seconds2 = end-start;
// third section
start = std::chrono::steady_clock::now();
double* y = find_in_cache(x);
for (int i = 1; i < elems; i++) {
y[i] = (y[i-1] + 1.0) * 1.001;
}
end = std::chrono::steady_clock::now();
elapsed_seconds3 = end-start;
}
std::cout << "elapsed time 1: " << elapsed_seconds1.count() << "s\n";
std::cout << "elapsed time 2: " << elapsed_seconds2.count() << "s\n";
std::cout << "elapsed time 3: " << elapsed_seconds3.count() << "s\n";
return x[elems - 1]; // prevent optimizing away
}
The timings of these sections are following:
elapsed time 1: 0.0018678s
elapsed time 2: 0.00423903s
elapsed time 3: 0.00189678s
Is it possible to change the interface of find_in_cache() without changing the body of the second iteration section to make its performance the same as section 3?

template <typename T>
[[gnu::const]]
T& find_in_cache(T& var) { ... }
lets clang optimize the code the way you want, but gcc fails to handle the call as a loop invariant, even with gnu::noinline to make sure the attribute is not lost (maybe worth a bug report?).
How safe such code is may depend on the rest of your code. It is a lie since the function can use memory, but it may be ok if that memory is private enough to the function. Preventing inlining of find_in_cache may help reduce the risks.
You can also convince gcc to optimize with
template <typename T>
[[gnu::const,gnu::noinline]]
T& find_in_cache(T& var) noexcept { ... }
which would cause your program to terminate if there isn't enough memory to add an element in the cache.

Related

What's the difference between std::vector and dynamic allocated array?

I have wrote two functions to compare the time cost of std::vector and dynamic allocated array
#include <iostream>
#include <vector>
#include <chrono>
void A() {
auto t1 = std::chrono::high_resolution_clock::now();
std::vector<float> data(5000000);
auto t2 = std::chrono::high_resolution_clock::now();
float *p = data.data();
for (int i = 0; i < 5000000; ++i) {
p[i] = 0.0f;
}
auto t3 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << " us\n";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t3 - t2).count() << " us\n";
}
void B() {
auto t1 = std::chrono::high_resolution_clock::now();
auto* data = new float [5000000];
auto t2 = std::chrono::high_resolution_clock::now();
float *ptr = data;
for (int i = 0; i < 5000000; ++i) {
ptr[i] = 0.0f;
}
auto t3 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() << " us\n";
std::cout << std::chrono::duration_cast<std::chrono::microseconds>(t3 - t2).count() << " us\n";
}
int main(int argc, char** argv) {
A();
B();
return 0;
}
A() cost about 6000 us to initialize the vector, then 1400 us to fill zeros.
B() cost less than 10 us to allocate memory, then 5800 us to fill zeros.
Why their time costs have such a large difference?
compiler: g++=9.3.0
flags: -O3 -DNDEBUG

First, note that the std::vector<float> constructor already zeros the vector.
There are many plausible system-level explanations for the behavior you observe:
One very plausible is caching: When you allocate the array using new, the memory referenced by the returned pointer is not in the cache. When you create a vector, the constructor will zero the allocated memory area under the hood thereby bringing the memory to the cache. Subsequent zeroing will hit in the cache thus.
Other reasons might include compiler optimizations. A compiler might realize that your zeroing is unneccesary with std::vector. Given the figures you obtained I would discount this here though.

QuickBench is a nice tool to compare different ways doing the same thing.
https://quick-bench.com/q/p4ThYlVCa7VbO6vy6LEVVZ_0CVs
Your array example leaves a huge memory leak and QuickBench gives an error because of that.
The code I used (added two more variants):
static void Vector(benchmark::State& state) {
// Code inside this loop is measured repeatedly
for (auto _ : state) {
std::vector<float> data(500000);
float *p = data.data();
for (int i = 0; i < 500000; ++i) {
p[i] = 0.0f;
}
// Make sure the variable is not optimized away by compiler
benchmark::DoNotOptimize(data);
}
}
// Register the function as a benchmark
BENCHMARK(Vector);
static void VectorPushBack(benchmark::State& state) {
for (auto _ : state) {
std::vector<float> data;
for (int i = 0; i < 500000; ++i) {
data.push_back(0.0f);
}
benchmark::DoNotOptimize(data);
}
}
BENCHMARK(VectorPushBack);
static void VectorInit(benchmark::State& state) {
for (auto _ : state) {
std::vector<float> data(500000, 0.0f);
benchmark::DoNotOptimize(data);
}
}
BENCHMARK(VectorInit);
static void Array(benchmark::State& state) {
for (auto _ : state) {
auto* data = new float [500000];
float *ptr = data;
for (int i = 0; i < 500000; ++i) {
ptr[i] = 0.0f;
}
benchmark::DoNotOptimize(data);
delete[] data;
}
}
BENCHMARK(Array);
static void ArrayInit(benchmark::State& state) {
for (auto _ : state) {
auto* data = new float [500000]();
benchmark::DoNotOptimize(data);
delete[] data;
}
}
BENCHMARK(ArrayInit);
static void ArrayMemoryLeak(benchmark::State& state) {
for (auto _ : state) {
auto* data = new float [500000];
float *ptr = data;
for (int i = 0; i < 500000; ++i) {
ptr[i] = 0.0f;
}
benchmark::DoNotOptimize(data);
}
}
//BENCHMARK(ArrayMemoryLeak);
Results:
All variants but the push_back one are almost the same in runtime. But the vector is much safer. It's very easy to forget to free the memory (as you demonstrated yourself).
EDIT: Fixed the mistake in the push_back variant. Thanks to t.niese and Scheff's Cat for pointing it out and fixing it.

Why does the combination of find + insert work faster than the single insert statements

Why does the combination of find + insert work faster than the single insert statements?
#include <chrono>
#include <iostream>
#include <unordered_set>
int main()
{
{
auto t1 = std::chrono::high_resolution_clock::now();
auto elements = 100000000;
std::unordered_set<int> s;
s.reserve(elements);
for (int i = 0; i < elements; ++i)
{
auto it = s.find(i % 2);
if (it == s.end())
{
s.insert(i % 2);
}
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
}
{
auto t1 = std::chrono::high_resolution_clock::now();
auto elements = 100000000;
std::unordered_set<int> s;
s.reserve(elements);
for (int i = 0; i < elements; ++i)
{
s.insert(i % 2);
}
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() << std::endl;
}
}
This code gives me the following results in MSVC-14.0 (Release configuration ofc):
716
1005

Since the elements you're adding most of the time are already in the set, insert has more work to do thatn find as it needs to construct a pair with the iterator pointing to the existing element in the set and a boolean to indicate that the element is already there. find only has to return the iterator. You can look at the library code to see this.
A more accurate title and question almost gives you the answer. Since you're only inserting two elements, then checking for around 100,000,000 more, a better title would end with "when the elements are already in the set". A better question is "why does find work faster than insert?".

Execution time of a function in C++

I want to use several functions that declare the same array but in different ways (statically, on the stack and on the heap) and to display the execution time of each functions. Finally I want to call those functions several times.
I think I've managed to do everything but for the execution time of the functions I'm constantly getting 0 and I don't know if it's supposed to be normal. If somebody could confirm it for me. Thanks
Here's my code
#include "stdafx.h"
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#define size 100000
using namespace std;
void prem(){
auto start = std::chrono::high_resolution_clock::now();
static int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed timefor static: " << elapsed.count() << " s\n";
}
void first(){
auto start = std::chrono::high_resolution_clock::now();
int array[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time on the stack: " << elapsed.count() << " s\n";
}
void secon(){
auto start = std::chrono::high_resolution_clock::now();
int *array = new int[size];
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time dynamic: " << elapsed.count() << " s\n";
delete[] array;
}
int main()
{
for (int i = 0; i <= 1000; i++){
prem();
first();
secon();
}
return 0;
}

prem() - the array is allocated outside of the function
first() - the array is allocated before your code gets to it
You are looping over all 3 functions in a single loop. Why? Didn't you mean to loop for 1000 times over each one separately, so that they (hopefully) don't affect each other? In practice that last statement is not true though.
My suggestions:
Loop over each function separately
Do the now() call for the entire 1000 loops: make the now() calls before you enter the loop and after you exit it, then get the difference and divide it by the number of iterations(1000)
Dynamic allocation can be (trivially) reduced to just grabbing a block of memory in the vast available address space (I assume you are running on 64-bit platform) and unless you actually use that memory the OS doesn't even need to make sure it is in RAM. That would certainly skew your results significantly
Write a "driver" function that gets function pointer to "test"
Possible implementation of that driver() function:
void driver( void(*_f)(), int _iter, std::string _name){
auto start = std::chrono::high_resolution_clock::now();
for(int i = 0; i < _iter; ++i){
*_f();
}
auto finish = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = finish - start;
std::cout << "Elapsed time " << _name << ": " << elapsed.count() / _iter << " s" << std::endl;
}
That way your main() looks like that:
void main(){
const int iterations = 1000;
driver(prem, iterations, "static allocation");
driver(first, iterations, "stack allocation");
driver(secon, iterations, "dynamic allocation");
}

Do not do such synthetic tests because the compiler will optimize out everything that is not used.
As another answer suggests, you need to measure the time for entire 1000 loops. And even though, I do not think you will get reasonable results.
Let's make not 1000 iterations, but 1000000. And let's add another case, where we just do two subsequent calls to chrono::high_resolution_clock::now() as a baseline:
#include <iostream>
#include <time.h>
#include <stdio.h>
#include <chrono>
#include <string>
#include <functional>
#define size 100000
using namespace std;
void prem() {
static int array[size];
}
void first() {
int array[size];
}
void second() {
int *array = new int[size];
delete[] array;
}
void PrintTime(std::chrono::duration<double> elapsed, int count, std::string msg)
{
std::cout << msg << elapsed.count() / count << " s\n";
}
int main()
{
int iterations = 1000000;
{
auto start = std::chrono::high_resolution_clock::now();
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time for nothing: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
prem();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed timefor static: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
first();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time on the stack: ");
}
{
auto start = std::chrono::high_resolution_clock::now();
for (int i = 0; i <= iterations; i++)
{
second();
}
auto finish = std::chrono::high_resolution_clock::now();
PrintTime(finish - start, iterations, "Elapsed time dynamic: ");
}
return 0;
}
With all optimisations on, I get this result:
Elapsed time for nothing: 3.11e-13 s
Elapsed timefor static: 3.11e-13 s
Elapsed time on the stack: 3.11e-13 s
Elapsed time dynamic: 1.88703e-07 s
That basically means, that compiler actually optimized out prem() and first(). Even not calls, but entire loops, because they do not have side effects.

about function pointer: why the overhead time changes when the content of the function changes

here is the c++ code, and I use vs2013, release mode
#include <ctime>
#include <iostream>
void Tempfunction(double& a, int N)
{
a = 0;
for (double i = 0; i < N; ++i)
{
a += i;
}
}
int main()
{
int N = 1000; // from 1000 to 8000
double Value = 0;
auto t0 = std::time(0);
for (int i = 0; i < 1000000; ++i)
{
Tempfunction(Value, N);
}
auto t1 = std::time(0);
auto Tempfunction_time = t1-t0;
std::cout << "Tempfunction_time = " << Tempfunction_time << '\n';
auto TempfunctionPtr = &Tempfunction;
Value = 0;
t0 = std::time(0);
for (int i = 0; i < 1000000; ++i)
{
(*TempfunctionPtr)(Value, N);
}
t1 = std::time(0);
auto TempfunctionPtr_time = t1-t0;
std::cout << "TempfunctionPtr_time = " << TempfunctionPtr_time << '\n';
std::system("pause");
}
I change the value of N from 1000 to 8000, and record Tempfunction_time and TempfunctionPtr_time.
The results are weird:
N=1000 , Tempfunction_time=1, TempfunctionPtr_time=2;
N=2000 , Tempfunction_time=2, TempfunctionPtr_time=6;
N=4000 , Tempfunction_time=4, TempfunctionPtr_time=11;
N=8000 , Tempfunction_time=8, TempfunctionPtr_time=21;
TempfunctionPtr_time - Tempfunction_time is not constant,
and TempfunctionPtr_time = 2~3 * Tempfunction_time.
The difference should be a constant which is the overhead of function pointer.
What is wrong?
EDIT:
Assume VS2013 inlines Tempfunction if it it called by Tempfunction(), and does not inline it if it is called by (*TempfunctionPtr), then we can explain the difference. So, if that is true, why can not the compiler inline (*TempfunctionPtr) ?

I compiled the existing code with g++ on my Linux machine, and I found that the time was too short to be measured accurately in seconds, so rewrote it to use std::chrono to measure the time more precisely - I also had to "use" the variable Value (hence the "499500" being printed below), otherwise the compiler would completely optimise away the first loop. Then I get the following result:
Tempfunction_time = 1.47983
499500
TempfunctionPtr_time = 1.69183
499500
Now, the results I have are for GCC (version 4.6.3 - other versions are available and may give other results!), which is not the same compiler as Microsoft, so the results may differ - different compilers optimise code quite differently at times. I'm actually quite surprised that the compiler doesn't figure out that the result of TempFunction only needs calculating once. But hey, made it easier to write the benchmark without trickery.
My second observation is that, with my compiler, if I replaceint N=1000; with a loop for(int N=1000; N <= 8000; N *= 2) around the main code, there is no or very little difference between the two cases - I'm not entirely sure why, because the code looks identical (there is no call via a function-pointer, because the compiler knows that the function pointer is a constant), and TempFUnction gets inlined in both cases. (The same "equality" happens when N is other values than 1000 - so I'm far from sure what is going on here....
To actually measure the difference between a function pointer and direct function call, you would need to move TempFUnction into a separate file, and "hide" the actual value stored in TempFunctionPtr such that the compiler doesn't figure out exactly what you are doing.
In the end, I ended up with something like this:
typedef void (*FunPtr)(double &a, int N);
void Tempfunction(double& a, int N)
{
a = 0;
for (double i = 0; i < N; ++i)
{
a += i;
}
}
FunPtr GetFunPtr()
{
return &Tempfunction;
}
And the "main" code like this:
#include <iostream>
#include <chrono>
typedef void (*FunPtr)(double &a, int N);
extern void Tempfunction(double& a, int N);
extern FunPtr GetFunPtr();
int main()
{
for(int N = 1000; N <= 8000; N *= 2)
{
std::cout << "N=" << N << std::endl;
double Value = 0;
auto t0 = std::chrono::system_clock::now();
for (int i = 0; i < 1000000; ++i)
{
Tempfunction(Value, N);
}
auto t1 = std::chrono::system_clock::now();;
std::chrono::duration<double> Tempfunction_time = t1-t0;
std::cout << "Tempfunction_time = " << Tempfunction_time.count() << '\n';
std::cout << Value << std::endl;
auto TempfunctionPtr = GetFunPtr();
Value = 0;
t0 = std::chrono::system_clock::now();
for (int i = 0; i < 1000000; ++i)
{
(*TempfunctionPtr)(Value, N);
}
t1 = std::chrono::system_clock::now();
std::chrono::duration<double> TempfunctionPtr_time = t1-t0;
std::cout << "TempfunctionPtr_time = " << TempfunctionPtr_time.count() << '\n';
std::cout << Value << std::endl;
}
}
However, the difference is thousands of a second, and variant is a clear winner, the only conclusion is the obvious one, that "calling a function is slower than inlining it".
N=1000
Tempfunction_time = 1.78323
499500
TempfunctionPtr_time = 1.77822
499500
N=2000
Tempfunction_time = 3.54664
1.999e+06
TempfunctionPtr_time = 3.54687
1.999e+06
N=4000
Tempfunction_time = 7.0854
7.998e+06
TempfunctionPtr_time = 7.08706
7.998e+06
N=8000
Tempfunction_time = 14.1597
3.1996e+07
TempfunctionPtr_time = 14.1577
3.1996e+07
Of course, if we do "only half the hiding trick", so that the function is known and inlineable in the first case, and not known and through a function pointer, we can perhaps expect a difference. But calling a function through a pointer is in itself not expensive. The real difference comes when the compiler decides to inline the function.
Obviously, these are the results of GCC 4.6.3, which is not the same compiler as MSVS2013. You should make the "chrono" modifications that are in the above code, and see what difference it makes.

Speed comparison of 2 loop styles

I'm reading about STL algorithms and the book pointed out that algorithms like find use a while loop rather than a for loop because it is minimal, efficient, and uses one less variable. I decided to do some testing and the results didn't really match up.
The forfind consistently performed better than the whilefind. At first I simply tested by pushing 10000 ints back into a vector, and then using find to get a single value from it and return it to the iterator. I timed it and output that time.
Then I decided to change it so that the forfind and whilefind functions were used multiple times (in this case 10000 times). However, the for loop find still came up with better performance than the while find. Can anyone explain this? Here is the code.
#include "std_lib_facilities.h"
#include<ctime>
template<class ln, class T>
ln whilefind(ln first, ln last, const T& val)
{
while (first!=last && *first!=val) ++first;
return first;
}
template<class ln, class T>
ln forfind(ln first, ln last, const T& val)
{
for (ln p = first; p!=last; ++p)
if(*p == val) return p;
return last;
}
int main()
{
vector<int> numbers;
vector<int>::iterator whiletest;
vector<int>::iterator fortest;
for (int n = 0; n < 10000; ++n)
numbers.push_back(n);
clock_t while1 = clock(); // start
for (int i = 0; i < 10000; ++i)
whiletest = whilefind(numbers.begin(), numbers.end(), i);
clock_t while2 = clock(); // stop
clock_t for1 = clock(); // start
for (int i = 0; i < 10000; ++i)
fortest = forfind(numbers.begin(), numbers.end(), i);
clock_t for2 = clock(); // stop
cout << "While loop: " << double(while2-while1)/CLOCKS_PER_SEC << " seconds.\n";
cout << "For loop: " << double(for2-for1)/CLOCKS_PER_SEC << " seconds.\n";
}
The while loop consistently reports taking around .78 seconds and the for loop reports .67 seconds.

if(*p = val) return p;
That should be a ==. So forfind will only go through the entire vector for the first value, 0, and return immediately for numbers 1-9999.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to tell the compiler to optimize array access? - c++

Related

What's the difference between std::vector and dynamic allocated array?

Why does the combination of find + insert work faster than the single insert statements

Execution time of a function in C++

about function pointer: why the overhead time changes when the content of the function changes

Speed comparison of 2 loop styles

Categories

Resources