C++11 tuple performance

C++11 tuple performance - c++

I just about to make my code more generalized by using std::tuple in a lot of cases including single element. I mean for example tuple<double> instead of double. But I decided to check performance of this particular case.
Here is simple performance benchmark test:
#include <tuple>
#include <iostream>
using std::cout;
using std::endl;
using std::get;
using std::tuple;
int main(void)
{
#ifdef TUPLE
using double_t = std::tuple<double>;
#else
using double_t = double;
#endif
constexpr int count = 1e9;
auto array = new double_t[count];
long long sum = 0;
for (int idx = 0; idx < count; ++idx) {
#ifdef TUPLE
sum += get<0>(array[idx]);
#else
sum += array[idx];
#endif
}
delete[] array;
cout << sum << endl; // just "external" side effect for variable sum.
}
And run results:
$ g++ -DTUPLE -O2 -std=c++11 test.cpp && time ./a.out
0
real 0m3.347s
user 0m2.839s
sys 0m0.485s
$ g++ -O2 -std=c++11 test.cpp && time ./a.out
0
real 0m2.963s
user 0m2.424s
sys 0m0.519s
I thought that tuple is strict static-compiled template and all of get<> functions are working just usual variable access in that case. BTW memory allocation sizes in this test are same.
Why does this execution time difference happens?
EDIT: Problem was in initialization of tuple<> object. To make test more accurate one line must be changed:
constexpr int count = 1e9;
- auto array = new double_t[count];
+ auto array = new double_t[count]();
long long sum = 0;
After that one can observe similar results:
$ g++ -DTUPLE -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real 0m3.342s
real 0m3.339s
real 0m3.343s
$ g++ -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real 0m3.349s
real 0m3.339s
real 0m3.334s

The tuple all default construct values (so everything is 0) doubles do not get default initialized.
In generated assembly, following initialization loop is only present when using tuples. Otherwise they are equivalent.
.L2:
movq $0, (%rdx)
addq $8, %rdx
cmpq %rcx, %rdx
jne .L2

Related

Branch prediction does not improve the performance [duplicate]

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

GCC not performing loop invariant code motion

I decided to check the result of loop invariant code motion optimization using g++. However, when I compiled the following code with -fmove-loop-invariants and analysed its assembly, I saw that k + 17 calculation is still performed in the loop body.
What could prevent the compiler from optimizing it?
May be the compiler concludes that it is more efficient to recalculate k + 17?
int main()
{
int k = 0;
std::cin >> k;
for (int i = 0; i < 10000; ++i)
{
int n = k + 17; // not moved out of the loop
printf("%d\n", n);
}
return 0;
}
Tried g++ -O0 -fmove-loop-invariants, g++ -O3 and g++ -O3 -fmove-loop-invariants using both g++ 4.6.3 and g++ 4.8.3.

EDIT: Ignore my previous answer. You can see that the calculation has been folded into a constant. Therefore it is performing the loop invariant optimization.
Because of the as-if rule. Simply put, the compiler is not allowed to make any optimizations that may affect the observable behavior of the program, in this case the printf. You can see what happens if you make n volatile and remove the printf:
for (int i = 0; i < 10000; ++i)
{
volatile int n = k + 17; // not moved out of the loop
}
// Example assembly output for GCC 4.6.4
// ...
movl $10000, %eax
addl $17, %edx
.L2:
subl $1, %eax
movl %edx, 12(%rsp)
// ...

Why is dividing slower than bitshifting in C++?

I wrote two pieces of code, one that divides a random number by two, and one that bitshifts the same random number right once. As I understand it, this should produce the same result. However, when I time both pieces of code, I consistently get data saying that the shifting is faster. Why is that?
Shifting code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand()>>1);
}else{
result = result - (rand()>>1);
}
}
Dividing code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand() / 2);
}else{
result = result - (rand() / 2);
}
}
Timing and results:
$ time ./divide 1000000; time ./shift 1000000
Doing 1e+09 iterations.
real 0m12.291s
user 0m12.260s
sys 0m0.021s
Doing 1e+09 iterations.
real 0m12.091s
user 0m12.056s
sys 0m0.019s
$ time ./shift 1000000; time ./divide 1000000
Doing 1e+09 iterations.
real 0m12.083s
user 0m12.028s
sys 0m0.035s
Doing 1e+09 iterations.
real 0m12.198s
user 0m12.158s
sys 0m0.028s
Addtional information:
I am not using any optimizations when compiling
I am running this on a virtualized install of Fedora 20, kernal: 3.12.10-300.fc20.x86_64

It's not; it's slower on the architecture you're running on. It's almost always slower because the hardware behind bit shifting is trivial, while division is a bit of a nightmare. In base 10, what's easier for you, 78358582354 >> 3 or 78358582354 / 85? Instructions generally take the same time to execute regardless of input, and in you case, it's the compiler's job to convert /2 to >>1; the CPU just does as it's told.

It isn't actually slower. I've run your benchmark using nonius like so:
#define NONIUS_RUNNER
#include "Nonius.h++"
#include <type_traits>
#include <random>
#include <vector>
NONIUS_BENCHMARK("Divide", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) >> 1) : storage[i] + (dist(rd) >> 1); });
})
NONIUS_BENCHMARK("std::string destruction", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) / 2) : storage[i] + (dist(rd) / 2); });
})
And these are the results:
As you can see both of them are neck and neck.
(You can find the html output here)
P.S: It seems I forgot to rename the second test. My bad.

It seems that difference in resuls is bellow the results spread, so you cann't really tell if it is different. But in general division can't be done in single opperation, bit shift can, so bit shift usualy should be faster.
But as you have literal 2 in your code, I would guess that compiler, even without optimizations produces identical code.

Note that rand returns int and divide int (signed by default) by 2 is not the same as shifting by 1. You can easily check generated asm and see the difference, or simply check resulting binary size:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 4016 ... boo # divide
... 3984 ... foo # shift
Now add static_cast patch:
if (i % 2 == 0) {
result = result + (static_cast<unsigned>(rand())/2);
}
else {
result = result - (static_cast<unsigned>(rand())/2);
}
and check the size again:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 3984 ... boo # divide
... 3984 ... foo # shift
to be sure you can verify that generated asm in both binaries is the same

g++ c++11 constexpr evaluation performance

g++ (4.7.2) and similar versions seem to evaluate constexpr surprisingly fast during compile-time. On my machines in fact much faster than the compiled program during runtime.
Is there a reasonable explanation for that behavior?
Are there optimization techniques involved which are only
applicable at compile-time, that can be executed quicker than actual compiled code?
If so, which?
Here`s my test program and the observed results.
#include <iostream>
constexpr int mc91(int n)
{
return (n > 100)? n-10 : mc91(mc91(n+11));
}
constexpr double foo(double n)
{
return (n>2)? (0.9999)*((unsigned int)(foo(n-1)+foo(n-2))%100):1;
}
constexpr unsigned ack( unsigned m, unsigned n )
{
return m == 0
? n + 1
: n == 0
? ack( m - 1, 1 )
: ack( m - 1, ack( m, n - 1 ) );
}
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
int main(void)
{
constexpr unsigned int compiletime_ack=ack(3,14);
constexpr int compiletime_91=slow91(49);
static_assert( compiletime_ack == 131069, "Must be evaluated at compile-time" );
static_assert( compiletime_91 == 91, "Must be evaluated at compile-time" );
std::cout << compiletime_ack << std::endl;
std::cout << compiletime_91 << std::endl;
std::cout << ack(3,14) << std::endl;
std::cout << slow91(49) << std::endl;
return 0;
}
compiletime:
time g++ constexpr.cpp -std=c++11 -fconstexpr-depth=10000000 -O3
real 0m0.645s
user 0m0.600s
sys 0m0.032s
runtime:
time ./a.out
131069
91
131069
91
real 0m43.708s
user 0m43.567s
sys 0m0.008s
Here mc91 is the usual mac carthy f91 (as can be found on wikipedia) and foo is just a useless function returning real values between about 1 and 100, with a fib runtime complexity.
Both the slow calculation of 91 and the ackermann functions get evaluated with the same arguments by the compiler and the compiled program.
Surprisingly the program would even run faster, just generating code and running it through the compiler than executing the code itself.

At compile-time, redundant (identical) constexpr calls can be memoized, while run-time recursive behavior does not provide this.
If you change every recursive function such as...
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
... to a form that isn't constexpr, but does remember past calculations at runtime:
std::unordered_map< int, boost::optional<unsigned> > results4;
// parameter(s) ^^^ result ^^^^^^^^
unsigned slow91(int n) {
boost::optional<unsigned> &ret = results4[n];
if ( !ret )
{
ret = mc91(mc91(foo(n))%100);
}
return *ret;
}
You will get less surprising results.
compiletime:
time g++ test.cpp -std=c++11 -O3
real 0m1.708s
user 0m1.496s
sys 0m0.176s
runtime:
time ./a.out
131069
91
131069
91
real 0m0.097s
user 0m0.064s
sys 0m0.032s

Memoization
This is a very interesting "discovery" but the answer is probably more simple than you think it is.
Something can be evaluated compile-time when declared constexpr if all values involved are known at compile time (and if the variable where the value is supposed to end up is declared constexpr as well) with that said imagine the following pseudo-code:
f(x) = g(x)
g(x) = x + h(x,x)
h(x,y) = x + y
since every value is known at compile time the compiler can rewrite the above into the, equivalent, below:
f(x) = x + x + x
To put it in words every function call has been removed and replaced with that of the expression itself. What is also applicable is a method called memoization where results of passed calculated expresions are stored away so you only need to do the hard work once.
If you know that g(5) = 15 why calculate it again? instead just replace g(5) with 15 everytime it is needed, This is possible since a function declared as constexpr isn't allowed to have side-effects .
Runtime
In runtime this is not happening (since we didn't tell the code to behave this way). The little guy running through your code will need to jump from f to g to h and then jump back to g from h before it jumps from g to f all while he stores the return value of each function and passing it along to the next one.
Even if this guy is very very tiny and that he doesn't need to jump very very far he still doesn't like jumping back and forth all the time, it takes a lot for him to do this and with that; it takes time.
But in the OPs example, is it really calculated compile-time?
Yes, and to those not believing that the compiler actually calculates this and put it as constants in the finished binary I will supply the relevant assembly instructions from OPs code below (output of g++ -S -Wall -pedantic -fconstexpr-depth=1000000 -std=c++11)
main:
.LFB1200:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $131069, -4(%rbp)
movl $91, -8(%rbp)
movl $131069, %esi # one of the values from constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEj
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movl $91, %esi # the other value from our constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEi
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
# ...
# a lot of jumping is taking place down here
# see the full output at http://codepad.org/Q8D7c41y

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

C++11 tuple performance - c++

The tuple all default construct values (so everything is 0) doubles do not get default initialized. In generated assembly, following initialization loop is only present when using tuples. Otherwise they are equivalent. .L2: movq $0, (%rdx) addq $8, %rdx cmpq %rcx, %rdx jne .L2

Related

Branch prediction does not improve the performance [duplicate]

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

GCC not performing loop invariant code motion

Why is dividing slower than bitshifting in C++?

g++ c++11 constexpr evaluation performance

Categories

Resources