Branch prediction does not improve the performance [duplicate]

Branch prediction does not improve the performance [duplicate] - c++

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

Related

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

I discovered this popular ~9-year-old SO question and decided to double-check its outcomes.
So, I have AMD Ryzen 9 5950X, clang++ 10 and Linux, I copy-pasted code from the question and here is what I got:
Sorted - 0.549702s:
~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
0.549702
sum = 314931600000
Unsorted - 0.546554s:
~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
0.546554
sum = 314931600000
I am pretty sure that the fact that unsorted version turned out to be faster by 3ms is just noise, but it seems it is not slower anymore.
So, what has changed in the architecture of CPU (so that it is not an order of magnitude slower anymore)?
Here are results from multiple runs:
Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted: 0.542587 0.559719 0.53938 0.557909
Just in case, here is my main.cpp:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
// std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
return 0;
}
Update
With larger number of elements (627680):
Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
// std::sort(data, data + arraySize);
10.3814
Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
std::sort(data, data + arraySize);
10.6885
I think the question is still relevant - almost no difference.

Several of the answers in the question you link talk about rewriting the code to be branchless and thus avoiding any branch prediction issues. That's what your updated compiler is doing.
Specifically, clang++ 10 with -O3 vectorizes the inner loop. See the code on godbolt, lines 36-67 of the assembly. The code is a little bit complicated, but one thing you definitely don't see is any conditional branch on the data[c] >= 128 test. Instead it uses vector compare instructions (pcmpgtd) whose output is a mask with 1s for matching elements and 0s for non-matching. The subsequent pand with this mask replaces the non-matching elements by 0, so that they do not contribute anything when unconditionally added to the sum.
The rough C++ equivalent would be
sum += data[c] & -(data[c] >= 128);
The code actually keeps two running 64-bit sums, for the even and odd elements of the array, so that they can be accumulated in parallel and then added together at the end of the loop.
Some of the extra complexity is to take care of sign-extending the 32-bit data elements to 64 bits; that's what sequences like pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5 accomplish. Turn on -mavx2 and you'll see a simpler vpmovsxdq ymm5, xmm5 in its place.
The code also looks long because the loop has been unrolled, processing 8 elements of data per iteration.

g++ optimization breaks for loops

A few days ago, I encountered what I believe to be a bug in g++ 5.3 concerning the nesting of for loops at higher -OX optimization levels. (Been experiencing it specifically for -O2 and -O3). The issue is that if you have two nested for loops, that have some internal sum to keep track of total iterations, once this sum exceeds its maximum value it prevents the outer loop from terminating. The smallest code set that I have been able to replicate this with is:
int main(){
int sum = 0;
// Value of 100 million. (2047483648 less than int32 max.)
int maxInner = 100000000;
int maxOuter = 30;
// 100million * 30 = 3 billion. (Larger than int32 max)
for(int i = 0; i < maxOuter; ++i)
{
for(int j = 0; j < maxInner; ++j)
{
++sum;
}
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
}
When this is compiled using g++ -o run.me main.cpp it runs just as expected outputting:
i = 0 sum = 100000000
i = 1 sum = 200000000
i = 2 sum = 300000000
i = 3 sum = 400000000
i = 4 sum = 500000000
i = 5 sum = 600000000
i = 6 sum = 700000000
i = 7 sum = 800000000
i = 8 sum = 900000000
i = 9 sum = 1000000000
i = 10 sum = 1100000000
i = 11 sum = 1200000000
i = 12 sum = 1300000000
i = 13 sum = 1400000000
i = 14 sum = 1500000000
i = 15 sum = 1600000000
i = 16 sum = 1700000000
i = 17 sum = 1800000000
i = 18 sum = 1900000000
i = 19 sum = 2000000000
i = 20 sum = 2100000000
i = 21 sum = -2094967296
i = 22 sum = -1994967296
i = 23 sum = -1894967296
i = 24 sum = -1794967296
i = 25 sum = -1694967296
i = 26 sum = -1594967296
i = 27 sum = -1494967296
i = 28 sum = -1394967296
i = 29 sum = -1294967296
However, when this is compiled using g++ -O2 -o run.me main.cpp, the outer loop fails to terminate. (This only occurs when maxInner * maxOuter > 2^31) While sum continually overflows, it shouldn't in any way affect the other variables. I have also tested this on Ideone.com with the test case demonstrated here: https://ideone.com/5MI5Jb
My question is thus twofold.
How is it possible for the value of sum to in some way effect the system? No decisions are based upon its value, it is merely utilized for the purposes of a counter and the std::cout statement.
What could possibly be causing the dramatically different outcomes at different optimization levels?
Thank you greatly in advance for taking the time to read and consider my question.
Note: This question differs from existing questions such as: Why does integer overflow on x86 with GCC cause an infinite loop? because the issue with that problem was an overflow for the sentinal variable. However, both sentinal variables in this question i and j never exceed the value of 100m let alone 2^31.

This is an optimisation that's perfectly valid for correct code. Your code isn't correct.
What GCC sees is that the only way the loop exit condition i >= maxOuter could ever be reached is if you have signed integer overflow during earlier loop iterations in your calculation of sum. The compiler assumes there isn't signed integer overflow, because signed integer overflow isn't allowed in standard C. Therefore, i < maxOuter can be optimised to just true.
This is controlled by the -faggressive-loop-optimizations flag. You should be able to get the behaviour you expect by adding -fno-aggressive-loop-optimizations to your command line arguments. But better would be making sure your code is valid. Use unsigned integer types to get guaranteed valid wraparound behaviour.

Your code invokes undefined behaviour, since the int sum overflows. You say "this shouldn't in any way affect the other variables". Wrong. Once you have undefined behaviour, all odds are off. Anything can happen.
gcc is (in)famous for optimisations that assume there is no undefined behaviour and do let's say interesting things if undefined behaviour happens.
Solution: Don't do it.

Answers
As #hvd pointed out, the problem is in your invalid code, not in the compiler.
During your program execution, the sum value overflows int range. Since int is by default signed and overflow of signed values causes undefined behavior* in C, the compiler is free to do anything. As someone noted somewhere, dragons could be flying out of your nose. The result is just undefined.
The difference -O2 causes is in testing the end condition. When the compiler optimizes your loop, it realizes that it can optimize away the inner loop, making it
int sum = 0;
for(int i = 0; i < maxOuter; i++) {
sum += maxInner;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
and it may go further, transforming it to
int i = 0;
for(int sum = 0; sum < (maxInner * maxOuter); sum += maxInner) {
i++;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
To be honest, I don't really know what it does, the point is, it can do just this. Or anything else, remember the dragons, your program causes undefined behavior.
Suddenly, your sum variable is used in the loop end condition. Note that for defined behavior, these optimizations are perfectly valid. If your sum was unsigned (and your maxInner and maxOuter), the (maxInner * maxOuter) value (which would also be unsigned) would be reached after maxOuter loops, because unsigned operations are defined** to overflow as expected.
Now since we're in the signed domain, the compiler is for one free to assume, that at all times sum < (maxInner * maxOuter), just because the latter overflows, and therefore is not defined. So the optimizing compiler can end up with something like
int i = 0;
for(int sum = 0;/* nothing here evaluates to true */; sum += maxInner) {
i++;
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
}
which looks like observed behavior.
*: According to the C11 standard draft, section 6.5 Expressions:
If an exceptional condition occurs during the evaluation of an expression (that is, if the result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined.
**: According to the C11 standard draft, Annex H, H.2.2:
C’s unsigned integer types are ‘‘modulo’’ in the LIA−1 sense in that overflows or out-of-bounds results silently wrap.
I did some research on the topic. I compiled the code above with gcc and g++ (version 5.3.0 on Manjaro) and got some pretty interesting things of it.
Description
To successfully compile it with gcc (C compiler, that is), I have replaced
#include <iostream>
...
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
with
#include <stdio.h>
...
printf("i = %d sum = %d\n", i, sum);
and wrapped this replacement with #ifndef ORIG, so I could have both versions. Then I ran 8 compilations: {gcc,g++} x {-O2, ""} x {-DORIG=1,""}. This yields following results:
Results
gcc, -O2, -DORIG=1: Won't compile, missing <iostream>. Not surprising.
gcc, -O2, "": Produces compiler warning and behaves "normally". A look in the assembly shows that the inner loop is optimized out (j being incremented by 100000000) and the outer loop variable is compared with hardcoded value -1294967296. So, GCC can detect this and do some clever things while the program is working expectably. More importantly, warning is emitted to warn user about undefined behavior.
gcc, "", -DORIG=1: Won't compile, missing <iostream>. Not surprising.
gcc, "", "": Compiles without warning. No optimizations, program runs as expected.
g++, -O2, -DORIG=1: Compiles without warning, runs in endless loop. This is OP's original code running. C++ assembly is tough to follow for me. Addition of 100000000 is there though.
g++, -O2, "": Compiles with warning. It is enough to change how the output is printed to change compiler warning emiting. Runs "normally". By the assembly, AFAIK the inner loop gets optimized out. At least there is again comparison against -1294967296 and incrementation by 100000000.
g++, "", -DORIG=1: Compiles without warning. No optimization, runs "normally".
g++, "", "": dtto
The most interesting part for me was to find out the difference upon change of printing. Actually from all the combinations, only the one used by OP produces endless-loop program, the others fail to compile, do not optimize or optimize with warning and preserve sanity.
Code
Follows example build command and my full code
$ gcc -x c -Wall -Wextra -O2 -DORIG=1 -o gcc_opt_orig main.cpp
main.cpp:
#ifdef ORIG
#include <iostream>
#else
#include <stdio.h>
#endif
int main(){
int sum = 0;
// Value of 100 million. (2047483648 less than int32 max.)
int maxInner = 100000000;
int maxOuter = 30;
// 100million * 30 = 3 billion. (Larger than int32 max)
for(int i = 0; i < maxOuter; ++i)
{
for(int j = 0; j < maxInner; ++j)
{
++sum;
}
#ifdef ORIG
std::cout<<"i = "<<i<<" sum = "<<sum<<std::endl;
#else
printf("i = %d sum = %d\n", i, sum);
#endif
}
}

Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

In this code snippet, I'm comparing performance of two functionally identical loops:
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
and
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:
second loop is about 330% slower now with Clang 3.7.0
second loop is about 2% slower with gcc 4.9.3
second loop is about 2% slower with Visual C++ 2015
I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?
Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?
The full code I'm using is provided below:
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
cout << "\ndone\n";
return 0;
}

It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].
Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.
So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.
By looking at assembly output from clang, we can see that it actually happens.
From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:
### f0: just the loop
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov edi, ecx
cmp edx, edi
setl r10b
mov ecx, dword ptr [r8 + 4*rsi + 4]
lea rsi, [rsi + 1]
cmp edi, ecx
setl dl
and dl, r10b
movzx edx, dl
add eax, edx
cmp rsi, r9
mov edx, edi
jb .LBB1_2
And for f1:
### f1: just the loop
.LBB2_2: # =>This Inner Loop Header: Depth=1
mov esi, r10d
mov r10d, dword ptr [r9 + 4*rdi]
lea rcx, [rdi + 1]
cmp esi, r10d
jge .LBB2_4 # <== This is Extra Jump
cmp r10d, dword ptr [r9 + 4*rdi + 4]
setl dl
movzx edx, dl
add eax, edx
.LBB2_4: # %._crit_edge.3
cmp rcx, r8
mov rdi, rcx
jb .LBB2_2
I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)
GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets.
Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.
A performance difference of 2% is negligible can be explained by different order or choice of some instructions.

Here's additional insight to expand on #deniss' answer, which correctly diagnosed the issue.
Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".
The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are
5.14 Logical AND operator [expr.log.and]
Unlike &, && guarantees left-to-right evaluation: the second
operand is not evaluated if the first operand is false.ISO C++14 Standard (draft N3797)
and for speculative reads
1.10 Multi-threaded executions and data races [intro.multithread]
23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]ISO C++14 Standard (draft N3797)
My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.
In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.
Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true.
The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.
To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random
int f2()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
n += (v[i-1] < v[i]) & (v[i] < v[i+1]); // Bitwise AND
return n;
}
Output
3.642443ms min
3.779982ms median
Result: 166634
3.725968ms min
3.870808ms median
Result: 166634
1.052786ms min
1.081085ms median
Result: 166634
done
The result on gcc 5.3 is 8x faster (live in Coliru here)
g++ --version
g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
3.761290ms min
4.025739ms median
Result: 166634
3.823133ms min
4.050742ms median
Result: 166634
0.459393ms min
0.505011ms median
Result: 166634
done
You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by #deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:
cmp edx, edi ; a < b ?
setl r10b ; r10b = a < b ? 1 : 0
mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea rsi, [rsi + 1] ; ++i
cmp edi, ecx ; b < c ?
setl dl ; dl = b < c ? 1 : 0
and dl, r10b ; dl &= r10b
movzx edx, dl ; edx = zero extended dl
add eax, edx ; n += edx

f0 and f1 are semantically different.
x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.
This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.
Adding another 2 tests proves the point.
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int f2()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
auto t1 = v[i-1] < v[i];
auto t2 = v[i] < v[i+1];
if (t1 && t2)
++n;
}
return n;
}
int f3()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
}
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
benchmark(f2);
benchmark(f3);
cout << "\ndone\n";
return 0;
}
results (apple clang, -O2):
1.233948ms min
1.320545ms median
Result: 166850
3.366751ms min
3.493069ms median
Result: 166850
1.261948ms min
1.361748ms median
Result: 166850
1.251434ms min
1.353653ms median
Result: 166850

None of the answers so far have given a version of f() that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the x86 tag wiki. As always, it usually works best to profile with performance counters to find stalls.)
v[i-1] < v[i] is work we already did last iteration, when we evaluated v[i] < v[i+1]. In theory, helping the compiler grok that would let it optimize better (see f3()). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2 where that's a huge problem.
Manually hoisting the v.size() - 1 out of the loop's upper bound check seems to help. The OP's f0 and f1 don't actually re-compute v.size() from the start/end pointers in v, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1 outside the loop (f2() and f4()).
A separate issue is that using an int loop counter with a size_t upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.
Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.
Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).
clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic)
f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med)
f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med)
f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med)
f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)
Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.
f0 and f1:
gcc 5.2: The OP's f0 and f1 both make branchy loops, and won't auto-vectorize. f0 only uses one branch, though, and uses a weird setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.
clang 3.8: f0: only one load per iteration, but does both compares and ands them together. f1: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.
int f2() {
int n = 0;
size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
gcc 5.2 -O3: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.
clang 3.8 -O3: unrolls: every iteration does two loads, three cmp/setcc, two ands, and two adds.
int f4() {
int n = 0;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
bool ab_lt = a < b;
bool bc_lt = b < c;
n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
}
return n;
}
gcc 5.2 -O3: autovectorizes like f2, but without the extra pcmpeqd.
gcc 5.2 -O2: didn't investigate why this is twice as fast as f2.
clang -O3: about the same code as f2.
Attempt at compiler hand-holding
int f3() {
int n = 0;
int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something
bool ab_lt = a < b;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int c = v[i+1]; // only one load and compare inside the loop
bool bc_lt = b < c;
n += (ab_lt & bc_lt);
ab_lt = bc_lt;
a = b; // unused inside the loop, only the compare result is needed
b = c;
}
return n;
}
clang 3.8 -O3: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups unaligned loads are slow) is as fast as gcc's vectorized versions.
gcc 5.2 -O3: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size(), or maybe just in general.
Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and per iteration. But gcc creates a partial-register stall, even with -mtune=core2 where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.
## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
add rcx, 1 # i,
mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
cmp edi, r8d # a, a # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
mov r8d, edi # a, a
setg sil #, tmp124
and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil
movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
add eax, edx # n, D.111085 # n += ...
cmp r9, rcx # upper, i
mov edx, esi # ab_lt, tmp124
jne .L31 #,
ret

Why is dividing slower than bitshifting in C++?

I wrote two pieces of code, one that divides a random number by two, and one that bitshifts the same random number right once. As I understand it, this should produce the same result. However, when I time both pieces of code, I consistently get data saying that the shifting is faster. Why is that?
Shifting code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand()>>1);
}else{
result = result - (rand()>>1);
}
}
Dividing code:
double iterations = atoi(argv[1]) * 1000;
int result = 0;
cout << "Doing " << iterations << " iterations." << endl;
srand(31459);
for(int i=0;i<iterations;i++){
if(i % 2 == 0){
result = result + (rand() / 2);
}else{
result = result - (rand() / 2);
}
}
Timing and results:
$ time ./divide 1000000; time ./shift 1000000
Doing 1e+09 iterations.
real 0m12.291s
user 0m12.260s
sys 0m0.021s
Doing 1e+09 iterations.
real 0m12.091s
user 0m12.056s
sys 0m0.019s
$ time ./shift 1000000; time ./divide 1000000
Doing 1e+09 iterations.
real 0m12.083s
user 0m12.028s
sys 0m0.035s
Doing 1e+09 iterations.
real 0m12.198s
user 0m12.158s
sys 0m0.028s
Addtional information:
I am not using any optimizations when compiling
I am running this on a virtualized install of Fedora 20, kernal: 3.12.10-300.fc20.x86_64

It's not; it's slower on the architecture you're running on. It's almost always slower because the hardware behind bit shifting is trivial, while division is a bit of a nightmare. In base 10, what's easier for you, 78358582354 >> 3 or 78358582354 / 85? Instructions generally take the same time to execute regardless of input, and in you case, it's the compiler's job to convert /2 to >>1; the CPU just does as it's told.

It isn't actually slower. I've run your benchmark using nonius like so:
#define NONIUS_RUNNER
#include "Nonius.h++"
#include <type_traits>
#include <random>
#include <vector>
NONIUS_BENCHMARK("Divide", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) >> 1) : storage[i] + (dist(rd) >> 1); });
})
NONIUS_BENCHMARK("std::string destruction", [](nonius::chronometer meter)
{
std::random_device rd;
std::uniform_int_distribution<int> dist(0, 9);
std::vector<int> storage(meter.runs());
meter.measure([&](int i) { storage[i] = storage[i] % 2 == 0 ? storage[i] - (dist(rd) / 2) : storage[i] + (dist(rd) / 2); });
})
And these are the results:
As you can see both of them are neck and neck.
(You can find the html output here)
P.S: It seems I forgot to rename the second test. My bad.

It seems that difference in resuls is bellow the results spread, so you cann't really tell if it is different. But in general division can't be done in single opperation, bit shift can, so bit shift usualy should be faster.
But as you have literal 2 in your code, I would guess that compiler, even without optimizations produces identical code.

Note that rand returns int and divide int (signed by default) by 2 is not the same as shifting by 1. You can easily check generated asm and see the difference, or simply check resulting binary size:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 4016 ... boo # divide
... 3984 ... foo # shift
Now add static_cast patch:
if (i % 2 == 0) {
result = result + (static_cast<unsigned>(rand())/2);
}
else {
result = result - (static_cast<unsigned>(rand())/2);
}
and check the size again:
> g++ -O3 boo.cpp -c -o boo # divide
> g++ -O3 foo.cpp -c -o foo # shift
> ls -la foo boo
... 3984 ... boo # divide
... 3984 ... foo # shift
to be sure you can verify that generated asm in both binaries is the same

C++11 tuple performance

I just about to make my code more generalized by using std::tuple in a lot of cases including single element. I mean for example tuple<double> instead of double. But I decided to check performance of this particular case.
Here is simple performance benchmark test:
#include <tuple>
#include <iostream>
using std::cout;
using std::endl;
using std::get;
using std::tuple;
int main(void)
{
#ifdef TUPLE
using double_t = std::tuple<double>;
#else
using double_t = double;
#endif
constexpr int count = 1e9;
auto array = new double_t[count];
long long sum = 0;
for (int idx = 0; idx < count; ++idx) {
#ifdef TUPLE
sum += get<0>(array[idx]);
#else
sum += array[idx];
#endif
}
delete[] array;
cout << sum << endl; // just "external" side effect for variable sum.
}
And run results:
$ g++ -DTUPLE -O2 -std=c++11 test.cpp && time ./a.out
0
real 0m3.347s
user 0m2.839s
sys 0m0.485s
$ g++ -O2 -std=c++11 test.cpp && time ./a.out
0
real 0m2.963s
user 0m2.424s
sys 0m0.519s
I thought that tuple is strict static-compiled template and all of get<> functions are working just usual variable access in that case. BTW memory allocation sizes in this test are same.
Why does this execution time difference happens?
EDIT: Problem was in initialization of tuple<> object. To make test more accurate one line must be changed:
constexpr int count = 1e9;
- auto array = new double_t[count];
+ auto array = new double_t[count]();
long long sum = 0;
After that one can observe similar results:
$ g++ -DTUPLE -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real 0m3.342s
real 0m3.339s
real 0m3.343s
$ g++ -g -O2 -std=c++11 test.cpp && (for i in $(seq 3); do time ./a.out; done) 2>&1 | grep real
real 0m3.349s
real 0m3.339s
real 0m3.334s

The tuple all default construct values (so everything is 0) doubles do not get default initialized.
In generated assembly, following initialization loop is only present when using tuples. Otherwise they are equivalent.
.L2:
movq $0, (%rdx)
addq $8, %rdx
cmpq %rcx, %rdx
jne .L2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Branch prediction does not improve the performance [duplicate] - c++

Related

Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?

g++ optimization breaks for loops

Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

Why is dividing slower than bitshifting in C++?

C++11 tuple performance

Categories

Resources