Maintain x*x in C++ - c++

I have the following while-loop
uint32_t x = 0;
while(x*x < STOP_CONDITION) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
The STOP_CONDITION is constant at run-time, but not at compile time. Is there are more efficient way to maintain x*x or do I really need to recompute it every time?

Note: According to the benchmark below, this code runs about 1 -- 2% slower than this option. Please read the disclaimer included at the bottom!
In addition to Tamas Ionut's answer, if you want to maintain STOP_CONDITION as the actual stop condition and avoid the square root calculation, you could update the square using the mathematical identity
(x + 1)² = x² + 2x + 1
whenever you change x:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
Since the 2*x + 1 is just a bit shift and an increment, the compiler should be able to optimize this fairly well.
Disclaimer: Since you asked "how can I optimize this code" I answered with one particular way to possibly make it faster. Whether the double + increment is actually faster than a single integer multiplication should be tested in practice. Whether you should optimize the code is a different question. I assume you have already benchmarked the loop and found it to be a bottleneck, or that you have a theoretical interest in the question. If you are writing production code that you wish to optimize, first measure the performance and then optimize where needed (which is probably not the x*x in this loop).

What about:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
This way, you're getting rid of that extra computation.

I made a small benchmarking for Tamas Ionut and CompuChip answers and here are the results:
Tamas Ionut: 19.7068
The code of this method:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
CompuChip: 20.2056
The code of this method:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
with STOP_CONDITION = 1000000 and repeating the process 1000000 times
Environment:
Compiler : MSVC 2013
OS : Windows 8.1 - X64
Processor: Core i7-4510U
#2.00 GHZ
Release Mode - Maximize Speed (/O2)

I would say, optimization in readibility is better than optimization in Performance in your case since we are talking about a very small Performance optimization
The compliter can optimize a lot for you regarding Performance but readibility lies in the responsibility of the programmer

I believe Tamas Ionut solution is better than that of CompuChip because we only have x++ inside the for loop. However, a comparison between uint32_t and double will kill the deal. It would be more efficient if we use uint32_t for bound instead of using double. This approach has less problem with numerical overflow because x cannot be greater than 2^16 = 65536 if we want to have a correct x^2 value.
If we also do a heavy work in the loop then results obtained from both approach should be very similar, however, Tamas Ionut approach is more simple and easier to read.
Below is my code and the corresponding assembly code obtained using clang version 3.8.0 with -O3 flag. It is very clear from the assembly code that the first approach is more efficient.
using T = size_t;
void test1(const T stopCondition, bool checkCondition) {
T x = 0;
while (x < stopCondition) {
if (checkCondition) {
x++;
}
// Do something heavy here
}
}
void test2(const T stopCondition, bool checkCondition) {
T x = 0;
T xSquare = 0;
const T threshold = stopCondition * stopCondition;
while (xSquare < threshold) {
if (checkCondition) {
xSquare += 2 * x + 1;
x++;
}
// Do something heavy here
}
}
(gdb) disassemble test1
Dump of assembler code for function _Z5test1mb:
0x0000000000400be0 <+0>: movzbl %sil,%eax
0x0000000000400be4 <+4>: mov %rax,%rcx
0x0000000000400be7 <+7>: neg %rcx
0x0000000000400bea <+10>: nopw 0x0(%rax,%rax,1)
0x0000000000400bf0 <+16>: add %rax,%rcx
0x0000000000400bf3 <+19>: cmp %rdi,%rcx
0x0000000000400bf6 <+22>: jb 0x400bf0 <_Z5test1mb+16>
0x0000000000400bf8 <+24>: retq
End of assembler dump.
(gdb) disassemble test2
Dump of assembler code for function _Z5test2mb:
0x0000000000400c00 <+0>: imul %rdi,%rdi
0x0000000000400c04 <+4>: test %sil,%sil
0x0000000000400c07 <+7>: je 0x400c2e <_Z5test2mb+46>
0x0000000000400c09 <+9>: xor %eax,%eax
0x0000000000400c0b <+11>: mov $0x1,%ecx
0x0000000000400c10 <+16>: test %rdi,%rdi
0x0000000000400c13 <+19>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c15 <+21>: data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c20 <+32>: add %rcx,%rax
0x0000000000400c23 <+35>: add $0x2,%rcx
0x0000000000400c27 <+39>: cmp %rdi,%rax
0x0000000000400c2a <+42>: jb 0x400c20 <_Z5test2mb+32>
0x0000000000400c2c <+44>: jmp 0x400c42 <_Z5test2mb+66>
0x0000000000400c2e <+46>: test %rdi,%rdi
0x0000000000400c31 <+49>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c33 <+51>: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c40 <+64>: jmp 0x400c40 <_Z5test2mb+64>
0x0000000000400c42 <+66>: retq
End of assembler dump.

Related

in c++ is there any way specialise a function template for specific values of arguments

I have a broadly used function foo(int a, int b) and I want to provide a special version of foo that performs differently if a is say 1.
a) I don't want to go through the whole code base and change all occurrences of foo(1, b) to foo1(b) because the rules on arguments may change and I dont want to keep going through the code base whenever the rules on arguments change.
b) I don't want to burden function foo with an "if (a == 1)" test because of performance issues.
It seems to me to be a fundamental skill of the compiler to call the right code based on what it can see in front of it. Or is this a possible missing feature of C++ that requires macros or something to handle currently.
Simply write
inline int foo(int a, int b)
{
if (a==1) {
// skip complex code and call easy code
call_easy(b);
} else {
// complex code here
do_complex(a, b);
}
}
When you call
foo(1, 10);
the optimizer will/should simply insert a call_easy(b).
Any decent optimizer will inline the function and detect if the function has been called with a==1. Also I think that the entire constexpr mentioned in other posts is nice, but not really necessary in your case. constexpr is very useful, if you want to resolve values at compile time. But you simply asked to switch code paths based on a value at runtime. The optimizer should be able to detect that.
In order to detect that, the optimizer needs to see your function definition at all places where your function is called. Hence the inline requirement - although compilers such as Visual Studio have a "generate code at link time" feature, that reduces this requirement somewhat.
Finally you might want to look at C++ attributes [[likely]] (I think). I haven't worked with them yet, but they are supposed to tell the compiler which execution path is likely and give a hint to the optimizer.
And why don't you experiment a little and look at the generated code in the debugger/disassemble. That will give you a feel for the optimizer. Don't forget that the optimizer is likely only active in Release Builds :)
Templates work in compile time and you want to decide in runtime which is never possible. If and only if you really can call your function with constexpr values, than you can change to a template, but the call becomes foo<1,2>() instead of foo(1,2); "performance issues"... that's really funny! If that single compare assembler instruction is the performance problem... yes, than you have done everything super perfect :-)
BTW: If you already call with constexpr values and the function is visible in the compilation unit, you can be sure the compiler already knows to optimize it away...
But there is another way to handle such things if you really have constexpr values sometimes and your algorithm inside the function can be constexpr evaluated. In that case, you can decide inside the function if your function was called in a constexpr context. If that is the case, you can do a full compile time algorithm which also can contain your if ( a== 1) which will be fully evaluated in compile time. If the function is not called in constexpr context, the function is running as before without any additional overhead.
To do such decision in compile time we need the actual C++ standard ( C++20 )!
constexpr int foo( int a, int)
{
if (std::is_constant_evaluated() )
{ // this part is fully evaluated in compile time!
if ( a == 1 )
{
return 1;
}
else
{
return 2;
}
}
else
{ // and the rest runs as before in runtime
if ( a == 0 )
{
return 3;
}
else
{
return 4;
}
}
}
int main()
{
constexpr int res1 = foo( 1,0 ); // fully evaluated during compile time
constexpr int res2 = foo( 2,0 ); // also full compile time
std::cout << res1 << std::endl;
std::cout << res2 << std::endl;
std::cout << foo( 5, 0) << std::endl; // here we go in runtime
std::cout << foo( 0, 0) << std::endl; // here we go in runtime
}
That code will return:
1
2
4
3
So we do not need to go with classic templates, no need to change the rest of the code but have full compile time optimization if possible.
#Sebastian's suggestion works at least in the simple case with all optimisation levels except -O0 in g++ 9.3.0 on Ubuntu 20.04 in c++20 mode. Thanks again.
See below disassembly always calling directly the correct subfunction func1 or func2 instead of the top function func(). A similar disassembly after -O0 shows only the top level func() being called leaving the decision to run-time which is not desired.
I hope this will work in production code and perhaps with multiple hard coded arguments.
Breakpoint 1, main () at p1.cpp:24
24 int main() {
(gdb) disass /m
Dump of assembler code for function main():
6 inline void func(int a, int b) {
7
8 if (a == 1)
9 func1(b);
10 else
11 func2(a,b);
12 }
13
14 void func1(int b) {
15 std::cout << "func1 " << " " << " " << b << std::endl;
16 }
17
18 void func2(int a, int b) {
19 std::cout << "func2 " << a << " " << b << std::endl;
20 }
21
22 };
23
24 int main() {
=> 0x0000555555555286 <+0>: endbr64
0x000055555555528a <+4>: push %rbp
0x000055555555528b <+5>: push %rbx
0x000055555555528c <+6>: sub $0x18,%rsp
0x0000555555555290 <+10>: mov $0x28,%ebp
0x0000555555555295 <+15>: mov %fs:0x0(%rbp),%rax
0x000055555555529a <+20>: mov %rax,0x8(%rsp)
0x000055555555529f <+25>: xor %eax,%eax
25
26 X x1;
27
28 int b=1;
29 x1.func(1,b);
0x00005555555552a1 <+27>: lea 0x7(%rsp),%rbx
0x00005555555552a6 <+32>: mov $0x1,%esi
0x00005555555552ab <+37>: mov %rbx,%rdi
0x00005555555552ae <+40>: callq 0x55555555531e <X::func1(int)>
30
31 b=2;
32 x1.func(2,b);
0x00005555555552b3 <+45>: mov $0x2,%edx
0x00005555555552b8 <+50>: mov $0x2,%esi
0x00005555555552bd <+55>: mov %rbx,%rdi
0x00005555555552c0 <+58>: callq 0x5555555553de <X::func2(int, int)>
33
34 b=3;
35 x1.func(1,b);
0x00005555555552c5 <+63>: mov $0x3,%esi
0x00005555555552ca <+68>: mov %rbx,%rdi
0x00005555555552cd <+71>: callq 0x55555555531e <X::func1(int)>
36
37 b=4;
38 x1.func(2,b);
0x00005555555552d2 <+76>: mov $0x4,%edx
0x00005555555552d7 <+81>: mov $0x2,%esi
0x00005555555552dc <+86>: mov %rbx,%rdi
0x00005555555552df <+89>: callq 0x5555555553de <X::func2(int, int)>
39
40 return 0;
0x00005555555552e4 <+94>: mov 0x8(%rsp),%rax
0x00005555555552e9 <+99>: xor %fs:0x0(%rbp),%rax
0x00005555555552ee <+104>: jne 0x5555555552fc <main()+118>
0x00005555555552f0 <+106>: mov $0x0,%eax
0x00005555555552f5 <+111>: add $0x18,%rsp
0x00005555555552f9 <+115>: pop %rbx
0x00005555555552fa <+116>: pop %rbp
0x00005555555552fb <+117>: retq
0x00005555555552fc <+118>: callq 0x555555555100 <__stack_chk_fail#plt>
End of assembler dump.

Iterating over linked list in C++ is slower than in Go

EDIT: After getting some feedback, I created a new example which should be more reproducible.
I've been writing a project in C++ that involves lots of linked list iteration. To get a benchmark, I rewrote the code in Go. Surprisingly I've found that the Go implementation runs consistently faster by ~10%, even after passing the -O flag to clang++. Probably I'm just missing some obvious optimization in C++ but I've been banging my head against a wall for a while with various tweaks.
Here's a simplified version, with identical implementations in C++ and Go where the Go program runs faster. All it does is create a linked list with 3000 nodes, and then time how long it takes to iterate over this list 1,000,000 times (7.5 secs in C++, 6.8 in Go).
C++:
#include <iostream>
#include <chrono>
using namespace std;
using ms = chrono::milliseconds;
struct Node {
Node *next;
double age;
};
// Global linked list of nodes
Node *nodes = nullptr;
void iterateAndPlace(double age) {
Node *node = nodes;
Node *prev = nullptr;
while (node != nullptr) {
// Just to make sure that age field is accessed
if (node->age > 99999) {
break;
}
prev = node;
node = node->next;
}
// Arbitrary action to make sure the compiler
// doesn't optimize away this function
prev->age = age;
}
int main() {
Node x = {};
std::cout << "Size of struct: " << sizeof(x) << "\n"; // 16 bytes
// Fill in global linked list with 3000 dummy nodes
for (int i=0; i<3000; i++) {
Node* newNode = new Node;
newNode->age = 0.0;
newNode->next = nodes;
nodes = newNode;
}
auto start = chrono::steady_clock::now();
for (int i=0; i<1000000; i++) {
iterateAndPlace(100.1);
}
auto end = chrono::steady_clock::now();
auto diff = end - start;
std::cout << "Elapsed time is : "<< chrono::duration_cast<ms>(diff).count()<<" ms "<<endl;
}
Go:
package main
import (
"time"
"fmt"
"unsafe"
)
type Node struct {
next *Node
age float64
}
var nodes *Node = nil
func iterateAndPlace(age float64) {
node := nodes
var prev *Node = nil
for node != nil {
if node.age > 99999 {
break
}
prev = node
node = node.next
}
prev.age = age
}
func main() {
x := Node{}
fmt.Printf("Size of struct: %d\n", unsafe.Sizeof(x)) // 16 bytes
for i := 0; i < 3000; i++ {
newNode := new(Node)
newNode.next = nodes
nodes = newNode
}
start := time.Now()
for i := 0; i < 1000000; i++ {
iterateAndPlace(100.1)
}
fmt.Printf("Time elapsed: %s\n", time.Since(start))
}
Output from my Mac:
$ go run minimal.go
Size of struct: 16
Time elapsed: 6.865176895s
$ clang++ -std=c++11 -stdlib=libc++ minimal.cpp -O3; ./a.out
Size of struct: 16
Elapsed time is : 7524 ms
Clang version:
$ clang++ --version
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
EDIT:
UKMonkey brought up the fact that the nodes may be contiguously allocated in Go but not C++. To test this, I allocated contiguously in C++ with a vector, and this did not change the runtime:
// Fill in global linked list with 3000 contiguous dummy nodes
vector<Node> vec;
vec.reserve(3000);
for (int i=0; i<3000; i++) {
vec.push_back(Node());
}
nodes = &vec[0];
Node *curr = &vec[0];
for (int i=1; i<3000; i++) {
curr->next = &vec[i];
curr = curr->next;
curr->age = 0.0;
}
I checked that the resulting linked list is indeed contiguous:
std::cout << &nodes << " " << &nodes->next << " " << &nodes->next->next << " " << &nodes->next->next->next << "\n";
0x1032de0e0 0x7fb934001000 0x7fb934001010 0x7fb934001020
Preface: I am not a C++ expert or assembly expert. But I know a little bit of them, enough to be dangerous, perhaps.
So I was piqued and decided to take a look at the assembler generated for the Go, and followed it up with checking it against the output for clang++.
High Level Summary
Later on here, I go through the assembler output for both languages in x86-64 assembler. The fundamental "critical section" of code in this example is a very tight loop. For that reason, it's the largest contributor to the time spent in the program.
Why tight loops matter is that modern CPU's can execute instructions usually faster than relevant values for the code to reference (like for comparisons) can be loaded from memory. In order to achieve the blazing fast speeds they achieve, CPU's perform a number of tricks including pipelining, branch prediction, and more. Tight loops are often the bane of pipelining and realistically branch prediction could be only marginally helpful if there's a dependency relationship between values.
Fundamentally, the traversal loop has four main chunks:
1. If `node` is null, exit the loop.
2. If `node.age` > 999999, exit the loop.
3a. set prev = node
3b. set node = node.next
Each of these are represented by several assembler instructions, but the chunks as output by the Go and C++ are ordered differently. The C++ effectively does it in order 3a, 1, 2, 3b. The Go version does it in order 3, 2, 1. (it starts the first loop on segment 2 to avoid the assignment happening before the null checks)
In actuality, the clang++ outputs a couple fewer instructions than the Go and should do fewer RAM accesses (at the cost of one more floating point register). One might imagine that executing almost the same instructions just in different orders should end up with the same time spent, but that doesn't take into account pipelining and branch prediction.
Takeaways One might be tempted to hand-optimize this code and write assembly if it was a critical but small loop. Ignoring the obvious reasons (it's more risky/more complex/more prone to bugs) there's also to take into account that while the Go generated code was faster for the two Intel x86-64 processors I tested it with, it's possible that with an AMD processor you'd get the opposite results. It's also possible that with the N+1th gen Intel you'll get different results.
My full investigation follows below:
The investigation
NOTE I've snipped examples as short as possible including truncating filenames, and removing excess fluff from the assembly listing, so your outputs may look slightly different from mine. But anyway, I continue.
So I ran go build -gcflags -S main.go to get this assembly listing, and I'm only really looking at iterateAndPlace.
"".iterateAndPlace STEXT nosplit size=56 args=0x8 locals=0x0
00000 (main.go:16) TEXT "".iterateAndPlace(SB), NOSPLIT, $0-8
00000 (main.go:16) FUNCDATA $0, gclocals·2a5305abe05176240e61b8620e19a815(SB)
00000 (main.go:16) FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
00000 (main.go:17) MOVQ "".nodes(SB), AX
00007 (main.go:17) MOVL $0, CX
00009 (main.go:20) JMP 20
00011 (main.go:25) MOVQ (AX), DX
00014 (main.go:25) MOVQ AX, CX
00017 (main.go:25) MOVQ DX, AX
00020 (main.go:20) TESTQ AX, AX
00023 (main.go:20) JEQ 44
00025 (main.go:21) MOVSD 8(AX), X0
00030 (main.go:21) MOVSD $f64.40f869f000000000(SB), X1
00038 (main.go:21) UCOMISD X1, X0
00042 (main.go:21) JLS 11
00044 (main.go:21) MOVSD "".age+8(SP), X0
00050 (main.go:28) MOVSD X0, 8(CX)
00055 (main.go:29) RET
In case you lost context, I'll paste the original listing with the line numbers here:
16 func iterateAndPlace(age float64) {
17 node := nodes
18 var prev *Node = nil
19
20 for node != nil {
21 if node.age > 99999 {
22 break
23 }
24 prev = node
25 node = node.next
26 }
27
28 prev.age = age
29 }
A few interesting things I noticed immediately:
It's not generating any code for line 24, prev = node. That is because it's realized that assignment can be cheated: in traversing to get node.next it uses the CX register which is the value of prev. This is probably a nice optimization the SSA compiler can realize is redundant.
The if statement checking for node.age is re-ordered to be after the node = node.nextstuff, that is skipped on the first iteration. You can think of this as more like a do..while loop in that case. Overall minor since it only really changes the first iteration. But maybe that's enough?
So let's jump over to the C++ assembly, which you get from clang++ -S -mllvm --x86-asm-syntax=intel -O3 minimal.cpp.
.quad 4681608292164698112 ## double 99999
# note I snipped some stuff here
__Z15iterateAndPlaced: ## #_Z15iterateAndPlaced
## BB#0:
push rbp
Lcfi0:
.cfi_def_cfa_offset 16
Lcfi1:
.cfi_offset rbp, -16
mov rbp, rsp
Lcfi2:
.cfi_def_cfa_register rbp
mov rcx, qword ptr [rip + _nodes]
xor eax, eax
movsd xmm1, qword ptr [rip + LCPI0_0] ## xmm1 = mem[0],zero
.p2align 4, 0x90
LBB0_2: ## =>This Inner Loop Header: Depth=1
mov rdx, rax
mov rax, rcx
movsd xmm2, qword ptr [rax + 8] ## xmm2 = mem[0],zero
ucomisd xmm2, xmm1
ja LBB0_3
## BB#1: ## in Loop: Header=BB0_2 Depth=1
mov rcx, qword ptr [rax]
test rcx, rcx
mov rdx, rax
jne LBB0_2
LBB0_3:
movsd qword ptr [rdx + 8], xmm0
pop rbp
ret
This is really interesting. The assembly generated is overall fairly similar (ignoring the minor differences in how assemblers list the syntax) - It made a similar optimization about not assigning prev. Furthermore, the C++ seems to have eliminated the need to load 99999 every time the comparison is done (the Go version loads it right before comparison each time).
For replication purposes, versions of things I used (on an x86-64 darwin mac on OSX High Sierra)
$ go version
go version go1.9.3 darwin/amd64
$ clang++ --version
Apple LLVM version 9.0.0 (clang-900.0.39.2)
Target: x86_64-apple-darwin17.4.0
I think the problem is the code made with clang.
My result are:
6097ms with clang
5106ms with gcc
5219ms with go
so i have disassembled and i see that the code generate without accessing the age field is the same in both clang and gcc but when you access the age field the code generated from clang is a little bit worse than the code generated from gcc.
This is the code generated from clang:
from gcc:
and the last one is the go version:
as you can see the code is pretty much the same for all but in the clang version the first two mov at the beginning make the number of instruction greater the gcc version so they slow down the performance a little bit, and i think that the biggest impact to the performance is from the instruction uncomisd on the clang version because make a memory redirection.
Sorry for my bad english i hope it's understandable

Why do C++ optimizers have problems with these temporary variables or rather why `v[]` should be avoided in tight loops?

In this code snippet, I'm comparing performance of two functionally identical loops:
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
and
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:
second loop is about 330% slower now with Clang 3.7.0
second loop is about 2% slower with gcc 4.9.3
second loop is about 2% slower with Visual C++ 2015
I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?
Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?
The full code I'm using is provided below:
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
cout << "\ndone\n";
return 0;
}
It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].
Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.
So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.
By looking at assembly output from clang, we can see that it actually happens.
From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:
### f0: just the loop
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov edi, ecx
cmp edx, edi
setl r10b
mov ecx, dword ptr [r8 + 4*rsi + 4]
lea rsi, [rsi + 1]
cmp edi, ecx
setl dl
and dl, r10b
movzx edx, dl
add eax, edx
cmp rsi, r9
mov edx, edi
jb .LBB1_2
And for f1:
### f1: just the loop
.LBB2_2: # =>This Inner Loop Header: Depth=1
mov esi, r10d
mov r10d, dword ptr [r9 + 4*rdi]
lea rcx, [rdi + 1]
cmp esi, r10d
jge .LBB2_4 # <== This is Extra Jump
cmp r10d, dword ptr [r9 + 4*rdi + 4]
setl dl
movzx edx, dl
add eax, edx
.LBB2_4: # %._crit_edge.3
cmp rcx, r8
mov rdi, rcx
jb .LBB2_2
I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)
GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets.
Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.
A performance difference of 2% is negligible can be explained by different order or choice of some instructions.
Here's additional insight to expand on #deniss' answer, which correctly diagnosed the issue.
Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".
The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are
5.14 Logical AND operator [expr.log.and]
Unlike &, && guarantees left-to-right evaluation: the second
operand is not evaluated if the first operand is false.ISO C++14 Standard (draft N3797)
and for speculative reads
1.10 Multi-threaded executions and data races [intro.multithread]
23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]ISO C++14 Standard (draft N3797)
My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.
In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.
Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true.
The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.
To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random
int f2()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
n += (v[i-1] < v[i]) & (v[i] < v[i+1]); // Bitwise AND
return n;
}
Output
3.642443ms min
3.779982ms median
Result: 166634
3.725968ms min
3.870808ms median
Result: 166634
1.052786ms min
1.081085ms median
Result: 166634
done
The result on gcc 5.3 is 8x faster (live in Coliru here)
g++ --version
g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
3.761290ms min
4.025739ms median
Result: 166634
3.823133ms min
4.050742ms median
Result: 166634
0.459393ms min
0.505011ms median
Result: 166634
done
You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by #deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:
cmp edx, edi ; a < b ?
setl r10b ; r10b = a < b ? 1 : 0
mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea rsi, [rsi + 1] ; ++i
cmp edi, ecx ; b < c ?
setl dl ; dl = b < c ? 1 : 0
and dl, r10b ; dl &= r10b
movzx edx, dl ; edx = zero extended dl
add eax, edx ; n += edx
f0 and f1 are semantically different.
x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.
This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.
Adding another 2 tests proves the point.
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int f2()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
auto t1 = v[i-1] < v[i];
auto t2 = v[i] < v[i+1];
if (t1 && t2)
++n;
}
return n;
}
int f3()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
}
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
benchmark(f2);
benchmark(f3);
cout << "\ndone\n";
return 0;
}
results (apple clang, -O2):
1.233948ms min
1.320545ms median
Result: 166850
3.366751ms min
3.493069ms median
Result: 166850
1.261948ms min
1.361748ms median
Result: 166850
1.251434ms min
1.353653ms median
Result: 166850
None of the answers so far have given a version of f() that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the x86 tag wiki. As always, it usually works best to profile with performance counters to find stalls.)
v[i-1] < v[i] is work we already did last iteration, when we evaluated v[i] < v[i+1]. In theory, helping the compiler grok that would let it optimize better (see f3()). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2 where that's a huge problem.
Manually hoisting the v.size() - 1 out of the loop's upper bound check seems to help. The OP's f0 and f1 don't actually re-compute v.size() from the start/end pointers in v, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1 outside the loop (f2() and f4()).
A separate issue is that using an int loop counter with a size_t upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.
Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.
Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).
clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic)
f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med)
f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med)
f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med)
f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)
Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.
f0 and f1:
gcc 5.2: The OP's f0 and f1 both make branchy loops, and won't auto-vectorize. f0 only uses one branch, though, and uses a weird setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.
clang 3.8: f0: only one load per iteration, but does both compares and ands them together. f1: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.
int f2() {
int n = 0;
size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
gcc 5.2 -O3: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.
clang 3.8 -O3: unrolls: every iteration does two loads, three cmp/setcc, two ands, and two adds.
int f4() {
int n = 0;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
bool ab_lt = a < b;
bool bc_lt = b < c;
n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
}
return n;
}
gcc 5.2 -O3: autovectorizes like f2, but without the extra pcmpeqd.
gcc 5.2 -O2: didn't investigate why this is twice as fast as f2.
clang -O3: about the same code as f2.
Attempt at compiler hand-holding
int f3() {
int n = 0;
int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something
bool ab_lt = a < b;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int c = v[i+1]; // only one load and compare inside the loop
bool bc_lt = b < c;
n += (ab_lt & bc_lt);
ab_lt = bc_lt;
a = b; // unused inside the loop, only the compare result is needed
b = c;
}
return n;
}
clang 3.8 -O3: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups unaligned loads are slow) is as fast as gcc's vectorized versions.
gcc 5.2 -O3: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size(), or maybe just in general.
Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and per iteration. But gcc creates a partial-register stall, even with -mtune=core2 where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.
## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
add rcx, 1 # i,
mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
cmp edi, r8d # a, a # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
mov r8d, edi # a, a
setg sil #, tmp124
and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil
movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
add eax, edx # n, D.111085 # n += ...
cmp r9, rcx # upper, i
mov edx, esi # ab_lt, tmp124
jne .L31 #,
ret

Do C++ compilers perform compile-time optimizations on lambda closures?

Suppose we have the following (nonsensical) code:
const int a = 0;
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
Variable 'a' equals zero, so the compiler can deduce on compile time, that the instruction 'if(a) c++;' will never be executed and will optimize it away.
My question: Does the same happen with lambda closures?
Check out another piece of code:
const int a = 0;
function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
}
Will the compiler know that 'a' is 0 and will it optimize the lambda?
Even more sophisticated example:
function<int()> generate_lambda(const int a)
{
return [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
}
function<int()> a_is_zero = generate_lambda(0);
function<int()> a_is_one = generate_lambda(1);
Will the compiler be smart enough to optimize the first lambda when it knows that 'a' is 0 at generation time?
Does gcc or llvm have this kind of optimizations?
I'm asking because I wonder if I should make such optimizations manually when I know that certain assumptions are satisfied on lambda generation time or the compiler will do that for me.
Looking at the assembly generated by gcc5.2 -O2 shows that the optimization does not happen when using std::function:
#include <functional>
int main()
{
const int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to some boilerplate and
movl (%rdi), %ecx
movl $10000000, %edx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
cmpl $1, %ecx
sbbl $-1, %eax
addl $7, %eax
subl $1, %edx
jne .L3
rep; ret
which is the loop you wanted to see optimized away. (Live) But if you actually use a lambda (and not an std::function), the optimization does happen:
int main()
{
const int a = 0;
auto lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to
movl $70000000, %eax
ret
i.e. the loop was removed completely. (Live)
Afaik, you can expect a lambda to have zero overhead, but std::function is different and comes with a cost (at least at the current state of the optimizers, although people apparently work on this), even if the code "inside the std::function" would have been optimized. (Take that with a grain of salt and try if in doubt, since this will probably vary between compilers and versions. std::functions overhead can certainly be optimized away.)
As #MarcGlisse correctly pointed out, clang3.6 performs the desired optimization (equivalent to the second case above) even with std::function. (Live)
Bonus edit, thanks to #MarkGlisse again: If the function that contains the std::function is not called main, the optimization happening with gcc5.2 is somewhere between gcc+main and clang, i.e. the function gets reduced to return 70000000; plus some extra code. (Live)
Bonus edit 2, this time mine: If you use -O3, gcc will, (for some reason) as explained in Marco's answer, optimize the std::function to
cmpl $1, (%rdi)
sbbl %eax, %eax
andl $-10000000, %eax
addl $80000000, %eax
ret
and keep the rest as in the not_main case. So I guess at the bottom of the line, one will just have to measure when using std::function.
Both gcc at -O3 and MSVC2015 Release won't optimize it away with this simple code and the lambda would actually be called
#include <functional>
#include <iostream>
int main()
{
int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10; b++)
{
if(a) c++;
c += 7;
}
return c;
};
std::cout << lambda();
return 0;
}
At -O3 this is what gcc generates for the lambda (code from godbolt)
lambda:
cmp DWORD PTR [rdi], 1
sbb eax, eax
and eax, -10
add eax, 80
ret
This is a contrived and optimized way to express the following:
If a was a 0, the first comparison would set the carry flag CR. eax would actually be set to 32 1 values, and'ed with -10 (and that would yield -10 in eax) and then added 80 -> result is 70.
If a was something different from 0, the first comparison would not set the carry flag CR, eax would be set to zero, the and would have no effect and it would be added 80 -> result is 80.
It has to be noted (thanks Marc Glisse) that if the function is marked as cold (i.e. unlikely to be called) gcc performs the right thing and optimizes the call away.
MSVC generates more verbose code but the comparison isn't skipped.
Clang is the only one which gets it right: the lambda hasn't its code optimized more than gcc did but it is not called
mov edi, std::cout
mov esi, 70
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Morale: Clang seems to get it right but the optimization challenge is still open.

C++ Conditional Operator versus if-else

I have always wondered about this. Let's say we have a variable, string weight, and an input variable, int mode, which can be 1 or 0.
Is there a clear benefit to using:
weight = (mode == 1) ? "mode:1" : "mode:0";
over
if(mode == 1)
weight = "mode:1";
else
weight = "mode:0";
beyond code readability? Are speeds at all affected, is this handled differently by the compiler (such as the ability of certain switch statements to be converted to jump tables)?
The key difference between the conditional operator and an if/else block is that the conditional operator is an expression, rather than a statement. Thus, there are few places you can use the conditional operator where you can't use an if/else. For example, initialization of constant objects, like so:
const double biasFactor = (x < 5) ? 2.5 : 6.432;
If you used if/else in this case, biasFactor would have to be non-const.
Additonally, constructor initializer lists call for expressions rather than statements as well:
X::X()
: myData(x > 5 ? 0xCAFEBABE : OxDEADBEEF)
{
}
In this case, myData may not have any assignment operator or non-const member functions defined--its constructor may be the only way to pass any parameters to it.
Also, note that any expression can be turned into a statement by adding a semicolon at the end--the reverse is not true.
No, this is purely about presenting the code to a human reader. I'd expect any compiler to generate identical code for these.
With mingw, the assembly code generated with
const char * testFunc()
{
int mode=1;
const char * weight = (mode == 1)? "mode:1" : "mode:0";
return weight;
}
is:
testFunc():
0040138c: push %ebp
0040138d: mov %esp,%ebp
0040138f: sub $0x10,%esp
10 int mode=1;
00401392: movl $0x1,-0x4(%ebp)
11 const char * weight = (mode == 1)? "mode:1" : "mode:0";
00401399: cmpl $0x1,-0x4(%ebp)
0040139d: jne 0x4013a6 <testFunc()+26>
0040139f: mov $0x403064,%eax
004013a4: jmp 0x4013ab <testFunc()+31>
004013a6: mov $0x40306b,%eax
004013ab: mov %eax,-0x8(%ebp)
12 return weight;
004013ae: mov -0x8(%ebp),%eax
13 }
And with
const char * testFunc()
{
const char * weight;
int mode=1;
if(mode == 1)
weight = "mode:1";
else
weight = "mode:0";
return weight;
}
is:
testFunc():
0040138c: push %ebp
0040138d: mov %esp,%ebp
0040138f: sub $0x10,%esp
11 int mode=1;
00401392: movl $0x1,-0x8(%ebp)
12 if(mode == 1)
00401399: cmpl $0x1,-0x8(%ebp)
0040139d: jne 0x4013a8 <testFunc()+28>
13 weight = "mode:1";
0040139f: movl $0x403064,-0x4(%ebp)
004013a6: jmp 0x4013af <testFunc()+35>
15 weight = "mode:0";
004013a8: movl $0x40306b,-0x4(%ebp)
17 return weight;
004013af: mov -0x4(%ebp),%eax
18 }
Pretty much the same code is generated. The performance of your application shouldn't depend on small details like this one.
So, no it doesn't make a difference.