I have 2 2D points which are jammed together into an array: int square[4]. These four numbers are interpreted as the definition of a rectangle with horizontal lines parallel to the X-axis and vertical lines parallel to the Y-axis. The elements of the array then respectively define:
Left edge's X coordinate
Bottom edge's Y coordinate
Right edge's X coordinate
Top edge's Y coordinate
I have defined the a winding order in this enum:
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
The minimal, complete, verifiable example of my code, is that I am given an output second array: int output[4] and an input WindingOrder edge. I need to populate output as follows:
switch(edge) {
case BOTTOM:
output[0] = square[0]; output[1] = square[1]; output[2] = square[2]; output[3] = square[1];
break;
case RIGHT:
output[0] = square[2]; output[1] = square[1]; output[2] = square[2]; output[3] = square[3];
break;
case TOP:
output[0] = square[2]; output[1] = square[3]; output[2] = square[0]; output[3] = square[3];
break;
case LEFT:
output[0] = square[0]; output[1] = square[3]; output[2] = square[0]; output[3] = square[1];
break;
}
I'm not married to a particular WindingOrder arrangement, nor do I care about the order of the points in ouptut, so if changing those makes this solvable I'm down. What I want to know is can I construct the square indexes to assign to output in a for loop, without an if/case/ternary statement (in other words using bit-wise operations)?
So I'd want, given int i = 0 and WindingOrder edge to do bit-wise operations on them to find:
do {
output[i] = array[???];
} while(++i <= LEFT);
EDIT:
I've received a lot of static array answers (which I believe are the best way to solve this so I've given a +1). But as a logic problem I'm curious how few bit-wise operations could be taken to find an element of a given edge dynamically. So for example, how should this function's body be writen given an arbitrary edge and i: int getIndex(int i, int edge)
Here is a different solution. It is a variation on the static array approach, but without an actual array: the indexing matrix is inlined as a 32 bit unsigned integer computed as constant expression. The column for the edge parameter is selected with a single shift, finally, individual indices for each array element are selected with via simple bit-shifting and masking.
This solution has some advantages:
it is simple to understand
it does not use tests
it does not use a static array, nor any other memory location
it is independent on the winding order and can be easily customized for any array component order
it does not use C99 specific syntax, which may not be available in C++.
This is as close as I could get to a bitwise solution.
#include <iostream>
enum WindingOrder { BOTTOM = 0, RIGHT, TOP, LEFT };
void BitwiseWind(int const *input, int *output, enum WindingOrder edge)
{
unsigned bits = ((0x00010201 << BOTTOM * 2) |
(0x02010203 << RIGHT * 2) |
(0x02030003 << TOP * 2) |
(0x00030001 << LEFT * 2))
>> (edge * 2);
output[0] = input[(bits >> 24) & 3];
output[1] = input[(bits >> 16) & 3];
output[2] = input[(bits >> 8) & 3];
output[3] = input[(bits >> 0) & 3];
}
int main() {
enum WindingOrder edges[4] = { BOTTOM, RIGHT, TOP, LEFT };
int rect[4] = { 1, 3, 4, 5 };
int output[4];
for (int i = 0; i < 4; i++) {
BitwiseWind(rect, output, edges[i]);
std::cout << output[0] << output[1] << output[2] << output[3] << std::endl;
}
return 0;
}
Compiling BitwiseWind for x86-64 with clang -O3 generates 21 instructions, 6 more than the static array version, but without any memory reference. That's a little disappointing, but I hope it could generate fewer instructions for an ARM target, taking advantage of bit-field extraction opcodes. Incidentally, the inlined version using output[i] = array[(i+(i==winding)*2)&3]; produces 25 instructions without any jumps, and gcc -O3 does much worse: it generates a lot more code with 4 tests and jumps.
The generic getIndex function below compiles to just 6 x86 instructions:
int getIndex(int i, int edge) {
return (((0x00010201 << BOTTOM * 2) |
(0x02010203 << RIGHT * 2) |
(0x02030003 << TOP * 2) |
(0x00030001 << LEFT * 2))
>> (edge * 2 + 24 - i * 8)) & 3;
}
Is there a particular reason that this needs to use lots of bitwise operations? It seems quite a complex way to solve the problem?
You seem to be quite worried about speed, for example, you don't want to use modulo because it is expensive. This being the case, why not just use a really simple lookup and unroll the loops? Example on ideone as well.
EDIT: Thanks to chqrlie for input. Have updated answer accordingly.
#include <iostream>
using namespace std;
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
void DoWinding1(unsigned int const *const in, unsigned int *const out, const enum WindingOrder ord)
{
static const unsigned int order[4][4] = { [BOTTOM] = {0,1,2,1},
[RIGHT] = {2,1,2,3},
[TOP] = {2,3,0,3},
[LEFT] = {0,3,0,1} };
out[0] = in[order[ord][0]];
out[1] = in[order[ord][1]];
out[2] = in[order[ord][2]];
out[3] = in[order[ord][3]];
}
int main() {
unsigned int idx;
unsigned int rect[4] = {1, 3, 4, 5};
unsigned int out[4] = {0};
DoWinding1(rect, out, BOTTOM);
std::cout << out[0] << out[1] << out[2] << out[3] << std::endl;
return 0;
}
Is that possible to redefine WindingOrder's value set? If it could be , here's my solution , which tried encoding selection indexes in WindingOrder's value set , then simply decoding out select index for input[] by shifting and masking as long the output[] index iterating.
[Thanks to chqrlie for offering code base]:
#include <iostream>
enum WindingOrder {
// the RIGHT most 4-bits indicate the selection index from input[] to output[0]
// the LEFT most 4-bits indicate the selection index from input[] to output[3]
BOTTOM = 0x1210,
RIGHT = 0x3212,
TOP = 0x3230,
LEFT = 0x3010
};
void BitwiseWind(int const *input, int *output, unsigned short edge)
{
for (size_t i = 0; i < 4; i++)
output[i] = input[(edge >> (i*4)) & 0x000F]; // decode
}
int main() {
enum WindingOrder edges[4] = { BOTTOM, RIGHT, TOP, LEFT };
int rect[4] = { 1, 3, 4, 5 };
int output[4];
for (int i = 0; i < 4; i++) {
BitwiseWind(rect, output, edges[i]);
std::cout << output[0] << output[1] << output[2] << output[3] << std::endl;
}
return 0;
}
The generic getIndex(int i,enum WindingOrder edge) would be:
int getIndex(int i,enum WindingOrder edge)
{
return ((edge >> (i*4)) & 0x000F);
}
I did not count how many instruction it used , but i believe it would be quiet few. And really easy to image how it worked. :)
This is untested and there might be a small mistake in some details but the general idea should work.
Copying the array to the output would use the indices {0,1,2,3}. To get a specific edge you have to do some transformations to the indices:
changed_pos changed_to
RIGHT : {2,1,2,3} 0 2
TOP : {0,3,2,3} 1 3
LEFT : {0,1,0,3} 2 0
BOTTOM: {0,1,2,1} 3 1
So basically you have to add 2 mod 4 for the specific position of your winding.
So the (like I said untested) snipped could look like this
for (size_t i=0; i<4; ++i) {
output[i] = array[(i+(i==edge)*2)%4];
}
If the comparison is true you add 1*2=2, else 0*2=0 to the index and do mod 4 to stay in the range.
Your enum have to look like this (but I guess you figured this out by yourself):
enum WindingOrder {
RIGHT,
TOP,
LEFT,
BOTTOM
};
MWE:
#include <iostream>
#include <string>
#include <vector>
enum WindingOrder {
RIGHT=0,
TOP,
LEFT,
BOTTOM
};
int main()
{
std::vector<int> array = {2,4,8,9};
std::vector<int> output(4);
std::vector<WindingOrder> test = {LEFT,RIGHT,BOTTOM,TOP};
for (auto winding : test) {
for (size_t i=0; i<4; ++i) {
output[i] = array[(i+(i==winding)*2)%4];
}
std::cout << "winding " << winding << ": " << output[0] << output[1] << output[2] << output[3] << std::endl;
}
}
From the answer of yourself, you're close to the solution. I think what you need here is Karnaugh map, which is a universal method for most Boolean algebra problems.
Suppose
The elements of the array then respectively define:
input[0]: Left edge's X coordinate
input[0]: Bottom edge's Y coordinate
input[0]: Right edge's X coordinate
input[0]: Top edge's Y coordinate
I have defined the a winding order in this enum:
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
Since the for-loop may looks like
for (int k = 0; k != 4; ++k) {
int i = getIndex(k, edge); // calculate i from k and edge
output[k] = square[i];
}
Then the input is k(output[k]) and edge, the output is i(square[i]). And because i has 2 bits, then two logic functions are needed.
Here we use P = F1(A, B, C, D) and Q = F2(A, B, C, D) to represent the logic functions, in which A, B, C, D, P and Q are all single bit, and
k = (A << 1) + B;
edge = (C << 1) + D;
i = (P << 1) + Q;
Then what we need to do is just deduce the two logic functions F1 and F2 from the given conditions.
From the switch case statements you gave, we can easily get the truth table.
k\edge 0 1 3 2
0 0 2 0 2
1 1 1 3 3
3 1 3 1 3
2 2 2 0 0
Then separate this into two truth table for two bits P and Q.
P edge 0 1 3 2
k AB\CD 00 01 11 10
0 00 0 1 0 1
1 01 0 0 1 1
3 11 0 1 0 1
2 10 1 1 0 0
Q edge 0 1 3 2
k AB\CD 00 01 11 10
0 00 0 0 0 0
1 01 1 1 1 1
3 11 1 1 1 1
2 10 0 0 0 0
These are the Karnaugh maps that I mentioned at the beginning. We can easily get the functions.
F1(A, B, C, D) = A~B~C + A~CD + ~B~CD + ~ABC + ~AC~D + BC~D
F2(A, B, C, D) = B
Then the program will be
int getIndex(int k, int edge) {
int A = (k >> 1) & 1;
int B = k & 1;
int C = (edge >> 1) & 1;
int D = edge & 1;
int P = A&~B&~C | A&~C&D | ~B&~C&D | ~A&B&C | ~A&C&~D | B&C&~D;
int Q = B;
return (P << 1) + Q;
}
Passed the examine here. Of course, you can simplify the function even more with the XOR.
EDIT
Using XOR to simplify the expression can be achieved most of time, since A^B == A~B + ~AB. But this may not the thing you want. First, I think the performance varies only a little between the Sum of Products(SoP) expression and the even more simplified version with XOR. Second, there is not a universal method (as far as I know) to simplify an expression with XOR, so you have to rely on your own experience to do this work.
There are sixteen possible logic functions of two variables, but in digital logic hardware, the simplest gate circuits implement only four of them: AND, OR, and the complements of those (NAND and NOR). And the Karnaugh map are used to simplify real-world logic requirements so that they can be implemented using a minimum number of physical logic gates.
There are two common expressions used here, Sum of Products and Product of Sums expressions. These two expressions can be implemented directly using only AND and OR logic operators. And they can be deduced directly with Karnaugh map.
If you define the coordinates and directions in clockwise order starting at left,
#define LEFT 0
#define TOP 1
#define RIGHT 2
#define BOTTOM 3
you can use
void edge_line(int line[4], const int rect[4], const int edge)
{
line[0] = rect[ edge & 2 ];
line[1] = rect[ ((edge + 3) & 2) + 1 ];
line[2] = rect[ ((edge + 1) & 2) ];
line[3] = rect[ (edge & 2) + 1 ];
}
to copy the edge line coordinates (each line segment in clockwise winding order). It looks suboptimal, but using -O2, GCC-4.8, you get essentially
edge_line:
pushl %esi
pushl %ebx
movl 20(%esp), %ecx
movl 16(%esp), %edx
movl 12(%esp), %eax
movl %ecx, %esi
andl $2, %esi
movl (%edx,%esi,4), %ebx
movl %ebx, (%eax)
leal 3(%ecx), %ebx
addl $1, %ecx
andl $2, %ebx
andl $2, %ecx
addl $1, %ebx
movl (%edx,%ebx,4), %ebx
movl %ebx, 4(%eax)
movl (%edx,%ecx,4), %ecx
movl %ecx, 8(%eax)
movl 4(%edx,%esi,4), %edx
movl %edx, 12(%eax)
popl %ebx
popl %esi
ret
but on 64-bit, even better
edge_line:
movl %edx, %ecx
andl $2, %ecx
movslq %ecx, %rcx
movl (%rsi,%rcx,4), %eax
movl %eax, (%rdi)
leal 3(%rdx), %eax
addl $1, %edx
andl $2, %edx
andl $2, %eax
movslq %edx, %rdx
cltq
movl 4(%rsi,%rax,4), %eax
movl %eax, 4(%rdi)
movl (%rsi,%rdx,4), %eax
movl %eax, 8(%rdi)
movl 4(%rsi,%rcx,4), %eax
movl %eax, 12(%rdi)
ret
As you can see, there are no conditionals, and the binary operators combine and optimize to very few instructions.
Edited to add:
If we define a getIndex(i, edge) function, using three binary ANDs, one bit shift (right by 1), three additions, and one subtraction,
int getIndex(const int i, const int edge)
{
return (i & 1) + ((edge + 4 - (i & 1) + (i >> 1)) & 2);
}
with which edge_line() can be implemented as
void edge_line(int line[4], const int rect[4], const int edge)
{
line[0] = rect[ getIndex(0, edge) ];
line[1] = rect[ getIndex(1, edge) ];
line[2] = rect[ getIndex(2, edge) ];
line[3] = rect[ getIndex(3, edge) ];
}
we get the exact same results as before. Using GCC-4.8.4 and -O2 on AMD64/x86-64 compiles to
getIndex:
movl %edi, %edx
sarl %edi
andl $1, %edx
subl %edx, %esi
leal 4(%rsi,%rdi), %eax
andl $2, %eax
addl %edx, %eax
ret
and to
getIndex:
movl 4(%esp), %eax
movl 8(%esp), %edx
movl %eax, %ecx
andl $1, %ecx
subl %ecx, %edx
sarl %eax
leal 4(%edx,%eax), %eax
andl $2, %eax
addl %ecx, %eax
ret
on i686. Note that I arrived at the above form using the four-by-four result table; there are other, more rigorous ways to construct it, and there might even be a more optimal form. Because of this, I seriously recommend adding a big huge comment above the function, explaining the intent, and preferably also showing the result table. Something like
/* This function returns an array index:
* 0 for left
* 1 for top
* 2 for right
* 3 for bottom
* given edge:
* 0 for left
* 1 for top
* 2 for right
* 3 for bottom
* and i:
* 0 for initial x
* 1 for initial y
* 2 for final x
* 3 for final y
*
* The result table is
* | edge
* | 0 1 2 3
* ----+-------
* i=0 | 0 0 2 2
* i=1 | 3 1 1 3
* i=2 | 0 2 2 0
* i=3 | 1 1 3 3
*
* Apologies for the write-only code.
*/
Or something similar.
Lets call our goal variable to be used to index squared: int index.
Now we'll create a table of the desired index for edge versus i, with edge across the row and i down the column:
║0│1│2│3
═╬═╪═╪═╪═
0║0│1│2│1
─╫─┼─┼─┼─
1║2│1│2│3
─╫─┼─┼─┼─
2║2│3│0│3
─╫─┼─┼─┼─
3║0│3│0│1
It is obvious from this that the least significant bit of index is always odd for odd is and even for even is. So if we could find the most significant bit of index we'd just to or that with i & 1 and we'd have our index. So lets make another table of just the most significant bit of index for the same edge versus i table:
║0│1│2│3
═╬═╪═╪═╪═
0║0│0│1│0
─╫─┼─┼─┼─
1║1│0│1│1
─╫─┼─┼─┼─
2║1│1│0│1
─╫─┼─┼─┼─
3║0│1│0│0
We can see several things here:
When i is 0 or 3 the columns are identical depending only on edge
These columns are set when edge is 1 or 2
When i is 1 or 2 the columns are inverse of each other
These columns are set when only edge's most significant bit or only i's most significant bit is set
So let's start by breaking edge and i into least significant and most significant bits:
const int ib0 = i & 1;
const int ib1 = (i & 2) >> 1;
const int eb0 = edge & 1;
const int eb1 = (edge & 2) >> 1;
From here we can easily find whether i is 0 or 3:
const int iXor = ib0 ^ ib1;
For the 0/3 condition:
const int iXorCondition = ib1 ^ eb1;
And the 1/2 condition:
const int iNXorCondition = eb0 ^ eb1;
Now we'll just need to combine those with their respective iXor and put back index's least significant bit:
const int index = ((iNXorCondition & ~iXor | iXorCondition & iXor) << 1) | ib0;
Putting this all together into a convenient function we get:
int getIndex(int i, int edge) {
const int ib0 = i & 1;
const int ib1 = (i & 2) >> 1;
const int eb0 = edge & 1;
const int eb1 = (edge & 2) >> 1;
const int iXor = ib0 ^ ib1;
const int iNXorCondition = eb0 ^ eb1;
const int iXorCondition = ib1 ^ eb1;
return ((iNXorCondition & ~iXor | iXorCondition & iXor) << 1) | ib0;
}
I've written a checking live example here.
What I want to know is can I construct the square indexes to assign to output in a for loop, without an if/case/ternary statement (in other words using bit-wise operations) ?
I would ask you what you expect to achieve in doing that ?
My view is that the switch-case construct will, typically, be completely reorganized by a compiler's optimization code. It's best, IMO, to leave that code alone and let the compiler do that.
There are only two conditions where Id change that view ;
You were writing in OpenCL ( rather than C ) and wanted to optimize the code where decision-branch logic can be problematic for performance.
You wanted to use explicit coding for SIMD vectorization. There are some special operations that might help there, but it's a coding option that locks you into things that might not work well on hardware without SIMD instruction sets ( or perform quite differently on different hardware ). It's also worth noting that some compilers can auto-vectorize with the right coding.
I just see little or no advantage to coding these operations any other way than switch-case for C.
This is a way to achieve that:
do {
output[i] = square[
(edge & 1) * (
!(i & 1) * ((edge + 1) & 2) +
(i & 1) * (
(!((edge - 1)/2)&1) * i +
(((edge - 1)/2)&1) * (4-i)
)
) +
!(edge & 1) * (
(i & 1) * (edge + 1) +
!(i & 1) * ((edge & 2) - ((edge & 2)-1) * i)
)
];
} while(++i <= LEFT);
To help you understand I indented the code, you can obviously erase all the white spaces. I have put a tab where ever I wanted to separate two cases. By the way as you see the calculation is in two sections for two different cases which are symmetrical but I solved each case with a different algorithm so you can see various ways to achieve things.
Related
In this code snippet, I'm comparing performance of two functionally identical loops:
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
and
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
The first one runs significantly slower than the second one across a number of different C++ compilers with optimization flag set to O2:
second loop is about 330% slower now with Clang 3.7.0
second loop is about 2% slower with gcc 4.9.3
second loop is about 2% slower with Visual C++ 2015
I'm puzzled that modern C++ optimizers have problems handling this case. Any clues why? Do I have to write ugly code without using temporary variables in order to get the best performance?
Using temporary variables makes the code faster, sometimes dramatically, now. What is going on?
The full code I'm using is provided below:
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
cout << "\ndone\n";
return 0;
}
It seems like the compiler lacks knowledge about the relationship between std::vector<>::size() and internal vector buffer size. Consider std::vector being our custom bugged_vector vector-like object with slight bug - its ::size() can sometimes be one more than internal buffer size n, but only then v[n-2] >= v[n-1].
Then two snippets have different semantics again: first one has undefined behavior, as we access element v[v.size() - 1]. The second one, however, doesn't have: due to short-circuit nature of &&, we don't ever read v[v.size() - 1] on the last iteration.
So, if compiler can't prove that our v is not a bugged_vector, it must short-circuit, which introduce additional jump in a machine code.
By looking at assembly output from clang, we can see that it actually happens.
From the Godbolt Compiler Explorer, with clang 3.7.0 -O2, the loop in f0 is:
### f0: just the loop
.LBB1_2: # =>This Inner Loop Header: Depth=1
mov edi, ecx
cmp edx, edi
setl r10b
mov ecx, dword ptr [r8 + 4*rsi + 4]
lea rsi, [rsi + 1]
cmp edi, ecx
setl dl
and dl, r10b
movzx edx, dl
add eax, edx
cmp rsi, r9
mov edx, edi
jb .LBB1_2
And for f1:
### f1: just the loop
.LBB2_2: # =>This Inner Loop Header: Depth=1
mov esi, r10d
mov r10d, dword ptr [r9 + 4*rdi]
lea rcx, [rdi + 1]
cmp esi, r10d
jge .LBB2_4 # <== This is Extra Jump
cmp r10d, dword ptr [r9 + 4*rdi + 4]
setl dl
movzx edx, dl
add eax, edx
.LBB2_4: # %._crit_edge.3
cmp rcx, r8
mov rdi, rcx
jb .LBB2_2
I've pointed out the extra jump in f1. And as we (hopefuly) know, conditional jumps in a tight loops are bad for performance. (See the performance guides in the x86 tag wiki for details.)
GCC and Visual Studio are aware that std::vector is well-behaved, and produce almost identical assembly for both snippets.
Edit. It turns out clang does better job optimizing the code. All three compilers can't prove that it is safe to read v[i + 1] prior to comparison in the second example (or choose not to), but only clang manages to optimize the first example with the additional information that reading v[i + 1] is either valid or UB.
A performance difference of 2% is negligible can be explained by different order or choice of some instructions.
Here's additional insight to expand on #deniss' answer, which correctly diagnosed the issue.
Incidentally, this is related to the most popular C++ Q&A of all time "Why is processing a sorted array faster than an unsorted array?".
The main issue is the compiler must honor the logical AND operator (&&) and not load from v[i+1] unless the first condition is true. This is a consequence of the semantics of the Logical AND operator as well as the tightened memory model semantics introduced with C++11, the relevant clauses in the draft of the standard are
5.14 Logical AND operator [expr.log.and]
Unlike &, && guarantees left-to-right evaluation: the second
operand is not evaluated if the first operand is false.ISO C++14 Standard (draft N3797)
and for speculative reads
1.10 Multi-threaded executions and data races [intro.multithread]
23 [ Note: Transformations that introduce a speculative read of a potentially shared memory location may not preserve the semantics of the C++ program as defined in this standard, since they potentially introduce a data race. However, they are typically valid in the context of an optimizing compiler that targets a specific machine with well-defined semantics for data races. They would be invalid for a hypothetical machine that is not tolerant of races or provides hardware race detection. — end note ]ISO C++14 Standard (draft N3797)
My guess is optimizers play it safe and currently choose not to issue speculative loads to potentially shared memory rather than special case for each target processor whether the speculative load could introduce a detectable data race for that target.
In order to implement this, the compiler generates a conditional branch. Usually this isn't noticeable because modern processors have very sophisticated branch prediction, and the misprediction rate is typically very low. However the data here is random - this kills branch prediction. The cost of a misprediction is 10 to 20 CPU cycles, considering that the CPU is typically retiring 2 instructions per cycle this is equivalent to 20 to 40 instructions. If the prediction rate is 50% (random) then every iteration has a mispredict penalty equivalent to 10 to 20 instructions - HUGE.
Note: The compiler could prove that elements v[0] to v[v.size()-2] will be referenced, in that order, regardless of the values they contain. This would allow the compiler in this case to generate code that unconditionally loads all but the last element of the vector. The last element of the vector, at v[v.size()-1], may only be loaded in the last iteration of the loop and only if the first condition is true.
The compiler could therefore generate code for the loop without the short circuit branch up until the last iteration, then use different code with the short circuit branch for the last iteration - that would require the compiler knowing that the data is random and branch prediction is useless and therefore that it is worth bothering with that - compilers aren't that sophisticated - yet.
To avoid the conditional branch generated by the Logical AND (&&) and avoid loading the memory locations into local variables we can change the Logical AND operator into a Bitwise AND, code snippet here, the result is almost 4x faster when the data is random
int f2()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i)
n += (v[i-1] < v[i]) & (v[i] < v[i+1]); // Bitwise AND
return n;
}
Output
3.642443ms min
3.779982ms median
Result: 166634
3.725968ms min
3.870808ms median
Result: 166634
1.052786ms min
1.081085ms median
Result: 166634
done
The result on gcc 5.3 is 8x faster (live in Coliru here)
g++ --version
g++ -std=c++14 -O3 -Wall -Wextra -pedantic -pthread -pedantic-errors main.cpp -lm && ./a.out
g++ (GCC) 5.3.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
3.761290ms min
4.025739ms median
Result: 166634
3.823133ms min
4.050742ms median
Result: 166634
0.459393ms min
0.505011ms median
Result: 166634
done
You might wonder how the compiler can evaluate the comparison v[i-1] < v[i] without generating a conditional branch. The answer depends on the target, for x86 this is possible because of the SETcc instruction, which generates a one byte result, 0 or 1, depending on a condition in the EFLAGS register, the same condition that could be used in a conditional branch, but without branching. In the generated code given by #deniss you can see setl generated, that sets the result to 1 if the condition "less than" is met, which is evaluated by the previous compare instruction:
cmp edx, edi ; a < b ?
setl r10b ; r10b = a < b ? 1 : 0
mov ecx, dword ptr [r8 + 4*rsi + 4] ; c = v[i+1]
lea rsi, [rsi + 1] ; ++i
cmp edi, ecx ; b < c ?
setl dl ; dl = b < c ? 1 : 0
and dl, r10b ; dl &= r10b
movzx edx, dl ; edx = zero extended dl
add eax, edx ; n += edx
f0 and f1 are semantically different.
x() && y() involves a short-circuit in the case of x() being false as we know. This means that if x() is false , then y() must not be evaluated.
This prevents prefetching of the data in order to evaluate y() and (at least on clang) is causing the insertion of a conditional jump, which is resulting in branch-predictor misses.
Adding another 2 tests proves the point.
#include <algorithm>
#include <chrono>
#include <random>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
using namespace std::chrono;
vector<int> v(1'000'000);
int f0()
{
int n = 0;
for (int i = 1; i < v.size()-1; ++i) {
int a = v[i-1];
int b = v[i];
int c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
int f1()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
if (v[i-1] < v[i] && v[i] < v[i+1])
++n;
return n;
}
int f2()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
auto t1 = v[i-1] < v[i];
auto t2 = v[i] < v[i+1];
if (t1 && t2)
++n;
}
return n;
}
int f3()
{
int n = 0;
auto s = v.size() - 1;
for (size_t i = 1; i < s; ++i)
{
n += 1 * (v[i-1] < v[i]) * (v[i] < v[i+1]);
}
return n;
}
int main()
{
auto benchmark = [](int (*f)()) {
const int N = 100;
volatile long long result = 0;
vector<long long> timings(N);
for (int i = 0; i < N; ++i) {
auto t0 = high_resolution_clock::now();
result += f();
auto t1 = high_resolution_clock::now();
timings[i] = duration_cast<nanoseconds>(t1-t0).count();
}
sort(timings.begin(), timings.end());
cout << fixed << setprecision(6) << timings.front()/1'000'000.0 << "ms min\n";
cout << timings[timings.size()/2]/1'000'000.0 << "ms median\n" << "Result: " << result/N << "\n\n";
};
mt19937 generator (31415); // deterministic seed
uniform_int_distribution<> distribution(0, 1023);
for (auto& e: v)
e = distribution(generator);
benchmark(f0);
benchmark(f1);
benchmark(f2);
benchmark(f3);
cout << "\ndone\n";
return 0;
}
results (apple clang, -O2):
1.233948ms min
1.320545ms median
Result: 166850
3.366751ms min
3.493069ms median
Result: 166850
1.261948ms min
1.361748ms median
Result: 166850
1.251434ms min
1.353653ms median
Result: 166850
None of the answers so far have given a version of f() that gcc or clang can fully optimize. They all generate asm that does both compares each iteration. See the code with asm output on the Godbolt Compiler Explorer. (Important background knowledge for predicting performance from asm output: Agner Fog's microarchitecture guide, and other links on the x86 tag wiki. As always, it usually works best to profile with performance counters to find stalls.)
v[i-1] < v[i] is work we already did last iteration, when we evaluated v[i] < v[i+1]. In theory, helping the compiler grok that would let it optimize better (see f3()). In practice, that ends up defeating auto-vectorization in some cases, and gcc emits code with partial-register stalls, even with -mtune=core2 where that's a huge problem.
Manually hoisting the v.size() - 1 out of the loop's upper bound check seems to help. The OP's f0 and f1 don't actually re-compute v.size() from the start/end pointers in v, but somehow it still optimizes less well than when computing a size_t upper = v.size() - 1 outside the loop (f2() and f4()).
A separate issue is that using an int loop counter with a size_t upper bound means the loop is potentially infinite. I'm not sure how much impact this has on other optimizations.
Bottom line: compilers are complex beasts. Predicting which version will optimize well is not at all obvious or straightforward.
Results on 64bit Ubuntu 15.10, on Core2 E6600 (Merom/Conroe microarchitecture).
clang++-3.8 -O3 -march=core2 | g++ 5.2 -O3 -march=core2 | gcc 5.2 -O2 (default -mtune=generic)
f0 1.825ms min(1.858 med) | 5.008ms min(5.048 med) | 5.000 min(5.028 med)
f1 4.637ms min(4.673 med) | 4.899ms min(4.952 med) | 4.894 min(4.931 med)
f2 1.292ms min(1.323 med) | 1.058ms min(1.088 med) (autovec) | 4.888 min(4.912 med)
f3 1.082ms min(1.117 med) | 2.426ms min(2.458 med) | 2.420 min(2.465 med)
f4 1.291ms min(1.341 med) | 1.022ms min(1.052 med) (autovec) | 2.529 min(2.560 med)
Results would be different on Intel SnB-family hardware, esp. IvyBridge and later where there would be no partial register slowdowns at all. Core2 is limited by slow unaligned loads, and only one load per cycle. The loops may be small enough that decode isn't an issue, though.
f0 and f1:
gcc 5.2: The OP's f0 and f1 both make branchy loops, and won't auto-vectorize. f0 only uses one branch, though, and uses a weird setl sil / cmp sil, 1 / sbb eax, -1 to do the second half of the short-circuit compare. So it's still doing both comparisons on every iteration.
clang 3.8: f0: only one load per iteration, but does both compares and ands them together. f1: both compares each iteration, one with a branch to preserve the C semantics. Two loads per iteration.
int f2() {
int n = 0;
size_t upper = v.size()-1; // difference from f0: hoist upper bound and use size_t loop counter
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
if (a < b && b < c)
++n;
}
return n;
}
gcc 5.2 -O3: auto-vectorizes, with three loads to get the three offset vectors needed to produce one vector of 4 compare results. Also, after combining the results from two pcmpgtd instructions, compares them against a vector of all-zero and then masks that. Zero is already the identity element for addition, so that's really silly.
clang 3.8 -O3: unrolls: every iteration does two loads, three cmp/setcc, two ands, and two adds.
int f4() {
int n = 0;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int a = v[i-1], b = v[i], c = v[i+1];
bool ab_lt = a < b;
bool bc_lt = b < c;
n += (ab_lt & bc_lt); // some really minor code-gen differences from f2: auto-vectorizes to better code that runs slightly faster even for this large problem size
}
return n;
}
gcc 5.2 -O3: autovectorizes like f2, but without the extra pcmpeqd.
gcc 5.2 -O2: didn't investigate why this is twice as fast as f2.
clang -O3: about the same code as f2.
Attempt at compiler hand-holding
int f3() {
int n = 0;
int a = v[0], b = v[1]; // These happen before checking v.size, defeating the loop vectorizer or something
bool ab_lt = a < b;
size_t upper = v.size()-1;
for (size_t i = 1; i < upper; ++i) {
int c = v[i+1]; // only one load and compare inside the loop
bool bc_lt = b < c;
n += (ab_lt & bc_lt);
ab_lt = bc_lt;
a = b; // unused inside the loop, only the compare result is needed
b = c;
}
return n;
}
clang 3.8 -O3: Unrolls with 4 loads inside the loop (clang typically likes to unroll by 4 when there aren't complex loop-carried dependencies).
4 cmp/setcc, 4x and/movzx, 4x add. So clang did exactly what I was hoping, and made near-optimal scalar code. This was the fastest non-vectorized version, and (on core2 where movups unaligned loads are slow) is as fast as gcc's vectorized versions.
gcc 5.2 -O3: Fails to auto-vectorize. My theory on that is that accessing the array outside the loop confuses the auto-vectorizer. Maybe because we do it before checking v.size(), or maybe just in general.
Compiles to the scalar code we'd hope for, with one load, one cmp/setcc, and one and per iteration. But gcc creates a partial-register stall, even with -mtune=core2 where it's a huge problem (2 to 3 cycle stall to insert a merging uop when reading a wide reg after writing only part of it). (setcc is only available with an 8-bit operand size, which IMO is something AMD should have changed when they designed the AMD64 ISA.) It's the main reason why gcc's code runs 2.5x slower than clang's.
## the loop in f3(), from gcc 5.2 -O3 (same code with -O2)
.L31:
add rcx, 1 # i,
mov edi, DWORD PTR [r10+rcx*4] # a, MEM[base: _19, index: i_13, step: 4, offset: 0]
cmp edi, r8d # a, a # gcc's verbose-asm comments are a bit bogus here: one of these `a`s is from the last iteration, so this is really comparing c, b
mov r8d, edi # a, a
setg sil #, tmp124
and edx, esi # D.111089, tmp124 # PARTIAL-REG STALL: reading esi after writing sil
movzx edx, dl # using movzx to widen sil to esi would have solved the problem, instead of doing it after the and
add eax, edx # n, D.111085 # n += ...
cmp r9, rcx # upper, i
mov edx, esi # ab_lt, tmp124
jne .L31 #,
ret
Suppose we have the following (nonsensical) code:
const int a = 0;
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
Variable 'a' equals zero, so the compiler can deduce on compile time, that the instruction 'if(a) c++;' will never be executed and will optimize it away.
My question: Does the same happen with lambda closures?
Check out another piece of code:
const int a = 0;
function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
}
Will the compiler know that 'a' is 0 and will it optimize the lambda?
Even more sophisticated example:
function<int()> generate_lambda(const int a)
{
return [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
}
function<int()> a_is_zero = generate_lambda(0);
function<int()> a_is_one = generate_lambda(1);
Will the compiler be smart enough to optimize the first lambda when it knows that 'a' is 0 at generation time?
Does gcc or llvm have this kind of optimizations?
I'm asking because I wonder if I should make such optimizations manually when I know that certain assumptions are satisfied on lambda generation time or the compiler will do that for me.
Looking at the assembly generated by gcc5.2 -O2 shows that the optimization does not happen when using std::function:
#include <functional>
int main()
{
const int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to some boilerplate and
movl (%rdi), %ecx
movl $10000000, %edx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
cmpl $1, %ecx
sbbl $-1, %eax
addl $7, %eax
subl $1, %edx
jne .L3
rep; ret
which is the loop you wanted to see optimized away. (Live) But if you actually use a lambda (and not an std::function), the optimization does happen:
int main()
{
const int a = 0;
auto lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to
movl $70000000, %eax
ret
i.e. the loop was removed completely. (Live)
Afaik, you can expect a lambda to have zero overhead, but std::function is different and comes with a cost (at least at the current state of the optimizers, although people apparently work on this), even if the code "inside the std::function" would have been optimized. (Take that with a grain of salt and try if in doubt, since this will probably vary between compilers and versions. std::functions overhead can certainly be optimized away.)
As #MarcGlisse correctly pointed out, clang3.6 performs the desired optimization (equivalent to the second case above) even with std::function. (Live)
Bonus edit, thanks to #MarkGlisse again: If the function that contains the std::function is not called main, the optimization happening with gcc5.2 is somewhere between gcc+main and clang, i.e. the function gets reduced to return 70000000; plus some extra code. (Live)
Bonus edit 2, this time mine: If you use -O3, gcc will, (for some reason) as explained in Marco's answer, optimize the std::function to
cmpl $1, (%rdi)
sbbl %eax, %eax
andl $-10000000, %eax
addl $80000000, %eax
ret
and keep the rest as in the not_main case. So I guess at the bottom of the line, one will just have to measure when using std::function.
Both gcc at -O3 and MSVC2015 Release won't optimize it away with this simple code and the lambda would actually be called
#include <functional>
#include <iostream>
int main()
{
int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10; b++)
{
if(a) c++;
c += 7;
}
return c;
};
std::cout << lambda();
return 0;
}
At -O3 this is what gcc generates for the lambda (code from godbolt)
lambda:
cmp DWORD PTR [rdi], 1
sbb eax, eax
and eax, -10
add eax, 80
ret
This is a contrived and optimized way to express the following:
If a was a 0, the first comparison would set the carry flag CR. eax would actually be set to 32 1 values, and'ed with -10 (and that would yield -10 in eax) and then added 80 -> result is 70.
If a was something different from 0, the first comparison would not set the carry flag CR, eax would be set to zero, the and would have no effect and it would be added 80 -> result is 80.
It has to be noted (thanks Marc Glisse) that if the function is marked as cold (i.e. unlikely to be called) gcc performs the right thing and optimizes the call away.
MSVC generates more verbose code but the comparison isn't skipped.
Clang is the only one which gets it right: the lambda hasn't its code optimized more than gcc did but it is not called
mov edi, std::cout
mov esi, 70
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Morale: Clang seems to get it right but the optimization challenge is still open.
g++ (4.7.2) and similar versions seem to evaluate constexpr surprisingly fast during compile-time. On my machines in fact much faster than the compiled program during runtime.
Is there a reasonable explanation for that behavior?
Are there optimization techniques involved which are only
applicable at compile-time, that can be executed quicker than actual compiled code?
If so, which?
Here`s my test program and the observed results.
#include <iostream>
constexpr int mc91(int n)
{
return (n > 100)? n-10 : mc91(mc91(n+11));
}
constexpr double foo(double n)
{
return (n>2)? (0.9999)*((unsigned int)(foo(n-1)+foo(n-2))%100):1;
}
constexpr unsigned ack( unsigned m, unsigned n )
{
return m == 0
? n + 1
: n == 0
? ack( m - 1, 1 )
: ack( m - 1, ack( m, n - 1 ) );
}
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
int main(void)
{
constexpr unsigned int compiletime_ack=ack(3,14);
constexpr int compiletime_91=slow91(49);
static_assert( compiletime_ack == 131069, "Must be evaluated at compile-time" );
static_assert( compiletime_91 == 91, "Must be evaluated at compile-time" );
std::cout << compiletime_ack << std::endl;
std::cout << compiletime_91 << std::endl;
std::cout << ack(3,14) << std::endl;
std::cout << slow91(49) << std::endl;
return 0;
}
compiletime:
time g++ constexpr.cpp -std=c++11 -fconstexpr-depth=10000000 -O3
real 0m0.645s
user 0m0.600s
sys 0m0.032s
runtime:
time ./a.out
131069
91
131069
91
real 0m43.708s
user 0m43.567s
sys 0m0.008s
Here mc91 is the usual mac carthy f91 (as can be found on wikipedia) and foo is just a useless function returning real values between about 1 and 100, with a fib runtime complexity.
Both the slow calculation of 91 and the ackermann functions get evaluated with the same arguments by the compiler and the compiled program.
Surprisingly the program would even run faster, just generating code and running it through the compiler than executing the code itself.
At compile-time, redundant (identical) constexpr calls can be memoized, while run-time recursive behavior does not provide this.
If you change every recursive function such as...
constexpr unsigned slow91(int n) {
return mc91(mc91(foo(n))%100);
}
... to a form that isn't constexpr, but does remember past calculations at runtime:
std::unordered_map< int, boost::optional<unsigned> > results4;
// parameter(s) ^^^ result ^^^^^^^^
unsigned slow91(int n) {
boost::optional<unsigned> &ret = results4[n];
if ( !ret )
{
ret = mc91(mc91(foo(n))%100);
}
return *ret;
}
You will get less surprising results.
compiletime:
time g++ test.cpp -std=c++11 -O3
real 0m1.708s
user 0m1.496s
sys 0m0.176s
runtime:
time ./a.out
131069
91
131069
91
real 0m0.097s
user 0m0.064s
sys 0m0.032s
Memoization
This is a very interesting "discovery" but the answer is probably more simple than you think it is.
Something can be evaluated compile-time when declared constexpr if all values involved are known at compile time (and if the variable where the value is supposed to end up is declared constexpr as well) with that said imagine the following pseudo-code:
f(x) = g(x)
g(x) = x + h(x,x)
h(x,y) = x + y
since every value is known at compile time the compiler can rewrite the above into the, equivalent, below:
f(x) = x + x + x
To put it in words every function call has been removed and replaced with that of the expression itself. What is also applicable is a method called memoization where results of passed calculated expresions are stored away so you only need to do the hard work once.
If you know that g(5) = 15 why calculate it again? instead just replace g(5) with 15 everytime it is needed, This is possible since a function declared as constexpr isn't allowed to have side-effects .
Runtime
In runtime this is not happening (since we didn't tell the code to behave this way). The little guy running through your code will need to jump from f to g to h and then jump back to g from h before it jumps from g to f all while he stores the return value of each function and passing it along to the next one.
Even if this guy is very very tiny and that he doesn't need to jump very very far he still doesn't like jumping back and forth all the time, it takes a lot for him to do this and with that; it takes time.
But in the OPs example, is it really calculated compile-time?
Yes, and to those not believing that the compiler actually calculates this and put it as constants in the finished binary I will supply the relevant assembly instructions from OPs code below (output of g++ -S -Wall -pedantic -fconstexpr-depth=1000000 -std=c++11)
main:
.LFB1200:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $131069, -4(%rbp)
movl $91, -8(%rbp)
movl $131069, %esi # one of the values from constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEj
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
call _ZNSolsEPFRSoS_E
movl $91, %esi # the other value from our constexpr
movl $_ZSt4cout, %edi
call _ZNSolsEi
movl $_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_, %esi
movq %rax, %rdi
# ...
# a lot of jumping is taking place down here
# see the full output at http://codepad.org/Q8D7c41y
I'm trying to manipulate a special struct and I need some sort of a swizzle operator. For this it makes sense to have an overloaded array [] operator, but I don't want to have any branching since the particular specification of the struct allows for a theoretical workaround.
Currently, the struct looks like this:
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
// template with an int here?
inline float& operator[] (int x) {
if (x < 2)
return fLow[x];
else
return fHigh[x - 2];
}
};
What could I/should I do to avoid the branch? My idea is to use a template with an integer parameter and define specializations, but it's not clear whether it does make sense and what the syntax of that monster could look like.
I explicitly, under no circumstances, can make use of a float[4] array to merge the two (also, no union tricks). If you need a good reason for that, it's because the float[2] are actually resembling a platform specific PowerPC paired singles. A normal windows compiler won't work with paired singles, that's why I replaced the code with float[2]s.
Using the GreenHills compiler I get this assembly output (which suggests branching does occur):
.LDW31:
00000050 80040000 89 lwz r0, 0(r4)
00000054 2c000000 90 cmpwi r0, 0
00000058 41820000 91 beq .L69
92 #line32
93
94 .LDWlin1:
0000005c 2c000001 95 cmpwi r0, 1
00000060 40820000 96 bne .L74
97 #line32
98
99 .LDWlin2:
00000064 38630004 100 addi r3, r3, 4
00000068 38210018 101 addi sp, sp, 24
0000006c 4e800020 102 blr
103 .L74:
00000070 2c000002 104 cmpwi r0, 2
00000074 40820000 105 bne .L77
106 #line33
107
108 .LDWlin3:
00000078 38630008 109 addi r3, r3, 8
0000007c 38210018 110 addi sp, sp, 24
00000080 4e800020 111 blr
112 .L77:
00000084 2c000003 113 cmpwi r0, 3
00000088 40820000 114 bne .L80
115 #line34
116
117 .LDWlin4:
0000008c 3863000c 118 addi r3, r3, 12
00000090 38210018 119 addi sp, sp, 24
00000094 4e800020 120 blr
121 .L80:
00000098 38610008 122 addi r3, sp, 8
123 .L69:
124 # .ef
The corresponding C++ code to that snippet should be this one:
inline const float& operator[](const unsigned& idx) const
{
if (idx == 0) return xy[0];
if (idx == 1) return xy[1];
if (idx == 2) return zw[0];
if (idx == 3) return zw[1];
return 0.f;
}
Either the index x is a runtime variable, or a compile-time constant.
if it is a compile-time constant, there's a good chance the optimizer will be able to prune the dead branch when inlining operator[] anyway.
if it is a runtime variable, like
for (int i=0; i<4; ++i) { dosomething(f[i]); }
you need the branch anyway. Unless, of course, your optimizer unrolls the loop, in which case it can replace the variable with four constants, inline & prune as above.
Did you profile this to show there's a real problem, and compile it to show the branch really happens where it could be avoided?
Example code:
float foo(f32x4 &f)
{
return f[0]+f[1]+f[2]+f[3];
}
example output from g++ -O3 -S
.globl _Z3fooR5f32x4
.type _Z3fooR5f32x4, #function
_Z3fooR5f32x4:
.LFB4:
.cfi_startproc
movss (%rdi), %xmm0
addss 4(%rdi), %xmm0
addss 8(%rdi), %xmm0
addss 12(%rdi), %xmm0
ret
.cfi_endproc
Seriously, don't do this!! Just combine the arrays. But since you asked the question, here's an answer:
#include <iostream>
float fLow [2] = {1.0,2.0};
float fHigh [2] = {50.0,51.0};
float * fArrays[2] = {fLow, fHigh};
float getFloat (int i)
{
return fArrays[i>=2][i%2];
}
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << getFloat(i) << '\n';
return 0;
}
Output:
1
2
50
51
Since you said in a comment that your index is always a template parameter, then you can indeed make the branching at compile-time instead of runtime. Here is a possible solution using std::enable_if:
#include <iostream>
#include <type_traits>
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
template <int x>
float& get(typename std::enable_if<(x >= 0 && x < 2)>::type* = 0)
{
return fLow[x];
}
template <int x>
float& get(typename std::enable_if<(x >= 2 && x < 4)>::type* = 0)
{
return fHigh[x-2];
}
};
int main()
{
f32x4 f(0.f, 1.f, 2.f, 3.f);
std::cout << f.get<0>() << " " << f.get<1>() << " "
<< f.get<2>() << " " << f.get<3>(); // prints 0 1 2 3
}
Regarding performance, I don't think there will be any difference since the optimizer should be able to easily propagate the constants and remove dead code subsequently, thereby removing the branch altogether. However, with this approach, you get the benefit that any attempts to invoke the function with an invalid index will result in a compiler error.
Create one array (or vector) with all 4 elements in it, the fLow values occupying the first two positions, then high in the second 2. Then just index into it.
inline float& operator[] (int x) {
return newFancyArray[x]; //But do some bounds checking above.
}
Based on Luc Touraille's answer, without using type traits due to their lack of compiler support, I found the following to achieve the purpose of the question. Since the operator[] can not be templatized with an int parameter and work syntactically, I introduced an at method. This is the result:
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
template <unsigned T>
const float& at() const;
};
template<>
const float& f32x4::at<0>() const { return fLow[0]; }
template<>
const float& f32x4::at<1>() const { return fLow[1]; }
template<>
const float& f32x4::at<2>() const { return fHigh[0]; }
template<>
const float& f32x4::at<3>() const { return fHigh[1]; }
I have always wondered about this. Let's say we have a variable, string weight, and an input variable, int mode, which can be 1 or 0.
Is there a clear benefit to using:
weight = (mode == 1) ? "mode:1" : "mode:0";
over
if(mode == 1)
weight = "mode:1";
else
weight = "mode:0";
beyond code readability? Are speeds at all affected, is this handled differently by the compiler (such as the ability of certain switch statements to be converted to jump tables)?
The key difference between the conditional operator and an if/else block is that the conditional operator is an expression, rather than a statement. Thus, there are few places you can use the conditional operator where you can't use an if/else. For example, initialization of constant objects, like so:
const double biasFactor = (x < 5) ? 2.5 : 6.432;
If you used if/else in this case, biasFactor would have to be non-const.
Additonally, constructor initializer lists call for expressions rather than statements as well:
X::X()
: myData(x > 5 ? 0xCAFEBABE : OxDEADBEEF)
{
}
In this case, myData may not have any assignment operator or non-const member functions defined--its constructor may be the only way to pass any parameters to it.
Also, note that any expression can be turned into a statement by adding a semicolon at the end--the reverse is not true.
No, this is purely about presenting the code to a human reader. I'd expect any compiler to generate identical code for these.
With mingw, the assembly code generated with
const char * testFunc()
{
int mode=1;
const char * weight = (mode == 1)? "mode:1" : "mode:0";
return weight;
}
is:
testFunc():
0040138c: push %ebp
0040138d: mov %esp,%ebp
0040138f: sub $0x10,%esp
10 int mode=1;
00401392: movl $0x1,-0x4(%ebp)
11 const char * weight = (mode == 1)? "mode:1" : "mode:0";
00401399: cmpl $0x1,-0x4(%ebp)
0040139d: jne 0x4013a6 <testFunc()+26>
0040139f: mov $0x403064,%eax
004013a4: jmp 0x4013ab <testFunc()+31>
004013a6: mov $0x40306b,%eax
004013ab: mov %eax,-0x8(%ebp)
12 return weight;
004013ae: mov -0x8(%ebp),%eax
13 }
And with
const char * testFunc()
{
const char * weight;
int mode=1;
if(mode == 1)
weight = "mode:1";
else
weight = "mode:0";
return weight;
}
is:
testFunc():
0040138c: push %ebp
0040138d: mov %esp,%ebp
0040138f: sub $0x10,%esp
11 int mode=1;
00401392: movl $0x1,-0x8(%ebp)
12 if(mode == 1)
00401399: cmpl $0x1,-0x8(%ebp)
0040139d: jne 0x4013a8 <testFunc()+28>
13 weight = "mode:1";
0040139f: movl $0x403064,-0x4(%ebp)
004013a6: jmp 0x4013af <testFunc()+35>
15 weight = "mode:0";
004013a8: movl $0x40306b,-0x4(%ebp)
17 return weight;
004013af: mov -0x4(%ebp),%eax
18 }
Pretty much the same code is generated. The performance of your application shouldn't depend on small details like this one.
So, no it doesn't make a difference.