I have a broadly used function foo(int a, int b) and I want to provide a special version of foo that performs differently if a is say 1.
a) I don't want to go through the whole code base and change all occurrences of foo(1, b) to foo1(b) because the rules on arguments may change and I dont want to keep going through the code base whenever the rules on arguments change.
b) I don't want to burden function foo with an "if (a == 1)" test because of performance issues.
It seems to me to be a fundamental skill of the compiler to call the right code based on what it can see in front of it. Or is this a possible missing feature of C++ that requires macros or something to handle currently.
Simply write
inline int foo(int a, int b)
{
if (a==1) {
// skip complex code and call easy code
call_easy(b);
} else {
// complex code here
do_complex(a, b);
}
}
When you call
foo(1, 10);
the optimizer will/should simply insert a call_easy(b).
Any decent optimizer will inline the function and detect if the function has been called with a==1. Also I think that the entire constexpr mentioned in other posts is nice, but not really necessary in your case. constexpr is very useful, if you want to resolve values at compile time. But you simply asked to switch code paths based on a value at runtime. The optimizer should be able to detect that.
In order to detect that, the optimizer needs to see your function definition at all places where your function is called. Hence the inline requirement - although compilers such as Visual Studio have a "generate code at link time" feature, that reduces this requirement somewhat.
Finally you might want to look at C++ attributes [[likely]] (I think). I haven't worked with them yet, but they are supposed to tell the compiler which execution path is likely and give a hint to the optimizer.
And why don't you experiment a little and look at the generated code in the debugger/disassemble. That will give you a feel for the optimizer. Don't forget that the optimizer is likely only active in Release Builds :)
Templates work in compile time and you want to decide in runtime which is never possible. If and only if you really can call your function with constexpr values, than you can change to a template, but the call becomes foo<1,2>() instead of foo(1,2); "performance issues"... that's really funny! If that single compare assembler instruction is the performance problem... yes, than you have done everything super perfect :-)
BTW: If you already call with constexpr values and the function is visible in the compilation unit, you can be sure the compiler already knows to optimize it away...
But there is another way to handle such things if you really have constexpr values sometimes and your algorithm inside the function can be constexpr evaluated. In that case, you can decide inside the function if your function was called in a constexpr context. If that is the case, you can do a full compile time algorithm which also can contain your if ( a== 1) which will be fully evaluated in compile time. If the function is not called in constexpr context, the function is running as before without any additional overhead.
To do such decision in compile time we need the actual C++ standard ( C++20 )!
constexpr int foo( int a, int)
{
if (std::is_constant_evaluated() )
{ // this part is fully evaluated in compile time!
if ( a == 1 )
{
return 1;
}
else
{
return 2;
}
}
else
{ // and the rest runs as before in runtime
if ( a == 0 )
{
return 3;
}
else
{
return 4;
}
}
}
int main()
{
constexpr int res1 = foo( 1,0 ); // fully evaluated during compile time
constexpr int res2 = foo( 2,0 ); // also full compile time
std::cout << res1 << std::endl;
std::cout << res2 << std::endl;
std::cout << foo( 5, 0) << std::endl; // here we go in runtime
std::cout << foo( 0, 0) << std::endl; // here we go in runtime
}
That code will return:
1
2
4
3
So we do not need to go with classic templates, no need to change the rest of the code but have full compile time optimization if possible.
#Sebastian's suggestion works at least in the simple case with all optimisation levels except -O0 in g++ 9.3.0 on Ubuntu 20.04 in c++20 mode. Thanks again.
See below disassembly always calling directly the correct subfunction func1 or func2 instead of the top function func(). A similar disassembly after -O0 shows only the top level func() being called leaving the decision to run-time which is not desired.
I hope this will work in production code and perhaps with multiple hard coded arguments.
Breakpoint 1, main () at p1.cpp:24
24 int main() {
(gdb) disass /m
Dump of assembler code for function main():
6 inline void func(int a, int b) {
7
8 if (a == 1)
9 func1(b);
10 else
11 func2(a,b);
12 }
13
14 void func1(int b) {
15 std::cout << "func1 " << " " << " " << b << std::endl;
16 }
17
18 void func2(int a, int b) {
19 std::cout << "func2 " << a << " " << b << std::endl;
20 }
21
22 };
23
24 int main() {
=> 0x0000555555555286 <+0>: endbr64
0x000055555555528a <+4>: push %rbp
0x000055555555528b <+5>: push %rbx
0x000055555555528c <+6>: sub $0x18,%rsp
0x0000555555555290 <+10>: mov $0x28,%ebp
0x0000555555555295 <+15>: mov %fs:0x0(%rbp),%rax
0x000055555555529a <+20>: mov %rax,0x8(%rsp)
0x000055555555529f <+25>: xor %eax,%eax
25
26 X x1;
27
28 int b=1;
29 x1.func(1,b);
0x00005555555552a1 <+27>: lea 0x7(%rsp),%rbx
0x00005555555552a6 <+32>: mov $0x1,%esi
0x00005555555552ab <+37>: mov %rbx,%rdi
0x00005555555552ae <+40>: callq 0x55555555531e <X::func1(int)>
30
31 b=2;
32 x1.func(2,b);
0x00005555555552b3 <+45>: mov $0x2,%edx
0x00005555555552b8 <+50>: mov $0x2,%esi
0x00005555555552bd <+55>: mov %rbx,%rdi
0x00005555555552c0 <+58>: callq 0x5555555553de <X::func2(int, int)>
33
34 b=3;
35 x1.func(1,b);
0x00005555555552c5 <+63>: mov $0x3,%esi
0x00005555555552ca <+68>: mov %rbx,%rdi
0x00005555555552cd <+71>: callq 0x55555555531e <X::func1(int)>
36
37 b=4;
38 x1.func(2,b);
0x00005555555552d2 <+76>: mov $0x4,%edx
0x00005555555552d7 <+81>: mov $0x2,%esi
0x00005555555552dc <+86>: mov %rbx,%rdi
0x00005555555552df <+89>: callq 0x5555555553de <X::func2(int, int)>
39
40 return 0;
0x00005555555552e4 <+94>: mov 0x8(%rsp),%rax
0x00005555555552e9 <+99>: xor %fs:0x0(%rbp),%rax
0x00005555555552ee <+104>: jne 0x5555555552fc <main()+118>
0x00005555555552f0 <+106>: mov $0x0,%eax
0x00005555555552f5 <+111>: add $0x18,%rsp
0x00005555555552f9 <+115>: pop %rbx
0x00005555555552fa <+116>: pop %rbp
0x00005555555552fb <+117>: retq
0x00005555555552fc <+118>: callq 0x555555555100 <__stack_chk_fail#plt>
End of assembler dump.
I have the following while-loop
uint32_t x = 0;
while(x*x < STOP_CONDITION) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
The STOP_CONDITION is constant at run-time, but not at compile time. Is there are more efficient way to maintain x*x or do I really need to recompute it every time?
Note: According to the benchmark below, this code runs about 1 -- 2% slower than this option. Please read the disclaimer included at the bottom!
In addition to Tamas Ionut's answer, if you want to maintain STOP_CONDITION as the actual stop condition and avoid the square root calculation, you could update the square using the mathematical identity
(x + 1)² = x² + 2x + 1
whenever you change x:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
Since the 2*x + 1 is just a bit shift and an increment, the compiler should be able to optimize this fairly well.
Disclaimer: Since you asked "how can I optimize this code" I answered with one particular way to possibly make it faster. Whether the double + increment is actually faster than a single integer multiplication should be tested in practice. Whether you should optimize the code is a different question. I assume you have already benchmarked the loop and found it to be a bottleneck, or that you have a theoretical interest in the question. If you are writing production code that you wish to optimize, first measure the performance and then optimize where needed (which is probably not the x*x in this loop).
What about:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
This way, you're getting rid of that extra computation.
I made a small benchmarking for Tamas Ionut and CompuChip answers and here are the results:
Tamas Ionut: 19.7068
The code of this method:
uint32_t x = 0;
double bound= sqrt(STOP_CONDITION);
while(x < bound) {
if(CHECK_CONDITION) x++
// Do other stuff that modifies CHECK_CONDITION
}
CompuChip: 20.2056
The code of this method:
uint32_t x = 0;
unit32_t xSquare = 0;
while(xSquare < STOP_CONDITION) {
if(CHECK_CONDITION) {
xSquare += 2 * x + 1;
x++;
}
// Do other stuff that modifies CHECK_CONDITION
}
with STOP_CONDITION = 1000000 and repeating the process 1000000 times
Environment:
Compiler : MSVC 2013
OS : Windows 8.1 - X64
Processor: Core i7-4510U
#2.00 GHZ
Release Mode - Maximize Speed (/O2)
I would say, optimization in readibility is better than optimization in Performance in your case since we are talking about a very small Performance optimization
The compliter can optimize a lot for you regarding Performance but readibility lies in the responsibility of the programmer
I believe Tamas Ionut solution is better than that of CompuChip because we only have x++ inside the for loop. However, a comparison between uint32_t and double will kill the deal. It would be more efficient if we use uint32_t for bound instead of using double. This approach has less problem with numerical overflow because x cannot be greater than 2^16 = 65536 if we want to have a correct x^2 value.
If we also do a heavy work in the loop then results obtained from both approach should be very similar, however, Tamas Ionut approach is more simple and easier to read.
Below is my code and the corresponding assembly code obtained using clang version 3.8.0 with -O3 flag. It is very clear from the assembly code that the first approach is more efficient.
using T = size_t;
void test1(const T stopCondition, bool checkCondition) {
T x = 0;
while (x < stopCondition) {
if (checkCondition) {
x++;
}
// Do something heavy here
}
}
void test2(const T stopCondition, bool checkCondition) {
T x = 0;
T xSquare = 0;
const T threshold = stopCondition * stopCondition;
while (xSquare < threshold) {
if (checkCondition) {
xSquare += 2 * x + 1;
x++;
}
// Do something heavy here
}
}
(gdb) disassemble test1
Dump of assembler code for function _Z5test1mb:
0x0000000000400be0 <+0>: movzbl %sil,%eax
0x0000000000400be4 <+4>: mov %rax,%rcx
0x0000000000400be7 <+7>: neg %rcx
0x0000000000400bea <+10>: nopw 0x0(%rax,%rax,1)
0x0000000000400bf0 <+16>: add %rax,%rcx
0x0000000000400bf3 <+19>: cmp %rdi,%rcx
0x0000000000400bf6 <+22>: jb 0x400bf0 <_Z5test1mb+16>
0x0000000000400bf8 <+24>: retq
End of assembler dump.
(gdb) disassemble test2
Dump of assembler code for function _Z5test2mb:
0x0000000000400c00 <+0>: imul %rdi,%rdi
0x0000000000400c04 <+4>: test %sil,%sil
0x0000000000400c07 <+7>: je 0x400c2e <_Z5test2mb+46>
0x0000000000400c09 <+9>: xor %eax,%eax
0x0000000000400c0b <+11>: mov $0x1,%ecx
0x0000000000400c10 <+16>: test %rdi,%rdi
0x0000000000400c13 <+19>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c15 <+21>: data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c20 <+32>: add %rcx,%rax
0x0000000000400c23 <+35>: add $0x2,%rcx
0x0000000000400c27 <+39>: cmp %rdi,%rax
0x0000000000400c2a <+42>: jb 0x400c20 <_Z5test2mb+32>
0x0000000000400c2c <+44>: jmp 0x400c42 <_Z5test2mb+66>
0x0000000000400c2e <+46>: test %rdi,%rdi
0x0000000000400c31 <+49>: je 0x400c42 <_Z5test2mb+66>
0x0000000000400c33 <+51>: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000400c40 <+64>: jmp 0x400c40 <_Z5test2mb+64>
0x0000000000400c42 <+66>: retq
End of assembler dump.
I have 2 2D points which are jammed together into an array: int square[4]. These four numbers are interpreted as the definition of a rectangle with horizontal lines parallel to the X-axis and vertical lines parallel to the Y-axis. The elements of the array then respectively define:
Left edge's X coordinate
Bottom edge's Y coordinate
Right edge's X coordinate
Top edge's Y coordinate
I have defined the a winding order in this enum:
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
The minimal, complete, verifiable example of my code, is that I am given an output second array: int output[4] and an input WindingOrder edge. I need to populate output as follows:
switch(edge) {
case BOTTOM:
output[0] = square[0]; output[1] = square[1]; output[2] = square[2]; output[3] = square[1];
break;
case RIGHT:
output[0] = square[2]; output[1] = square[1]; output[2] = square[2]; output[3] = square[3];
break;
case TOP:
output[0] = square[2]; output[1] = square[3]; output[2] = square[0]; output[3] = square[3];
break;
case LEFT:
output[0] = square[0]; output[1] = square[3]; output[2] = square[0]; output[3] = square[1];
break;
}
I'm not married to a particular WindingOrder arrangement, nor do I care about the order of the points in ouptut, so if changing those makes this solvable I'm down. What I want to know is can I construct the square indexes to assign to output in a for loop, without an if/case/ternary statement (in other words using bit-wise operations)?
So I'd want, given int i = 0 and WindingOrder edge to do bit-wise operations on them to find:
do {
output[i] = array[???];
} while(++i <= LEFT);
EDIT:
I've received a lot of static array answers (which I believe are the best way to solve this so I've given a +1). But as a logic problem I'm curious how few bit-wise operations could be taken to find an element of a given edge dynamically. So for example, how should this function's body be writen given an arbitrary edge and i: int getIndex(int i, int edge)
Here is a different solution. It is a variation on the static array approach, but without an actual array: the indexing matrix is inlined as a 32 bit unsigned integer computed as constant expression. The column for the edge parameter is selected with a single shift, finally, individual indices for each array element are selected with via simple bit-shifting and masking.
This solution has some advantages:
it is simple to understand
it does not use tests
it does not use a static array, nor any other memory location
it is independent on the winding order and can be easily customized for any array component order
it does not use C99 specific syntax, which may not be available in C++.
This is as close as I could get to a bitwise solution.
#include <iostream>
enum WindingOrder { BOTTOM = 0, RIGHT, TOP, LEFT };
void BitwiseWind(int const *input, int *output, enum WindingOrder edge)
{
unsigned bits = ((0x00010201 << BOTTOM * 2) |
(0x02010203 << RIGHT * 2) |
(0x02030003 << TOP * 2) |
(0x00030001 << LEFT * 2))
>> (edge * 2);
output[0] = input[(bits >> 24) & 3];
output[1] = input[(bits >> 16) & 3];
output[2] = input[(bits >> 8) & 3];
output[3] = input[(bits >> 0) & 3];
}
int main() {
enum WindingOrder edges[4] = { BOTTOM, RIGHT, TOP, LEFT };
int rect[4] = { 1, 3, 4, 5 };
int output[4];
for (int i = 0; i < 4; i++) {
BitwiseWind(rect, output, edges[i]);
std::cout << output[0] << output[1] << output[2] << output[3] << std::endl;
}
return 0;
}
Compiling BitwiseWind for x86-64 with clang -O3 generates 21 instructions, 6 more than the static array version, but without any memory reference. That's a little disappointing, but I hope it could generate fewer instructions for an ARM target, taking advantage of bit-field extraction opcodes. Incidentally, the inlined version using output[i] = array[(i+(i==winding)*2)&3]; produces 25 instructions without any jumps, and gcc -O3 does much worse: it generates a lot more code with 4 tests and jumps.
The generic getIndex function below compiles to just 6 x86 instructions:
int getIndex(int i, int edge) {
return (((0x00010201 << BOTTOM * 2) |
(0x02010203 << RIGHT * 2) |
(0x02030003 << TOP * 2) |
(0x00030001 << LEFT * 2))
>> (edge * 2 + 24 - i * 8)) & 3;
}
Is there a particular reason that this needs to use lots of bitwise operations? It seems quite a complex way to solve the problem?
You seem to be quite worried about speed, for example, you don't want to use modulo because it is expensive. This being the case, why not just use a really simple lookup and unroll the loops? Example on ideone as well.
EDIT: Thanks to chqrlie for input. Have updated answer accordingly.
#include <iostream>
using namespace std;
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
void DoWinding1(unsigned int const *const in, unsigned int *const out, const enum WindingOrder ord)
{
static const unsigned int order[4][4] = { [BOTTOM] = {0,1,2,1},
[RIGHT] = {2,1,2,3},
[TOP] = {2,3,0,3},
[LEFT] = {0,3,0,1} };
out[0] = in[order[ord][0]];
out[1] = in[order[ord][1]];
out[2] = in[order[ord][2]];
out[3] = in[order[ord][3]];
}
int main() {
unsigned int idx;
unsigned int rect[4] = {1, 3, 4, 5};
unsigned int out[4] = {0};
DoWinding1(rect, out, BOTTOM);
std::cout << out[0] << out[1] << out[2] << out[3] << std::endl;
return 0;
}
Is that possible to redefine WindingOrder's value set? If it could be , here's my solution , which tried encoding selection indexes in WindingOrder's value set , then simply decoding out select index for input[] by shifting and masking as long the output[] index iterating.
[Thanks to chqrlie for offering code base]:
#include <iostream>
enum WindingOrder {
// the RIGHT most 4-bits indicate the selection index from input[] to output[0]
// the LEFT most 4-bits indicate the selection index from input[] to output[3]
BOTTOM = 0x1210,
RIGHT = 0x3212,
TOP = 0x3230,
LEFT = 0x3010
};
void BitwiseWind(int const *input, int *output, unsigned short edge)
{
for (size_t i = 0; i < 4; i++)
output[i] = input[(edge >> (i*4)) & 0x000F]; // decode
}
int main() {
enum WindingOrder edges[4] = { BOTTOM, RIGHT, TOP, LEFT };
int rect[4] = { 1, 3, 4, 5 };
int output[4];
for (int i = 0; i < 4; i++) {
BitwiseWind(rect, output, edges[i]);
std::cout << output[0] << output[1] << output[2] << output[3] << std::endl;
}
return 0;
}
The generic getIndex(int i,enum WindingOrder edge) would be:
int getIndex(int i,enum WindingOrder edge)
{
return ((edge >> (i*4)) & 0x000F);
}
I did not count how many instruction it used , but i believe it would be quiet few. And really easy to image how it worked. :)
This is untested and there might be a small mistake in some details but the general idea should work.
Copying the array to the output would use the indices {0,1,2,3}. To get a specific edge you have to do some transformations to the indices:
changed_pos changed_to
RIGHT : {2,1,2,3} 0 2
TOP : {0,3,2,3} 1 3
LEFT : {0,1,0,3} 2 0
BOTTOM: {0,1,2,1} 3 1
So basically you have to add 2 mod 4 for the specific position of your winding.
So the (like I said untested) snipped could look like this
for (size_t i=0; i<4; ++i) {
output[i] = array[(i+(i==edge)*2)%4];
}
If the comparison is true you add 1*2=2, else 0*2=0 to the index and do mod 4 to stay in the range.
Your enum have to look like this (but I guess you figured this out by yourself):
enum WindingOrder {
RIGHT,
TOP,
LEFT,
BOTTOM
};
MWE:
#include <iostream>
#include <string>
#include <vector>
enum WindingOrder {
RIGHT=0,
TOP,
LEFT,
BOTTOM
};
int main()
{
std::vector<int> array = {2,4,8,9};
std::vector<int> output(4);
std::vector<WindingOrder> test = {LEFT,RIGHT,BOTTOM,TOP};
for (auto winding : test) {
for (size_t i=0; i<4; ++i) {
output[i] = array[(i+(i==winding)*2)%4];
}
std::cout << "winding " << winding << ": " << output[0] << output[1] << output[2] << output[3] << std::endl;
}
}
From the answer of yourself, you're close to the solution. I think what you need here is Karnaugh map, which is a universal method for most Boolean algebra problems.
Suppose
The elements of the array then respectively define:
input[0]: Left edge's X coordinate
input[0]: Bottom edge's Y coordinate
input[0]: Right edge's X coordinate
input[0]: Top edge's Y coordinate
I have defined the a winding order in this enum:
enum WindingOrder {
BOTTOM = 0,
RIGHT,
TOP,
LEFT
};
Since the for-loop may looks like
for (int k = 0; k != 4; ++k) {
int i = getIndex(k, edge); // calculate i from k and edge
output[k] = square[i];
}
Then the input is k(output[k]) and edge, the output is i(square[i]). And because i has 2 bits, then two logic functions are needed.
Here we use P = F1(A, B, C, D) and Q = F2(A, B, C, D) to represent the logic functions, in which A, B, C, D, P and Q are all single bit, and
k = (A << 1) + B;
edge = (C << 1) + D;
i = (P << 1) + Q;
Then what we need to do is just deduce the two logic functions F1 and F2 from the given conditions.
From the switch case statements you gave, we can easily get the truth table.
k\edge 0 1 3 2
0 0 2 0 2
1 1 1 3 3
3 1 3 1 3
2 2 2 0 0
Then separate this into two truth table for two bits P and Q.
P edge 0 1 3 2
k AB\CD 00 01 11 10
0 00 0 1 0 1
1 01 0 0 1 1
3 11 0 1 0 1
2 10 1 1 0 0
Q edge 0 1 3 2
k AB\CD 00 01 11 10
0 00 0 0 0 0
1 01 1 1 1 1
3 11 1 1 1 1
2 10 0 0 0 0
These are the Karnaugh maps that I mentioned at the beginning. We can easily get the functions.
F1(A, B, C, D) = A~B~C + A~CD + ~B~CD + ~ABC + ~AC~D + BC~D
F2(A, B, C, D) = B
Then the program will be
int getIndex(int k, int edge) {
int A = (k >> 1) & 1;
int B = k & 1;
int C = (edge >> 1) & 1;
int D = edge & 1;
int P = A&~B&~C | A&~C&D | ~B&~C&D | ~A&B&C | ~A&C&~D | B&C&~D;
int Q = B;
return (P << 1) + Q;
}
Passed the examine here. Of course, you can simplify the function even more with the XOR.
EDIT
Using XOR to simplify the expression can be achieved most of time, since A^B == A~B + ~AB. But this may not the thing you want. First, I think the performance varies only a little between the Sum of Products(SoP) expression and the even more simplified version with XOR. Second, there is not a universal method (as far as I know) to simplify an expression with XOR, so you have to rely on your own experience to do this work.
There are sixteen possible logic functions of two variables, but in digital logic hardware, the simplest gate circuits implement only four of them: AND, OR, and the complements of those (NAND and NOR). And the Karnaugh map are used to simplify real-world logic requirements so that they can be implemented using a minimum number of physical logic gates.
There are two common expressions used here, Sum of Products and Product of Sums expressions. These two expressions can be implemented directly using only AND and OR logic operators. And they can be deduced directly with Karnaugh map.
If you define the coordinates and directions in clockwise order starting at left,
#define LEFT 0
#define TOP 1
#define RIGHT 2
#define BOTTOM 3
you can use
void edge_line(int line[4], const int rect[4], const int edge)
{
line[0] = rect[ edge & 2 ];
line[1] = rect[ ((edge + 3) & 2) + 1 ];
line[2] = rect[ ((edge + 1) & 2) ];
line[3] = rect[ (edge & 2) + 1 ];
}
to copy the edge line coordinates (each line segment in clockwise winding order). It looks suboptimal, but using -O2, GCC-4.8, you get essentially
edge_line:
pushl %esi
pushl %ebx
movl 20(%esp), %ecx
movl 16(%esp), %edx
movl 12(%esp), %eax
movl %ecx, %esi
andl $2, %esi
movl (%edx,%esi,4), %ebx
movl %ebx, (%eax)
leal 3(%ecx), %ebx
addl $1, %ecx
andl $2, %ebx
andl $2, %ecx
addl $1, %ebx
movl (%edx,%ebx,4), %ebx
movl %ebx, 4(%eax)
movl (%edx,%ecx,4), %ecx
movl %ecx, 8(%eax)
movl 4(%edx,%esi,4), %edx
movl %edx, 12(%eax)
popl %ebx
popl %esi
ret
but on 64-bit, even better
edge_line:
movl %edx, %ecx
andl $2, %ecx
movslq %ecx, %rcx
movl (%rsi,%rcx,4), %eax
movl %eax, (%rdi)
leal 3(%rdx), %eax
addl $1, %edx
andl $2, %edx
andl $2, %eax
movslq %edx, %rdx
cltq
movl 4(%rsi,%rax,4), %eax
movl %eax, 4(%rdi)
movl (%rsi,%rdx,4), %eax
movl %eax, 8(%rdi)
movl 4(%rsi,%rcx,4), %eax
movl %eax, 12(%rdi)
ret
As you can see, there are no conditionals, and the binary operators combine and optimize to very few instructions.
Edited to add:
If we define a getIndex(i, edge) function, using three binary ANDs, one bit shift (right by 1), three additions, and one subtraction,
int getIndex(const int i, const int edge)
{
return (i & 1) + ((edge + 4 - (i & 1) + (i >> 1)) & 2);
}
with which edge_line() can be implemented as
void edge_line(int line[4], const int rect[4], const int edge)
{
line[0] = rect[ getIndex(0, edge) ];
line[1] = rect[ getIndex(1, edge) ];
line[2] = rect[ getIndex(2, edge) ];
line[3] = rect[ getIndex(3, edge) ];
}
we get the exact same results as before. Using GCC-4.8.4 and -O2 on AMD64/x86-64 compiles to
getIndex:
movl %edi, %edx
sarl %edi
andl $1, %edx
subl %edx, %esi
leal 4(%rsi,%rdi), %eax
andl $2, %eax
addl %edx, %eax
ret
and to
getIndex:
movl 4(%esp), %eax
movl 8(%esp), %edx
movl %eax, %ecx
andl $1, %ecx
subl %ecx, %edx
sarl %eax
leal 4(%edx,%eax), %eax
andl $2, %eax
addl %ecx, %eax
ret
on i686. Note that I arrived at the above form using the four-by-four result table; there are other, more rigorous ways to construct it, and there might even be a more optimal form. Because of this, I seriously recommend adding a big huge comment above the function, explaining the intent, and preferably also showing the result table. Something like
/* This function returns an array index:
* 0 for left
* 1 for top
* 2 for right
* 3 for bottom
* given edge:
* 0 for left
* 1 for top
* 2 for right
* 3 for bottom
* and i:
* 0 for initial x
* 1 for initial y
* 2 for final x
* 3 for final y
*
* The result table is
* | edge
* | 0 1 2 3
* ----+-------
* i=0 | 0 0 2 2
* i=1 | 3 1 1 3
* i=2 | 0 2 2 0
* i=3 | 1 1 3 3
*
* Apologies for the write-only code.
*/
Or something similar.
Lets call our goal variable to be used to index squared: int index.
Now we'll create a table of the desired index for edge versus i, with edge across the row and i down the column:
║0│1│2│3
═╬═╪═╪═╪═
0║0│1│2│1
─╫─┼─┼─┼─
1║2│1│2│3
─╫─┼─┼─┼─
2║2│3│0│3
─╫─┼─┼─┼─
3║0│3│0│1
It is obvious from this that the least significant bit of index is always odd for odd is and even for even is. So if we could find the most significant bit of index we'd just to or that with i & 1 and we'd have our index. So lets make another table of just the most significant bit of index for the same edge versus i table:
║0│1│2│3
═╬═╪═╪═╪═
0║0│0│1│0
─╫─┼─┼─┼─
1║1│0│1│1
─╫─┼─┼─┼─
2║1│1│0│1
─╫─┼─┼─┼─
3║0│1│0│0
We can see several things here:
When i is 0 or 3 the columns are identical depending only on edge
These columns are set when edge is 1 or 2
When i is 1 or 2 the columns are inverse of each other
These columns are set when only edge's most significant bit or only i's most significant bit is set
So let's start by breaking edge and i into least significant and most significant bits:
const int ib0 = i & 1;
const int ib1 = (i & 2) >> 1;
const int eb0 = edge & 1;
const int eb1 = (edge & 2) >> 1;
From here we can easily find whether i is 0 or 3:
const int iXor = ib0 ^ ib1;
For the 0/3 condition:
const int iXorCondition = ib1 ^ eb1;
And the 1/2 condition:
const int iNXorCondition = eb0 ^ eb1;
Now we'll just need to combine those with their respective iXor and put back index's least significant bit:
const int index = ((iNXorCondition & ~iXor | iXorCondition & iXor) << 1) | ib0;
Putting this all together into a convenient function we get:
int getIndex(int i, int edge) {
const int ib0 = i & 1;
const int ib1 = (i & 2) >> 1;
const int eb0 = edge & 1;
const int eb1 = (edge & 2) >> 1;
const int iXor = ib0 ^ ib1;
const int iNXorCondition = eb0 ^ eb1;
const int iXorCondition = ib1 ^ eb1;
return ((iNXorCondition & ~iXor | iXorCondition & iXor) << 1) | ib0;
}
I've written a checking live example here.
What I want to know is can I construct the square indexes to assign to output in a for loop, without an if/case/ternary statement (in other words using bit-wise operations) ?
I would ask you what you expect to achieve in doing that ?
My view is that the switch-case construct will, typically, be completely reorganized by a compiler's optimization code. It's best, IMO, to leave that code alone and let the compiler do that.
There are only two conditions where Id change that view ;
You were writing in OpenCL ( rather than C ) and wanted to optimize the code where decision-branch logic can be problematic for performance.
You wanted to use explicit coding for SIMD vectorization. There are some special operations that might help there, but it's a coding option that locks you into things that might not work well on hardware without SIMD instruction sets ( or perform quite differently on different hardware ). It's also worth noting that some compilers can auto-vectorize with the right coding.
I just see little or no advantage to coding these operations any other way than switch-case for C.
This is a way to achieve that:
do {
output[i] = square[
(edge & 1) * (
!(i & 1) * ((edge + 1) & 2) +
(i & 1) * (
(!((edge - 1)/2)&1) * i +
(((edge - 1)/2)&1) * (4-i)
)
) +
!(edge & 1) * (
(i & 1) * (edge + 1) +
!(i & 1) * ((edge & 2) - ((edge & 2)-1) * i)
)
];
} while(++i <= LEFT);
To help you understand I indented the code, you can obviously erase all the white spaces. I have put a tab where ever I wanted to separate two cases. By the way as you see the calculation is in two sections for two different cases which are symmetrical but I solved each case with a different algorithm so you can see various ways to achieve things.
Suppose we have the following (nonsensical) code:
const int a = 0;
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
Variable 'a' equals zero, so the compiler can deduce on compile time, that the instruction 'if(a) c++;' will never be executed and will optimize it away.
My question: Does the same happen with lambda closures?
Check out another piece of code:
const int a = 0;
function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
}
Will the compiler know that 'a' is 0 and will it optimize the lambda?
Even more sophisticated example:
function<int()> generate_lambda(const int a)
{
return [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
}
function<int()> a_is_zero = generate_lambda(0);
function<int()> a_is_one = generate_lambda(1);
Will the compiler be smart enough to optimize the first lambda when it knows that 'a' is 0 at generation time?
Does gcc or llvm have this kind of optimizations?
I'm asking because I wonder if I should make such optimizations manually when I know that certain assumptions are satisfied on lambda generation time or the compiler will do that for me.
Looking at the assembly generated by gcc5.2 -O2 shows that the optimization does not happen when using std::function:
#include <functional>
int main()
{
const int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to some boilerplate and
movl (%rdi), %ecx
movl $10000000, %edx
xorl %eax, %eax
.p2align 4,,10
.p2align 3
.L3:
cmpl $1, %ecx
sbbl $-1, %eax
addl $7, %eax
subl $1, %edx
jne .L3
rep; ret
which is the loop you wanted to see optimized away. (Live) But if you actually use a lambda (and not an std::function), the optimization does happen:
int main()
{
const int a = 0;
auto lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10000000; b++)
{
if(a) c++;
c += 7;
}
return c;
};
return lambda();
}
compiles to
movl $70000000, %eax
ret
i.e. the loop was removed completely. (Live)
Afaik, you can expect a lambda to have zero overhead, but std::function is different and comes with a cost (at least at the current state of the optimizers, although people apparently work on this), even if the code "inside the std::function" would have been optimized. (Take that with a grain of salt and try if in doubt, since this will probably vary between compilers and versions. std::functions overhead can certainly be optimized away.)
As #MarcGlisse correctly pointed out, clang3.6 performs the desired optimization (equivalent to the second case above) even with std::function. (Live)
Bonus edit, thanks to #MarkGlisse again: If the function that contains the std::function is not called main, the optimization happening with gcc5.2 is somewhere between gcc+main and clang, i.e. the function gets reduced to return 70000000; plus some extra code. (Live)
Bonus edit 2, this time mine: If you use -O3, gcc will, (for some reason) as explained in Marco's answer, optimize the std::function to
cmpl $1, (%rdi)
sbbl %eax, %eax
andl $-10000000, %eax
addl $80000000, %eax
ret
and keep the rest as in the not_main case. So I guess at the bottom of the line, one will just have to measure when using std::function.
Both gcc at -O3 and MSVC2015 Release won't optimize it away with this simple code and the lambda would actually be called
#include <functional>
#include <iostream>
int main()
{
int a = 0;
std::function<int()> lambda = [a]()
{
int c = 0;
for(int b = 0; b < 10; b++)
{
if(a) c++;
c += 7;
}
return c;
};
std::cout << lambda();
return 0;
}
At -O3 this is what gcc generates for the lambda (code from godbolt)
lambda:
cmp DWORD PTR [rdi], 1
sbb eax, eax
and eax, -10
add eax, 80
ret
This is a contrived and optimized way to express the following:
If a was a 0, the first comparison would set the carry flag CR. eax would actually be set to 32 1 values, and'ed with -10 (and that would yield -10 in eax) and then added 80 -> result is 70.
If a was something different from 0, the first comparison would not set the carry flag CR, eax would be set to zero, the and would have no effect and it would be added 80 -> result is 80.
It has to be noted (thanks Marc Glisse) that if the function is marked as cold (i.e. unlikely to be called) gcc performs the right thing and optimizes the call away.
MSVC generates more verbose code but the comparison isn't skipped.
Clang is the only one which gets it right: the lambda hasn't its code optimized more than gcc did but it is not called
mov edi, std::cout
mov esi, 70
call std::basic_ostream<char, std::char_traits<char> >::operator<<(int)
Morale: Clang seems to get it right but the optimization challenge is still open.
I'm trying to manipulate a special struct and I need some sort of a swizzle operator. For this it makes sense to have an overloaded array [] operator, but I don't want to have any branching since the particular specification of the struct allows for a theoretical workaround.
Currently, the struct looks like this:
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
// template with an int here?
inline float& operator[] (int x) {
if (x < 2)
return fLow[x];
else
return fHigh[x - 2];
}
};
What could I/should I do to avoid the branch? My idea is to use a template with an integer parameter and define specializations, but it's not clear whether it does make sense and what the syntax of that monster could look like.
I explicitly, under no circumstances, can make use of a float[4] array to merge the two (also, no union tricks). If you need a good reason for that, it's because the float[2] are actually resembling a platform specific PowerPC paired singles. A normal windows compiler won't work with paired singles, that's why I replaced the code with float[2]s.
Using the GreenHills compiler I get this assembly output (which suggests branching does occur):
.LDW31:
00000050 80040000 89 lwz r0, 0(r4)
00000054 2c000000 90 cmpwi r0, 0
00000058 41820000 91 beq .L69
92 #line32
93
94 .LDWlin1:
0000005c 2c000001 95 cmpwi r0, 1
00000060 40820000 96 bne .L74
97 #line32
98
99 .LDWlin2:
00000064 38630004 100 addi r3, r3, 4
00000068 38210018 101 addi sp, sp, 24
0000006c 4e800020 102 blr
103 .L74:
00000070 2c000002 104 cmpwi r0, 2
00000074 40820000 105 bne .L77
106 #line33
107
108 .LDWlin3:
00000078 38630008 109 addi r3, r3, 8
0000007c 38210018 110 addi sp, sp, 24
00000080 4e800020 111 blr
112 .L77:
00000084 2c000003 113 cmpwi r0, 3
00000088 40820000 114 bne .L80
115 #line34
116
117 .LDWlin4:
0000008c 3863000c 118 addi r3, r3, 12
00000090 38210018 119 addi sp, sp, 24
00000094 4e800020 120 blr
121 .L80:
00000098 38610008 122 addi r3, sp, 8
123 .L69:
124 # .ef
The corresponding C++ code to that snippet should be this one:
inline const float& operator[](const unsigned& idx) const
{
if (idx == 0) return xy[0];
if (idx == 1) return xy[1];
if (idx == 2) return zw[0];
if (idx == 3) return zw[1];
return 0.f;
}
Either the index x is a runtime variable, or a compile-time constant.
if it is a compile-time constant, there's a good chance the optimizer will be able to prune the dead branch when inlining operator[] anyway.
if it is a runtime variable, like
for (int i=0; i<4; ++i) { dosomething(f[i]); }
you need the branch anyway. Unless, of course, your optimizer unrolls the loop, in which case it can replace the variable with four constants, inline & prune as above.
Did you profile this to show there's a real problem, and compile it to show the branch really happens where it could be avoided?
Example code:
float foo(f32x4 &f)
{
return f[0]+f[1]+f[2]+f[3];
}
example output from g++ -O3 -S
.globl _Z3fooR5f32x4
.type _Z3fooR5f32x4, #function
_Z3fooR5f32x4:
.LFB4:
.cfi_startproc
movss (%rdi), %xmm0
addss 4(%rdi), %xmm0
addss 8(%rdi), %xmm0
addss 12(%rdi), %xmm0
ret
.cfi_endproc
Seriously, don't do this!! Just combine the arrays. But since you asked the question, here's an answer:
#include <iostream>
float fLow [2] = {1.0,2.0};
float fHigh [2] = {50.0,51.0};
float * fArrays[2] = {fLow, fHigh};
float getFloat (int i)
{
return fArrays[i>=2][i%2];
}
int main()
{
for (int i = 0; i < 4; ++i)
std::cout << getFloat(i) << '\n';
return 0;
}
Output:
1
2
50
51
Since you said in a comment that your index is always a template parameter, then you can indeed make the branching at compile-time instead of runtime. Here is a possible solution using std::enable_if:
#include <iostream>
#include <type_traits>
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
template <int x>
float& get(typename std::enable_if<(x >= 0 && x < 2)>::type* = 0)
{
return fLow[x];
}
template <int x>
float& get(typename std::enable_if<(x >= 2 && x < 4)>::type* = 0)
{
return fHigh[x-2];
}
};
int main()
{
f32x4 f(0.f, 1.f, 2.f, 3.f);
std::cout << f.get<0>() << " " << f.get<1>() << " "
<< f.get<2>() << " " << f.get<3>(); // prints 0 1 2 3
}
Regarding performance, I don't think there will be any difference since the optimizer should be able to easily propagate the constants and remove dead code subsequently, thereby removing the branch altogether. However, with this approach, you get the benefit that any attempts to invoke the function with an invalid index will result in a compiler error.
Create one array (or vector) with all 4 elements in it, the fLow values occupying the first two positions, then high in the second 2. Then just index into it.
inline float& operator[] (int x) {
return newFancyArray[x]; //But do some bounds checking above.
}
Based on Luc Touraille's answer, without using type traits due to their lack of compiler support, I found the following to achieve the purpose of the question. Since the operator[] can not be templatized with an int parameter and work syntactically, I introduced an at method. This is the result:
struct f32x4
{
float fLow[2];
float fHigh[2];
f32x4(float a, float b, float c, float d)
{
fLow[0] = a;
fLow[1] = b;
fHigh[0] = c;
fHigh[1] = d;
}
template <unsigned T>
const float& at() const;
};
template<>
const float& f32x4::at<0>() const { return fLow[0]; }
template<>
const float& f32x4::at<1>() const { return fLow[1]; }
template<>
const float& f32x4::at<2>() const { return fHigh[0]; }
template<>
const float& f32x4::at<3>() const { return fHigh[1]; }