I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
Related
When I run the following program, it always prints "yes". However when I change SOME_CONSTANT to -2 it always prints "no". Why is that? I am using visual studio 2019 compiler with optimizations disabled.
#define SOME_CONSTANT -3
void func() {
static int i = 2;
int j = SOME_CONSTANT;
i += j;
}
void main() {
if (((bool(*)())func)()) {
printf("yes\n");
}
else {
printf("no\n");
}
}
EDIT: Here is the output assembly of func (IDA Pro 7.2):
sub rsp, 18h
mov [rsp+18h+var_18], 0FFFFFFFEh
mov eax, [rsp+18h+var_18]
mov ecx, cs:i
add ecx, eax
mov eax, ecx
mov cs:i, eax
add rsp, 18h
retn
Here is the first part of main:
sub rsp, 628h
mov rax, cs:__security_cookie
xor rax, rsp
mov [rsp+628h+var_18], rax
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
Here is main decompiled:
int __cdecl main(int argc, const char **argv, const char **envp)
{
int v3; // eax
func();
if ( v3 )
printf("yes\n");
else
printf("no\n");
return 0;
}
((bool(*)())func)()
This expression takes a pointer to func, casts the pointer to a different type of function, then invokes it. Invoking a function through a pointer-to-function whose function signature does not match the original function is undefined behavior which means that anything at all might happen. From the moment this function call happens, the behavior of the program cannot be reasoned about. You cannot predict what will happen with any certainty. Behavior might be different on different optimization levels, different compilers, different versions of the same compiler, or when targeting different architectures.
This is simply because the compiler is allowed to assume that you won't do this. When the compiler's assumptions and reality come into conflict, the result is a vacuum into which the compiler can insert whatever it likes.
The simple answer to your question "why is that?" is, quite simply: because it can. But tomorrow it might do something else.
What apparently happened is:
mov ecx, cs:i
add ecx, eax
mov eax, ecx ; <- final value of i is stored in eax
mov cs:i, eax ; and then also stored in i itself
Different registers could have been used, it just happened to work this way. There is nothing about the code that forces eax to be chosen. That mov eax, ecx is really redundant, ecx could have been stored straight to i. But it happened to work this way.
And in main:
call ?func##YAXXZ ; func(void)
test eax, eax
jz short loc_1400012B0
rax (or part of it, like eax or al) is used for the return value for integer-ish types (such as booleans) in the WIN64 ABI, so that makes sense. That means the final value of i happens to be used as the return value, by accident.
I always get printed out no, so it must be dependent from compiler to compiler, hence the best answer is UB (Undefined Behavior).
How can I achieve the following with the minimum number of Intel instructions and without a branch or conditional move:
unsigned compare(unsigned x
,unsigned y) {
return (x == y)? ~0 : 0;
}
This is on hot code path and I need to squeeze out the most.
GCC solves this nicely, and it knows the negation trick when compiling with -O2 and up:
unsigned compare(unsigned x, unsigned y) {
return (x == y)? ~0 : 0;
}
unsigned compare2(unsigned x, unsigned y) {
return -static_cast<unsigned>(x == y);
}
compare(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
compare2(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
Visual Studio generates the following code:
compare2, COMDAT PROC
xor eax, eax
or r8d, -1 ; ffffffffH
cmp ecx, edx
cmove eax, r8d
ret 0
compare2 ENDP
compare, COMDAT PROC
xor eax, eax
cmp ecx, edx
setne al
dec eax
ret 0
compare ENDP
Here it seems the first version avoids the conditional move (note that the order of the funtions was changed).
To view other compiler's solution try pasting the code to
https://gcc.godbolt.org/ (add optimization flags).
Interestingly the first version produces shorter code on icc. Basically you have to measure actual performance with your compiler for each version and choose the best.
Also I would not be so sure a conditional register move is slower than other operations.
I assume you wrote the function just to show us the relevant part of the code, but a function like this would be an ideal candidate for inlining, potentially allowing the compiler to perform much more useful optimizations that involve the code where this is actually used. This may allow the compiler/CPU to parallelize this computation with other code, or merge some operations.
So, assuming this is indeed a function in your code, write it with the inline keyword and put it in a header.
return -int(x==y) is pretty terse C++. It's of course still up to the compiler to turn that into efficient assembly.
Works because int(true)==1 and unsigned (-1)==~0U.
I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using abs() from C++ library it gave a correct answer verdict.
Here is my function that got an inefficient verdict -
inline int abs(int x){return x>0 ? x : -x;}
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Here's the one that got the correct answer -
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ abs() use that is faster than inline abs()?
I don't agree with their verdict. They are clearly wrong.
On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).
No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.
Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).
Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).
Proof that both implementations are the same:
GCC:
myabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
libabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
Clang:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
Visual Studio (MSVC):
libabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
myabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
ICC:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)
Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):
cdq
xor eax, edx
sub eax, edx
The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.
As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.
Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.
However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.
Edit2: Just switching from int to float shows considerable performance degradation:
float libfoo(float x)
{
return ::std::fabs(x);
}
andps xmm0, xmmword ptr [rip + .LCPI0_0]
And a custom version:
inline float my_fabs(float x)
{
return x>0.0f?x:-x;
}
float myfoo(float x)
{
return my_fabs(x);
}
movaps xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
xorps xmm1, xmm0
xorps xmm2, xmm2
cmpltss xmm2, xmm0
andps xmm0, xmm2
andnps xmm2, xmm1
orps xmm0, xmm2
online compiler
Your solution might arguably be "cleaner" by the textbook if you used the standard library version, but I think the evaluation is wrong. There is no truly good, justifiable reason for your code being rejected.
This is one of those cases where someone is formally correct (by the textbook), but insists on knowing the only correct solution in a sheer stupid way rather than accepting an alternate solution and saying "...but this here would be best practice, you know".
Technically, it's a correct, practical approach to say "use the standard library, that's what it is for, and it's likely optimized as much as can be". Even though the "optimized as much as can be" part can, in some situations, very well be wrong due to some constraints that the standard puts onto certain alogorithms and/or containers.
Now, opinions, best practice, and religion aside. Factually, if you compare the two approaches...
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return std::abs(argc);
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return (argc > 0) ? argc : -argc;
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
... they result in exactly the same, identical instructions.
But even if the compiler did use a compare followed by a conditional move (which it may do in more complicated "branching assignments" and which it will do for example in the case of min/max), that's maybe one CPU cycle or so slower than the bit hacks, so unless you do this several million times, the statement "not efficient" is kinda doubtful anyway.
One cache miss, and you have a hundred times the penalty of a conditional move.
There are valid arguments for and against either approach, which I won't discuss in length. My point is, turning down the OP's solution as "totally wrong" because of such a petty, unimportant detail is rather narrow-minded.
EDIT:
(Fun trivia)
I just tried, for fun and no profit, on my Linux Mint box which uses a somewhat older version of GCC (5.4 as compared to 7.1 above).
Due to me including <cmath> without much of a thought (hey, a function like abs very clearly belongs to math, doesn't it!) rather than <cstdlib> which hosts the integer overload, the result was, well... surprising. Calling the library function was much inferior to the single-expression wrapper.
Now, in defense of the standard library, if you include <cstdlib>, then, again, the produced output is exactly identical in either case.
For reference, the test code looked like:
#ifdef DRY
#include <cmath>
int main(int argc, char**)
{
return std::abs(argc);
}
#else
int abs(int v) noexcept { return (v >= 0) ? v : -v; }
int main(int argc, char**)
{
return abs(argc);
}
#endif
...resulting in
4004f0: 89 fa mov %edi,%edx
4004f2: 89 f8 mov %edi,%eax
4004f4: c1 fa 1f sar $0x1f,%edx
4004f7: 31 d0 xor %edx,%eax
4004f9: 29 d0 sub %edx,%eax
4004fb: c3 retq
Now, It is apparently quite easy to fall into the trap of unwittingly using the wrong standard library function (I demonstrated how easy it is myself!). And all that without the slightest warning from the compiler, such as "hey, you know, you're using a double overload on an integer value (well, obviously there's no warning, it's a valid conversion).
With that in mind, there may be yet another "justification" why the OP providing his own one-liner wasn't all that terribly bad and wrong. After all, he could have made that same mistake.
I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?
To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless
In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}
I'm using visual studio and calling assembly from C++. I know that when you pass an argument to assembly the first argument is in ECX and the second is in EDX. Why can't I compare the two registers directly without first copying ECX to EAX?
C++:
#include <iostream>
extern "C" int PassingParameters(int a, int b);
int main()
{
std::cout << "The function returned: " << PassingParameters(5, 10) << std::endl;
std::cin.get();
return 0;
}
ASM: This gives the wrong value when comparing the two registers directly.
.code
PassingParameters proc
cmp edx, ecx
jg ReturnEAX
mov eax, edx
ReturnEAX:
ret
PassingParameters endp
end
But if I write it like this I get the correct value, and can compare the two registers directly, why is this?
.code
PassingParameters proc
mov eax, ecx ; copy ecx to eax.
cmp edx, ecx ; compare ecx and edx directly like above, but this gives the correct value.
jg ReturnEAX
mov eax, edx
ReturnEAX:
ret
PassingParameters endp
end
In your first version if the jg is taken, you're leaving eax exactly as it was upon entry to the function (i.e., we pretty much have no clue). Since the return value will normally be in eax, that's going to give an undefined return whenever the jg is taken. In other words, what you've written is roughly like:
int PassingParameters(int a, int b) {
if (a < b)
return a;
}
In this case, if a==b, or a>b, your return value is garbage.
In the second code sequence, you're loading one value into eax. Then, if the jg not taken, you're loading the other value into eax. Either way, the return value will be one input parameter or the other (depending on which is greater). In other words, what you have is roughly equivalent to:
int PassingParameters(int a, int b) {
if (a<b)
return a;
return b;
}
P.S. I would also note that your code looks like x86, not 64-bit code at all. For 64-bit code, you should be using RAX, RCX, etc., rather than EAX, ECX, and such.