Related
This question already has answers here:
Is mov %esi, %esi a no-op or not on x86-64?
(2 answers)
Why did GCC generate mov %eax,%eax and what does it mean?
(1 answer)
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
(4 answers)
Closed 3 years ago.
I have some C++ code that is being compiled to the following assembly using MSVC compiler v14.24:
00007FF798252D4C vmulsd xmm1,xmm1,xmm7
00007FF798252D50 vcvttsd2si rcx,xmm1
00007FF798252D55 vmulsd xmm1,xmm7,mmword ptr [rbx+28h]
00007FF798252D5A mov ecx,ecx
00007FF798252D5C imul rdx,rcx,0BB8h
00007FF798252D63 vcvttsd2si rcx,xmm1
00007FF798252D68 mov ecx,ecx
00007FF798252D6A add rdx,rcx
00007FF798252D6D add rdx,rdx
00007FF798252D70 cmp byte ptr [r14+rdx*8+8],0
00007FF798252D76 je applyActionMovements+15Dh (07FF798252D8Dh)
As you can see, the compiler added two
mov ecx,ecx
instructions that don't make any sense to me, because they move data from and to the same register.
Is there something that I'm missing?
Here is a small Godbolt reproducer: https://godbolt.org/z/UFo2qe
int arr[4000][3000];
inline int foo(double a, double b) {
return arr[static_cast<unsigned int>(a * 100)][static_cast<unsigned int>(b * 100)];
}
int bar(double a, double b) {
if (foo(a, b)) {
return 0;
}
return 1;
}
That's an inefficient way to zero-extend ECX into RCX. More efficient would be mov into a different register so mov-elimination could work.
Duplicates of:
Why did GCC generate mov %eax,%eax and what does it mean?
Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?
But your specific test-case needs zero-extension for a slightly non-obvious reason:
x86 only has conversion between FP and signed integers (until AVX512). FP -> unsigned int is efficiently possible on x86-64 by doing FP -> int64_t and then taking the low 32 bits as unsigned int.
This is what this sequence is doing:
vcvttsd2si rcx,xmm1 ; double -> int64_t, unsigned int result in ECX
mov ecx,ecx ; zero-extend to promote unsigned to ptrdiff_t for indexing
add rdx,rcx ; 64-bit integer math on the zero-extended result
clang 8.0.0 introduces support for the char8_t type from c++20. However, I would expect the following functions to have the same compiler output
#include <algorithm>
bool compare4(char const* pcha, char const* pchB, int n) {
return std::equal(pcha, pcha+4, pchB);
}
bool compare4(char8_t const* pchA, char8_t const* pchB, int n) {
return std::equal(pchA, pchA+4, pchB);
}
However, they compile under -std=c++2a -O2 to
compare4(char const*, char const*, int): # #compare4(char const*, char const*, int)
mov eax, dword ptr [rdi]
cmp eax, dword ptr [rsi]
sete al
ret
_Z8compare4PKDuS0_i: # #_Z8compare4PKDuS0_i
mov al, byte ptr [rdi]
cmp al, byte ptr [rsi]
jne .LBB1_4
mov al, byte ptr [rdi + 1]
cmp al, byte ptr [rsi + 1]
jne .LBB1_4
mov al, byte ptr [rdi + 2]
cmp al, byte ptr [rsi + 2]
jne .LBB1_4
mov al, byte ptr [rdi + 3]
cmp al, byte ptr [rsi + 3]
sete al
ret
.LBB1_4:
xor eax, eax
ret
in which the latter is clearly less optimized.
Is there a reason for this (I couldn't find any in the standard) or is this a bug in clang?
In libstdc++, std::equal calls __builtin_memcmp when it detects that the arguments are "simple", otherwise it uses a naive for loop. "Simple" here means pointers (or certain iterator wrappers around pointer) to the same integer or pointer type.(relevant source code)
Whether a type is an integer type is detected by the internal __is_integer trait, but libstdc++ 8.2.0 (the version used on godbolt.org) does not specialize this trait for char8_t, so the latter is not detected as an integer type.(relevant source code)
Clang (with this particular configuration) generates more verbose assembly in the for loop case than in the __builtin_memcmp case. (But the former is not necessarily less optimized in terms of performance. See Loop_unrolling.)
So there's a reason for this difference, and it's not a bug in clang IMO.
This is not a "bug" in Clang; merely a missed opportunity for optimization.
You can replicate the Clang compiler output by using the same function taking an enum class whose underlying type is unsigned char. By contrast, GCC recognizes a difference between an enumerator with an underlying type of unsigned char and char8_t. It emits the same code for unsigned char and char8_t, but emits more complex code for the enum class case.
So something about Clang's implementation of char8_t seems to think of it more as a user-defined enumeration than as a fundamental type. It's best to just consider it an early implementation of the standard.
It should be noted that one of the most important differences between unsigned char and char8_t is aliasing requirements. unsigned char pointers may alias with pretty much anything else. By contrast, char8_t pointers cannot. As such, it is reasonable to expect (on a mature implementation, not something that beats the standard it implements to market) different code to be emitted in different cases. The trick is that char8_t code ought to be more efficient if it's different, since the compiler no longer has to emit code that performs additional work to deal with potential aliasing from stores.
I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
I was trying a question on arrays in InterviewBit. In this question I made an inline function returning the absolute value of an integer. But I was told that my algorithm was not efficient on submitting it. But when I changed to using abs() from C++ library it gave a correct answer verdict.
Here is my function that got an inefficient verdict -
inline int abs(int x){return x>0 ? x : -x;}
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Here's the one that got the correct answer -
int Solution::coverPoints(vector<int> &X, vector<int> &Y) {
int l = X.size();
int i = 0;
int ans = 0;
while (i<l-1){
ans = ans + max(abs(X[i]-X[i+1]), abs(Y[i]-Y[i+1]));
i++;
}
return ans;
}
Why did this happened, as I thought that inline functions are fastest as no calling is done? Or is the site having an error? And if the site is correct, what does C++ abs() use that is faster than inline abs()?
I don't agree with their verdict. They are clearly wrong.
On current, optimizing compilers, both solutions produce the exact same output. And even, if they didn't produce the exact same, they would produce as efficient code as the library one (it could be a little surprising that everything matches: the algorithm, the registers used. Maybe because the actual library implementation is the same as OP's one?).
No sane optimizing compiler will create branch in your abs() code (if it can be done without a branch), as other answer suggests. If the compiler is not optimizing, then it may not inline library abs(), so it won't be fast either.
Optimizing abs() is one of the easiest thing to do for a compiler (just add an entry for it in the peephole optimizer, and done).
Furthermore, I've seen library implementations in the past, where abs() were implemented as a non-inline, library function (it was long time ago, though).
Proof that both implementations are the same:
GCC:
myabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
libabs:
mov edx, edi ; argument passed in EDI by System V AMD64 calling convention
mov eax, edi
sar edx, 31
xor eax, edx
sub eax, edx
ret
Clang:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
neg eax
cmovl eax, edi
ret
Visual Studio (MSVC):
libabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
myabs:
mov eax, ecx ; argument passed in ECX by Windows 64-bit calling convention
cdq
xor eax, edx
sub eax, edx
ret 0
ICC:
myabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
libabs:
mov eax, edi ; argument passed in EDI by System V AMD64 calling convention
cdq
xor edi, edx
sub edi, edx
mov eax, edi
ret
See for yourself on Godbolt Compiler Explorer, where you can inspect the machine code generated by various compilers. (Link kindly provided by Peter Cordes.)
Your abs performs branching based on a condition. While the built-in variant just removes the sign bit from the integer, most likely using just a couple of instructions. Possible assembly example (taken from here):
cdq
xor eax, edx
sub eax, edx
The cdq copies the sign of the register eax to register edx. For example, if it is a positive number, edx will be zero, otherwise, edx will be 0xFFFFFF which denotes -1. The xor operation with the origin number will change nothing if it is a positive number (any number xor 0 will not change). However, when eax is negative, eax xor 0xFFFFFF yields (not eax). The final step is to subtract edx from eax. Again, if eax is positive, edx is zero, and the final value is still the same. For negative values, (~ eax) – (-1) = –eax which is the value wanted.
As you can see this approach uses only three simple arithmetic instructions and no conditional branching at all.
Edit: After some research it turned out that many built-in implementations of abs use the same approach, return __x >= 0 ? __x : -__x;, and such a pattern is an obvious target for compiler optimization to avoid unnecessary branching.
However, that does not justify the use of custom abs implementation as it violates the DRY principle and no one can guarantee that your implementation is going to be just as good for more sophisticated scenarios and/or unusual platforms. Typically one should think about rewriting some of the library functions only when there is a definite performance problem or some other defect detected in existing implementation.
Edit2: Just switching from int to float shows considerable performance degradation:
float libfoo(float x)
{
return ::std::fabs(x);
}
andps xmm0, xmmword ptr [rip + .LCPI0_0]
And a custom version:
inline float my_fabs(float x)
{
return x>0.0f?x:-x;
}
float myfoo(float x)
{
return my_fabs(x);
}
movaps xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [-0.000000e+00,-0.000000e+00,-0.000000e+00,-0.000000e+00]
xorps xmm1, xmm0
xorps xmm2, xmm2
cmpltss xmm2, xmm0
andps xmm0, xmm2
andnps xmm2, xmm1
orps xmm0, xmm2
online compiler
Your solution might arguably be "cleaner" by the textbook if you used the standard library version, but I think the evaluation is wrong. There is no truly good, justifiable reason for your code being rejected.
This is one of those cases where someone is formally correct (by the textbook), but insists on knowing the only correct solution in a sheer stupid way rather than accepting an alternate solution and saying "...but this here would be best practice, you know".
Technically, it's a correct, practical approach to say "use the standard library, that's what it is for, and it's likely optimized as much as can be". Even though the "optimized as much as can be" part can, in some situations, very well be wrong due to some constraints that the standard puts onto certain alogorithms and/or containers.
Now, opinions, best practice, and religion aside. Factually, if you compare the two approaches...
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return std::abs(argc);
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
int main(int argc, char**)
{
40f360: 53 push %rbx
40f361: 48 83 ec 20 sub $0x20,%rsp
40f365: 89 cb mov %ecx,%ebx
40f367: e8 a4 be ff ff callq 40b210 <__main>
return (argc > 0) ? argc : -argc;
40f36c: 89 da mov %ebx,%edx
40f36e: 89 d8 mov %ebx,%eax
40f370: c1 fa 1f sar $0x1f,%edx
40f373: 31 d0 xor %edx,%eax
40f375: 29 d0 sub %edx,%eax
//}
... they result in exactly the same, identical instructions.
But even if the compiler did use a compare followed by a conditional move (which it may do in more complicated "branching assignments" and which it will do for example in the case of min/max), that's maybe one CPU cycle or so slower than the bit hacks, so unless you do this several million times, the statement "not efficient" is kinda doubtful anyway.
One cache miss, and you have a hundred times the penalty of a conditional move.
There are valid arguments for and against either approach, which I won't discuss in length. My point is, turning down the OP's solution as "totally wrong" because of such a petty, unimportant detail is rather narrow-minded.
EDIT:
(Fun trivia)
I just tried, for fun and no profit, on my Linux Mint box which uses a somewhat older version of GCC (5.4 as compared to 7.1 above).
Due to me including <cmath> without much of a thought (hey, a function like abs very clearly belongs to math, doesn't it!) rather than <cstdlib> which hosts the integer overload, the result was, well... surprising. Calling the library function was much inferior to the single-expression wrapper.
Now, in defense of the standard library, if you include <cstdlib>, then, again, the produced output is exactly identical in either case.
For reference, the test code looked like:
#ifdef DRY
#include <cmath>
int main(int argc, char**)
{
return std::abs(argc);
}
#else
int abs(int v) noexcept { return (v >= 0) ? v : -v; }
int main(int argc, char**)
{
return abs(argc);
}
#endif
...resulting in
4004f0: 89 fa mov %edi,%edx
4004f2: 89 f8 mov %edi,%eax
4004f4: c1 fa 1f sar $0x1f,%edx
4004f7: 31 d0 xor %edx,%eax
4004f9: 29 d0 sub %edx,%eax
4004fb: c3 retq
Now, It is apparently quite easy to fall into the trap of unwittingly using the wrong standard library function (I demonstrated how easy it is myself!). And all that without the slightest warning from the compiler, such as "hey, you know, you're using a double overload on an integer value (well, obviously there's no warning, it's a valid conversion).
With that in mind, there may be yet another "justification" why the OP providing his own one-liner wasn't all that terribly bad and wrong. After all, he could have made that same mistake.
I have this simple C++ code:
int testFunction(int* input, long length) {
int sum = 0;
for (long i = 0; i < length; ++i) {
sum += input[i];
}
return sum;
}
#include <stdlib.h>
#include <iostream>
using namespace std;
int main()
{
union{
int* input;
char* cinput;
};
size_t length = 1024;
input = new int[length];
//cinput++;
cout<<testFunction(input, length-1);
}
If I compile it with g++ 4.9.2 with -O3, it runs fine. I expected that if I uncomment the penultimate line it would run slower, however it outright crashes with SIGSEGV.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000400754 in main ()
(gdb) disassemble
Dump of assembler code for function main:
0x00000000004006e0 <+0>: sub $0x8,%rsp
0x00000000004006e4 <+4>: movabs $0x100000000,%rdi
0x00000000004006ee <+14>: callq 0x400690 <_Znam#plt>
0x00000000004006f3 <+19>: lea 0x1(%rax),%rdx
0x00000000004006f7 <+23>: and $0xf,%edx
0x00000000004006fa <+26>: shr $0x2,%rdx
0x00000000004006fe <+30>: neg %rdx
0x0000000000400701 <+33>: and $0x3,%edx
0x0000000000400704 <+36>: je 0x4007cc <main+236>
0x000000000040070a <+42>: cmp $0x1,%rdx
0x000000000040070e <+46>: mov 0x1(%rax),%esi
0x0000000000400711 <+49>: je 0x4007f1 <main+273>
0x0000000000400717 <+55>: add 0x5(%rax),%esi
0x000000000040071a <+58>: cmp $0x3,%rdx
0x000000000040071e <+62>: jne 0x4007e1 <main+257>
0x0000000000400724 <+68>: add 0x9(%rax),%esi
0x0000000000400727 <+71>: mov $0x3ffffffc,%r9d
0x000000000040072d <+77>: mov $0x3,%edi
0x0000000000400732 <+82>: mov $0x3fffffff,%r8d
0x0000000000400738 <+88>: sub %rdx,%r8
0x000000000040073b <+91>: pxor %xmm0,%xmm0
0x000000000040073f <+95>: lea 0x1(%rax,%rdx,4),%rcx
0x0000000000400744 <+100>: xor %edx,%edx
0x0000000000400746 <+102>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000400750 <+112>: add $0x1,%rdx
=> 0x0000000000400754 <+116>: paddd (%rcx),%xmm0
0x0000000000400758 <+120>: add $0x10,%rcx
0x000000000040075c <+124>: cmp $0xffffffe,%rdx
0x0000000000400763 <+131>: jbe 0x400750 <main+112>
0x0000000000400765 <+133>: movdqa %xmm0,%xmm1
0x0000000000400769 <+137>: lea -0x3ffffffc(%r9),%rcx
---Type <return> to continue, or q <return> to quit---
Why does it crash? Is it a compiler bug? Am I causing some undefined behavior? Does the compiler expect that ints are always 4-byte-aligned?
I also tested it on clang and there's no crash.
Here's g++'s assembly output: http://pastebin.com/CJdCDCs4
The code input = new int[length]; cinput++; causes undefined behaviour because the second statement is reading from a union member that is not active.
Even ignoring that, testFunction(input, length-1) would again have undefined behaviour for the same reason.
Even ignoring that, the sum loop accesses an object through a glvalue of the wrong type, which has undefined behaviour.
Even ignoring that, reading from an uninitialized object, as your sum loop does, would again have undefined behaviour.
gcc has vectorized the loop with SSE instructions. paddd (like most SSE instructions) requires 16 byte alignment. I haven't looked at the code previous to paddd in detail but I expect that it assumes 4 byte alignment initially, iterates with scalar code (where misalignment only incurs a performance penalty, not a crash) until it can assume 16 byte alignment, then enters the SIMD loop, processing 4 ints at a time. By adding an offset of 1 byte you are breaking the precondition of 4 byte alignment for the array of ints, and after that all bets are off. If you're going to be doing nasty stuff with misaligned data (and I highly recommend you don't) then you should disable automatic vectorization (gcc -fno-tree-vectorize).
The instruction that crashed is paddd (you highlighted it). The name is short for "packed add doubleword" (see e.g. here) - it is a part of the SSE instruction set. These instructions require aligned pointers; for example, the link above has a description of exceptions that paddd may cause:
GP(0)
...(128-bit operations only)
If a memory operand is not aligned on a 16-byte boundary, regardless of segment.
This is exactly your case. The compiler arranged the code in such a way that it could use these fast 128-bit operations like paddd, and you subverted it with your union trick.
I can guess that code generated by clang doesn't use SSE, so it's not sensitive to alighnment. If so, it's also probably much slower (but you won't notice it with just 1024 iterations).