I have a class that takes 64 bit in memory. To implement equality, I used reinterpret_cast<uint64_t*>, but it results in this warning on gcc 7.2 (but not clang 5.0):
$ g++ -O3 -Wall -std=c++17 -g -c example.cpp
example.cpp: In member function ‘bool X::eq_via_cast(X)’:
example.cpp:27:85: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
return *reinterpret_cast<uint64_t*>(this) == *reinterpret_cast<uint64_t*>(&x); ^
From my understanding, casting is undefined behavior unless you are casting to the actual type or to char*. For instance, there could be architecture specific layout restricts while loading values. That is why I tried alternative approaches.
Here is the source code of a simplified version (link to godbolt):
#include <cstdint>
#include <cstring>
struct Y
{
uint32_t x;
bool operator==(Y y) { return x == y.x; }
};
struct X
{
Y a;
int16_t b;
int16_t c;
uint64_t to_uint64() {
uint64_t result;
std::memcpy(&result, this, sizeof(uint64_t));
return result;
}
bool eq_via_memcpy(X x) {
return to_uint64() == x.to_uint64();
}
bool eq_via_cast(X x) {
return *reinterpret_cast<uint64_t*>(this) == *reinterpret_cast<uint64_t*>(&x);
}
bool eq_via_comparisons(X x) {
return a == x.a && b == x.b && c == x.c;
}
};
static_assert(sizeof(X) == sizeof(uint64_t));
bool via_memcpy(X x1, X x2) {
return x1.eq_via_memcpy(x2);
}
bool via_cast(X x1, X x2) {
return x1.eq_via_cast(x2);
}
bool via_comparisons(X x1, X x2) {
return x1.eq_via_comparisons(x2);
}
Avoiding the cast by explicitly copying the data via memcpy prevents the warning. As far as I understand it, it should also be portable.
Looking at the assembler (gcc 7.2 with -std=c++17 -O3), memcpy is optimized perfectly while the straightforward comparisons lead to less efficient code:
via_memcpy(X, X):
cmp rdi, rsi
sete al
ret
via_cast(X, X):
cmp rdi, rsi
sete al
ret
via_comparisons(X, X):
xor eax, eax
cmp esi, edi
je .L7
rep ret
.L7:
sar rdi, 32
sar rsi, 32
cmp edi, esi
sete al
ret
Very similar with clang 5.0 (-std=c++17 -O3):
via_memcpy(X, X): # #via_memcpy(X, X)
cmp rdi, rsi
sete al
ret
via_cast(X, X): # #via_cast(X, X)
cmp rdi, rsi
sete al
ret
via_comparisons(X, X): # #via_comparisons(X, X)
cmp edi, esi
jne .LBB2_1
mov rax, rdi
shr rax, 32
mov rcx, rsi
shr rcx, 32
shl eax, 16
shl ecx, 16
cmp ecx, eax
jne .LBB2_3
shr rdi, 48
shr rsi, 48
shl edi, 16
shl esi, 16
cmp esi, edi
sete al
ret
.LBB2_1:
xor eax, eax
ret
.LBB2_3:
xor eax, eax
ret
From this experiment, it looks like the memcpy version is the best approach in performance critical parts of the code.
Questions:
Is my understanding correct that the memcpy version is portable C++ code?
Is it reasonable to assume that the compilers are able to optimize away the memcpy call like in this example?
Are there better approaches that I have overlooked?
Update:
As UKMonkey pointed out, memcmp is more natural when doing bitwise comparisons. It also compiles down to the same optimized version:
bool eq_via_memcmp(X x) {
return std::memcmp(this, &x, sizeof(*this)) == 0;
}
Here is the updated godbolt link. Should also be portable (sizeof(*this) is 64 bit), so I assume it is the best solution so far.
In C++17, memcmp in combination with has_unique_object_representations can be used:
bool eq_via_memcmp(X x) {
static_assert(std::has_unique_object_representations_v<X>);
return std::memcmp(this, &x, sizeof(*this)) == 0;
}
Compilers should be able to optimize it to one comparison (godbolt link):
via_memcmp(X, X):
cmp rdi, rsi
sete al
ret
The static assertion makes sure that the class X does not contain padding bits. Otherwise, comparing two logically equivalent objects could return false because the content of the padding bits may differ. In that case, it is safer to reject that code at compile time.
(Note: Presumably, C++20 will add std::bit_cast, which could be used as an alternative for memcmp. But still, you have to make sure that no padding is involved for the same reason.)
Related
While benchmarking code involving std::optional<double>, I noticed that the code MSVC generates runs at roughly half the speed compared to the one produced by clang or gcc. After spending some time reducing the code, I noticed that MSVC apparently has issues generating code for std::optional::operator=. Using std::optional::emplace() does not exhibit the slow down.
The following function
void test_assign(std::optional<double> & f){
f = std::optional{42.0};
}
produces
sub rsp, 24
vmovsd xmm0, QWORD PTR __real#4045000000000000
mov BYTE PTR $T1[rsp+8], 1
vmovups xmm1, XMMWORD PTR $T1[rsp]
vmovsd xmm1, xmm1, xmm0
vmovups XMMWORD PTR [rcx], xmm1
add rsp, 24
ret 0
Notice the unaligned mov operations.
On the contrary, the function
void test_emplace(std::optional<double> & f){
f.emplace(42.0);
}
compiles to
mov rax, 4631107791820423168 ; 4045000000000000H
mov BYTE PTR [rcx+8], 1
mov QWORD PTR [rcx], rax
ret 0
This version is much simpler and faster.
These were generated using MSVC 19.32 with /O2 /std:c++17 /DNDEBUG /arch:AVX.
clang 14 with -O3 -std=c++17 -DNDEBUG -mavx produces
movabs rax, 4631107791820423168
mov qword ptr [rdi], rax
mov byte ptr [rdi + 8], 1
ret
in both cases.
Replacing std::optional<double> with
struct MyOptional {
double d;
bool hasValue; // Required to reproduce the problem
MyOptional(double v) {
d = v;
}
void emplace(double v){
d = v;
}
};
exhibits the same issue. Apparently MSVC has some troubles with the additional bool member.
See godbolt for a live example.
Why is MSVC producing these unaligned moves? I.e. the question is not why they are unaligned rather than aligned (which wouldn't improve things according to this post). But why does MSVC produce a considerably more expensive set of instructions in the assignment case?
Is this simply a bug (or missed optimization opportunity) by MSVC? Or am I missing something?
I have been performing performance optimisations on some code at work, and stumbled upon some strange behaviour, which I've boiled down to the simple snippet of C++ code below:
#include <stdint.h>
void Foo(uint8_t*& out)
{
out[0] = 1;
out[1] = 2;
out[2] = 3;
out[3] = 4;
}
I then compile it with clang (on Windows) with the following: clang -S -O3 -masm=intel test.cpp. This results in the following assembly:
mov rax, qword ptr [rcx]
mov byte ptr [rax], 1
mov rax, qword ptr [rcx]
mov byte ptr [rax + 1], 2
mov rax, qword ptr [rcx]
mov byte ptr [rax + 2], 3
mov rax, qword ptr [rcx]
mov byte ptr [rax + 3], 4
ret
Why has clang generated code that repeatedly dereferences the out parameter into the rax register? This seems like a really obvious optimization that it is deliberately not making, so the question is why?
Interestingly, I've tried changing uint8_t to uint16_t and this much better machine code is generated as a result:
mov rax, qword ptr [rcx]
movabs rcx, 1125912791875585
mov qword ptr [rax], rcx
ret
The compiler cannot do such optimization simply due to strict aliasing as uint8_t is always* defined as unsigned char. Therefore it can point to any memory location, which means it can also point to itself and because you pass it as a reference, the writes can have side-effects inside the function.
Here is obscure, yet correct, usage dependent on non-cached reads:
#include <cassert>
#include <stdint.h>
void Foo(uint8_t*& out)
{
uint8_t local;
// CANNOT be used as a cached value further down in the code.
uint8_t* tmp = out;
// Recover the stored pointer.
uint8_t **orig =reinterpret_cast<uint8_t**>(out);
// CHANGES `out` itself;
*orig=&local;
**orig=5;
assert(local==5);
// IS NOT EQUAL even though we did not touch `out` at all;
assert(tmp!=out);
assert(out==&local);
assert(*out==5);
}
int main(){
// True type of the stored ptr is uint8_t**
uint8_t* ptr = reinterpret_cast<uint8_t*>(&ptr);
Foo(ptr);
}
This also explains why uint16_t generates "optimized" code because uin16_t can never* be (unsigned) char so the compiler is free to assume that it does not alias other pointer types such as itself.
*Maybe some irrelevant obscure platforms with differently-sized bytes. That is beside the point.
Suppose we have 2 POD structs composed of only integral data types (including enums and raw pointers).
struct A
{
std::int64_t x = 0;
std::int64_t y = 1;
};
struct B
{
std::int64_t x = 0;
std::int32_t y = 1;
std::int32_t z = 2;
};
Note that A and B are both 128 bits in size.
Let's also assume we're on 64-bit architecture.
Of course, if x, y, and z weren't integral types, the cost of copy, move, construction and destruction may be different between A and B depending on the implementation details of the members.
But if we assume that x, y, and z are only integral types, is there any cost difference between A and B in terms of:
Construction
Copy Construction/Assignment
Member Access (does alignment play any role here?)
Specifically, is the copy and initialization of two side-by-side 32-bit integers universally more expensive than a single 64-bit integer?
Or is this something specific to compiler and optimization flags?
But if we assume that x, y, and z are only integral types, is there any cost difference between A and B...
Provided that both A and B are trivial types of the same size, there shouldn't be any difference in cost of construction and copying. That's because modern compilers implement store merging:
-fstore-merging
Perform merging of narrow stores to consecutive memory addresses. This pass merges contiguous stores of immediate values narrower than a word into fewer wider stores to reduce the number of instructions. This is enabled by default at -O2 and higher as well as -Os.
Example code:
#include <cstdint>
struct A {
std::int64_t x = 0;
std::int64_t y = 1;
};
struct B {
std::int64_t x = 0;
std::int32_t y = 1;
std::int32_t z = 2;
};
A f0(std::int64_t x, std::int64_t y) {
return {x, y};
}
B f1(std::int64_t x, std::int32_t y, std::int32_t z) {
return {x, y, z};
}
void g0(A);
void g1(B);
void h0(A a) { g0(a); }
void h1(B b) { g1(b); }
Here is generated assembly for construction and copy:
gcc-9.2 -O3 -std=gnu++17 -march=skylake:
f0(long, long):
mov rax, rdi
mov rdx, rsi
ret
f1(long, int, int):
mov QWORD PTR [rsp-16], 0
mov QWORD PTR [rsp-24], rdi
vmovdqa xmm1, XMMWORD PTR [rsp-24]
vpinsrd xmm0, xmm1, esi, 2
vpinsrd xmm2, xmm0, edx, 3
vmovaps XMMWORD PTR [rsp-24], xmm2
mov rax, QWORD PTR [rsp-24]
mov rdx, QWORD PTR [rsp-16]
ret
h0(int, A):
mov rdi, rsi
mov rsi, rdx
jmp g0(A)
h1(int, B):
mov rdi, rsi
mov rsi, rdx
jmp g1(B)
clang-9.0 -O3 -std=gnu++17 -march=skylake:
f0(long, long): # #f0(long, long)
mov rdx, rsi
mov rax, rdi
ret
f1(long, int, int): # #f1(long, int, int)
mov rax, rdi
shl rdx, 32
mov ecx, esi
or rdx, rcx
ret
h0(int, A): # #h0(int, A)
mov rdi, rsi
mov rsi, rdx
jmp g0(A) # TAILCALL
h1(int, B): # #h1(int, B)
mov rdi, rsi
mov rsi, rdx
jmp g1(B) # TAILCALL
Note how both structures are passed in registers in h0 and h1.
However, gcc botches code for construction of B by generating unnecessary AVX instructions. Filed a bug report.
I am currently trying to improve the speed of my program.
I was wondering whether it would help to replace all if-statements of the type:
bool a=1;
int b=0;
if(a){b++;}
with this:
bool a=1;
int b=0;
b+=a;
I am unsure whether the conversion from bool to int could be a problem time-wise.
One rule of thumb when programming is to not micro-optimise.
Another rule is to write clear code.
But in this case, another rule applies. If you are writing optimised code then avoid any code that can cause branches, as you can cause unwanted cpu pipeline dumps due to failed branch prediction.
Bear in mind also that there are not bool and int types as such in assembler: just registers, so you will probably find that all conversions will be optimised out. Therefore
b += a;
wins for me; it's also clearer.
Compilers are allowed to assume that the underlying value of a bool isn't messed up, so optimizing compilers can avoid the branch.
If we look at the generated code for this artificial test
int with_if_bool(bool a, int b) {
if(a){b++;}
return b;
}
int with_if_char(unsigned char a, int b) {
if(a){b++;}
return b;
}
int without_if(bool a, int b) {
b += a;
return b;
}
clang will exploit this fact and generate the exact same branchless code that sums a and b for the bool version, and instead generate actual comparisons with zero in the unsigned char case (although it's still branchless code):
with_if_bool(bool, int): # #with_if_bool(bool, int)
lea eax, [rdi + rsi]
ret
with_if_char(unsigned char, int): # #with_if_char(unsigned char, int)
cmp dil, 1
sbb esi, -1
mov eax, esi
ret
without_if(bool, int): # #without_if(bool, int)
lea eax, [rdi + rsi]
ret
gcc will instead treat bool just as if it was an unsigned char, without exploiting its properties, generating similar code as clang's unsigned char case.
with_if_bool(bool, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
with_if_char(unsigned char, int):
mov eax, esi
cmp dil, 1
sbb eax, -1
ret
without_if(bool, int):
movzx edi, dil
lea eax, [rdi+rsi]
ret
Finally, Visual C++ will treat the bool and the unsigned char versions equally, just as gcc, although with more naive codegen (it uses a conditional move instead of performing arithmetic with the flags register, which IIRC traditionally used to be less efficient, don't know for current machines).
a$ = 8
b$ = 16
int with_if_bool(bool,int) PROC ; with_if_bool, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_bool(bool,int) ENDP ; with_if_bool
a$ = 8
b$ = 16
int with_if_char(unsigned char,int) PROC ; with_if_char, COMDAT
test cl, cl
lea eax, DWORD PTR [rdx+1]
cmove eax, edx
ret 0
int with_if_char(unsigned char,int) ENDP ; with_if_char
a$ = 8
b$ = 16
int without_if(bool,int) PROC ; without_if, COMDAT
movzx eax, cl
add eax, edx
ret 0
int without_if(bool,int) ENDP ; without_if
In all cases, no branches are generated; the only difference is that, on most compilers, some more complex code is generated that depends on a cmp or a test, creating a longer dependency chain.
That being said, I would worry about this kind of micro-optimization only if you actually run your code under a profiler, and the results point to this specific code (or to some tight loop that involve it); in general you should write sensible, semantically correct code and focus on using the correct algorithms/data structures. Micro-optimization comes later.
In my program, this wouldn't work, as a is actually an operation of the type: b+=(a==c)
This should be even better for the optimizer, as it doesn't even have any doubt about where the bool is coming from - it can just decide straight from the flags register. As you can see, here gcc produces quite similar code for the two cases, clang exactly the same, while VC++ as usual produces something that is more conditional-ish (a cmov) in the if case.
How can I achieve the following with the minimum number of Intel instructions and without a branch or conditional move:
unsigned compare(unsigned x
,unsigned y) {
return (x == y)? ~0 : 0;
}
This is on hot code path and I need to squeeze out the most.
GCC solves this nicely, and it knows the negation trick when compiling with -O2 and up:
unsigned compare(unsigned x, unsigned y) {
return (x == y)? ~0 : 0;
}
unsigned compare2(unsigned x, unsigned y) {
return -static_cast<unsigned>(x == y);
}
compare(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
compare2(unsigned int, unsigned int):
xor eax, eax
cmp edi, esi
sete al
neg eax
ret
Visual Studio generates the following code:
compare2, COMDAT PROC
xor eax, eax
or r8d, -1 ; ffffffffH
cmp ecx, edx
cmove eax, r8d
ret 0
compare2 ENDP
compare, COMDAT PROC
xor eax, eax
cmp ecx, edx
setne al
dec eax
ret 0
compare ENDP
Here it seems the first version avoids the conditional move (note that the order of the funtions was changed).
To view other compiler's solution try pasting the code to
https://gcc.godbolt.org/ (add optimization flags).
Interestingly the first version produces shorter code on icc. Basically you have to measure actual performance with your compiler for each version and choose the best.
Also I would not be so sure a conditional register move is slower than other operations.
I assume you wrote the function just to show us the relevant part of the code, but a function like this would be an ideal candidate for inlining, potentially allowing the compiler to perform much more useful optimizations that involve the code where this is actually used. This may allow the compiler/CPU to parallelize this computation with other code, or merge some operations.
So, assuming this is indeed a function in your code, write it with the inline keyword and put it in a header.
return -int(x==y) is pretty terse C++. It's of course still up to the compiler to turn that into efficient assembly.
Works because int(true)==1 and unsigned (-1)==~0U.