I have been performing performance optimisations on some code at work, and stumbled upon some strange behaviour, which I've boiled down to the simple snippet of C++ code below:
#include <stdint.h>
void Foo(uint8_t*& out)
{
out[0] = 1;
out[1] = 2;
out[2] = 3;
out[3] = 4;
}
I then compile it with clang (on Windows) with the following: clang -S -O3 -masm=intel test.cpp. This results in the following assembly:
mov rax, qword ptr [rcx]
mov byte ptr [rax], 1
mov rax, qword ptr [rcx]
mov byte ptr [rax + 1], 2
mov rax, qword ptr [rcx]
mov byte ptr [rax + 2], 3
mov rax, qword ptr [rcx]
mov byte ptr [rax + 3], 4
ret
Why has clang generated code that repeatedly dereferences the out parameter into the rax register? This seems like a really obvious optimization that it is deliberately not making, so the question is why?
Interestingly, I've tried changing uint8_t to uint16_t and this much better machine code is generated as a result:
mov rax, qword ptr [rcx]
movabs rcx, 1125912791875585
mov qword ptr [rax], rcx
ret
The compiler cannot do such optimization simply due to strict aliasing as uint8_t is always* defined as unsigned char. Therefore it can point to any memory location, which means it can also point to itself and because you pass it as a reference, the writes can have side-effects inside the function.
Here is obscure, yet correct, usage dependent on non-cached reads:
#include <cassert>
#include <stdint.h>
void Foo(uint8_t*& out)
{
uint8_t local;
// CANNOT be used as a cached value further down in the code.
uint8_t* tmp = out;
// Recover the stored pointer.
uint8_t **orig =reinterpret_cast<uint8_t**>(out);
// CHANGES `out` itself;
*orig=&local;
**orig=5;
assert(local==5);
// IS NOT EQUAL even though we did not touch `out` at all;
assert(tmp!=out);
assert(out==&local);
assert(*out==5);
}
int main(){
// True type of the stored ptr is uint8_t**
uint8_t* ptr = reinterpret_cast<uint8_t*>(&ptr);
Foo(ptr);
}
This also explains why uint16_t generates "optimized" code because uin16_t can never* be (unsigned) char so the compiler is free to assume that it does not alias other pointer types such as itself.
*Maybe some irrelevant obscure platforms with differently-sized bytes. That is beside the point.
Related
I was playing with the Compiler Explorer and I stumbled upon an interesting behavior with the ternary operator when using something like this:
std::string get_string(bool b)
{
return b ? "Hello" : "Stack-overflow";
}
The compiler generated code for this (clang trunk, with -O3) is this:
get_string[abi:cxx11](bool): # #get_string[abi:cxx11](bool)
push r15
push r14
push rbx
mov rbx, rdi
mov ecx, offset .L.str
mov eax, offset .L.str.1
test esi, esi
cmovne rax, rcx
add rdi, 16 #< Why is the compiler storing the length of the string
mov qword ptr [rbx], rdi
xor sil, 1
movzx ecx, sil
lea r15, [rcx + 8*rcx]
lea r14, [rcx + 8*rcx]
add r14, 5 #< I also think this is the length of "Hello" (but not sure)
mov rsi, rax
mov rdx, r14
call memcpy #< Why is there a call to memcpy
mov qword ptr [rbx + 8], r14
mov byte ptr [rbx + r15 + 21], 0
mov rax, rbx
pop rbx
pop r14
pop r15
ret
.L.str:
.asciz "Hello"
.L.str.1:
.asciz "Stack-Overflow"
However, the compiler generated code for the following snippet is considerably smaller and with no calls to memcpy, and does not care about knowing the length of both strings at the same time. There are 2 different labels that it jumps to
std::string better_string(bool b)
{
if (b)
{
return "Hello";
}
else
{
return "Stack-Overflow";
}
}
The compiler generated code for the above snippet (clang trunk with -O3) is this:
better_string[abi:cxx11](bool): # #better_string[abi:cxx11](bool)
mov rax, rdi
lea rcx, [rdi + 16]
mov qword ptr [rdi], rcx
test sil, sil
je .LBB0_2
mov dword ptr [rcx], 1819043144
mov word ptr [rcx + 4], 111
mov ecx, 5
mov qword ptr [rax + 8], rcx
ret
.LBB0_2:
movabs rdx, 8606216600190023247
mov qword ptr [rcx + 6], rdx
movabs rdx, 8525082558887720019
mov qword ptr [rcx], rdx
mov byte ptr [rax + 30], 0
mov ecx, 14
mov qword ptr [rax + 8], rcx
ret
The same result is when I use the ternary operator with:
std::string get_string(bool b)
{
return b ? std::string("Hello") : std::string("Stack-Overflow");
}
I would like to know why the ternary operator in the first example generates that compiler code. I believe that the culprit lies within the const char[].
P.S: GCC does calls to strlen in the first example but Clang doesn't.
Link to the Compiler Explorer example: https://godbolt.org/z/Exqs6G
Thank you for your time!
sorry for the wall of code
The overarching difference here is that the first version is branchless.
16 isn’t the length of any string here (the longer one, with NUL, is only 15 bytes long); it’s an offset into the return object (whose address is passed in RDI to support RVO), used to indicate that the small-string optimization is in use (note the lack of allocation). The lengths are 5 or 5+1+8 stored in R14, which is stored in the std::string as well as passed to memcpy (along with a pointer chosen by CMOVNE) to load the actual string bytes.
The other version has an obvious branch (although part of the std::string construction has been hoisted above it) and actually does have 5 and 14 explicitly, but is obfuscated by the fact that the string bytes have been included as immediate values (expressed as integers) of various sizes.
As for why these three equivalent functions produce two different versions of the generated code, all I can offer is that optimizers are iterative and heuristic algorithms; they don’t reliably find the same “best” assembly independently of their starting point.
The first version returns a string object which is initialized with a not-constant expression yielding one of the string literals, so the constructor is run as for any other variable string object, thus the memcpy to do the initialization.
The other variants return either one string object initialized with a string literal or another string object initialized with another string literal, both of which can be optimized to a string object constructed from a constant expression where no memcpy is needed.
So the real answer is: the first version operates the ?: operator on char[] expressions before initializing the objects and the other versions on the string objects already being initialized.
It does not matter whether one of the versions is branchless.
I am curious why the following piece of code:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNO";
}
when compiled with -O3 yields the following code:
main: # #main
xor eax, eax
ret
(I perfectly understand that there is no need for the unused a so the compiler can entirely omit it from the generated code)
However the following program:
#include <string>
int main()
{
std::string a = "ABCDEFGHIJKLMNOP"; // <-- !!! One Extra P
}
yields:
main: # #main
push rbx
sub rsp, 48
lea rbx, [rsp + 32]
mov qword ptr [rsp + 16], rbx
mov qword ptr [rsp + 8], 16
lea rdi, [rsp + 16]
lea rsi, [rsp + 8]
xor edx, edx
call std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long)
mov qword ptr [rsp + 16], rax
mov rcx, qword ptr [rsp + 8]
mov qword ptr [rsp + 32], rcx
movups xmm0, xmmword ptr [rip + .L.str]
movups xmmword ptr [rax], xmm0
mov qword ptr [rsp + 24], rcx
mov rax, qword ptr [rsp + 16]
mov byte ptr [rax + rcx], 0
mov rdi, qword ptr [rsp + 16]
cmp rdi, rbx
je .LBB0_3
call operator delete(void*)
.LBB0_3:
xor eax, eax
add rsp, 48
pop rbx
ret
mov rdi, rax
call _Unwind_Resume
.L.str:
.asciz "ABCDEFGHIJKLMNOP"
when compiled with the same -O3. I don't understand why it does not recognize that the a is still unused, regardless that the string is one byte longer.
This question is relevant to gcc 9.1 and clang 8.0, (online: https://gcc.godbolt.org/z/p1Z8Ns) because other compilers in my observation either entirely drop the unused variable (ellcc) or generate code for it regardless the length of the string.
This is due to the small string optimization. When the string data is less than or equal 16 characters, including the null terminator, it is stored in a buffer local to the std::string object itself. Otherwise, it allocates memory on the heap and stores the data over there.
The first string "ABCDEFGHIJKLMNO" plus the null terminator is exactly of size 16. Adding "P" makes it exceed the buffer, hence new is being called internally, inevitably leading to a system call. The compiler can optimize something away if it's possible to ensure that there are no side effects. A system call probably makes it impossible to do this - by constrast, changing a buffer local to the object under construction allows for such a side effect analysis.
Tracing the local buffer in libstdc++, version 9.1, reveals these parts of bits/basic_string.h:
template<typename _CharT, typename _Traits, typename _Alloc>
class basic_string
{
// ...
enum { _S_local_capacity = 15 / sizeof(_CharT) };
union
{
_CharT _M_local_buf[_S_local_capacity + 1];
size_type _M_allocated_capacity;
};
// ...
};
which lets you spot the local buffer size _S_local_capacity and the local buffer itself (_M_local_buf). When the constructor triggers basic_string::_M_construct being called, you have in bits/basic_string.tcc:
void _M_construct(_InIterator __beg, _InIterator __end, ...)
{
size_type __len = 0;
size_type __capacity = size_type(_S_local_capacity);
while (__beg != __end && __len < __capacity)
{
_M_data()[__len++] = *__beg;
++__beg;
}
where the local buffer is filled with its content. Right after this part, we get to the branch where the local capacity is exhausted - new storage is allocated (through the allocate in M_create), the local buffer is copied into the new storage and filled with the rest of the initializing argument:
while (__beg != __end)
{
if (__len == __capacity)
{
// Allocate more space.
__capacity = __len + 1;
pointer __another = _M_create(__capacity, __len);
this->_S_copy(__another, _M_data(), __len);
_M_dispose();
_M_data(__another);
_M_capacity(__capacity);
}
_M_data()[__len++] = *__beg;
++__beg;
}
As a side note, small string optimization is quite a topic on its own. To get a feeling for how tweaking individual bits can make a difference at large scale, I'd recommend this talk. It also mentions how the std::string implementation that ships with gcc (libstdc++) works and changed during the past to match newer versions of the standard.
I was surprised the compiler saw through a std::string constructor/destructor pair until I saw your second example. It didn't. What you're seeing here is small string optimization and corresponding optimizations from the compiler around that.
Small string optimizations are when the std::string object itself is big enough to hold the contents of the string, a size and possibly a discriminating bit used to indicate whether the string is operating in small or big string mode. In such a case, no dynamic allocations occur and the string is stored in the std::string object itself.
Compilers are really bad at eliding unneeded allocations and deallocations, they are treated almost as if having side effects and are thus impossible to elide. When you go over the small string optimization threshold, dynamic allocations occur and the result is what you see.
As an example
void foo() {
delete new int;
}
is the simplest, dumbest allocation/deallocation pair possible, yet gcc emits this assembly even under O3
sub rsp, 8
mov edi, 4
call operator new(unsigned long)
mov esi, 4
add rsp, 8
mov rdi, rax
jmp operator delete(void*, unsigned long)
While the accepted answer is valid, since C++14 it's actually the case that new and delete calls can be optimized away. See this arcane wording on cppreference:
New-expressions are allowed to elide ... allocations made through replaceable allocation functions. In case of elision, the storage may be provided by the compiler without making the call to an allocation function (this also permits optimizing out unused new-expression).
...
Note that this optimization is only permitted when new-expressions are
used, not any other methods to call a replaceable allocation function:
delete[] new int[10]; can be optimized out, but operator
delete(operator new(10)); cannot.
This actually allows compilers to completely drop your local std::string even if it's very long. In fact - clang++ with libc++ already does this (GodBolt), since libc++ uses built-ins __new and __delete in its implementation of std::string - that's "storage provided by the compiler". Thus, we get:
main():
xor eax, eax
ret
with basically any-length unused string.
GCC doesn't do but I've recently opened bug reports about this; see this SO answer for links.
I have a question about performance. I think this can also applies to other languages (not only C++).
Imagine that I have this function:
int addNumber(int a, int b){
int result = a + b;
return result;
}
Is there any performance improvement if I write the code above like this?
int addNumber(int a, int b){
return a + b;
}
I have this question because the second function doesn´t declare a 3rd variable. But would the compiler detect this in the first code?
To answer this question you can look at the generated assembler code. With -O2, x86-64 gcc 6.2 generates exactly the same code for both methods:
addNumber(int, int):
lea eax, [rdi+rsi]
ret
addNumber2(int, int):
lea eax, [rdi+rsi]
ret
Only without optimization turned on, there is a difference:
addNumber(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-20], edi
mov DWORD PTR [rbp-24], esi
mov edx, DWORD PTR [rbp-20]
mov eax, DWORD PTR [rbp-24]
add eax, edx
mov DWORD PTR [rbp-4], eax
mov eax, DWORD PTR [rbp-4]
pop rbp
ret
addNumber2(int, int):
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], edi
mov DWORD PTR [rbp-8], esi
mov edx, DWORD PTR [rbp-4]
mov eax, DWORD PTR [rbp-8]
add eax, edx
pop rbp
ret
However, performance comparison without optimization is meaningless
In principle there is no difference between the two approaches. The majority of compilers have handled this type of optimisation for some decades.
Additionally, if the function can be inlined (e.g. its definition is visible to the compiler when compiling code that uses such a function) the majority of compilers will eliminate the function altogether, and simply emit code to add the two variables passed and store the result as required by the caller.
Obviously, the comments above assume compiling with a relevant optimisation setting (e.g. not doing a debug build without optimisation).
Personally, I would not write such a function anyway. It is easier, in the caller, to write c = a + b instead of c = addNumber(a, b), so having a function like that offers no benefit to either programmer (effort to understand) or program (performance, etc). You might as well write comments that give no useful information.
c = a + b; // add a and b and store into c
Any self-respecting code reviewer would complain bitterly about uninformative functions or uninformative comments.
I'd only use such a function if its name conveyed some special meaning (i.e. more than just adding two values) for the application
c = FunkyOperation(a,b);
int FunkyOperation(int a, int b)
{
/* Many useful ways of implementing this operation.
One of those ways happens to be addition, but we need to
go through 25 pages of obscure mathematical proof to
realise that
*/
return a + b;
}
There is a class SomeClass which holds some data and methods that operates on this data. And it must be created with some arguments like:
SomeClass(int some_val, float another_val);
There is another class, say Manager, which includes SomeClass, and heavily uses its methods.
So, what would be better in terms of performance (data locality, cache hits, etc.), declare object of SomeClass as member of Manager and use member initialization in Manager's constructor or declare object of SomeClass as unique_ptr?
class Manager
{
public:
Manager() : some(5, 3.0f) {}
private:
SomeClass some;
};
or
class Manager
{
public:
Manager();
private:
std::unique_ptr<SomeClass> some;
}
Short answer
Most likely, there is no difference in runtime efficiency of accessing your subobject. But using pointer can be slower for several reasons (see details below).
Moreover, there are several other things you should remember:
When using pointer, you usually have to allocate/deallocate memory for subobject separately, which takes some time (quite a lot if you do it much).
When using pointer, you can cheaply move your subobject without copying.
Speaking of compile times, pointer is better than plain member. With plain member, you cannot remove dependency of Manager declaration on SomeClass declaration. With pointers, you can do it with forward declaration. Less dependencies may result is less build times.
Details
I'd like to provide more details about performance of subobject accesses. I think that using pointer can be slower than using plain member for several reasons:
Data locality (and cache performance) is likely to be better with plain member. You usually access data of Manager and SomeClass together, and plain member is guaranteed to be near other data, while heap allocations may place object and subobject far from each other.
Using pointer means one more level of indirection. To get address of a plain member, you can simply add a compile-time constant offset fo object address (which is often merged with other assembly instruction). When using pointer, you have to additionally read a word from the member pointer to get actual pointer to subobject. See Q1 and Q2 for more details.
Aliasing is perhaps the most important issue. If you are using plain member, then compiler can assume that: your subobject lies fully within your object in memory, and it does not overlap with other members of your object. When using pointer, compiler often cannot assume anything like this: you subobject may overlap with your object and its members. As a result, compiler has to generate more useless load/store operations, because it thinks that some values may change.
Here is an example for the last issue (full code is here):
struct IntValue {
int x;
IntValue(int x) : x(x) {}
};
class MyClass_Ptr {
unique_ptr<IntValue> a, b, c;
public:
void Compute() {
a->x += b->x + c->x;
b->x += a->x + c->x;
c->x += a->x + b->x;
}
};
Clearly, it is stupid to store subobjects a, b, c by pointers. I've measured time spent in one billion calls of Compute method for a single object. Here are results with different configurations:
2.3 sec: plain member (MinGW 5.1.0)
2.0 sec: plain member (MSVC 2013)
4.3 sec: unique_ptr (MinGW 5.1.0)
9.3 sec: unique_ptr (MSVC 2013)
When looking at the generated assembly for innermost loop in each case, it is easy to understand why the times are so different:
;;; plain member (GCC)
lea edx, [rcx+rax] ; well-optimized code: only additions on registers
add r8d, edx ; all 6 additions present (no CSE optimization)
lea edx, [r8+rax] ; ('lea' instruction is also addition BTW)
add ecx, edx
lea edx, [r8+rcx]
add eax, edx
sub r9d, 1
jne .L3
;;; plain member (MSVC)
add ecx, r8d ; well-optimized code: only additions on registers
add edx, ecx ; 5 additions instead of 6 due to a common subexpression eliminated
add ecx, edx
add r8d, edx
add r8d, ecx
dec r9
jne SHORT $LL6#main
;;; unique_ptr (GCC)
add eax, DWORD PTR [rcx] ; slow code: a lot of memory accesses
add eax, DWORD PTR [rdx] ; each addition loads value from memory
mov DWORD PTR [rdx], eax ; each sum is stored to memory
add eax, DWORD PTR [r8] ; compiler is afraid that some values may be at same address
add eax, DWORD PTR [rcx]
mov DWORD PTR [rcx], eax
add eax, DWORD PTR [rdx]
add eax, DWORD PTR [r8]
sub r9d, 1
mov DWORD PTR [r8], eax
jne .L4
;;; unique_ptr (MSVC)
mov r9, QWORD PTR [rbx] ; awful code: 15 loads, 3 stores
mov rcx, QWORD PTR [rbx+8] ; compiler thinks that values may share
mov rdx, QWORD PTR [rbx+16] ; same address with pointers to values!
mov r8d, DWORD PTR [rcx]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
mov r8, QWORD PTR [rbx+8]
mov rcx, QWORD PTR [rbx] ; load value of 'a' pointer from memory
mov rax, QWORD PTR [rbx+16]
mov edx, DWORD PTR [rcx] ; load value of 'a->x' from memory
add edx, DWORD PTR [rax] ; add the 'c->x' value
add DWORD PTR [r8], edx ; add sum 'a->x + c->x' to 'b->x'
mov r9, QWORD PTR [rbx+16]
mov rax, QWORD PTR [rbx] ; load value of 'a' pointer again =)
mov rdx, QWORD PTR [rbx+8]
mov r8d, DWORD PTR [rax]
add r8d, DWORD PTR [rdx]
add DWORD PTR [r9], r8d
dec rsi
jne SHORT $LL3#main
I was curious to see what the cost is of accessing a data member through a pointer compared with not through a pointer, so came up with this test:
#include <iostream>
struct X{
int a;
};
int main(){
X* xheap = new X();
std::cin >> xheap->a;
volatile int x = xheap->a;
X xstack;
std::cin >> xstack.a;
volatile int y = xstack.a;
}
the generated x86 is:
int main(){
push rbx
sub rsp,20h
X* xheap = new X();
mov ecx,4
call qword ptr [__imp_operator new (013FCD3158h)]
mov rbx,rax
test rax,rax
je main+1Fh (013FCD125Fh)
xor eax,eax
mov dword ptr [rbx],eax
jmp main+21h (013FCD1261h)
xor ebx,ebx
std::cin >> xheap->a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov rdx,rbx
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int x = xheap->a;
mov eax,dword ptr [rbx]
X xstack;
std::cin >> xstack.a;
mov rcx,qword ptr [__imp_std::cin (013FCD3060h)]
mov dword ptr [x],eax
lea rdx,[xstack]
call qword ptr [__imp_std::basic_istream<char,std::char_traits<char> >::operator>> (013FCD3070h)]
volatile int y = xstack.a;
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
It looks like the non-pointer access takes two instructions, compared to oneinstruction for the access through a pointer. Could somebody please tell me why this is and which would take fewer CPU cycles to retrieve?
I am trying to understand if pointers do incur more CPU instructions/cycles when accessing data members through them as opposed to non-pointer-access.
That's a terrible test.
The complete assignment to x is this:
mov eax,dword ptr [rbx]
mov dword ptr [x],eax
(the compiler is allowed to re-order the instructions somewhat, and has).
The assignment to y (which the compiler has given the same address as x) is
mov eax,dword ptr [xstack]
mov dword ptr [x],eax
which is almost the same (read memory pointed to by register, write to the stack).
The first one would be more complicated except that the compiler kept xheap in register rbx after the call to new, so it doesn't need to re-load it.
In either case I would be more worried about whether any of those accesses misses the L1 or L2 caches than about the precise instructions. (The processor doesn't even directly execute those instructions, they get converted internally to a different instruction set, and it may execute them in a different order.)
Accessing via a pointer instead of directly accessing from the stack costs you one extra indirection in the worst case (fetching the pointer). This is almost always irrelevant in itself; you need to look at your whole algorithm and how it works with the processor's caches and branch prediction logic.