Does a function local static variable automatically incur a branch? - c++

For example:
int foo()
{
static int i = 0;
return i++;
}
The variable i will only be initialized to 0 the first time foo is called. Does this automatically mean there's a hidden branch in there to keep the initialization from happening more than once? Or are there more clever tricks to avoid this?

Yes, it must incur a branch, and it must also incur at least an atomic operation for safe concurrent initialization. The Standard requires that they are initialized on function entry, in a concurrency-safe way.
The implementation can only dodge this requirement if it can prove that the difference between lazy init and some earlier initialization like before main() is entered is equivalent. For example, simple PODs initialized from constants, the compiler may choose to initialize it earlier like a file-scope global since it's non-observable and saving the lazy initialization code, but that's a non-observable optimization.

Yes, there is a branch. Each time the function is entered, the code must check if the variable has already been initialized. But as will be explained below, you usually do not have to care about this branch.
Example
Check out this code:
#include <iostream>
struct Foo { Foo(){ std::cout << "FOO" << std::endl;} };
void foo(){ static Foo foo; }
int main(){ foo();}
Now, here is the first part of assembly code that gcc4.8 generates for the foo function:
_Z3foov:
.LFB974:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA974
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %r12
pushq %rbx
.cfi_offset 12, -24
.cfi_offset 3, -32
movl $_ZGVZ3foovE3foo, %eax
movzbl (%rax), %eax
testb %al, %al
jne .L7 <------------------- FIRST CHECK
movl $_ZGVZ3foovE3foo, %edi
call __cxa_guard_acquire <------------------- LOCK
testl %eax, %eax
setne %al
testb %al, %al
je .L7 <------------------- SECOND CHECK
movl $0, %r12d
movl $_ZZ3foovE3foo, %edi
A you see, there is a jne! Then, a guard is aquired using __cxa_guard_acquire, followed by a je. Thus, it seems that the compiler is generating the famous double checked locking pattern here.
Will every compiler generate a branch?
I am pretty sure the spec does NOT mandate that a branch or double checked locking must be used. It just mandates that the initialization must be thread safe. However, I do not see a way to perform a thread safe initialization without a branch. Thus, even though the spec does not mandate it, it is simply not possible with current CPU architectures to omit the branch here.
Is the branch expensive?
Considering whether you should care about this branch:
You should definitly NOT care about this branch, since it will be correctly predicted (as it once the object is initialized the branch always takes the same route). Thus, the branch is almost free. Trying to avoid a static local variable for optimization purposes should never yield any observable performance benefit.
Is there really no way around the branch?
If the constructor is not observable, like simply initialization with constant values, then it may be performed eagerly at program startup and the branch is omitted. If, however, it is observable, then things get pretty tricky:
The only possibility I see is stated in the answer of R. Martinho Fernandes (which has been deleted): The code could modify itself. I.e., simply remove the initialization code once the initialization is done. However, this is idea is impractical for the following reasons:
Self-modifying code is very hard to get thread-safe.
Usually, memory flagged executable is write protected so code is not allowed to rewrite itself.
It is just not worth it, as the branch is not expensive (see above).

Related

do constant calculations in #define consume resources?

as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources, but that is a different story when we use definitions as macros and put some calculations inside it. take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code, does it spend time for CPU to calculate the SQRT and add operations or this values are calculated in compiler and just replaced in the code?
If it is calculated by CPU, is there anyway to be calculated beforehand in preprocessor and assigned as a constant here?
as far as I'm concerned, constants and definitions in c/c++ do not consume memory and other resources,
It depends on what you mean by that. Constants appearing in expressions that potentially are evaluated at runtime have to be represented in the program somehow. Under most circumstances, that will take some space.
Even (in C++) a constexpr can be evaluated at runtime, even though implementations can and probably do evaluate them at compile time.
but that is a different story when we use definitions as macros and put some calculations inside it.
Yes, it is different, because macros are not constants in any applicable sense of that term.
take a look at the code:
#include "math.h"
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
When we use the variable 'd' in our code,
d is not a variable. It is a macro. Wherever it appears within the scope of those macro definitions, it is exactly equivalent to the expression sqrt(12.2*5.8) appearing at that point.
does it spend time for CPU
to calculate the SQRT and add operations or this values are calculated
in compiler and just replaced in the code?
Either could happen. It depends on your compiler, probably on compilation options, and possibly on other code in the translation unit. Pre-computation is more likely at higher optimization levels.
If it is calculated by CPU,
is there anyway to be calculated beforehand in preprocessor and
assigned as a constant here?
Such a calculation is not part of the semantics of the preprocessor per se. To the somewhat artificial extent that we draw a distinction between preprocessor and compiler in modern C and C++ implementations, if a precomputation is performed then it will be performed by the compiler, not the preprocessor.
The C and C++ languages do not define a mechanism to force such evaluations to be performed at compile time, but you can make it more likely by increasing the compiler's optimization level. Or in C++, there is probably a way to use a template to wrap the expression in a constexpr function computing its value, which would make it very likely to be computed at compile time.
But if a constant is what you want, then you always have the option of pre-computing it manually, and defining the macro to expand to an actual constant.
The C standard doesn't specify whether the calculations required when using d (i.e. sqrt(12.2 * 5.8) is done at compile time or at run time. It's left to the individual compiler to decide.
Example:
#define a 12.2
#define b 5.8
#define c a*b
#define d sqrt(c)
float foo() {
return d;
}
int main(void)
{
printf("%f\n", foo());
}
may result in (using https://godbolt.org/ and gcc 10.2 -O0)
foo:
pushq %rbp
movq %rsp, %rbp
movss .LC0(%rip), %xmm0
popq %rbp
ret
.LC1:
.string "%f\n"
main:
pushq %rbp
movq %rsp, %rbp
movl $0, %eax
call foo
pxor %xmm1, %xmm1
cvtss2sd %xmm0, %xmm1
movq %xmm1, %rax
movq %rax, %xmm0
movl $.LC1, %edi
movl $1, %eax
call printf
movl $0, %eax
popq %rbp
ret
.LC0:
.long 1090950945
and for -O2
foo:
movss .LC0(%rip), %xmm0
ret
.LC2:
.string "%f\n"
main:
subq $8, %rsp
movl $.LC2, %edi
movl $1, %eax
movsd .LC1(%rip), %xmm0
call printf
xorl %eax, %eax
addq $8, %rsp
ret
.LC0:
.long 1090950945
.LC1:
.long 536870912
.long 1075892964
so in this example we see a compile time calculation. In my experience all the major compilers will do that but it's not a requirement made by the C standard.
To understand this, you need to understand how #define works! #define is not a statement which the compiler even receives. It is a preprocessor directive, which means, it is processed by the preprocessor. #define a b literally replaces all occurrences of a with b. So, as many times as you call sqrt(c), that many times, it will be replaced by sqrt(a*b). Now, the standards do not mention whether something like this will be calculated at runtime or at compile time, and it has been left to the individual compiler to decide. Although, usage of optimisation flags will definitely impact the end result. Not to mention, neither of those variables will be type safe!

What's the difference between T, volatile T, and std::atomic<T>?

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
int64_t shared = 0;
std::thread thread([&shared]() {
shared = 42;
});
while (shared != 42) {
}
assert(shared == 42);
thread.join();
return 0;
}
With GCC 4.8.5 and default options, the sample works as expected.
The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.
We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)
To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:
// Thread B:
// shared = 42;
movq -8(%rbp), %rax
movq (%rax), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L11:
movq -32(%rbp), %rax # Check shared every iteration
cmpq $42, %rax
jne .L11
Thread B executes a simple store of the value 42 in shared.
Thread A reads shared for each loop iteration until the comparison indicates equality.
Now, we compare that to the -O3 outcome:
// Thread B:
// shared = 42;
movq 8(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
cmpq $42, (%rsp) # check shared once
je .L87 # and skip the infinite loop or not
.L88:
jmp .L88 # infinite loop
.L87:
Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)
The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)
The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.
With volatile int64_t shared, the generated instructions change as follows:
// Thread B:
// shared = 42;
movq 24(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L87:
movq 8(%rsp), %rax
cmpq $42, %rax
jne .L87
The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.
If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.
To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:
// Thread B:
// shared = 42;
movl 4(%esp), %eax
movl 12(%eax), %eax
movl $42, (%eax)
movl $0, 4(%eax)
// Thread A:
// while (shared != 42) {
// }
.L88: # do {
movl 40(%esp), %eax
movl 44(%esp), %edx
xorl $42, %eax
movl %eax, %ecx
orl %edx, %ecx
jne .L88 # } while(shared ^ 42 != 0);
For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.
To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
std::atomic<int64_t> shared;
std::thread thread([&shared]() {
shared.store(42, std::memory_order_relaxed);
});
while (shared.load(std::memory_order_relaxed) != 42) {
}
assert(shared.load(std::memory_order_relaxed) == 42);
thread.join();
return 0;
}
The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):
// Thread B:
// shared.store(42, std::memory_order_relaxed);
movl $42, %ecx
xorl %ebx, %ebx
subl $8, %esp
movl 16(%esp), %eax
movl 4(%eax), %eax # function arg: pointer to shared
movl %ecx, (%esp)
movl %ebx, 4(%esp)
movq (%esp), %xmm0 # 8-byte reload
movq %xmm0, (%eax) # 8-byte store to shared
addl $8, %esp
// Thread A:
// while (shared.load(std::memory_order_relaxed) != 42) {
// }
.L9: # do {
movq -16(%ebp), %xmm1 # 8-byte load from shared
movq %xmm1, -32(%ebp) # copy to a dummy temporary
movl -32(%ebp), %edx
movl -28(%ebp), %ecx # and scalar reload
movl %edx, %eax
movl %ecx, %edx
xorl $42, %eax
orl %eax, %edx
jne .L9 # } while(shared.load() ^ 42 != 0);
To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.
By using std::atomic in the sample, it is guaranteed that
std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed
The C++ standard doesn't talk about registers at all, but it does say:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.
On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.
atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.
A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.
What's the second aspect mentioned above and what does std::memory_order_relaxed mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.
Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.
Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.
If you do not use explicit sharing construct, like those you mention, it is undefined when main() will see shared having a value of 42: please see "Optimisations and reordering" below. Even if your test does not see a problem: please check out "About your test" below!
In multi-threading, a test that gives the "right" answer is (almost) never a proof of correctness.
A "successful" test is at most anecdotal evidence There is simply too much to take into account, like:
The memory model: what is guaranteed and, more likely: what not!
Optimisations by compiler and CPU
Scheduling. For example, thread can terminate anywhere between just before the while loop and inside the thread.join() function.
Run-time stuff like how many other threads and programs are running, how heavily the memory is used etc. This is both hardware and operating system dependent.
More things I forgot …
The only things you CAN trust, are the guarantees that your language's memory model gives.
Fortunately, C++ has a memory model since C++11!
Unfortunately, that model does not give much guarantees. The compiler can generate code that is allowed to do anything, for as long as the semantics of the program do not change as seen from a single-treaded perspective. That includes omitting code, postponing code or changing the order in which things happen. The only exceptions are when you make guaranteed progress, or when you use use explicit sharing constructs, like those you mentioned.
Debugging a multi-threaded situation is also extremely hard. Adding "debug code" to debug your program often changes its behaviour. For example, writing something to the standard output does I/O, which ensures progress. That can cause values to be visible by other threads, where that normally would not be the case!
Make sure you find out what the constructs you mention like atomics, volatile and mutexes do. That way, you can build programs that behave perfectly predictable in multi-threaded circumstances.
About your test
For the fun of it, let's explore some interesting cases surrounding your test program.
Thread scheduling
The operating system decides when threads run and terminate.
It is perfectly acceptable that thread is already terminated even before the while loop in main() is executed. Because thread termination is progress, shared might end up where main() can see it, before the while loop. In that case, the test seems successful. But if the scheduling is any different, the test might fail. You should never rely on scheduling.
Hence, even if your test does not see a problem, that is at most anecdotal evidence.
Optimisations and reordering
As #horsts excellent answer already indicates, the compiler and the CPU can optimise you code. Anything is allowed, as long as the program semantics do not change from the perspective of a single thread.
Imagine that you assign to a variable that you never read again in that thread (like you do in thread). The compiler can postpone the actual assignment for as much as it wants, as there is nothing that depends on the value of shared in that thread for as far as the compiler can see. You must have guaranteed progress in your thread to ensure that actual assignment. In your example, this progress is only guaranteed when thread is terminated: likely at the end of the thread-function. Then again: you have no idea when the thread is scheduled to call your function.
Using constructs like atomic<> and volatile force the compiler to generate code that does ensure predictable behaviour. If you know how to use them, you can make programs that can be shown to behave correctly in multi-threaded circumstances.

Performance of a string operation with strlen vs stop on zero

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution.
I'll take as an example the following two implementations of the upper method:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
INDEX is simple a typedef of uint_fast32_t.
Now I can test the speed of those methods in my main.cpp:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
if (!(_function()) && ok) \
ok = false; \
} \
char output[TEST_RECURSIVE_OUTPUT_STR]; \
sprintf(output, "[%s] Test %s %s: %ld ms\n", \
ok ? "OK" : "Failed", \
TEST_RECURSIVE_BUILD_TYPE, \
#_function, \
(clock() - before) * 1000 / CLOCKS_PER_SEC); \
fprintf(stdout, output); \
fprintf(file_log, output); \
}
String a;
String b;
bool stringUpperStrlen()
{
a.upperStrlen();
return true;
}
bool stringUpperPtr()
{
b.upperPtr();
return true;
}
int main(int argc, char** argv) {
...
a = "Hello World!";
b = "Hello World!";
TEST_RECURSIVE(stringUpperPtr);
TEST_RECURSIVE(stringUpperStrlen);
...
return 0;
}
Then I can compile and test with cmake in Debug or Release with the following results.
[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms
[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms
So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.
So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.
The stringUpperStrlen assembly:
_Z17stringUpperStrlenv:
.LFB72:
.cfi_startproc
pushq %r13
.cfi_def_cfa_offset 16
.cfi_offset 13, -16
xorl %eax, %eax
pushq %r12
.cfi_def_cfa_offset 24
.cfi_offset 12, -24
pushq %rbp
.cfi_def_cfa_offset 32
.cfi_offset 6, -32
xorl %ebp, %ebp
pushq %rbx
.cfi_def_cfa_offset 40
.cfi_offset 3, -40
pushq %rcx
.cfi_def_cfa_offset 48
orq $-1, %rcx
movq a#GOTPCREL(%rip), %r13
movq 0(%r13), %rdi
repnz scasb
movq %rcx, %rdx
notq %rdx
leaq -1(%rdx), %rbx
.L4:
cmpq %rbp, %rbx
je .L3
movq 0(%r13), %r12
addq %rbp, %r12
movsbl (%r12), %edi
incq %rbp
call toupper#PLT
movb %al, (%r12)
jmp .L4
.L3:
popq %rdx
.cfi_def_cfa_offset 40
popq %rbx
.cfi_def_cfa_offset 32
popq %rbp
.cfi_def_cfa_offset 24
popq %r12
.cfi_def_cfa_offset 16
movb $1, %al
popq %r13
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE72:
.size _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
.globl _Z14stringUpperPtrv
.type _Z14stringUpperPtrv, #function
The stringUpperPtr assembly:
_Z14stringUpperPtrv:
.LFB73:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq b#GOTPCREL(%rip), %rax
movq (%rax), %rbx
.L9:
movsbl (%rbx), %edi
testb %dil, %dil
je .L8
call toupper#PLT
movb %al, (%rbx)
incq %rbx
jmp .L9
.L8:
movb $1, %al
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE73:
.size _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
.section .rodata.str1.1,"aMS",#progbits,1
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
So how do you explain this difference in performance?
Thanks in advance.
EDIT:
CMake generate something like this command to compile:
/bin/g++-8 -Os -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o xpp-tests libxpp.so
/bin/g++-8 -O3 -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o Release/xpp-tests Release/libxpp.so
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16
# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG -Wall -pipe -fPIC -march=native -fno-strict-aliasing
CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1
The define TEST_RECURSIVE will call _function 1000000 times in my examples.
You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x; take on a PC?
If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
It's even possible that x is currently in swap, so that reading it requires reading from a disk.
And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
je .L3: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi: Read the current character from the string in memory.
incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper#PLT
movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
jmp .L4: Unconditional jump to the beginning of the loop.
For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:
movsbl (%rbx), %edi: read from the address containing the current.
testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
je .L8: conditional jump, to exit the loop if the character is zero.
call toupper#PLT
movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Visual C++ optimization options - how to improve the code output?

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard.
Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # #main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess.
Am I missing something here or is VS really this bad?
Compiler explorer link:
https://godbolt.org/g/GJYHjE
Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm#00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4#main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4#main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper).
This has already generated much discussion on SO:
clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector
Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # #main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, #unwind, #except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2#YAPEAX_K#Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy#?$vector#HV?$allocator#H#std###std##IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector#HV?$allocator#H#std###std##QEAA#XZ.exit"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)#IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # #main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2#YAPEAX_K#Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3#YAXPEAX#Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.
memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Writing non-class types onto raw memory

Given a void pointer to a "blob" of raw memory, there are two ways of writing something onto it.
The first way is to use placement new. This method has the advantage of calling the ctor automagically when we are dealing with class-types. However, when I deal with non-class types, would it be better to do a cast instead? I imagine it could possibly be faster.
(pLocation is a void pointer to a blob of memory
// ----- Is this better -----
*reinterpret_cast<char*>(pLocation) = pattern;
// ----- Or is this better -----
::new(pLocation) char(pattern);
I had a look at the generated assembly for each of these techniques, using the following program:
#include <new>
char blob[128];
int main() {
void *pLocation = blob;
char pattern = 'x';
#ifdef CAST
*reinterpret_cast<char*>(pLocation) = pattern;
#else
::new(pLocation) char(pattern);
#endif
}
I'm using g++ 4.4.3 on Linux 64-bits with default compiler flags.
The relevant part of the asm for placement new:
movb $120, -1(%rbp)
movq -16(%rbp), %rax
movq %rax, %rsi
movl $1, %edi
call _ZnwmPv
movq %rax, %rdx
testq %rdx, %rdx
je .L5
movzbl -1(%rbp), %edx
movb %dl, (%rax)
.L5:
From what I gather, this actually calls the placement new operator, and checks its return value, even though it always succeeds. It then proceeds to write the value of x into the returned memory.
And for the reinterpret_cast:
movb $120, -1(%rbp)
movq -16(%rbp), %rax
movzbl -1(%rbp), %edx
movb %dl, (%rax)
Note that these instructions are identical to the first two and the last two of the placement new version.
Using -O1, both pieces of code generate identical assembly:
movb $120, blob(%rip)
So, if you're worried about performance, don't be. Any other sane compiler will probably reduce both to the same code as well.
While casting raw memory into objects might work in practice, officially it invokes undefined behavior and as a result of that, according to the C++ standard, your code might do anything.
Placement new, OTOH, is a technique to invoke a constructor at a particular address, and construction is what officially turns raw memory into valid objects. That's why I would prefer placement new.
Just to make sure, I would also have the destructor for such objects is called. While you say that you only need this for PODs and PODs' destruction is a no-op, many bugs I have seen in my carrier were in code that was written with a set of restrictions in mind, but had later some of the restrictions lifted and suddenly found itself in an environment with which it was unable to cope.
Also note that there might be platforms out there for which not all possible bit patterns a valid values even for a built-in type. Such platforms might also trap access to values of such pattern. For example, it could be that an all-zero bit pattern is not a valid value for a floating type, so even zeroing the memory before-hand could not prevent a hardware exception.