Is std::atomic_compare_exchange_weak thread-unsafe by design? - c++

It was brought up on cppreference atomic_compare_exchange Talk page that the existing implementations of std::atomic_compare_exchange_weak compute the boolean result of the CAS with a non-atomic compare instruction, e.g.
lock
cmpxchgq %rcx, (%rsp)
cmpq %rdx, %rax
which (Edit: apologies for the red herring)
break CAS loops such as Concurrency in Action's listing 7.2:
while(!head.compare_exchange_weak(new_node->next, new_node);
The specification (29.6.5[atomics.types.operations.req]/21-22) seems to imply that the result of the comparison must be a part of the atomic operation:
Effects: atomically compares ...
Returns: the result of the comparison
but is it actually implementable? Should we file bug reports to the vendors or to the LWG?

TL;DR: atomic_compare_exchange_weak is safe by design, but actual implementations are buggy.
Here's the code that Clang actually generates for this little snippet:
struct node {
int data;
node* next;
};
std::atomic<node*> head;
void push(int data) {
node* new_node = new node{data};
new_node->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(new_node->next, new_node,
std::memory_order_release, std::memory_order_relaxed)) {}
}
Result:
movl %edi, %ebx
# Allocate memory
movl $16, %edi
callq _Znwm
movq %rax, %rcx
# Initialize with data and 0
movl %ebx, (%rcx)
movq $0, 8(%rcx) ; dead store, should have been optimized away
# Overwrite next with head.load
movq head(%rip), %rdx
movq %rdx, 8(%rcx)
.align 16, 0x90
.LBB0_1: # %while.cond
# =>This Inner Loop Header: Depth=1
# put value of head into comparand/result position
movq %rdx, %rax
# atomic operation here, compares second argument to %rax, stores first argument
# in second if same, and second in %rax otherwise
lock
cmpxchgq %rcx, head(%rip)
# unconditionally write old value back to next - wait, what?
movq %rax, 8(%rcx)
# check if cmpxchg modified the result position
cmpq %rdx, %rax
movq %rax, %rdx
jne .LBB0_1
The comparison is perfectly safe: it's just comparing registers. However, the whole operation is not safe.
The critical point is this: the description of compare_exchange_(weak|strong) says:
Atomically [...] if true, replace the contents of the memory point to by this with that in desired, and if false, updates the contents of the memory in expected with the contents of the memory pointed to by this
Or in pseudo-code:
if (*this == expected)
*this = desired;
else
expected = *this;
Note that expected is only written to if the comparison is false, and *this is only written to if comparison is true. The abstract model of C++ does not allow an execution where both are written to. This is important for the correctness of push above, because if the write to head happens, suddenly new_node points to a location that is visible to other threads, which means other threads can start reading next (by accessing head->next), and if the write to expected (which aliases new_node->next) also happens, that's a race.
And Clang writes to new_node->next unconditionally. In the case where the comparison is true, that's an invented write.
This is a bug in Clang. I don't know whether GCC does the same thing.
In addition, the wording of the standard is suboptimal. It claims that the entire operation must happen atomically, but this is impossible, because expected is not an atomic object; writes to there cannot happen atomically. What the standard should say is that the comparison and the write to *this happen atomically, but the write to expected does not. But this isn't that bad, because no one really expects that write to be atomic anyway.
So there should be a bug report for Clang (and possibly GCC), and a defect report for the standard.

I was the one who originally found this bug. For the last few days I have been e-mailing Anthony Williams regarding this issue and vendor implementations. I didn't realize Cubbi had raise a StackOverFlow question. It's not just Clang or GCC it's every vendor that is broken (all that matters anyway). Anthony Williams also author of Just::Thread (a C++11 thread and atomic library) confirmed his library is implemented correctly (only known correct implementation).
Anthony has raised a GCC bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272
Simple example:
#include <atomic>
struct Node { Node* next; };
void Push(std::atomic<Node*> head, Node* node)
{
node->next = head.load();
while(!head.compare_exchange_weak(node->next, node))
;
}
g++ 4.8 [assembler]
mov rdx, rdi
mov rax, QWORD PTR [rdi]
mov QWORD PTR [rsi], rax
.L3:
mov rax, QWORD PTR [rsi]
lock cmpxchg QWORD PTR [rdx], rsi
mov QWORD PTR [rsi], rax !!!!!!!!!!!!!!!!!!!!!!!
jne .L3
rep; ret
clang 3.3 [assembler]
movq (%rdi), %rcx
movq %rcx, (%rsi)
.LBB0_1:
movq %rcx, %rax
lock
cmpxchgq %rsi, (%rdi)
movq %rax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpq %rcx, %rax !!!!!!!!!!!!!!!!!!!!!!!
movq %rax, %rcx
jne .LBB0_1
ret
icc 13.0.1 [assembler]
movl %edx, %ecx
movl (%rsi), %r8d
movl %r8d, %eax
lock
cmpxchg %ecx, (%rdi)
movl %eax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpl %eax, %r8d !!!!!!!!!!!!!!!!!!!!!!!
je ..B1.7
..B1.4:
movl %edx, %ecx
movl %eax, %r8d
lock
cmpxchg %ecx, (%rdi)
movl %eax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpl %eax, %r8d !!!!!!!!!!!!!!!!!!!!!!!
jne ..B1.4
..B1.7:
ret
Visual Studio 2012 [No need to check assembler, MS uses _InterlockedCompareExchange !!!]
inline int _Compare_exchange_seq_cst_4(volatile _Uint4_t *_Tgt, _Uint4_t *_Exp, _Uint4_t _Value)
{ /* compare and exchange values atomically with
sequentially consistent memory order */
int _Res;
_Uint4_t _Prev = _InterlockedCompareExchange((volatile long
*)_Tgt, _Value, *_Exp);
if (_Prev == *_Exp) !!!!!!!!!!!!!!!!!!!!!!!
_Res = 1;
else
{ /* copy old value */
_Res = 0;
*_Exp = _Prev;
}
return (_Res);
}

[...]
break CAS loops such as Concurrency in Action's listing 7.2:
while(!head.compare_exchange_weak(new_node->next, new_node);
The specification (29.6.5[atomics.types.operations.req]/21-22) seems to
imply that the result of the comparison must be a part of the atomic
operation:
[...]
The issue with this code and the specification is not whether the atomicity of compare_exchange needs to extend beyond just the comparison and exchange itself to returning the result of the comparison or assigning to the expected parameter. That is, the code may still be correct without the store to expected being atomic.
What causes the above code to be potentially racy is when implementations write to the expected parameter after a successful exchange may have been observed by other threads. The code is written with the expectation that in the case when the exchange is successful there is no write on expected to produce a race.
The spec, as written, does appear to guarantee this expected behavior. (And indeed can be read as making the much stronger guarantee you describe, that the entire operation is atomic.) According to the spec, compare_exchange_weak:
Atomically, compares the contents of the memory pointed to by object
or by this for equality with that in expected, and if true, replaces
the contents of the memory pointed to by object or by this with that
in desired, and if false, updates the contents of the memory in
expected with the contents of the memory pointed to by object or by
this. [n4140 § 29.6.5 / 21] (N.B. The wording is unchanged between C++11 and C++14)
The problem is that it seems as though the actual language of the standard is stronger than the original intent of the proposal. Herb Sutter is saying that Concurrency in Action's usage was never really intended to be supported, and that updating expected was only intended to be done on local variables.
I don't see any current defect report on this. [See second update below] If in fact this language is stronger than intended then presumably one will get filed. Either C++11's wording will be updated to guarantee the above code's expected behavior, thus making current implementations non-conformant, or the new wording will not guarantee this behavior, making the above code potentially result in undefined behavior. In that case I guess Anthony's book will need updating. What the committee will do about this, and whether or not actual implementations conform to the original intent (rather than the actual wording of the spec) is still an open question. [See update below]
For the purposes of writing code in the meantime, you'll have to take into account the actual behavior of implementation whether it's conformant or not. Existing implementations may be 'buggy' in the sense that they don't implement the the exact wording of the ISO spec, but they do operate as their implementers intended and they can be used to write thread safe code. [See update below]
So to answer your questions directly:
but is it actually implementable?
I believe that the actual wording of the spec is not reasonably implementable (And that the actual wording makes guarantees stronger even than Anthony's just::thread library provides. For example the actual wording appears to require atomic operations on a non-atomic object. Anthony's slightly weaker interpretation, that the assignment to expected need not be atomic but must be conditioned on the failure of the exchange, is obviously implementable. Herb's even weaker interpretation is also obviously implementable, as that's what most libraries actually implement. [See update below]
Is std::atomic_compare_exchange_weak thread-unsafe by design?
The operation is not thread unsafe no matter whether the operation makes guarantees as strong as the actual wording of the spec or as weak as Herb Sutter indicates. It's simply that correct, thread safe usage of the operation depends on what is guaranteed. The example code from Concurrency in Action is an unsafe usage of a compare_exchange that only offers Herb's weak guarantee, but it could be written to work correctly with Herb's implementation. That could be done like so:
node *expected_head = head.load();
while(!head.compare_exchange_weak(expected_head, new_node) {
new_node->next = expected_head;
}
With this change the 'spurious' writes to expected are simply made to a local variable, and no longer produce any races. The write to new_node->next is now conditional upon the exchange having failed, and thus new_node->next is not visible to any other thread and may be safely updated. This code sample is safe both under current implementations and under stronger guarantees, so it should be future proof to any updates to C++11's atomics that resolve this issue.
Update:
Actual implementations (MSVC, gcc, and clang at least) have been updated to offer the guarantees under Anthony Williams' interpretation; that is, they have stopped inventing writes to expected in the case that the exchange succeeds.
https://llvm.org/bugs/show_bug.cgi?id=18899
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272
https://connect.microsoft.com/VisualStudio/feedback/details/819819/std-atomic-compare-exchange-weak-has-spurious-write-which-can-cause-race-conditions
Update 2:
This defect report on this issue has been filed with the C++ committee. From the currently proposed resolution the committee does want to make stronger guarantees than provided by the implementations you checked (but not as strong as current wording which appears to guarantee atomic operations on non-atomic objects.) The draft for the next C++ standard (C++1z or 'C++17') has not yet adopted the improved wording.
Update 3: C++17 adopted the proposed resolution.

Those people don't seem to understand either the standard or the instructions.
First of all, std::atomic_compare_exchange_weak is not thread-unsafe by design. That is complete nonsense.
The design very clearly defines what the function does and which guarantees (including atomicity and memory ordering) it must provide.
Whether your program that uses this function is thread-safe as a whole is a different matter, but the function's semantics per se are certainly correct in the sense of an atomic copare-exchange (you can still write thread-unsafe code using any available thread-safe primitive, but that is a totally different story).
This particular function implements the "weak" version of a thread-safe compare-exchange operation which differs from the "non weak" version in that the implementation is allowed to generate code which may spuriously fail, if that gives a performance benefit (irrelevant on x86). Weak does not mean it's worse, it only means that it is allowable to fail more often on some platforms, if that gives an overall performance benefit.
The implementation is of course still required to work correctly. That is, if the compare-exchange fails -- whether by concurrency or spuriously -- it must be correctly reported back as having failed.
Second, the code generated by existing implementations has no bearing for the correctness or thread-safety of std::atomic_compare_exchange_weak. At best, if the generated instructions do not work correctly, this is an implementation issue, but it has nothing to do with the language construct. The language standard defines what behavior an implementation must provide, it is not responsible for implementations acutally doing it correctly.
Third, there is no problem in the generated code. The x86 CMPXCHG instruction has a well-defined mode of operation. It compares the actual value with the expected value, and if the comparison is successful, it performs the swap. You know whether or not the operation was successful either by looking at EAX (or RAX in x64) or by the state of ZF.
What matters is that the atomic compare-exchange is atomic, and that's the case. Whatever you do with the result afterwards needs not be atomic (in your case, the CMP), since the state does not change any more. Either the swap was successful at that point, or it has failed. In either case, it's already "history".
std::atomic_compare_exchange_weak has different semantics than the underlying instruction, it returns a bool value. Therefore, you cannot always expect a 1:1 mapping to instructions. The compiler may have to generate additional instructions (and different ones depending on how you consume the result) to implement these semantics, but it really makes no difference for correctness.
The only thing one could arguably complain about is the fact that instead of directly using the already present state of ZF (with a Jcc or CMOVcc), it performs another comparison. But this is a performance issue (1 cycle wasted), not a correctness issue.

Quoting Duncan Forster from the linked page:
The important thing to remember is that the hardware implementation of CAS only returns 1 value (the old value) not two (old plus boolean)
So there's one instruction - the (atomic) CAS - which actually operates on memory, and then another instruction to convert the (atomically-assigned) result into the expected boolean.
Since the value in %rax was set atomically and can't then be affected by another thread, there is no race here.
The quote is anyway false, since ZF is also set depending on the CAS result (ie, it does return both the old value and the boolean). The fact the flag isn't used might be a missed optimisation, or the cmpq might be faster, but it doesn't affect correctness.
For reference, consider decomposing compare_exchange_weak like this pseudocode:
T compare_exchange_weak_value(atomic<T> *obj, T *expected, T desired) {
// setup ...
lock cmpxchgq %rcx, (%rsp) // actual CAS
return %rax; // actual destination value
}
bool compare_exchange_weak_bool(atomic<T> *obj, T *expected, T desired) {
// CAS is atomic
T actual = compare_exchange_weak_value(obj, expected, desired);
// now we figure out if it worked
return actual == *expected;
}
Do you agree the CAS is properly atomic?
If the unconditional store to expected is really what you wanted to ask about (instead of the perfectly safe comparison), I agree with Sebastian that it's a bug.
For reference, you can work around it by forcing the unconditional store into a local, and making the potentially-visible store conditional again:
struct node {
int data;
node* next;
};
std::atomic<node*> head;
void push(int data) {
node* new_node = new node{data};
node* cur_head = head.load(std::memory_order_relaxed);
do {
new_node->next = cur_head;
} while (!head.compare_exchange_weak(cur_head, new_node,
std::memory_order_release, std::memory_order_relaxed));
}

Related

What's the difference between T, volatile T, and std::atomic<T>?

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
int64_t shared = 0;
std::thread thread([&shared]() {
shared = 42;
});
while (shared != 42) {
}
assert(shared == 42);
thread.join();
return 0;
}
With GCC 4.8.5 and default options, the sample works as expected.
The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.
We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)
To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:
// Thread B:
// shared = 42;
movq -8(%rbp), %rax
movq (%rax), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L11:
movq -32(%rbp), %rax # Check shared every iteration
cmpq $42, %rax
jne .L11
Thread B executes a simple store of the value 42 in shared.
Thread A reads shared for each loop iteration until the comparison indicates equality.
Now, we compare that to the -O3 outcome:
// Thread B:
// shared = 42;
movq 8(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
cmpq $42, (%rsp) # check shared once
je .L87 # and skip the infinite loop or not
.L88:
jmp .L88 # infinite loop
.L87:
Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)
The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)
The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.
With volatile int64_t shared, the generated instructions change as follows:
// Thread B:
// shared = 42;
movq 24(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L87:
movq 8(%rsp), %rax
cmpq $42, %rax
jne .L87
The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.
If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.
To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:
// Thread B:
// shared = 42;
movl 4(%esp), %eax
movl 12(%eax), %eax
movl $42, (%eax)
movl $0, 4(%eax)
// Thread A:
// while (shared != 42) {
// }
.L88: # do {
movl 40(%esp), %eax
movl 44(%esp), %edx
xorl $42, %eax
movl %eax, %ecx
orl %edx, %ecx
jne .L88 # } while(shared ^ 42 != 0);
For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.
To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
std::atomic<int64_t> shared;
std::thread thread([&shared]() {
shared.store(42, std::memory_order_relaxed);
});
while (shared.load(std::memory_order_relaxed) != 42) {
}
assert(shared.load(std::memory_order_relaxed) == 42);
thread.join();
return 0;
}
The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):
// Thread B:
// shared.store(42, std::memory_order_relaxed);
movl $42, %ecx
xorl %ebx, %ebx
subl $8, %esp
movl 16(%esp), %eax
movl 4(%eax), %eax # function arg: pointer to shared
movl %ecx, (%esp)
movl %ebx, 4(%esp)
movq (%esp), %xmm0 # 8-byte reload
movq %xmm0, (%eax) # 8-byte store to shared
addl $8, %esp
// Thread A:
// while (shared.load(std::memory_order_relaxed) != 42) {
// }
.L9: # do {
movq -16(%ebp), %xmm1 # 8-byte load from shared
movq %xmm1, -32(%ebp) # copy to a dummy temporary
movl -32(%ebp), %edx
movl -28(%ebp), %ecx # and scalar reload
movl %edx, %eax
movl %ecx, %edx
xorl $42, %eax
orl %eax, %edx
jne .L9 # } while(shared.load() ^ 42 != 0);
To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.
By using std::atomic in the sample, it is guaranteed that
std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed
The C++ standard doesn't talk about registers at all, but it does say:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.
On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.
atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.
A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.
What's the second aspect mentioned above and what does std::memory_order_relaxed mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.
Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.
Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.
If you do not use explicit sharing construct, like those you mention, it is undefined when main() will see shared having a value of 42: please see "Optimisations and reordering" below. Even if your test does not see a problem: please check out "About your test" below!
In multi-threading, a test that gives the "right" answer is (almost) never a proof of correctness.
A "successful" test is at most anecdotal evidence There is simply too much to take into account, like:
The memory model: what is guaranteed and, more likely: what not!
Optimisations by compiler and CPU
Scheduling. For example, thread can terminate anywhere between just before the while loop and inside the thread.join() function.
Run-time stuff like how many other threads and programs are running, how heavily the memory is used etc. This is both hardware and operating system dependent.
More things I forgot …
The only things you CAN trust, are the guarantees that your language's memory model gives.
Fortunately, C++ has a memory model since C++11!
Unfortunately, that model does not give much guarantees. The compiler can generate code that is allowed to do anything, for as long as the semantics of the program do not change as seen from a single-treaded perspective. That includes omitting code, postponing code or changing the order in which things happen. The only exceptions are when you make guaranteed progress, or when you use use explicit sharing constructs, like those you mentioned.
Debugging a multi-threaded situation is also extremely hard. Adding "debug code" to debug your program often changes its behaviour. For example, writing something to the standard output does I/O, which ensures progress. That can cause values to be visible by other threads, where that normally would not be the case!
Make sure you find out what the constructs you mention like atomics, volatile and mutexes do. That way, you can build programs that behave perfectly predictable in multi-threaded circumstances.
About your test
For the fun of it, let's explore some interesting cases surrounding your test program.
Thread scheduling
The operating system decides when threads run and terminate.
It is perfectly acceptable that thread is already terminated even before the while loop in main() is executed. Because thread termination is progress, shared might end up where main() can see it, before the while loop. In that case, the test seems successful. But if the scheduling is any different, the test might fail. You should never rely on scheduling.
Hence, even if your test does not see a problem, that is at most anecdotal evidence.
Optimisations and reordering
As #horsts excellent answer already indicates, the compiler and the CPU can optimise you code. Anything is allowed, as long as the program semantics do not change from the perspective of a single thread.
Imagine that you assign to a variable that you never read again in that thread (like you do in thread). The compiler can postpone the actual assignment for as much as it wants, as there is nothing that depends on the value of shared in that thread for as far as the compiler can see. You must have guaranteed progress in your thread to ensure that actual assignment. In your example, this progress is only guaranteed when thread is terminated: likely at the end of the thread-function. Then again: you have no idea when the thread is scheduled to call your function.
Using constructs like atomic<> and volatile force the compiler to generate code that does ensure predictable behaviour. If you know how to use them, you can make programs that can be shown to behave correctly in multi-threaded circumstances.

Visual C++ optimization options - how to improve the code output?

Are there any options (other than /O2) to improve the Visual C++ code output? The MSDN documentation is quite bad in this regard.
Note that I'm not asking about project-wide settings (link-time optimization, etc...). I'm only interested in this particular example.
The fairly simple C++11 code looks like this:
#include <vector>
int main() {
std::vector<int> v = {1, 2, 3, 4};
int sum = 0;
for(int i = 0; i < v.size(); i++) {
sum += v[i];
}
return sum;
}
Clang's output with libc++ is quite compact:
main: # #main
mov eax, 10
ret
Visual C++ output, on the other hand, is a multi-page mess.
Am I missing something here or is VS really this bad?
Compiler explorer link:
https://godbolt.org/g/GJYHjE
Unfortunately, it's difficult to greatly improve Visual C++ output in this case, even by using more aggressive optimization flags. There are several factors contributing to VS inefficiency, including lack of certain compiler optimizations, and the structure of Microsoft's implementation of <vector>.
Inspecting the generated assembly, Clang does an outstanding job optimizing this code. Specifically, when compared to VS, Clang is able to perform a very effective Constant propagation, Function Inlining (and consequently, Dead Code Elimination), and New/delete optimization.
Constant Propagation
In the example, the vector is statically initialized:
std::vector<int> v = {1, 2, 3, 4};
Normally, the compiler will store the constants 1, 2, 3, 4 in the data memory, and in the for loop, will load one value at one at a time, starting from the low address in which 1 is stored, and add each value to the sum.
Here's the abbreviated VS code for doing this:
movdqa xmm0, XMMWORD PTR __xmm#00000004000000030000000200000001
...
movdqu XMMWORD PTR $T1[rsp], xmm0 ; Store integers 1, 2, 3, 4 in memory
...
$LL4#main:
add ebx, DWORD PTR [rdx] ; loop and sum the values
lea rdx, QWORD PTR [rdx+4]
inc r8d
movsxd rax, r8d
cmp rax, r9
jb SHORT $LL4#main
Clang, however, is very clever to realize that the sum could be calculated in advance. My best guess is that it replaces the loading of the constants from memory to constant mov operations into registers (propagates the constants), and then combines them into the result of 10. This has the useful side effect of breaking dependencies, and since the addresses are no longer loaded from, the compiler is free to remove everything else as dead code.
Clang seems to be unique in doing this - neither VS or GCC were able to precalculate the vector accumulation result in advance.
New/Delete Optimization
Compilers conforming to C++14 are allowed to omit calls to new and delete on certain conditions, specifically when the number of allocation calls is not part of the observable behavior of the program (N3664 standard paper).
This has already generated much discussion on SO:
clang vs gcc - optimization including operator new
Is the compiler allowed to optimize out heap memory allocations?
Optimization of raw new[]/delete[] vs std::vector
Clang invoked with -std=c++14 -stdlib=libc++ indeed performs this optimization and eliminates the calls to new and delete, which do carry side effects, but supposedly do not affect the observable behaviour of the program. With -stdlib=libstdc++, Clang is stricter and keeps the calls to new and delete - although, by looking at the assembly, it's clear they are not really needed.
Now, when inspecting the main code generated by VS, we can find there two function calls (with the rest of vector construction and iteration code inlined into main):
call std::vector<int,std::allocator<int> >::_Range_construct_or_tidy<int const * __ptr64>
and
call void __cdecl operator delete(void * __ptr64)
The first is used for allocating the vector, and the second for deallocating it, and practically all other functions in the VS output are pulled in by these functions calls. This hints that Visual C++ will not optimize away calls to allocation functions (for C++14 conformance we should add the /std:c++14 flag, but the results are the same).
This blog post (May 10, 2017) from the Visual C++ team confirms that indeed, this optimization is not implemented. Searching the page for N3664 shows that "Avoiding/fusing allocations" is at status N/A, and linked comment says:
[E] Avoiding/fusing allocations is permitted but not required. For the time being, we’ve chosen not to implement this.
Combining new/delete optimization and constant propagation, it's easy to see the impact of these two optimizations in this Compiler Explorer 3-way comparison of Clang with -stdlib=libc++, Clang with -stdlib=libstdc++, and GCC.
STL Implementation
VS has its own STL implementation which is very differently structured than libc++ and stdlibc++, and that seems to have a large contribution to VS inferior code generation. While VS STL has some very useful features, such as checked iterators and iterator debugging hooks (_ITERATOR_DEBUG_LEVEL), it gives the general impression of being heavier and to perform less efficiently than stdlibc++.
For isolating the impact of the vector STL implementation, an interesting experiment is to use Clang for compilation, combined with the VS header files. Indeed, using Clang 5.0.0 with Visual Studio 2015 headers, results in the following code generation - clearly, the STL implementation has a huge impact!
main: # #main
.Lfunc_begin0:
.Lcfi0:
.seh_proc main
.seh_handler __CxxFrameHandler3, #unwind, #except
# BB#0: # %.lr.ph
pushq %rbp
.Lcfi1:
.seh_pushreg 5
pushq %rsi
.Lcfi2:
.seh_pushreg 6
pushq %rdi
.Lcfi3:
.seh_pushreg 7
pushq %rbx
.Lcfi4:
.seh_pushreg 3
subq $72, %rsp
.Lcfi5:
.seh_stackalloc 72
leaq 64(%rsp), %rbp
.Lcfi6:
.seh_setframe 5, 64
.Lcfi7:
.seh_endprologue
movq $-2, (%rbp)
movl $16, %ecx
callq "??2#YAPEAX_K#Z"
movq %rax, -24(%rbp)
leaq 16(%rax), %rcx
movq %rcx, -8(%rbp)
movups .L.ref.tmp(%rip), %xmm0
movups %xmm0, (%rax)
movq %rcx, -16(%rbp)
movl 4(%rax), %ebx
movl 8(%rax), %esi
movl 12(%rax), %edi
.Ltmp0:
leaq -24(%rbp), %rcx
callq "?_Tidy#?$vector#HV?$allocator#H#std###std##IEAAXXZ"
.Ltmp1:
# BB#1: # %"\01??1?$vector#HV?$allocator#H#std###std##QEAA#XZ.exit"
addl %ebx, %esi
leal 1(%rdi,%rsi), %eax
addq $72, %rsp
popq %rbx
popq %rdi
popq %rsi
popq %rbp
retq
.seh_handlerdata
.long ($cppxdata$main)#IMGREL
.text
Update - Visual Studio 2017
In Visual Studio 2017, <vector> has seen a major overhaul, as announced on this blog post from the Visual C++ team. Specifically, it mentions the following optimizations:
Eliminated unnecessary EH logic. For example, vector’s copy assignment operator had an unnecessary try-catch block. It just has to provide the basic guarantee, which we can achieve through proper action sequencing.
Improved performance by avoiding unnecessary rotate() calls. For example, emplace(where, val) was calling emplace_back() followed by rotate(). Now, vector calls rotate() in only one scenario (range insertion with input-only iterators, as previously described).
Improved performance with stateful allocators. For example, move construction with non-equal allocators now attempts to activate our memmove() optimization. (Previously, we used make_move_iterator(), which had the side effect of inhibiting the memmove() optimization.) Note that a further improvement is coming in VS 2017 Update 1, where move assignment will attempt to reuse the buffer in the non-POCMA non-equal case.
Curious, I went back to test this. When building the example in Visual Studio 2017, the result is still a multi page assembly listing, with many function calls, so even if code generation improved, it is difficult to notice.
However, when building with clang 5.0.0 and Visual Studio 2017 headers, we get the following assembly:
main: # #main
.Lcfi0:
.seh_proc main
# BB#0:
subq $40, %rsp
.Lcfi1:
.seh_stackalloc 40
.Lcfi2:
.seh_endprologue
movl $16, %ecx
callq "??2#YAPEAX_K#Z" ; void * __ptr64 __cdecl operator new(unsigned __int64)
movq %rax, %rcx
callq "??3#YAXPEAX#Z" ; void __cdecl operator delete(void * __ptr64)
movl $10, %eax
addq $40, %rsp
retq
.seh_handlerdata
.text
Note the movl $10, %eax instruction - that is, with VS 2017's <vector>, clang was able to collapse everything, precalculate the result of 10, and keep only the calls to new and delete.
I'd say that is pretty amazing!
Function Inlining
Function inlining is probably the single most vital optimization in this example. By collapsing the code of called functions into their call sites, the compiler is able to perform further optimizations on the merged code, plus, removing of function calls is beneficial in reducing call overhead and removing of optimization barriers.
When inspecting the generated assembly for VS, and comparing the code before and after inlining (Compiler Explorer), we can see that most vector functions were indeed inlined, except for the allocation and deallocation functions. In particular, there are calls to memmove, which are the result of inlining of some higher level functions, such as _Uninitialized_copy_al_unchecked.
memmove is a library function, and therefore cannot be inlined. However, clang has a clever way around this - it replaces the call to memmove with a call to __builtin_memmove. __builtin_memmove is a builtin/intrinsic function, which has the same functionality as memmove, but as opposed to the plain function call, the compiler generates code for it and embeds it into the calling function. Consequently, the code could be further optimized inside the calling function and eventually removed as dead code.
Summary
To conclude, Clang is clearly superior than VS in this example, both thanks to high quality optimizations, and more efficient vector STL implementation. When using the same header files for Visual C++ and clang (the Visual Studio 2017 headers), Clang beats Visual C++ hands down.
While writing this answer, I couldn't help not to think, what would we do without Compiler Explorer? Thanks Matt Godbolt for this amazing tool!

Using base pointer register in C++ inline asm

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.
See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.
In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

Does a function local static variable automatically incur a branch?

For example:
int foo()
{
static int i = 0;
return i++;
}
The variable i will only be initialized to 0 the first time foo is called. Does this automatically mean there's a hidden branch in there to keep the initialization from happening more than once? Or are there more clever tricks to avoid this?
Yes, it must incur a branch, and it must also incur at least an atomic operation for safe concurrent initialization. The Standard requires that they are initialized on function entry, in a concurrency-safe way.
The implementation can only dodge this requirement if it can prove that the difference between lazy init and some earlier initialization like before main() is entered is equivalent. For example, simple PODs initialized from constants, the compiler may choose to initialize it earlier like a file-scope global since it's non-observable and saving the lazy initialization code, but that's a non-observable optimization.
Yes, there is a branch. Each time the function is entered, the code must check if the variable has already been initialized. But as will be explained below, you usually do not have to care about this branch.
Example
Check out this code:
#include <iostream>
struct Foo { Foo(){ std::cout << "FOO" << std::endl;} };
void foo(){ static Foo foo; }
int main(){ foo();}
Now, here is the first part of assembly code that gcc4.8 generates for the foo function:
_Z3foov:
.LFB974:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA974
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %r12
pushq %rbx
.cfi_offset 12, -24
.cfi_offset 3, -32
movl $_ZGVZ3foovE3foo, %eax
movzbl (%rax), %eax
testb %al, %al
jne .L7 <------------------- FIRST CHECK
movl $_ZGVZ3foovE3foo, %edi
call __cxa_guard_acquire <------------------- LOCK
testl %eax, %eax
setne %al
testb %al, %al
je .L7 <------------------- SECOND CHECK
movl $0, %r12d
movl $_ZZ3foovE3foo, %edi
A you see, there is a jne! Then, a guard is aquired using __cxa_guard_acquire, followed by a je. Thus, it seems that the compiler is generating the famous double checked locking pattern here.
Will every compiler generate a branch?
I am pretty sure the spec does NOT mandate that a branch or double checked locking must be used. It just mandates that the initialization must be thread safe. However, I do not see a way to perform a thread safe initialization without a branch. Thus, even though the spec does not mandate it, it is simply not possible with current CPU architectures to omit the branch here.
Is the branch expensive?
Considering whether you should care about this branch:
You should definitly NOT care about this branch, since it will be correctly predicted (as it once the object is initialized the branch always takes the same route). Thus, the branch is almost free. Trying to avoid a static local variable for optimization purposes should never yield any observable performance benefit.
Is there really no way around the branch?
If the constructor is not observable, like simply initialization with constant values, then it may be performed eagerly at program startup and the branch is omitted. If, however, it is observable, then things get pretty tricky:
The only possibility I see is stated in the answer of R. Martinho Fernandes (which has been deleted): The code could modify itself. I.e., simply remove the initialization code once the initialization is done. However, this is idea is impractical for the following reasons:
Self-modifying code is very hard to get thread-safe.
Usually, memory flagged executable is write protected so code is not allowed to rewrite itself.
It is just not worth it, as the branch is not expensive (see above).

Writing non-class types onto raw memory

Given a void pointer to a "blob" of raw memory, there are two ways of writing something onto it.
The first way is to use placement new. This method has the advantage of calling the ctor automagically when we are dealing with class-types. However, when I deal with non-class types, would it be better to do a cast instead? I imagine it could possibly be faster.
(pLocation is a void pointer to a blob of memory
// ----- Is this better -----
*reinterpret_cast<char*>(pLocation) = pattern;
// ----- Or is this better -----
::new(pLocation) char(pattern);
I had a look at the generated assembly for each of these techniques, using the following program:
#include <new>
char blob[128];
int main() {
void *pLocation = blob;
char pattern = 'x';
#ifdef CAST
*reinterpret_cast<char*>(pLocation) = pattern;
#else
::new(pLocation) char(pattern);
#endif
}
I'm using g++ 4.4.3 on Linux 64-bits with default compiler flags.
The relevant part of the asm for placement new:
movb $120, -1(%rbp)
movq -16(%rbp), %rax
movq %rax, %rsi
movl $1, %edi
call _ZnwmPv
movq %rax, %rdx
testq %rdx, %rdx
je .L5
movzbl -1(%rbp), %edx
movb %dl, (%rax)
.L5:
From what I gather, this actually calls the placement new operator, and checks its return value, even though it always succeeds. It then proceeds to write the value of x into the returned memory.
And for the reinterpret_cast:
movb $120, -1(%rbp)
movq -16(%rbp), %rax
movzbl -1(%rbp), %edx
movb %dl, (%rax)
Note that these instructions are identical to the first two and the last two of the placement new version.
Using -O1, both pieces of code generate identical assembly:
movb $120, blob(%rip)
So, if you're worried about performance, don't be. Any other sane compiler will probably reduce both to the same code as well.
While casting raw memory into objects might work in practice, officially it invokes undefined behavior and as a result of that, according to the C++ standard, your code might do anything.
Placement new, OTOH, is a technique to invoke a constructor at a particular address, and construction is what officially turns raw memory into valid objects. That's why I would prefer placement new.
Just to make sure, I would also have the destructor for such objects is called. While you say that you only need this for PODs and PODs' destruction is a no-op, many bugs I have seen in my carrier were in code that was written with a set of restrictions in mind, but had later some of the restrictions lifted and suddenly found itself in an environment with which it was unable to cope.
Also note that there might be platforms out there for which not all possible bit patterns a valid values even for a built-in type. Such platforms might also trap access to values of such pattern. For example, it could be that an all-zero bit pattern is not a valid value for a floating type, so even zeroing the memory before-hand could not prevent a hardware exception.