What's the difference between T, volatile T, and std::atomic<T>?

What's the difference between T, volatile T, and std::atomic<T>? - c++

Given the following sample that intends to wait until another thread stores 42 in a shared variable shared without locks and without waiting for thread termination, why would volatile T or std::atomic<T> be required or recommended to guarantee concurrency correctness?
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
int64_t shared = 0;
std::thread thread([&shared]() {
shared = 42;
});
while (shared != 42) {
}
assert(shared == 42);
thread.join();
return 0;
}
With GCC 4.8.5 and default options, the sample works as expected.

The test seems to indicate that the sample is correct but it is not. Similar code could easily end up in production and might even run flawlessly for years.
We can start off by compiling the sample with -O3. Now, the sample hangs indefinitely. (The default is -O0, no optimization / debug-consistency, which is somewhat similar to making every variable volatile, which is the reason the test didn't reveal the code as unsafe.)
To get to the root cause, we have to inspect the generated assembly. First, the GCC 4.8.5 -O0 based x86_64 assembly corresponding to the un-optimized working binary:
// Thread B:
// shared = 42;
movq -8(%rbp), %rax
movq (%rax), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L11:
movq -32(%rbp), %rax # Check shared every iteration
cmpq $42, %rax
jne .L11
Thread B executes a simple store of the value 42 in shared.
Thread A reads shared for each loop iteration until the comparison indicates equality.
Now, we compare that to the -O3 outcome:
// Thread B:
// shared = 42;
movq 8(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
cmpq $42, (%rsp) # check shared once
je .L87 # and skip the infinite loop or not
.L88:
jmp .L88 # infinite loop
.L87:
Optimizations associated with -O3 replaced the loop with a single comparison and, if not equal, an infinite loop to match the expected behavior. With GCC 10.2, the loop is optimized out. (Unlike C, infinite loops with no side-effects or volatile accesses are undefined behaviour in C++.)
The problem is that the compiler and its optimizer are not aware of the implementation's concurrency implications. Consequently, the conclusion needs to be that shared cannot change in thread A - the loop is equivalent to dead code. (Or to put it another way, data races are UB, and the optimizer is allowed to assume that the program doesn't encounter UB. If you're reading a non-atomic variable, that must mean nobody else is writing it. This is what allows compilers to hoist loads out of loops, and similarly sink stores, which are very valuable optimizations for the normal case of non-shared variables.)
The solution requires us to communicate to the compiler that shared is involved in inter-thread communication. One way to accomplish that may be volatile. While the actual meaning of volatile varies across compilers and guarantees, if any, are compiler-specific, the general consensus is that volatile prevents the compiler from optimizing volatile accesses in terms of register-based caching. This is essential for low-level code that interacts with hardware and has its place in concurrent programming, albeit with a downward trend due to the introduction of std::atomic.
With volatile int64_t shared, the generated instructions change as follows:
// Thread B:
// shared = 42;
movq 24(%rdi), %rax
movq $42, (%rax)
// Thread A:
// while (shared != 42) {
// }
.L87:
movq 8(%rsp), %rax
cmpq $42, %rax
jne .L87
The loop cannot be eliminated anymore as it must be assumed that shared changed even though there's no evidence of that in the form of code. As a result, the sample now works with -O3.
If volatile fixes the issue, why would you ever need std::atomic? Two aspects relevant for lock-free code are what makes std::atomic essential: memory operation atomicity and memory order.
To build the case for load/store atomicity, we review the generated assembly compiled with GCC4.8.5 -O3 -m32 (the 32-bit version) for volatile int64_t shared:
// Thread B:
// shared = 42;
movl 4(%esp), %eax
movl 12(%eax), %eax
movl $42, (%eax)
movl $0, 4(%eax)
// Thread A:
// while (shared != 42) {
// }
.L88: # do {
movl 40(%esp), %eax
movl 44(%esp), %edx
xorl $42, %eax
movl %eax, %ecx
orl %edx, %ecx
jne .L88 # } while(shared ^ 42 != 0);
For 32-bit x86 code generation, 64-bit loads and stores are usually split into two instructions. For single-threaded code, this is not an issue. For multi-threaded code, this means that another thread can see a partial result of the 64-bit memory operation, leaving room for unexpected inconsistencies that might not cause problems 100 percent of the time, but can occur at random and the probability of occurrence is heavily influenced by the surrounding code and software usage patterns. Even if GCC chose to generate instructions that guarantee atomicity by default, that still wouldn't affect other compilers and might not hold true for all supported platforms.
To guard against partial loads/stores in all circumstances and across all compilers and supported platforms, std::atomic can be employed. Let's review how std::atomic affects the generated assembly. The updated sample:
#include <atomic>
#include <cassert>
#include <cstdint>
#include <thread>
int main()
{
std::atomic<int64_t> shared;
std::thread thread([&shared]() {
shared.store(42, std::memory_order_relaxed);
});
while (shared.load(std::memory_order_relaxed) != 42) {
}
assert(shared.load(std::memory_order_relaxed) == 42);
thread.join();
return 0;
}
The generated 32-bit assembly based on GCC 10.2 (-O3: https://godbolt.org/z/8sPs55nzT):
// Thread B:
// shared.store(42, std::memory_order_relaxed);
movl $42, %ecx
xorl %ebx, %ebx
subl $8, %esp
movl 16(%esp), %eax
movl 4(%eax), %eax # function arg: pointer to shared
movl %ecx, (%esp)
movl %ebx, 4(%esp)
movq (%esp), %xmm0 # 8-byte reload
movq %xmm0, (%eax) # 8-byte store to shared
addl $8, %esp
// Thread A:
// while (shared.load(std::memory_order_relaxed) != 42) {
// }
.L9: # do {
movq -16(%ebp), %xmm1 # 8-byte load from shared
movq %xmm1, -32(%ebp) # copy to a dummy temporary
movl -32(%ebp), %edx
movl -28(%ebp), %ecx # and scalar reload
movl %edx, %eax
movl %ecx, %edx
xorl $42, %eax
orl %eax, %edx
jne .L9 # } while(shared.load() ^ 42 != 0);
To guarantee atomicity for loads and stores, the compiler emits an 8-byte SSE2 movq instruction (to/from the bottom half of a 128-bit SSE register). Additionally, the assembly shows that the loop remains intact even though volatile was removed.
By using std::atomic in the sample, it is guaranteed that
std::atomic loads and stores are not subject to register-based caching
std::atomic loads and stores do not allow partial values to be observed
The C++ standard doesn't talk about registers at all, but it does say:
Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.
While that leaves room for interpretation, caching std::atomic loads across iterations, like triggered in our sample (without volatile or atomic) would clearly be a violation - the store might never become visible. Current compilers don't even optimize atomics within one block, like 2 accesses in the same iteration.
On x86, naturally-aligned loads/stores (where the address is a multiple of the load/store size) are atomic up to 8 bytes without special instructions. That's why GCC is able to use movq.
atomic<T> with a large T may not be supported directly by hardware, in which case the compiler can fall back to using a mutex.
A large T (e.g. the size of 2 registers) on some platforms might require an atomic RMW operation (if the compiler doesn't simply fall back to locking), which are sometimes provided with larger size than the largest efficient pure-load / pure-store that's guaranteed atomic. (e.g. on x86-64, lock cmpxchg16, or ARM ldrexd/strexd retry loop). Single-instruction atomic RMWs (like x86 uses) internally involve a cache line lock or a bus lock. For example, older versions of clang -m32 for x86 will use lock cmpxchg8b instead of movq for 8-byte pure-load or pure-store.
What's the second aspect mentioned above and what does std::memory_order_relaxed mean?
Both, the compiler and CPU can reorder memory operations to optimize efficiency. The primary constraint of reordering is that all loads and stores must appear to have been executed in the order given by the code (program order). Therefore, in case of inter-thread communication, the memory order must be take into account to establish the required order despite reordering attempts. The required memory order can be specified for std::atomic loads and stores. std::memory_order_relaxed does not impose any particular order.
Mutual exclusion primitives enforce a specific memory order (acquire-release order) so that memory operations stay in the lock scope and stores executed by previous lock owners are guaranteed to be visible to subsequent lock owners. Thus, using locks, all the aspects raised here are addressed simply by using the locking facility. As soon as you break out of the comfort locks provide, you have to be mindful of the consequences and the factors that affect concurrency correctness.
Being as explicit as possible about inter-thread communication is a good starting point so that the compiler is aware of the load/store context and can generate code accordingly. Whenever possible, prefer std::atomic<T> with std::memory_order_relaxed (unless the scenario calls for a specific memory order) to volatile T (and, of course, T). Also, whenever possible, prefer not to roll your own lock-free code to reduce code complexity and maximize the probability of correctness.

If you do not use explicit sharing construct, like those you mention, it is undefined when main() will see shared having a value of 42: please see "Optimisations and reordering" below. Even if your test does not see a problem: please check out "About your test" below!
In multi-threading, a test that gives the "right" answer is (almost) never a proof of correctness.
A "successful" test is at most anecdotal evidence There is simply too much to take into account, like:
The memory model: what is guaranteed and, more likely: what not!
Optimisations by compiler and CPU
Scheduling. For example, thread can terminate anywhere between just before the while loop and inside the thread.join() function.
Run-time stuff like how many other threads and programs are running, how heavily the memory is used etc. This is both hardware and operating system dependent.
More things I forgot …
The only things you CAN trust, are the guarantees that your language's memory model gives.
Fortunately, C++ has a memory model since C++11!
Unfortunately, that model does not give much guarantees. The compiler can generate code that is allowed to do anything, for as long as the semantics of the program do not change as seen from a single-treaded perspective. That includes omitting code, postponing code or changing the order in which things happen. The only exceptions are when you make guaranteed progress, or when you use use explicit sharing constructs, like those you mentioned.
Debugging a multi-threaded situation is also extremely hard. Adding "debug code" to debug your program often changes its behaviour. For example, writing something to the standard output does I/O, which ensures progress. That can cause values to be visible by other threads, where that normally would not be the case!
Make sure you find out what the constructs you mention like atomics, volatile and mutexes do. That way, you can build programs that behave perfectly predictable in multi-threaded circumstances.
About your test
For the fun of it, let's explore some interesting cases surrounding your test program.
Thread scheduling
The operating system decides when threads run and terminate.
It is perfectly acceptable that thread is already terminated even before the while loop in main() is executed. Because thread termination is progress, shared might end up where main() can see it, before the while loop. In that case, the test seems successful. But if the scheduling is any different, the test might fail. You should never rely on scheduling.
Hence, even if your test does not see a problem, that is at most anecdotal evidence.
Optimisations and reordering
As #horsts excellent answer already indicates, the compiler and the CPU can optimise you code. Anything is allowed, as long as the program semantics do not change from the perspective of a single thread.
Imagine that you assign to a variable that you never read again in that thread (like you do in thread). The compiler can postpone the actual assignment for as much as it wants, as there is nothing that depends on the value of shared in that thread for as far as the compiler can see. You must have guaranteed progress in your thread to ensure that actual assignment. In your example, this progress is only guaranteed when thread is terminated: likely at the end of the thread-function. Then again: you have no idea when the thread is scheduled to call your function.
Using constructs like atomic<> and volatile force the compiler to generate code that does ensure predictable behaviour. If you know how to use them, you can make programs that can be shown to behave correctly in multi-threaded circumstances.

Related

Performance of a string operation with strlen vs stop on zero

While I was writing a class for strings in C ++, I found a strange behavior regarding the speed of execution.
I'll take as an example the following two implementations of the upper method:
class String {
char* str;
...
forceinline void upperStrlen();
forceinline void upperPtr();
};
void String::upperStrlen()
{
INDEX length = strlen(str);
for (INDEX i = 0; i < length; i++) {
str[i] = toupper(str[i]);
}
}
void String::upperPtr()
{
char* ptr_char = str;
for (; *ptr_char != '\0'; ptr_char++) {
*ptr_char = toupper(*ptr_char);
}
}
INDEX is simple a typedef of uint_fast32_t.
Now I can test the speed of those methods in my main.cpp:
#define TEST_RECURSIVE(_function) \
{ \
bool ok = true; \
clock_t before = clock(); \
for (int i = 0; i < TEST_RECURSIVE_TIMES; i++) { \
if (!(_function()) && ok) \
ok = false; \
} \
char output[TEST_RECURSIVE_OUTPUT_STR]; \
sprintf(output, "[%s] Test %s %s: %ld ms\n", \
ok ? "OK" : "Failed", \
TEST_RECURSIVE_BUILD_TYPE, \
#_function, \
(clock() - before) * 1000 / CLOCKS_PER_SEC); \
fprintf(stdout, output); \
fprintf(file_log, output); \
}
String a;
String b;
bool stringUpperStrlen()
{
a.upperStrlen();
return true;
}
bool stringUpperPtr()
{
b.upperPtr();
return true;
}
int main(int argc, char** argv) {
...
a = "Hello World!";
b = "Hello World!";
TEST_RECURSIVE(stringUpperPtr);
TEST_RECURSIVE(stringUpperStrlen);
...
return 0;
}
Then I can compile and test with cmake in Debug or Release with the following results.
[OK] Test RELEASE stringUpperPtr: 21 ms
[OK] Test RELEASE stringUpperStrlen: 12 ms
[OK] Test DEBUG stringUpperPtr: 27 ms
[OK] Test DEBUG stringUpperStrlen: 33 ms
So in Debug the behavior is what I expected, the pointer is faster than strlen, but in Release strlen is faster.
So I took the GCC assembly and the number of instructions is much less in the stringUpperPtr than in stringUpperStrlen.
The stringUpperStrlen assembly:
_Z17stringUpperStrlenv:
.LFB72:
.cfi_startproc
pushq %r13
.cfi_def_cfa_offset 16
.cfi_offset 13, -16
xorl %eax, %eax
pushq %r12
.cfi_def_cfa_offset 24
.cfi_offset 12, -24
pushq %rbp
.cfi_def_cfa_offset 32
.cfi_offset 6, -32
xorl %ebp, %ebp
pushq %rbx
.cfi_def_cfa_offset 40
.cfi_offset 3, -40
pushq %rcx
.cfi_def_cfa_offset 48
orq $-1, %rcx
movq a#GOTPCREL(%rip), %r13
movq 0(%r13), %rdi
repnz scasb
movq %rcx, %rdx
notq %rdx
leaq -1(%rdx), %rbx
.L4:
cmpq %rbp, %rbx
je .L3
movq 0(%r13), %r12
addq %rbp, %r12
movsbl (%r12), %edi
incq %rbp
call toupper#PLT
movb %al, (%r12)
jmp .L4
.L3:
popq %rdx
.cfi_def_cfa_offset 40
popq %rbx
.cfi_def_cfa_offset 32
popq %rbp
.cfi_def_cfa_offset 24
popq %r12
.cfi_def_cfa_offset 16
movb $1, %al
popq %r13
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE72:
.size _Z17stringUpperStrlenv, .-_Z17stringUpperStrlenv
.globl _Z14stringUpperPtrv
.type _Z14stringUpperPtrv, #function
The stringUpperPtr assembly:
_Z14stringUpperPtrv:
.LFB73:
.cfi_startproc
pushq %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq b#GOTPCREL(%rip), %rax
movq (%rax), %rbx
.L9:
movsbl (%rbx), %edi
testb %dil, %dil
je .L8
call toupper#PLT
movb %al, (%rbx)
incq %rbx
jmp .L9
.L8:
movb $1, %al
popq %rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE73:
.size _Z14stringUpperPtrv, .-_Z14stringUpperPtrv
.section .rodata.str1.1,"aMS",#progbits,1
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
So how do you explain this difference in performance?
Thanks in advance.
EDIT:
CMake generate something like this command to compile:
/bin/g++-8 -Os -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o xpp-tests libxpp.so
/bin/g++-8 -O3 -DNDEBUG -Wl,-rpath,$ORIGIN CMakeFiles/xpp-tests.dir/tests/main.cpp.o -o Release/xpp-tests Release/libxpp.so
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.16
# compile CXX with /bin/g++-8
CXX_FLAGS = -O3 -DNDEBUG -Wall -pipe -fPIC -march=native -fno-strict-aliasing
CXX_DEFINES = -DPLATFORM_UNIX=1 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1
The define TEST_RECURSIVE will call _function 1000000 times in my examples.

You have several misconceptions about performance. You need to dispel these misconceptions.
Now I can test the speed of those methods in my main.cpp: (…)
Your benchmarking code calls the benchmarked functions directly. So you're measuring the benchmarked functions as optimized for the specific case of how they're used by the benchmarking code: to call them repeatedly on the same input. This is unlikely to have any relevance to how they behave in a realistic environment.
I think the compiler didn't do anything earth-shattering because it doesn't know what toupper does. If the compiler had known that toupper doesn't transform a nonzero character into zero, it might well have hoisted the strlen call outside the benchmarked loop. And if it had known that toupper(toupper(x)) == toupper(x), it might well have decided to run the loop only once.
To make a somewhat realistic benchmark, put the benchmarked code and the benchmarking code in separate source files, compile them separately, and disable any kind of cross-module or link-time optimization.
Then I can compile and test with cmake in Debug or Release
Compiling in debug mode rarely has any relevance to microbenchmarks (benchmarking the speed of an implementation of a small fragment of code, as opposed to benchmarking the relative speed of algorithms in terms of how many elementary functions they call). Compiler optimizations have a significant effect on microbenchmarks.
So rationally, fewer instructions should mean more speed (excluding cache, scheduler, etc ...).
No, absolutely not.
First of all, fewer instructions total is completely irrelevant to the speed of the program. Even on a platform where executing one instruction takes the same amount of time regardless of what the instruction is, which is unusual, what matters is how many instructions are executed, not how many instructions there are in the program. For example, a loop with 100 instructions that is executed 10 times is 10 times faster than a loop with 10 instructions that is executed 1000 times, even though it's 10 times larger. Inlining is a common program transformation that usually makes the code larger and makes it faster often enough that it's considered a common optimization.
Second, on many platforms, such as any PC or server made in the 21st century, any smartphone, and even many lower-end devices, the time it takes to execute an instruction can vary so widely that it's a poor indication of performance. Cache is a major factor: a read from memory can be more than 1000 times slower than a read from cache on a PC. Other factors with less impact include pipelining, which causes the speed of an instruction to depend on the surrounding instructions, and branch prediction, which causes the speed of a conditional instruction to depend on the outcome of previous conditional instructions.
Third, that's just considering processor instructions — what you see in assembly code. Compilers for C, C++ and most other languages optimize programs in such a way that it can be hard to predict what the processor will be doing exactly.
For example, how long does the instruction ++x; take on a PC?
If the compiler has figured out that the addition is unnecessary, for example because nothing uses x afterwards, or because the value of x is known at compile time and therefore so is the value of x+1, it'll optimize it away. So the answer is 0.
If the value of x is already in a register at this point and the value is only needed in a register afterwards, the compiler just needs to generate an addition or increment instruction. So the simplistic, but not quite correct answer is 1 clock cycle. One reason this is not quite correct is that merely decoding the instruction takes many cycles on a high-end processor such as what you find in a 21st century PC or smartphone. However “one cycle” is kind of correct in that while it takes multiple clock cycles from starting the instruction to finishing it, the instruction only takes one cycle in each pipeline stage. Furthermore, even taking this into account, another reason this is not quite correct is that ++x; ++y; might not take 2 clock cycles: modern processors are sophisticated enough that they may be able to decode and execute multiple instructions in parallel (for example, a processor with 4 arithmetic units can perform 4 additions at the same time). Yet another reason this might not be correct is if the type of x is larger or smaller than a register, which might require more than one assembly instruction to perform the addition.
If the value of x needs to be loaded from memory, this takes a lot more than one clock cycle. Anything other than the innermost cache level dwarfs the time it takes to decode the instruction and perform the addition. The amount of time is very different depending on whether x is found in the L3 cache, in the L2 cache, in the L1 cache, or in the “real” RAM. And even that gets more complicated when you consider that x might be part of a cache prefetch (hardware- or software- triggered).
It's even possible that x is currently in swap, so that reading it requires reading from a disk.
And writing the result exhibits somewhat similar variations to reading the input. However the performance characteristics are different for reads and for writes because when you need a value, you need to wait for the read to be complete, whereas when you write a value, you don't need to wait for the write to be complete: a write to memory writes to a buffer in cache, and the time when the buffer is flushed to a higher-level cache or to RAM depends on what else is happening on the system (what else is competing for space in the cache).
Ok, now let's turn to your specific example and look at what happens in their inner loop. I'm not very familiar with x86 assembly but I think I get the gist.
For stringUpperStrlen, the inner loop starts at .L4. Just before entering the inner loop, %rbx is set to the length of the string. Here's what the inner loop contains:
cmpq %rbp, %rbx: Compare the current index to the length, both obtained from registers.
je .L3: conditional jump, to exit the loop if the index is equal to the length.
movq 0(%r13), %r12: Read from memory to get the address of the beginning of the string. (I'm surprised that the address isn't in a register at this point.)
addq %rbp, %r12: an arithmetic operation that depends on the value that was just read from memory.
movsbl (%r12), %edi: Read the current character from the string in memory.
incq %rbp: Increment the index. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
call toupper#PLT
movb %al, (%r12): Write the value returned by the function to the current character of the string in memory.
jmp .L4: Unconditional jump to the beginning of the loop.
For stringUpperPtr, the inner loop starts at .L9. Here's what the inner loop contains:
movsbl (%rbx), %edi: read from the address containing the current.
testb %dil, %dil: test if %dil is zero. %dil is the least significant byte of %edi which was just read from memory.
je .L8: conditional jump, to exit the loop if the character is zero.
call toupper#PLT
movb %al, (%rbx): Write the value returned by the function to the current character of the string in memory.
incq %rbx: Increment the pointer. This is an arithmetic instruction on a register value that doesn't depend on a recent memory read, so it's very likely to be free: it only takes pipeline stages and an arithmetic unit that wouldn't be busy anyway.
jmp .L9: Unconditional jump to the beginning of the loop.
The differences between the two loops are:
The loops have slightly different lengths, but both are small enough that they fit in a single cache line (or two, if the code happens to straddle a line boundary). So after the first iteration of the loop, the code will be in the innermost instruction cache. Not only that, but if I understand correctly, on modern Intel processors, there is a cache of decoded instructions, which the loop is small enough to fit in, and so no decoding needs to take place.
The stringUpperStrlen loop has one more read. The extra read is from a constant address which is likely to remain in the innermost cache after the first iteration.
The conditional instruction in the stringUpperStrlen loop depends only on values that are in registers. On the other hand, the conditional instruction in the stringUpperPtr loop depends on a value which was just read from memory.
So the difference boils down to an extra data read from the innermost cache, vs having a conditional instruction whose outcome depends on a memory read. An instruction whose outcome depends on the result of another instruction leads to a hazard: the second instruction is blocked until the first instruction is fully executed, which prevents taking advantage from pipelining, and can render speculative execution less effective. In the stringUpperStrlen loop, the processor essentially runs two things in parallel: the load-call-store cycle, which doesn't have any conditional instructions (apart from what happens inside toupper), and the increment-test cycle, which doesn't access memory. This lets the processor work on the conditional instruction while it's waiting for memory. In the stringUpperPtr loop, the conditional instruction depends on a memory read, so the processor can't start working on it until the read is complete. I'd typically expect this to be slower than the extra read from the innermost cache, although it might depend on the processor.
Of course, the stringUpperStrlen does need to have a load-test hazard to determine the end of the string: no matter how it does it, it needs to fetch characters in memory. This is hidden inside repnz scasb. I don't know the internal architecture of an x86 processor, but I suspect that this case (which is extremely common since it's the meat of strlen) is heavily optimized inside the processor, probably to an extent that is impossible to reach with generic instructions.
You may see different results if the string was longer and the two memory accesses in stringUpperStrlen weren't in the same cache line, although possibly not because this only costs one more cache line and there are several. The details would depend on how the caches work and how toupper uses them.

Using base pointer register in C++ inline asm

I want to be able to use the base pointer register (%rbp) within inline asm. A toy example of this is like so:
void Foo(int &x)
{
asm volatile ("pushq %%rbp;" // 'prologue'
"movq %%rsp, %%rbp;" // 'prologue'
"subq $12, %%rsp;" // make room
"movl $5, -12(%%rbp);" // some asm instruction
"movq %%rbp, %%rsp;" // 'epilogue'
"popq %%rbp;" // 'epilogue'
: : : );
x = 5;
}
int main()
{
int x;
Foo(x);
return 0;
}
I hoped that, since I am using the usual prologue/epilogue function-calling method of pushing and popping the old %rbp, this would be ok. However, it seg faults when I try to access x after the inline asm.
The GCC-generated assembly code (slightly stripped-down) is:
_Foo:
pushq %rbp
movq %rsp, %rbp
movq %rdi, -8(%rbp)
# INLINEASM
pushq %rbp; // prologue
movq %rsp, %rbp; // prologue
subq $12, %rsp; // make room
movl $5, -12(%rbp); // some asm instruction
movq %rbp, %rsp; // epilogue
popq %rbp; // epilogue
# /INLINEASM
movq -8(%rbp), %rax
movl $5, (%rax) // x=5;
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq -4(%rbp), %rax
movq %rax, %rdi
call _Foo
movl $0, %eax
leave
ret
Can anyone tell me why this seg faults? It seems that I somehow corrupt %rbp but I don't see how. Thanks in advance.
I'm running GCC 4.8.4 on 64-bit Ubuntu 14.04.

See the bottom of this answer for a collection of links to other inline-asm Q&As.
Your code is broken because you step on the red-zone below RSP (with push) where GCC was keeping a value.
What are you hoping to learn to accomplish with inline asm? If you want to learn inline asm, learn to use it to make efficient code, rather than horrible stuff like this. If you want to write function prologues and push/pop to save/restore registers, you should write whole functions in asm. (Then you can easily use nasm or yasm, rather than the less-preferred-by-most AT&T syntax with GNU assembler directives1.)
GNU inline asm is hard to use, but allows you to mix custom asm fragments into C and C++ while letting the compiler handle register allocation and any saving/restoring if necessary. Sometimes the compiler will be able to avoid the save and restore by giving you a register that's allowed to be clobbered. Without volatile, it can even hoist asm statements out of loops when the input would be the same. (i.e. unless you use volatile, the outputs are assumed to be a "pure" function of the inputs.)
If you're just trying to learn asm in the first place, GNU inline asm is a terrible choice. You have to fully understand almost everything that's going on with the asm, and understand what the compiler needs to know, to write correct input/output constraints and get everything right. Mistakes will lead to clobbering things and hard-to-debug breakage. The function-call ABI is a much simpler and easier to keep track of boundary between your code and the compiler's code.
Why this breaks
You compiled with -O0, so gcc's code spills the function parameter from %rdi to a location on the stack. (This could happen in a non-trivial function even with -O3).
Since the target ABI is the x86-64 SysV ABI, it uses the "Red Zone" (128 bytes below %rsp that even asynchronous signal handlers aren't allowed to clobber), instead of wasting an instruction decrementing the stack pointer to reserve space.
It stores the 8B pointer function arg at -8(rsp_at_function_entry). Then your inline asm pushes %rbp, which decrements %rsp by 8 and then writes there, clobbering the low 32b of &x (the pointer).
When your inline asm is done,
gcc reloads -8(%rbp) (which has been overwritten with %rbp) and uses it as the address for a 4B store.
Foo returns to main with %rbp = (upper32)|5 (orig value with the low 32 set to 5).
main runs leave: %rsp = (upper32)|5
main runs ret with %rsp = (upper32)|5, reading the return address from virtual address (void*)(upper32|5), which from your comment is 0x7fff0000000d.
I didn't check with a debugger; one of those steps might be slightly off, but the problem is definitely that you clobber the red zone, leading to gcc's code trashing the stack.
Even adding a "memory" clobber doesn't get gcc to avoid using the red zone, so it looks like allocating your own stack memory from inline asm is just a bad idea. (A memory clobber means you might have written some memory you're allowed to write to, e.g. a global variable or something pointed-to by a global, not that you might have overwritten something you're not supposed to.)
If you want to use scratch space from inline asm, you should probably declare an array as a local variable and use it as an output-only operand (which you never read from).
AFAIK, there's no syntax for declaring that you modify the red-zone, so your only options are:
use an "=m" output operand (possibly an array) for scratch space; the compiler will probably fill in that operand with an addressing mode relative to RBP or RSP. You can index into it with constants like 4 + %[tmp] or whatever. You might get an assembler warning from 4 + (%rsp) but not an error.
skip over the red-zone with add $-128, %rsp / sub $-128, %rsp around your code. (Necessary if you want to use an unknown amount of extra stack space, e.g. push in a loop, or making a function call. Yet another reason to deref a function pointer in pure C, not inline asm.)
compile with -mno-red-zone (I don't think you can enable that on a per-function basis, only per-file)
Don't use scratch space in the first place. Tell the compiler what registers you clobber and let it save them.
Here's what you should have done:
void Bar(int &x)
{
int tmp;
long tmplong;
asm ("lea -16 + %[mem1], %%rbp\n\t"
"imul $10, %%rbp, %q[reg1]\n\t" // q modifier: 64bit name.
"add %k[reg1], %k[reg1]\n\t" // k modifier: 32bit name
"movl $5, %[mem1]\n\t" // some asm instruction writing to mem
: [mem1] "=m" (tmp), [reg1] "=r" (tmplong) // tmp vars -> tmp regs / mem for use inside asm
:
: "%rbp" // tell compiler it needs to save/restore %rbp.
// gcc refuses to let you clobber %rbp with -fno-omit-frame-pointer (the default at -O0)
// clang lets you, but memory operands still use an offset from %rbp, which will crash!
// gcc memory operands still reference %rsp, so don't modify it. Declaring a clobber on %rsp does nothing
);
x = 5;
}
Note the push/pop of %rbp in the code outside the #APP / #NO_APP section, emitted by gcc. Also note that the scratch memory it gives you is in the red zone. If you compile with -O0, you'll see that it's at a different position from where it spills &x.
To get more scratch regs, it's better to just declare more output operands that are never used by the surrounding non-asm code. That leaves register allocation to the compiler, so it can be different when inlined into different places. Choosing ahead of time and declaring a clobber only makes sense if you need to use a specific register (e.g. shift count in %cl). Of course, an input constraint like "c" (count) gets gcc to put the count in rcx/ecx/cx/cl, so you don't emit a potentially redundant mov %[count], %%ecx.
If this looks too complicated, don't use inline asm. Either lead the compiler to the asm you want with C that's like the optimal asm, or write a whole function in asm.
When using inline asm, keep it as small as possible: ideally just the one or two instructions that gcc isn't emitting on its own, with input/output constraints to tell it how to get data into / out of the asm statement. This is what it's designed for.
Rule of thumb: if your GNU C inline asm start or ends with a mov, you're usually doing it wrong and should have used a constraint instead.
Footnotes:
You can use GAS's intel-syntax in inline-asm by building with -masm=intel (in which case your code will only work with that option), or using dialect alternatives so it works with the compiler in Intel or AT&T asm output syntax. But that doesn't change the directives, and GAS's Intel-syntax is not well documented. (It's like MASM, not NASM, though.) I don't really recommend it unless you really hate AT&T syntax.
Inline asm links:
x86 wiki. (The tag wiki also links to this question, for this collection of links)
The inline-assembly tag wiki
The manual. Read this. Note that inline asm was designed to wrap single instructions that the compiler doesn't normally emit. That's why it's worded to say things like "the instruction", not "the block of code".
A tutorial
Looping over arrays with inline assembly Using r constraints for pointers/indices and using your choice of addressing mode, vs. using m constraints to let gcc choose between incrementing pointers vs. indexing arrays.
How can I indicate that the memory *pointed* to by an inline ASM argument may be used? (pointer inputs in registers do not imply that the pointed-to memory is read and/or written, so it might not be in sync if you don't tell the compiler).
In GNU C inline asm, what're the modifiers for xmm/ymm/zmm for a single operand?. Using %q0 to get %rax vs. %w0 to get %ax. Using %g[scalar] to get %zmm0 instead of %xmm0.
Efficient 128-bit addition using carry flag Stephen Canon's answer explains a case where an early-clobber declaration is needed on a read+write operand. Also note that x86/x86-64 inline asm doesn't need to declare a "cc" clobber (the condition codes, aka flags); it's implicit. (gcc6 introduces syntax for using flag conditions as input/output operands. Before that you have to setcc a register that gcc will emit code to test, which is obviously worse.)
Questions about the performance of different implementations of strlen: my answer on a question with some badly-used inline asm, with an answer similar to this one.
llvm reports: unsupported inline asm: input with type 'void *' matching output with type 'int': Using offsetable memory operands (in x86, all effective addresses are offsettable: you can always add a displacement).
When not to use inline asm, with an example of 32b/32b => 32b division and remainder that the compiler can already do with a single div. (The code in the question is an example of how not to use inline asm: many instructions for setup and save/restore that should be left to the compiler by writing proper in/out constraints.)
MSVC inline asm vs. GNU C inline asm for wrapping a single instruction, with a correct example of inline asm for 64b/32b=>32bit division. MSVC's design and syntax require a round trip through memory for inputs and outputs, making it terrible for short functions. It's also "never very reliable" according to Ross Ridge's comment on that answer.
Using x87 floating point, and commutative operands. Not a great example, because I didn't find a way to get gcc to emit ideal code.
Some of those re-iterate some of the same stuff I explained here. I didn't re-read them to try to avoid redundancy, sorry.

In x86-64, the stack pointer needs to be aligned to 8 bytes.
This:
subq $12, %rsp; // make room
should be:
subq $16, %rsp; // make room

The cost of atomic counters and spinlocks on x86(_64)

Preface
I recently came across some synchronization problems, which led me to spinlocks and atomic counters. Then I was searching a bit more, how these work and found std::memory_order and memory barriers (mfence, lfence and sfence).
So now, it seems that I should use acquire/release for the spinlocks and relaxed for the counters.
Some reference
x86 MFENCE - Memory Fence
x86 LOCK - Assert LOCK# Signal
Question
What is the machine code (edit: see below) for those three operations (lock = test_and_set, unlock = clear, increment = operator++ = fetch_add) with default (seq_cst) memory order and with acquire/release/relaxed (in that order for those three operations). What is the difference (which memory barriers where) and the cost (how many CPU cycles)?
Purpose
I was just wondering how bad my old code (not specifying memory order = seq_cst used) really is and if I should create some class atomic_counter derived from std::atomic but using relaxed memory ordering (as well as good spinlock with acquire/release instead of mutexes on some places ...or to use something from boost library - I have avoided boost so far).
My Knowledge
So far I do understand that spinlocks protect more than itself (but some shared resource/memory as well), so, there must be something that makes some memory view coherent for multiple threads/cores (that would be those acquire/release and memory fences). Atomic counter just lives for itself and only need that atomic increment (no other memory involved and I do not really care about the value when I read it, it is informative and can be few cycles old, no problem). There is some LOCK prefix and some instructions like xchg implicitly have it. Here my knowledge ends, I don't know how the cache and buses really work and what is behind (but I know that modern CPUs can reorder instructions, execute them in parallel and use memory cache and some synchronization). Thank you for explanation.
P.S.: I have old 32bit PC now, can only see lock addl and simple xchg, nothing else - all versions look the same (except unlock), memory_order makes no difference on my old PC (except unlock, release uses move instead of xchg). Will that be true for 64bit PC? (edit: see below) Do I have to care about memory order? (answer: no, not much, release on unlock saves few cycles, that's all.)
The Code:
#include <atomic>
using namespace std;
atomic_flag spinlock;
atomic<int> counter;
void inc1() {
counter++;
}
void inc2() {
counter.fetch_add(1, memory_order_relaxed);
}
void lock1() {
while(spinlock.test_and_set()) ;
}
void lock2() {
while(spinlock.test_and_set(memory_order_acquire)) ;
}
void unlock1() {
spinlock.clear();
}
void unlock2() {
spinlock.clear(memory_order_release);
}
int main() {
inc1();
inc2();
lock1();
unlock1();
lock2();
unlock2();
}
g++ -std=c++11 -O1 -S (32bit Cygwin, shortened output)
__Z4inc1v:
__Z4inc2v:
lock addl $1, _counter ; both seq_cst and relaxed
ret
__Z5lock1v:
__Z5lock2v:
movl $1, %edx
L5:
movl %edx, %eax
xchgb _spinlock, %al ; both seq_cst and acquire
testb %al, %al
jne L5
rep ret
__Z7unlock1v:
movl $0, %eax
xchgb _spinlock, %al ; seq_cst
ret
__Z7unlock2v:
movb $0, _spinlock ; release
ret
UPDATE for x86_64bit: (see mfence in unlock1)
_Z4inc1v:
_Z4inc2v:
lock addl $1, counter(%rip) ; both seq_cst and relaxed
ret
_Z5lock1v:
_Z5lock2v:
movl $1, %edx
.L5:
movl %edx, %eax
xchgb spinlock(%rip), %al ; both seq_cst and acquire
testb %al, %al
jne .L5
ret
_Z7unlock1v:
movb $0, spinlock(%rip)
mfence ; seq_cst
ret
_Z7unlock2v:
movb $0, spinlock(%rip) ; release
ret

x86 has mostly strong memory model, all the usual stores/loads have release/acquire semantics implicitly. The exception is only SSE non-temporal store operations which require sfence to be ordered as usual. All the read-modify-write (RMW) instructions with the LOCK prefix imply full memory barrier, i.e. seq_cst.
Thus on x86, we have
test_and_set can be coded with lock bts (for bit-wise operations), lock cmpxchg, or lock xchg (or just xchg which implies the lock). Other spin-lock implementations can use instructions like lock inc (or dec) if they need e.g. fairness. It is not possible to implement try_lock with release/acquire fence (at least you'd need standalone memory barrier mfence anyway).
clear is coded with lock and (for bit-wise) or lock xchg, though, more efficient implementations would use plain write (mov) instead of locked instruction.
fetch_add is coded with lock add.
Removing the lock prefix will not guarantee atomicity for RMW operations thus such operations cannot be interpreted strictly as having memory_order_relaxed in C++ view. However in practice, you might want to access atomic variable via faster non-atomic operation when it is safe (in constructor, under lock).
In our experience, it does not really matter which exactly RMW atomic operation is performed they take almost the same number of cycles to execute (and mfence is about x0.5 of a lock operation). You can estimate performance of synchronization algorithms by counting the number of atomic operations (and mfences), and the number of memory indirections (cache misses).

I recommend: x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors.
Your x86 and x86_64 are indeed pretty "well behaved". In particular, they do not re-order write operations (and any speculative writes are discarded while they are in the cpu/core's write-queue), and they do not re-order read operations. However, they will start read operations as early as they can, which means that reads and writes can be re-ordered. (A read of something sitting in the write-queue reads the queued value, so reads/writes of the same location are not re-ordered.) So:
read-modify-write operations require LOCKs which makes them, implicitly, memory_order_seq_cst.
So for these operations you gain nothing by weakening the memory ordering (on the x86/x86_64). The general advice is to "keep it simple" and stick with memory_order_seq_cst, which happily is not costing anything extra for the x86 and x86_64.
For anything newer than a Pentium, if the cpu/core already has "exclusive" access to the affected memory, the LOCK does not affect other cpus/cores, and may be a relatively simple operation.
memory_order_acquire/_release do not require an mfence or any other overhead.
So, for atomic load/store, if acquire/release is sufficient, then for the x86/x86_64 those operations are "tax free".
memory_order_seq_cst does require mfence...
...which is worth understanding.
(NB: we are here talking about what the processor does with the instructions generated by the compiler. The compiler's re-ordering of operations is a very similar issue, but not addressed here.)
An mfence stalls the cpu/core until all pending writes are cleared out of the write-queue. In particular, any read operations which follow the mfence will not start until the write-queue is empty. Consider two threads:
initial state: wa = wb = 0
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Left to their own devices, the x86/x86_64 can produce any of (a = 1, b = 1), (a = 0, b = 1), (a = 1, b = 0) and (a = 0, b = 0). The last is invalid if you expect memory_order_seq_cst -- since you cannot get that by any interleaving of the operations. The reason this can happen is that the writes of wa and wb are queued in the respective cpu's/core's queue, and the reads of wa and wb can both be scheduled and can both complete before either write. To achieve memory_order_seq_cst you need an mfence:
thread 'A' thread 'B'
wa = 1 ; (mov [wa] ← 1) wb = 1 ; (mov [wb] ← 1)
mfence ; mfence
a = wb ; (mov ebx ← [wb]) b = wa ; (mov ebx ← [wa])
Since there is no synchronization between the threads, the result may be anything except (a = 0, b = 0). Interestingly, the mfence is for the benefit of the thread itself, because it prevents the read operation starting before the write completes. The only thing that other threads care about is the order in which writes occur, and the x86/x86_64 does not re-order those in any case.
So, to implement memory_order_seq_cst atomic_load() and atomic_store(), it is necessary to insert an mfence after one or more stores and before a load. Where these operations are implemented as library functions, the common convention is to add the mfence to all stores, leaving the load "naked". (The logic being that loads are more common than stores, and it seems better to add the overhead to the store.)
For spin-locks, at least, your question seems to boil down to whether a spin-unlock operation requires an mfence, or not, and what difference it makes.
The C11 atomic_flag_clear() is, implicitly, memory_order_seq_cst, for which an mfence is required. The C11 atomic_flag_test_and_set() is not only a read-modify-write operation but is also implictly memory_order_seq_cst -- and LOCK does that.
C11 does not offer a spin-lock in the threads.h library. But you can use an atomic_flag -- though for your x86/x86_64 you have PAUSE instruction problem to deal with. The question is, do you need memory_order_seq_cst for this, in particular for the unlock ? I think the answer is no, and that the trick is to do: atomic_flag_test_and_set_explicit(xxx, memory_order_acquire) and atomic_flag_clear(xxx, memory_order_release).
FWIW, the glibc pthread_spin_unlock() does not have an mfence. Nor does the gcc __sync_lock_release() (which is explicitly a "release" operation). But the gcc _atomic_clear() is aligned with the C11 atomic_flag_clear(), and takes a memory order parameter.
What difference does the mfence make to the unlock ? Clearly it's very disruptive to the pipe-line, and since it's not necessary, there's not much to be gained working out the exact scale of its impact, which will depend on the circumstances.

spinlock do not use mfence, mfence only enforce serialise/flush of data of current core. The fence itself do not in any way relate to atomic operation.
For spinlock you need some kind of atomic action to exchange data to a memory place. There are many different implementation, targeted for different requirement: for example, do it work on kernel or user-space? is it fair-lock?
A very simple and dumb spinlock for x86 looks like this (my kernel use this):
typedef volatile uint32_t _SPINLOCK __attribute__ ((aligned(16)));
static inline void _SPIN_LOCK(_SPINLOCK* lock) {
__asm (
"cli\n"
"lock bts %0, 0\n"
"jnc 1f\n"
"0:\n"
"pause\n"
"test %0, 1\n"
"je 0b\n"
"lock bts %0, 0\n"
"jc 0b\n"
"1:\n"
:
: "m"(lock)
:
);
}
The logic is simple
test and exchange a bit, if zero it mean the lock not taken, and we got it.
if bit is not zero, it mean the lock is taken by other, pause is a hint recommended by cpu manufacture so that it doesn't burn the cpu with a tight look.
loop until you got the lock
Note 1. you may also implement spinlock with intrinsics and extensions, it should be fairly similar.
Note 2. Spinlock is not judge by cycles, a sane implementation should be quite fast, for instant, the above implementation you should grab the lock on first try in well designed usage, if not, fix the algorithm or split the lock to prevent/reduce lock contention.
Note 3. You should also consider other things like fairness.

Re
and the cost (how many CPU cycles)?
On x86 at least, instructions that perform memory synchronization (atomic ops, fences) have a very variable CPU cycle latency. They wait for the processor store buffers to be flushed to memory, and this varies dramatically depending on the store buffer content.
E.g., if an atomic op is straight after a memcpy() that pushes multiple cache lines out to main memory, the delay may be in the 100's of nanoseconds. The same atomic op, but after a series of register-only arithmetic instructions, may take only a few clock cycles.

Does a function local static variable automatically incur a branch?

For example:
int foo()
{
static int i = 0;
return i++;
}
The variable i will only be initialized to 0 the first time foo is called. Does this automatically mean there's a hidden branch in there to keep the initialization from happening more than once? Or are there more clever tricks to avoid this?

Yes, it must incur a branch, and it must also incur at least an atomic operation for safe concurrent initialization. The Standard requires that they are initialized on function entry, in a concurrency-safe way.
The implementation can only dodge this requirement if it can prove that the difference between lazy init and some earlier initialization like before main() is entered is equivalent. For example, simple PODs initialized from constants, the compiler may choose to initialize it earlier like a file-scope global since it's non-observable and saving the lazy initialization code, but that's a non-observable optimization.

Yes, there is a branch. Each time the function is entered, the code must check if the variable has already been initialized. But as will be explained below, you usually do not have to care about this branch.
Example
Check out this code:
#include <iostream>
struct Foo { Foo(){ std::cout << "FOO" << std::endl;} };
void foo(){ static Foo foo; }
int main(){ foo();}
Now, here is the first part of assembly code that gcc4.8 generates for the foo function:
_Z3foov:
.LFB974:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
.cfi_lsda 0x3,.LLSDA974
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
pushq %r12
pushq %rbx
.cfi_offset 12, -24
.cfi_offset 3, -32
movl $_ZGVZ3foovE3foo, %eax
movzbl (%rax), %eax
testb %al, %al
jne .L7 <------------------- FIRST CHECK
movl $_ZGVZ3foovE3foo, %edi
call __cxa_guard_acquire <------------------- LOCK
testl %eax, %eax
setne %al
testb %al, %al
je .L7 <------------------- SECOND CHECK
movl $0, %r12d
movl $_ZZ3foovE3foo, %edi
A you see, there is a jne! Then, a guard is aquired using __cxa_guard_acquire, followed by a je. Thus, it seems that the compiler is generating the famous double checked locking pattern here.
Will every compiler generate a branch?
I am pretty sure the spec does NOT mandate that a branch or double checked locking must be used. It just mandates that the initialization must be thread safe. However, I do not see a way to perform a thread safe initialization without a branch. Thus, even though the spec does not mandate it, it is simply not possible with current CPU architectures to omit the branch here.
Is the branch expensive?
Considering whether you should care about this branch:
You should definitly NOT care about this branch, since it will be correctly predicted (as it once the object is initialized the branch always takes the same route). Thus, the branch is almost free. Trying to avoid a static local variable for optimization purposes should never yield any observable performance benefit.
Is there really no way around the branch?
If the constructor is not observable, like simply initialization with constant values, then it may be performed eagerly at program startup and the branch is omitted. If, however, it is observable, then things get pretty tricky:
The only possibility I see is stated in the answer of R. Martinho Fernandes (which has been deleted): The code could modify itself. I.e., simply remove the initialization code once the initialization is done. However, this is idea is impractical for the following reasons:
Self-modifying code is very hard to get thread-safe.
Usually, memory flagged executable is write protected so code is not allowed to rewrite itself.
It is just not worth it, as the branch is not expensive (see above).

Is std::atomic_compare_exchange_weak thread-unsafe by design?

It was brought up on cppreference atomic_compare_exchange Talk page that the existing implementations of std::atomic_compare_exchange_weak compute the boolean result of the CAS with a non-atomic compare instruction, e.g.
lock
cmpxchgq %rcx, (%rsp)
cmpq %rdx, %rax
which (Edit: apologies for the red herring)
break CAS loops such as Concurrency in Action's listing 7.2:
while(!head.compare_exchange_weak(new_node->next, new_node);
The specification (29.6.5[atomics.types.operations.req]/21-22) seems to imply that the result of the comparison must be a part of the atomic operation:
Effects: atomically compares ...
Returns: the result of the comparison
but is it actually implementable? Should we file bug reports to the vendors or to the LWG?

TL;DR: atomic_compare_exchange_weak is safe by design, but actual implementations are buggy.
Here's the code that Clang actually generates for this little snippet:
struct node {
int data;
node* next;
};
std::atomic<node*> head;
void push(int data) {
node* new_node = new node{data};
new_node->next = head.load(std::memory_order_relaxed);
while (!head.compare_exchange_weak(new_node->next, new_node,
std::memory_order_release, std::memory_order_relaxed)) {}
}
Result:
movl %edi, %ebx
# Allocate memory
movl $16, %edi
callq _Znwm
movq %rax, %rcx
# Initialize with data and 0
movl %ebx, (%rcx)
movq $0, 8(%rcx) ; dead store, should have been optimized away
# Overwrite next with head.load
movq head(%rip), %rdx
movq %rdx, 8(%rcx)
.align 16, 0x90
.LBB0_1: # %while.cond
# =>This Inner Loop Header: Depth=1
# put value of head into comparand/result position
movq %rdx, %rax
# atomic operation here, compares second argument to %rax, stores first argument
# in second if same, and second in %rax otherwise
lock
cmpxchgq %rcx, head(%rip)
# unconditionally write old value back to next - wait, what?
movq %rax, 8(%rcx)
# check if cmpxchg modified the result position
cmpq %rdx, %rax
movq %rax, %rdx
jne .LBB0_1
The comparison is perfectly safe: it's just comparing registers. However, the whole operation is not safe.
The critical point is this: the description of compare_exchange_(weak|strong) says:
Atomically [...] if true, replace the contents of the memory point to by this with that in desired, and if false, updates the contents of the memory in expected with the contents of the memory pointed to by this
Or in pseudo-code:
if (*this == expected)
*this = desired;
else
expected = *this;
Note that expected is only written to if the comparison is false, and *this is only written to if comparison is true. The abstract model of C++ does not allow an execution where both are written to. This is important for the correctness of push above, because if the write to head happens, suddenly new_node points to a location that is visible to other threads, which means other threads can start reading next (by accessing head->next), and if the write to expected (which aliases new_node->next) also happens, that's a race.
And Clang writes to new_node->next unconditionally. In the case where the comparison is true, that's an invented write.
This is a bug in Clang. I don't know whether GCC does the same thing.
In addition, the wording of the standard is suboptimal. It claims that the entire operation must happen atomically, but this is impossible, because expected is not an atomic object; writes to there cannot happen atomically. What the standard should say is that the comparison and the write to *this happen atomically, but the write to expected does not. But this isn't that bad, because no one really expects that write to be atomic anyway.
So there should be a bug report for Clang (and possibly GCC), and a defect report for the standard.

I was the one who originally found this bug. For the last few days I have been e-mailing Anthony Williams regarding this issue and vendor implementations. I didn't realize Cubbi had raise a StackOverFlow question. It's not just Clang or GCC it's every vendor that is broken (all that matters anyway). Anthony Williams also author of Just::Thread (a C++11 thread and atomic library) confirmed his library is implemented correctly (only known correct implementation).
Anthony has raised a GCC bug report http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272
Simple example:
#include <atomic>
struct Node { Node* next; };
void Push(std::atomic<Node*> head, Node* node)
{
node->next = head.load();
while(!head.compare_exchange_weak(node->next, node))
;
}
g++ 4.8 [assembler]
mov rdx, rdi
mov rax, QWORD PTR [rdi]
mov QWORD PTR [rsi], rax
.L3:
mov rax, QWORD PTR [rsi]
lock cmpxchg QWORD PTR [rdx], rsi
mov QWORD PTR [rsi], rax !!!!!!!!!!!!!!!!!!!!!!!
jne .L3
rep; ret
clang 3.3 [assembler]
movq (%rdi), %rcx
movq %rcx, (%rsi)
.LBB0_1:
movq %rcx, %rax
lock
cmpxchgq %rsi, (%rdi)
movq %rax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpq %rcx, %rax !!!!!!!!!!!!!!!!!!!!!!!
movq %rax, %rcx
jne .LBB0_1
ret
icc 13.0.1 [assembler]
movl %edx, %ecx
movl (%rsi), %r8d
movl %r8d, %eax
lock
cmpxchg %ecx, (%rdi)
movl %eax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpl %eax, %r8d !!!!!!!!!!!!!!!!!!!!!!!
je ..B1.7
..B1.4:
movl %edx, %ecx
movl %eax, %r8d
lock
cmpxchg %ecx, (%rdi)
movl %eax, (%rsi) !!!!!!!!!!!!!!!!!!!!!!!
cmpl %eax, %r8d !!!!!!!!!!!!!!!!!!!!!!!
jne ..B1.4
..B1.7:
ret
Visual Studio 2012 [No need to check assembler, MS uses _InterlockedCompareExchange !!!]
inline int _Compare_exchange_seq_cst_4(volatile _Uint4_t *_Tgt, _Uint4_t *_Exp, _Uint4_t _Value)
{ /* compare and exchange values atomically with
sequentially consistent memory order */
int _Res;
_Uint4_t _Prev = _InterlockedCompareExchange((volatile long
*)_Tgt, _Value, *_Exp);
if (_Prev == *_Exp) !!!!!!!!!!!!!!!!!!!!!!!
_Res = 1;
else
{ /* copy old value */
_Res = 0;
*_Exp = _Prev;
}
return (_Res);
}

[...]
break CAS loops such as Concurrency in Action's listing 7.2:
while(!head.compare_exchange_weak(new_node->next, new_node);
The specification (29.6.5[atomics.types.operations.req]/21-22) seems to
imply that the result of the comparison must be a part of the atomic
operation:
[...]
The issue with this code and the specification is not whether the atomicity of compare_exchange needs to extend beyond just the comparison and exchange itself to returning the result of the comparison or assigning to the expected parameter. That is, the code may still be correct without the store to expected being atomic.
What causes the above code to be potentially racy is when implementations write to the expected parameter after a successful exchange may have been observed by other threads. The code is written with the expectation that in the case when the exchange is successful there is no write on expected to produce a race.
The spec, as written, does appear to guarantee this expected behavior. (And indeed can be read as making the much stronger guarantee you describe, that the entire operation is atomic.) According to the spec, compare_exchange_weak:
Atomically, compares the contents of the memory pointed to by object
or by this for equality with that in expected, and if true, replaces
the contents of the memory pointed to by object or by this with that
in desired, and if false, updates the contents of the memory in
expected with the contents of the memory pointed to by object or by
this. [n4140 § 29.6.5 / 21] (N.B. The wording is unchanged between C++11 and C++14)
The problem is that it seems as though the actual language of the standard is stronger than the original intent of the proposal. Herb Sutter is saying that Concurrency in Action's usage was never really intended to be supported, and that updating expected was only intended to be done on local variables.
I don't see any current defect report on this. [See second update below] If in fact this language is stronger than intended then presumably one will get filed. Either C++11's wording will be updated to guarantee the above code's expected behavior, thus making current implementations non-conformant, or the new wording will not guarantee this behavior, making the above code potentially result in undefined behavior. In that case I guess Anthony's book will need updating. What the committee will do about this, and whether or not actual implementations conform to the original intent (rather than the actual wording of the spec) is still an open question. [See update below]
For the purposes of writing code in the meantime, you'll have to take into account the actual behavior of implementation whether it's conformant or not. Existing implementations may be 'buggy' in the sense that they don't implement the the exact wording of the ISO spec, but they do operate as their implementers intended and they can be used to write thread safe code. [See update below]
So to answer your questions directly:
but is it actually implementable?
I believe that the actual wording of the spec is not reasonably implementable (And that the actual wording makes guarantees stronger even than Anthony's just::thread library provides. For example the actual wording appears to require atomic operations on a non-atomic object. Anthony's slightly weaker interpretation, that the assignment to expected need not be atomic but must be conditioned on the failure of the exchange, is obviously implementable. Herb's even weaker interpretation is also obviously implementable, as that's what most libraries actually implement. [See update below]
Is std::atomic_compare_exchange_weak thread-unsafe by design?
The operation is not thread unsafe no matter whether the operation makes guarantees as strong as the actual wording of the spec or as weak as Herb Sutter indicates. It's simply that correct, thread safe usage of the operation depends on what is guaranteed. The example code from Concurrency in Action is an unsafe usage of a compare_exchange that only offers Herb's weak guarantee, but it could be written to work correctly with Herb's implementation. That could be done like so:
node *expected_head = head.load();
while(!head.compare_exchange_weak(expected_head, new_node) {
new_node->next = expected_head;
}
With this change the 'spurious' writes to expected are simply made to a local variable, and no longer produce any races. The write to new_node->next is now conditional upon the exchange having failed, and thus new_node->next is not visible to any other thread and may be safely updated. This code sample is safe both under current implementations and under stronger guarantees, so it should be future proof to any updates to C++11's atomics that resolve this issue.
Update:
Actual implementations (MSVC, gcc, and clang at least) have been updated to offer the guarantees under Anthony Williams' interpretation; that is, they have stopped inventing writes to expected in the case that the exchange succeeds.
https://llvm.org/bugs/show_bug.cgi?id=18899
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60272
https://connect.microsoft.com/VisualStudio/feedback/details/819819/std-atomic-compare-exchange-weak-has-spurious-write-which-can-cause-race-conditions
Update 2:
This defect report on this issue has been filed with the C++ committee. From the currently proposed resolution the committee does want to make stronger guarantees than provided by the implementations you checked (but not as strong as current wording which appears to guarantee atomic operations on non-atomic objects.) The draft for the next C++ standard (C++1z or 'C++17') has not yet adopted the improved wording.
Update 3: C++17 adopted the proposed resolution.

Those people don't seem to understand either the standard or the instructions.
First of all, std::atomic_compare_exchange_weak is not thread-unsafe by design. That is complete nonsense.
The design very clearly defines what the function does and which guarantees (including atomicity and memory ordering) it must provide.
Whether your program that uses this function is thread-safe as a whole is a different matter, but the function's semantics per se are certainly correct in the sense of an atomic copare-exchange (you can still write thread-unsafe code using any available thread-safe primitive, but that is a totally different story).
This particular function implements the "weak" version of a thread-safe compare-exchange operation which differs from the "non weak" version in that the implementation is allowed to generate code which may spuriously fail, if that gives a performance benefit (irrelevant on x86). Weak does not mean it's worse, it only means that it is allowable to fail more often on some platforms, if that gives an overall performance benefit.
The implementation is of course still required to work correctly. That is, if the compare-exchange fails -- whether by concurrency or spuriously -- it must be correctly reported back as having failed.
Second, the code generated by existing implementations has no bearing for the correctness or thread-safety of std::atomic_compare_exchange_weak. At best, if the generated instructions do not work correctly, this is an implementation issue, but it has nothing to do with the language construct. The language standard defines what behavior an implementation must provide, it is not responsible for implementations acutally doing it correctly.
Third, there is no problem in the generated code. The x86 CMPXCHG instruction has a well-defined mode of operation. It compares the actual value with the expected value, and if the comparison is successful, it performs the swap. You know whether or not the operation was successful either by looking at EAX (or RAX in x64) or by the state of ZF.
What matters is that the atomic compare-exchange is atomic, and that's the case. Whatever you do with the result afterwards needs not be atomic (in your case, the CMP), since the state does not change any more. Either the swap was successful at that point, or it has failed. In either case, it's already "history".
std::atomic_compare_exchange_weak has different semantics than the underlying instruction, it returns a bool value. Therefore, you cannot always expect a 1:1 mapping to instructions. The compiler may have to generate additional instructions (and different ones depending on how you consume the result) to implement these semantics, but it really makes no difference for correctness.
The only thing one could arguably complain about is the fact that instead of directly using the already present state of ZF (with a Jcc or CMOVcc), it performs another comparison. But this is a performance issue (1 cycle wasted), not a correctness issue.

Quoting Duncan Forster from the linked page:
The important thing to remember is that the hardware implementation of CAS only returns 1 value (the old value) not two (old plus boolean)
So there's one instruction - the (atomic) CAS - which actually operates on memory, and then another instruction to convert the (atomically-assigned) result into the expected boolean.
Since the value in %rax was set atomically and can't then be affected by another thread, there is no race here.
The quote is anyway false, since ZF is also set depending on the CAS result (ie, it does return both the old value and the boolean). The fact the flag isn't used might be a missed optimisation, or the cmpq might be faster, but it doesn't affect correctness.
For reference, consider decomposing compare_exchange_weak like this pseudocode:
T compare_exchange_weak_value(atomic<T> *obj, T *expected, T desired) {
// setup ...
lock cmpxchgq %rcx, (%rsp) // actual CAS
return %rax; // actual destination value
}
bool compare_exchange_weak_bool(atomic<T> *obj, T *expected, T desired) {
// CAS is atomic
T actual = compare_exchange_weak_value(obj, expected, desired);
// now we figure out if it worked
return actual == *expected;
}
Do you agree the CAS is properly atomic?
If the unconditional store to expected is really what you wanted to ask about (instead of the perfectly safe comparison), I agree with Sebastian that it's a bug.
For reference, you can work around it by forcing the unconditional store into a local, and making the potentially-visible store conditional again:
struct node {
int data;
node* next;
};
std::atomic<node*> head;
void push(int data) {
node* new_node = new node{data};
node* cur_head = head.load(std::memory_order_relaxed);
do {
new_node->next = cur_head;
} while (!head.compare_exchange_weak(cur_head, new_node,
std::memory_order_release, std::memory_order_relaxed));
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js