Why does MSVC generate nop instructions for atomic loads on x64? - c++

If you compile code such as
#include <atomic>
int load(std::atomic<int> *p) {
return p->load(std::memory_order_acquire) + p->load(std::memory_order_acquire);
}
you see that MSVC generates NOP padding after each memory load:
int load(std::atomic<int> *) PROC
mov edx, DWORD PTR [rcx]
npad 1
mov eax, DWORD PTR [rcx]
npad 1
add eax, edx
ret 0
Why is this? Is there any way to avoid it without relaxing the memory order (which would affect the correctness of the code)?

p->load() may eventually use the _ReadWriteBarrier compiler intrinsic.
According to this: https://developercommunity.visualstudio.com/t/-readwritebarrier-intrinsic-emits-unnecessary-code/1538997
the nops get inserted because of the flag /volatileMetadata which is now on by default. You can return to the old behavior by adding /volatileMetadata-, but doing so will result in worse performance if your code is ever run emulated. It’ll still be emulated correctly, but the emulator will have to pessimistically assume every load/store needs a barrier.
And compiling with /volatileMetadata- does indeed remove the npad.

Related

improve atomic read from InterlockedCompareExchange()

Assuming architecture is ARM64 or x86-64.
I want to make sure if these two are equivalent:
a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
MyBarrier(); a = *(volatile __int64*)p; MyBarrier();
Where MyBarrier() is a memory barrier (hint) of compiler level, like __asm__ __volatile__ ("" ::: "memory").
So method 2 is supposed to be faster than method 1.
I heard that _Interlocked() functions would also imply memory barrier of both compiler and hardware level.
I heard that read (proper-aligned) intrinsic data is atomic on these architectures, but I am not sure if method 2 could be widely used?
(ps. because I think CPU will handle data dependency automatically so hardware barrier is not much considered here.)
Thank you for any advise/correction on this.
Here is some benchmarks on Ivy Bridge (i5 laptop).
(1E+006 loops: 27ms):
; __int64 a = _InterlockedCompareExchange64((__int64*)p, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR val$[rsp], rbx
(1E+006 loops: 27ms):
; __faststorefence(); __int64 a = *(volatile __int64*)p;
lock or DWORD PTR [rsp], 0
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 7ms):
; _mm_sfence(); __int64 a = *(volatile __int64*)p;
sfence
mov rcx, QWORD PTR val$[rsp]
(1E+006 loops: 1.26ms, not synchronized?):
; __int64 a = *(volatile __int64*)p;
mov rcx, QWORD PTR val$[rsp]
For the second version to be functionally equivalent, you obviously need atomic 64-bit reads, which is true on your platform.
However, _MemoryBarrier() is not a "hint to the compiler". _MemoryBarrier() on x86 prevents compiler and CPU reordering, and also ensures global visibility after the write. You also probably only need the first _MemoryBarrier(), the second one could be replaced with a _ReadWriteBarrier() unless a is also a shared variable - but you don't even need that since you are reading through a volatile pointer, which will prevent any compiler reordering in MSVC.
When you create this replacement, you basically end up with pretty much the same result:
// a = _InterlockedCompareExchange64((__int64*)&val, 0, 0);
xor eax, eax
lock cmpxchg QWORD PTR __int64 val, r8 ; val
// _MemoryBarrier(); a = *(volatile __int64*)&val;
lock or DWORD PTR [rsp], r8d
mov rax, QWORD PTR __int64 val ; val
Running these two in a loop, on my i7 Ivy Bridge laptop, gives equal results, within 2-3%.
However, with two memory barriers, the "optimized version" is actually around 2x slower.
So the better question is: Why are you using _InterlockedCompareExchange64 at all? If you need atomic access to a variable, use std::atomic, and an optimizing compiler should compile it to the most optimized version for your architecture, and add all the necessary barriers to prevent reordering and ensure cache coherency.

Understanding volatile asm vs volatile variable

We consider the following program, that is just timing a loop:
#include <cstdlib>
std::size_t count(std::size_t n)
{
#ifdef VOLATILEVAR
volatile std::size_t i = 0;
#else
std::size_t i = 0;
#endif
while (i < n) {
#ifdef VOLATILEASM
asm volatile("": : :"memory");
#endif
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
For readability, the version with both volatile variable and volatile asm reads as follow:
#include <cstdlib>
std::size_t count(std::size_t n)
{
volatile std::size_t i = 0;
while (i < n) {
asm volatile("": : :"memory");
++i;
}
return i;
}
int main(int argc, char* argv[])
{
return count(argc > 1 ? std::atoll(argv[1]) : 1);
}
Compilation under g++ 8 with g++ -Wall -Wextra -g -std=c++11 -O3 loop.cpp -o loop gives roughly the following timings:
default: 0m0.001s
-DVOLATILEASM: 0m1.171s
-DVOLATILEVAR: 0m5.954s
-DVOLATILEVAR -DVOLATILEASM: 0m5.965s
The question I have is: why is that? The default version is normal since the loop is optimized away by the compiler. But I have harder time understanding why -DVOLATILEVAR is way longer than -DVOLATILEASM since both should force the loop to run.
Compiler explorer gives the following count function for -DVOLATILEASM:
count(unsigned long):
mov rax, rdi
test rdi, rdi
je .L2
xor edx, edx
.L3:
add rdx, 1
cmp rax, rdx
jne .L3
.L2:
ret
and for -DVOLATILEVAR (and the combined -DVOLATILEASM -DVOLATILEVAR):
count(unsigned long):
mov QWORD PTR [rsp-8], 0
mov rax, QWORD PTR [rsp-8]
cmp rdi, rax
jbe .L2
.L3:
mov rax, QWORD PTR [rsp-8]
add rax, 1
mov QWORD PTR [rsp-8], rax
mov rax, QWORD PTR [rsp-8]
cmp rax, rdi
jb .L3
.L2:
mov rax, QWORD PTR [rsp-8]
ret
Why is the exact reason of that? Why does the volatile qualification of the variable prevents the compiler from doing the same loop as the one with asm volatile?
When you make i volatile you tell the compiler that something that it doesn't know about can change its value. That means it is forced to load it's value every time you use it and it has to store it every time you write to it. When i is not volatile the compiler can optimize that synchronization away.
-DVOLATILEVAR forces the compiler to keep the loop counter in memory, so the loop bottlenecks on the latency of store/reload (store forwarding), ~5 cycles + the latency of an add 1 cycle.
Every assignment to and read from volatile int i is considered an observable side-effect of the program that the optimizer has to make happen in memory, not just a register. This is what volatile means.
There's also a reload for the compare, but that's only a throughput issue, not latency. The ~6 cycle loop carried data dependency means your CPU doesn't bottleneck on any throughput limits.
This is similar to what you'd get from -O0 compiler output, so have a look at my answer on Adding a redundant assignment speeds up code when compiled without optimization for more about loops like that, and x86 store-forwarding.
With only VOLATILEASM, the empty asm template (""), has to run the right number of times. Being empty, it doesn't add any instructions to the loop, so you're left with a 2-uop add / cmp+jne loop that can run at 1 iteration per clock on modern x86 CPUs.
Critically, the loop counter can stay in a register, despite the compiler memory barrier. A "memory" clobber is treated like a call to a non-inline function: it might read or modify any object that it might possibly have a reference to, but that does not include local variables that have never had their address escape the function. (i.e. we never called sscanf("0", "%d", &i) or posix_memalign(&i, 64, 1234). But if we did, then the "memory" barrier would have to spill / reload it, because an external function could have saved a pointer to the object.
i.e. a "memory" clobber is only a full compiler barrier for objects that could possibly be visible outside the current function. This is really only an issue when messing around and looking at compiler output to see what barriers do what, because a barrier can only matter for multi-threading correctness for variables that other threads could possible have a pointer to.
And BTW, your asm statement is already implicitly volatile because it has no output operands. (See Extended-Asm#Volatile in the gcc manual).
You can add a dummy output to make a non-volatile asm statement the compiler can optimize away, but unfortunately gcc still keep the empty loop after eliminating a non-volatile asm statement from it. If i's address has escaped the function, removing the asm statement entirely turns the loop into a single compare jump over a store, right before the function returns. I think it would be legal to simply return without ever storing to that local, because there's no a correct program can know that it managed to read i from another thread before i went out of scope.
But anyway, here's the source I used. As I said, note that there's always an asm statement here, and I'm controlling whether it's volatile or not.
#include <stdlib.h>
#include <stdio.h>
#ifndef VOLATILEVAR // compile with -DVOLATILEVAR=volatile to apply that
#define VOLATILEVAR
#endif
#ifndef VOLATILEASM // Different from your def; yours drops the whole asm statement
#define VOLATILEASM
#endif
// note I ported this to also be valid C, but I didn't try -xc to compile as C.
size_t count(size_t n)
{
int dummy; // asm with no outputs is implicitly volatile
VOLATILEVAR size_t i = 0;
sscanf("0", "%zd", &i);
while (i < n) {
asm VOLATILEASM ("nop # operand = %0": "=r"(dummy) : :"memory");
++i;
}
return i;
}
compiles (with gcc4.9 and newer -O3, neither VOLATILE enabled) to this weird asm.
(Godbolt compiler explorer with gcc and clang):
# gcc8.1 -O3 with sscanf(.., &i) but non-volatile asm
# the asm nop doesn't appear anywhere, but gcc is making clunky code.
.L8:
mov rdx, rax # i, <retval>
.L3: # first iter entry point
lea rax, [rdx+1] # <retval>,
cmp rax, rbx # <retval>, n
jb .L8 #,
Nice job, gcc.... gcc4.8 -O3 avoids pulling an extra mov inside the loop:
# gcc4.8 -O3 with sscanf(.., &i) but non-volatile asm
.L3:
add rdx, 1 # i,
cmp rbx, rdx # n, i
ja .L3 #,
mov rax, rdx # i.0, i # outside the loop
Anyway, without the dummy output operand, or with volatile, gcc8.1 gives us:
# gcc8.1 with sscanf(&i) and asm volatile("nop" ::: "memory")
.L3:
nop # operand = eax # dummy
mov rax, QWORD PTR [rsp+8] # tmp96, i
add rax, 1 # <retval>,
mov QWORD PTR [rsp+8], rax # i, <retval>
cmp rax, rbx # <retval>, n
jb .L3 #,
So we see the same store/reload of the loop counter, only difference from volatile i being the cmp doesn't need to reload it.
I used nop instead of just a comment because Godbolt hides comment-only lines by default, and I wanted to see it. For gcc, it's purely a text substitution: we're looking at the compiler's asm output with operands substituted into the template before it's sent to the assembler. For clang, there might be some effect because the asm has to be valid (i.e. actually assemble correctly).
If we comment out the scanf and remove the dummy output operand, we get a register-only loop with the nop in it. But keep the dummy output operand and the nop doesn't appear anywhere.

How does reference translates to asm in gcc?

Is reference is compiled as usual pointer or it has other stuff behind?
And how does it differ in clang?
You can think of a reference as an immutable pointer that is automatically de-referenced on usage. This isn't what the C++ standard says, so you cannot rely on that being an actual implementation.
Practically speaking though, it likely to be what you see in many cases.
Take the following example in the case of parameter passing:
#include <stdio.h>
void function (int *const n){
printf("%d",*n);
}
void function (int & n){
printf("%d",n);
}
int main(){
int n = 123;
function(&n);
function(n);
}
Both gcc and clang produce identical code for the functions without any optimizations enabled:
function(int*):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
function(int&):
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov rax, QWORD PTR [rbp-8]
mov eax, DWORD PTR [rax]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
nop
leave
ret
How does reference translates to asm in gcc?
In general: It depends.
To find that out in a specific case, you can test by reading the generated assembly.
Is reference is compiled as usual pointer or it has other stuff behind?
The implementation of code using a reference is practically identical to one using a pointer to achieve identical indirection. How each are implemented, is not guaranteed by the standard, but there is no need for implementing them differently.
References only differ from pointers in the way that they are allowed to use by the rules of C++. Of course, because the rules are different, pointers can be used in a way that references can not. And in such case you cannot compare whether pointers generate the same assembly.
The limitations of a reference might make some optimizations easier, so there might be a difference, but such optimization would also have been possible with pointers, so there is no guarantee of different assembly output when using references instead of pointers.
And how does it differ in clang?
In general: It depends.
Both compilers are bound by the same rules of the standard. They might generate identical assembly or different. How the generated assembly of particular version of one compiler differs (if at all) from the assembly generated by a particular version of another compiler, with particular compilation options for each, on particular processor architecture on particular operating system, in a particular use case of a reference, can be found by inspecting and comparing the generated assembly in each particular case.

MOVAPS accesses unaligned address

For some reason one of my functions is calling an SSE instruction movaps with unaligned parameter, which causes a crash. It happens on the first line of the function, the rest is needed to be there just for crash to happen, but is ommited for clarity.
Vec3f CrashFoo(
const Vec3f &aVec3,
const float aFloat,
const Vec2f &aVec2)
{
const Vec3f vecNew =
Normalize(Vec3f(aVec3.x, aVec3.x, std::max(aVec3.x, 0.0f)));
// ...
}
This is how I call it from the debugging main:
int32_t main(int32_t argc, const char *argv[])
{
Vec3f vec3{ 0.00628005248f, -0.999814332f, 0.0182171166f };
Vec2f vec2{ 0.947231591f, 0.0522233732f };
float floatVal{ 0.010f };
Vec3f vecResult = CrashFoo(vec3, floatVal, vec2);
return (int32_t)vecResult.x;
}
This is the disassembly from the beginning of the CrashFoo function to the line where it crashes:
00007FF7A7DC34F0 mov rax,rsp
00007FF7A7DC34F3 mov qword ptr [rax+10h],rbx
00007FF7A7DC34F7 push rdi
00007FF7A7DC34F8 sub rsp,80h
00007FF7A7DC34FF movaps xmmword ptr [rax-18h],xmm6
00007FF7A7DC3503 movss xmm6,dword ptr [rdx]
00007FF7A7DC3507 movaps xmmword ptr [rax-28h],xmm7
00007FF7A7DC350B mov dword ptr [rax+18h],0
00007FF7A7DC3512 mov rdi,r9
00007FF7A7DC3515 mov rbx,rcx
00007FF7A7DC3518 movaps xmmword ptr [rax-38h],xmm8
00007FF7A7DC351D movaps xmmword ptr [rax-48h],xmm9
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
00007FF7A7DC352B xorps xmm8,xmm8
00007FF7A7DC352F comiss xmm8,xmm6
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
My understanding is that it first does the usual function call stuff and then it starts preparing the playground by saving the current content of some SSE registers (xmm6-xmm11) onto the stack so that they are free to be used by the subsequent code. The xmm* registers are stored one after another to addresses from [rax-18h] to [rax-68h], which are nicely aligned to 16 bytes since rax=0xe4d987f788, but before the xmm11 register gets stored, the rax is increased by 18h which breaks the alignment causing crash. The xorps and comiss lines is where the actual code starts (std::max's comparison with 0). When I remove std::max it works nicely.
Do you see any reason for this behaviour?
Additional info
I uploaded a small compilable example that crashes for me in my Visual Studio, but not in the IDEone.
The code is compiled in Visual Studio 2013 Update 5 (x64 release, v120). I've set the "Struct Member Alignment" setting of the project to 16 bytes, but with little improvement and there are no packing pragma in the structures that I use. The error message is:
First-chance exception at 0x00007ff7a7dc3533 in PG3Render.exe: 0xC0000005: Access violation reading location 0xffffffffffffffff.
gcc and clang are both fine, and make non-crashing non-vectorized code for your example. (Of course, I'm compiling for the Linux SysV ABI where none of the vector regs are caller-saved, so they weren't generating code to save xmm{6..15} on the stack in the first place.)
Your IDEone link doesn't demonstrate a crash either, so IDK. I there are online compile & run sites that have MSVC as an option. You can even get asm out of them if your program uses system to run a disassembler on itself. :P
The asm output you posted is guaranteed to crash, for any possible value of rax:
00007FF7A7DC3522 movaps xmmword ptr [rax-58h],xmm10
00007FF7A7DC3527 lea rax,[rax+18h]
...
00007FF7A7DC3533 movaps xmmword ptr [rax-68h],xmm11
Accounting for the LEA, the second store address is [init_rax-50h], which is only 8B offset from the earlier stores. One or the other will fault. This appears to be a compiler bug that you should report.
I have no idea why your compiler would use lea instead of add rax, 18h. It does it right before clobbering the flags with a comiss

Register keyword in C++

What is difference between
int x=7;
and
register int x=7;
?
I am using C++.
register is a hint to the compiler, advising it to store that variable in a processor register instead of memory (for example, instead of the stack).
The compiler may or may not follow that hint.
According to Herb Sutter in "Keywords That Aren't (or, Comments by Another Name)":
A register specifier has the same
semantics as an auto specifier...
According to Herb Sutter, register is "exactly as meaningful as whitespace" and has no effect on the semantics of a C++ program.
In C++ as it existed in 2010, any program which is valid that uses the keywords "auto" or "register" will be semantically identical to one with those keywords removed (unless they appear in stringized macros or other similar contexts). In that sense the keywords are useless for properly-compiling programs. On the other hand, the keywords might be useful in certain macro contexts to ensure that improper usage of a macro will cause a compile-time error rather than producing bogus code.
In C++11 and later versions of the language, the auto keyword was re-purposed to act as a pseudo-type for objects which are initialized, which a compiler will automatically replace with the type of the initializing expression. Thus, in C++03, the declaration: auto int i=(unsigned char)5; was equivalent to int i=5; when used within a block context, and auto i=(unsigned char)5; was a constraint violation. In C++11, auto int i=(unsigned char)5; became a constraint violation while auto i=(unsigned char)5; became equivalent to auto unsigned char i=5;.
With today's compilers, probably nothing. Is was orginally a hint to place a variable in a register for faster access, but most compilers today ignore that hint and decide for themselves.
register is deprecated in C++11. It is unused and reserved in C++17.
Source: http://en.cppreference.com/w/cpp/keyword/register
Almost certainly nothing.
register is a hint to the compiler that you plan on using x a lot, and that you think it should be placed in a register.
However, compilers are now far better at determining what values should be placed in registers than the average (or even expert) programmer is, so compilers just ignore the keyword, and do what they wants.
The register keyword was useful for:
Inline assembly.
Expert C/C++ programming.
Cacheable variables declaration.
An example of a productive system, where the register keyword was required:
typedef unsigned long long Out;
volatile Out out,tmp;
Out register rax asm("rax");
asm volatile("rdtsc":"=A"(rax));
out=out*tmp+rax;
It has been deprecated since C++11 and is unused and reserved in C++17.
As of gcc 9.3, compiling using -std=c++2a, register produces a compiler warning, but it still has the desired effect and behaves identically to C's register when compiling without -O1–-Ofast optimisation flags in the respect of this answer. Using clang++-7 causes a compiler error however. So yes, register optimisations only make a difference on standard compilation with no optimisation -O flags, but they're basic optimisations that the compiler would figure out even with -O1.
The only difference is that in C++, you are allowed to take the address of the register variable which means that the optimisation only occurs if you don't take the address of the variable or its aliases (to create a pointer) or take a reference of it in the code (only on - O0, because a reference also has an address, because it's a const pointer on the stack, which, like a pointer can be optimised off the stack if compiling using -Ofast, except they will never appear on the stack using -Ofast, because unlike a pointer, they cannot be made volatile and their addresses cannot be taken), otherwise it will behave like you hadn't used register, and the value will be stored on the stack.
On -O0, another difference is that const register on gcc C and gcc C++ do not behave the same. On gcc C, const register behaves like register, because block-scope consts are not optimised on gcc. On clang C, register does nothing and only const block-scope optimisations apply. On gcc C, register optimisations apply but const at block-scope has no optimisation. On gcc C++, both register and const block-scope optimisations combine.
#include <stdio.h> //yes it's C code on C++
int main(void) {
const register int i = 3;
printf("%d", i);
return 0;
}
int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3
mov eax, DWORD PTR [rbp-4]
mov esi, eax
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
register int i = 3;:
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
push rbx
sub rsp, 8
mov ebx, 3
mov esi, ebx
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
mov rbx, QWORD PTR [rbp-8] //callee restoration
leave
ret
const int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
sub rsp, 16
mov DWORD PTR [rbp-4], 3 //still saves to stack
mov esi, 3 //immediate substitution
mov edi, OFFSET FLAT:.LC0
mov eax, 0
call printf
mov eax, 0
leave
ret
const register int i = 3;
.LC0:
.string "%d"
main:
push rbp
mov rbp, rsp
mov esi, 3 //loads straight into esi saving rbx push/pop and extra indirection (because C++ block-scope const is always substituted immediately into the instruction)
mov edi, OFFSET FLAT:.LC0 // can't optimise away because printf only takes const char*
mov eax, 0 //zeroed: https://stackoverflow.com/a/6212755/7194773
call printf
mov eax, 0 //default return value of main is 0
pop rbp //nothing else pushed to stack -- more efficient than leave (rsp == rbp already)
ret
register tells the compiler to 1)store a local variable in a callee saved register, in this case rbx, and 2)optimise out stack writes if address of variable is never taken. const tells the compiler to substitute the value immediately (instead of assigning it a register or loading it from memory) and write the local variable to the stack as default behaviour. const register is the combination of these emboldened optimisations. This is as slimline as it gets.
Also, on gcc C and C++, register on its own seems to create a random 16 byte gap on the stack for the first local on the stack, which doesn't happen with const register.
Compiling using -Ofast however; register has 0 optimisation effect because if it can be put in a register or made immediate, it always will be and if it can't it won't be; const still optimises out the load on C and C++ but at file scope only; volatile still forces the values to be stored and loaded from the stack.
.LC0:
.string "%d"
main:
//optimises out push and change of rbp
sub rsp, 8 //https://stackoverflow.com/a/40344912/7194773
mov esi, 3
mov edi, OFFSET FLAT:.LC0
xor eax, eax //xor 2 bytes vs 5 for mov eax, 0
call printf
xor eax, eax
add rsp, 8
ret
Consider a case when compiler's optimizer has two variables and is forced to spill one onto stack. It so happened that both variables have the same weight to the compiler. Given there is no difference, the compiler will arbitrarily spill one of the variables. On the other hand, the register keyword gives compiler a hint which variable will be accessed more frequently. It is similar to x86 prefetch instruction, but for compiler optimizer.
Obviously register hints are similar to user-provided branch probability hints, and can be inferred from these probability hints. If compiler knows that some branch is taken often, it will keep branch related variables in registers. So I suggest caring more about branch hints, and forgetting about register. Ideally your profiler should communicate somehow with the compiler and spare you from even thinking about such nuances.