I've got some code that employs __sync_val_compare_and_swap with my cross-platform shared_mutex-implementation. The Windows-version (#ifdef'd) uses the _InterlockedCompareExchange-intrinsic which has full physical (to the ordering of loads and stores by the cpu) and logical (to the compiler) acquire- and release-behaviour. I don' find any documentation that the __sync_val_compare_and_swap-intrinsic also has some logical effects on the ordering of loads and stores to the compiler. So is there any "intrinsic" that forces the compiler to have logical a acquire- or release barrier? I know there's asm volatile("" ::: "memory"); with gcc, but this has acquire- and release-behaviour as well.
Related
I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.
The data I'm using _mm_load_ps on is defined like this:
struct alignas(16) Vector {
float v[4];
}
// often embedded in other structs like this
struct AABB {
Vector min;
Vector max;
bool intersection(/* parameters */) const;
}
Now when I'm using this construct, the following will happen:
// this code
__mm128 bb_min = _mm_load_ps(min.v);
// generates this
movups xmm4, XMMWORD PTR [r8]
I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?
EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?
On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.
When compiling for AVX or higher:
The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.
The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.
Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.
Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.
New Relevations:
Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.
Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.
The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.
So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.
I try to use swp instruction to implement atomic swap.
asm volatile ("swp %[newval], %[newval], [%[oldval]]"
: [newval] "+r" (newval), [oldval] "+p" (oldval)
:
: "memory");
when I compiling the code (using g++ main.cpp -o main -march=armv8-a). I got the following error message.
/tmp/cc0MHTHA.s: Assembler messages:
/tmp/cc0MHTHA.s:20: Error: selected processor does not support `swp x1,x1,[x0]'
The ARM machine I use is with armv8, /proc/cpuinfo is like this (It's a SMP machine with 16 cores, information of others processors is the same besides the first line.)
processor : 0
model name : phytium FT1500a
flags : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x70
CPU architecture: 8
CPU variant : 0x1
CPU part : 0x660
bogomips : 3590.55
CPU revision : 1
g++ --version outputs
g++ (Ubuntu/Linaro 4.9.1-16kord6) 4.9.1
Copyright (C) 2014 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I'm getting the following errors when using ldrex/strex instructions
/tmp/ccXxJgQH.s: Assembler messages:
/tmp/ccXxJgQH.s:19: Error: unknown mnemonic `ldrex' -- `ldrex x0,[x0]'
Can anyone explain me why and where this error comes and how to deal with this error? The machine does not support SWP or I should add some parameters(maybe -march) on compile command to indicate the CPU architecture?
You don't need (and shouldn't use) inline assembly for this.
Use a gcc builtin: type __atomic_exchange_n (type *ptr, type val, int memorder) or C++11 std::atomic for this, so the compiler can use the best instruction sequence for the target CPU, based on your -mcpu= command line option and whether you're building for 64-bit or 32-bit ARM (or x86), etc. etc. Also, the compiler understands what you're doing and can optimize accordingly.
// static inline
int xchg_gcc(int *p, int newval) {
int oldval = __atomic_exchange_n(p, newval, __ATOMIC_SEQ_CST);
//__atomic_signal_fence ( __ATOMIC_SEQ_CST);
return oldval;
}
For ARM64 and ARM (32-bit with -mcpu=cortex-a72) with gcc5.4, this compiles to what you want (Godbolt compiler explorer):
.L2: ## This is the ARM64 version.
ldaxr w2, [x0]
stlxr w3, w1, [x0]
cbnz w3, .L2
mov w0, w2 # This insn will optimize away after inlining, leaving just the retry loop
ret
Or if you just want atomicity but don't need ordering wrt other operations, then use __ATOMIC_RELAXED instead of __ATOMIC_SEQ_CST. Then it compiles to ldxr / stxr, instead of the acquire/release version of the LL/SC instructions.
For the 32-bit version, if you don't specify a -mcpu or -march, it calls library functions because it doesn't know what to use for exchange.
I'm not sure if SEQ_CST for the __atomic_exchange builtin orders with respect to non-atomic things the way asm volatile("":::"memory") does; if not you might need fences as described below for C++11 atomic_signal_fence.
or use this portable C++11 version, which compiles to the same asm:
#include <atomic>
// static inline
int xchg_stdatomic(std::atomic<int> *p, int newval) {
atomic_signal_fence(std::memory_order_seq_cst);
int oldval = p->exchange(newval, std::memory_order_seq_cst);
atomic_signal_fence(std::memory_order_seq_cst); // order WRT. non-atomic variables (on gcc/clang at least).
return oldval;
}
atomic_signal_fence is used as an equivalent of asm("":::"memory"), to block compile-time reordering with non-atomic loads/stores (but without emitting any instructions). This is how gcc implements it, but IDK if that's required by the standard of just an implementation detail in gcc.
In gcc at least, atomic_signal_fence orders operations on "normal" variables, but atomic_thread_fence only orders operations on atomic variables. (Shared access to non-atomic variables from multiple threads would be an undefined-behaviour data race, so gcc assumes it doesn't happen. The question here is whether the standard requires signal_fence to order non-atomic operations along with atomic and volatile accesses, because the guarantees about what you can safely access in signal handlers are quite weak.)
Anyway, since signal_fence compiles to no instruction, and is only blocking reordering that we want exchange() to block anyway, there's no harm. (Unless you don't want exchange() to order your non-shared variables, in which case you shouldn't use signal_fence).
swp is supported but deprecated in ARMv6 and ARMv7. ARM's docs say that it increases interrupt latency (because swp itself is not interruptible). Also,
In a multi-core system, preventing access to main memory for all processors for the duration of a swap instruction can reduce overall system performance.
I am looking at the generated assembly for my code (using Visual Studio 2017) and noticed that _mm_load_ps is often (always?) compiled to movups.
The data I'm using _mm_load_ps on is defined like this:
struct alignas(16) Vector {
float v[4];
}
// often embedded in other structs like this
struct AABB {
Vector min;
Vector max;
bool intersection(/* parameters */) const;
}
Now when I'm using this construct, the following will happen:
// this code
__mm128 bb_min = _mm_load_ps(min.v);
// generates this
movups xmm4, XMMWORD PTR [r8]
I'm expecting movaps because of alignas(16). Do I need something else to convince the compiler to use movaps in this case?
EDIT: My question is different from this question because I'm not getting any crashes. The struct is specifically aligned and I'm also using aligned allocation. Rather, I'm curious why the compiler is switching _mm_load_ps (the intrinsic for aligned memory) to movups. If I know struct was allocated at an aligned address and I'm calling it via this* it would be safe to use movaps, right?
On recent versions of Visual Studio and the Intel Compiler (recent as post-2013?), the compiler rarely ever generates aligned SIMD load/stores anymore.
When compiling for AVX or higher:
The Microsoft compiler (>VS2013?) doesn't generate aligned loads. But it still generates aligned stores.
The Intel compiler (> Parallel Studio 2012?) doesn't do it at all anymore. But you'll still see them in ICC-compiled binaries inside their hand-optimized libraries like memset().
As of GCC 6.1, it still generates aligned load/stores when you use the aligned intrinsics.
The compiler is allowed to do this because it's not a loss of functionality when the code is written correctly. All processors starting from Nehalem have no penalty for unaligned load/stores when the address is aligned.
Microsoft's stance on this issue is that it "helps the programmer by not crashing". Unfortunately, I can't find the original source for this statement from Microsoft anymore. In my opinion, this achieves the exact opposite of that because it hides misalignment penalties. From the correctness standpoint, it also hides incorrect code.
Whatever the case is, unconditionally using unaligned load/stores does simplify the compiler a bit.
New Relevations:
Starting Parallel Studio 2018, the Intel Compiler no longer generates aligned moves at all - even for pre-Nehalem targets.
Starting from Visual Studio 2017, the Microsoft Compiler also no longer generates aligned moves at all - even when targeting pre-AVX hardware.
Both cases result in inevitable performance degradation on older processors. But it seems that this is intentional as both Intel and Microsoft no longer care about old processors.
The only load/store intrinsics that are immune to this are the non-temporal load/stores. There is no unaligned equivalent of them, so the compiler has no choice.
So if you want to just test for correctness of your code, you can substitute in the load/store intrinsics for non-temporal ones. But be careful not to let something like this slip into production code since NT load/stores (NT-stores in particular) are a double-edged sword that can hurt you if you don't know what you're doing.
I've been trying to search on google but couldn't find anything useful.
typedef int64_t v4si __attribute__ ((vector_size(32)));
//warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
// so isn't AVX already automatically enabled?
// What does it mean "without AVX enabled"?
// What does it mean "changes the ABI"?
inline v4si v4si_gt0(v4si x_);
//warning: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
//So why there's warning and what does it mean?
// Why only this parameter got warning?
// And all other v4si parameter/arguments got no warning?
void set_quota(v4si quota);
That's not legacy code. __attribute__ ((vector_size(32))) means a 32 byte vector, i.e. 256 bit, which (on x86) means AVX. (GNU C Vector Extensions)
AVX isn't enabled unless you use -mavx (or a -march setting that includes it). Without that, the compiler isn't allowed to generate code that uses AVX instructions, because those would trigger an illegal-instruction fault on older CPUs that don't support AVX.
So the compiler can't pass or return 256b vectors in registers, like the normal calling convention specifies. Probably it treats them the same as structs of that size passed by value.
See the ABI links in the x86 tag wiki, or the x86 Calling Conventions page on Wikipedia (mostly doesn't mention vector registers).
Since the GNU C Vector Extensions syntax isn't tied to any particular hardware, using a 32 byte vector will still compile to correct code. It will perform badly, but it will still work even if the compiler can only use SSE instructions. (Last I saw, gcc was known to do a very bad job of generating code to deal with vectors wider than the target machine supports. You'd get significantly better code for a machine with 16B vectors from using vector_size(16) manually.)
Anyway, the point is that you get a warning instead of a compiler error because __attribute__ ((vector_size(32))) doesn't imply AVX specifically, but AVX or some other 256b vector instruction set is required for it to compile to good code.
The CUDA PTX Guide describes the instructions 'atom' and 'red', which perform atomic and non-atomic reductions. This is news to me (at least with respect to non-atomic reductions)... I remember learning how to do reductions with SHFL a while back. Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Are these instructions reflected or wrapped somehow in CUDA runtime APIs? Or some other way accessible with C++ code without actually writing PTX code?
Most of these instructions are reflected in atomic operations (built-in intrinsics) described in the programming guide. If you compile any of those atomic intrinsics, you will find atom or red instructions emitted by the compiler at the PTX or SASS level in your generated code.
The red instruction type will generally be used when you don't explicitly use the return value from from one of the atomic intrinsics. If you use the return value explicitly, then the compiler usually emits the atom instruction.
Thus, it should be clear that this instruction by itself does not perform a complete classical parallel reduction, but certainly could be used to implement one if you wanted to depend on atomic hardware (and associated limitations) for your reduction operations. This is generally not the fastest possible implementation for parallel reductions.
If you want direct access to these instructions, the usual advice would be to use inline PTX where desired.
As requested, to elaborate using atomicAdd() as an example:
If I perform the following:
atomicAdd(&x, data);
perhaps because I am using it for a typical atomic-based reduction into the device variable x, then the compiler would emit a red (PTX) or RED (SASS) instruction taking the necessary arguments (the pointer to x and the variable data, i.e. 2 logical registers).
If I perform the following:
int offset = atomicAdd(&buffer_ptr, buffer_size);
perhaps because I am using it not for a typical reduction but instead to reserve a space (buffer_size) in a buffer shared amongst various threads in the grid, which has an offset index (buffer_ptr) to the next available space in the shared buffer, then the compiler would emit a atom (PTX) or ATOM (SASS) instruction, including 3 arguments (offset, &buffer_ptr, and buffer_size, in registers).
The red form can be issued by the thread/warp which may then continue and not normally stall due to this instruction issue which will normally have no dependencies for subsequent instructions. The atom form OTOH will imply modification of one of its 3 arguments (one of 3 logical registers). Therefore subsequent use of the data in that register (i.e. the return value of the intrinsic, i.e. offset in this case) can result in a thread/warp stall, until the return value is actually returned by the atomic hardware.