(Note: Although this question is about "store", the "load" case has the same issues and is perfectly symmetric.)
The SSE intrinsics provide an _mm_storeu_pd function with the following signature:
void _mm_storeu_pd (double *p, __m128d a);
So if I have vector of two doubles, and I want to store it to an array of two doubles, I can just use this intrinsic.
However, my vector is not two doubles; it is two 64-bit integers, and I want to store it to an array of two 64-bit integers. That is, I want a function with the following signature:
void _mm_storeu_epi64 (int64_t *p, __m128i a);
But the intrinsics provide no such function. The closest they have is _mm_storeu_si128:
void _mm_storeu_si128 (__m128i *p, __m128i a);
The problem is that this function takes a pointer to __m128i, while my array is an array of int64_t. Writing to an object via the wrong type of pointer is a violation of strict aliasing and is definitely undefined behavior. I am concerned that my compiler, now or in the future, will reorder or otherwise optimize away the store thus breaking my program in strange ways.
To be clear, what I want is a function I can invoke like this:
__m128i v = _mm_set_epi64x(2,1);
int64_t ra[2];
_mm_storeu_epi64(&ra[0], v); // does not exist, so I want to implement it
Here are six attempts to create such a function.
Attempt #1
void _mm_storeu_epi64(int64_t *p, __m128i a) {
_mm_storeu_si128(reinterpret_cast<__m128i *>(p), a);
}
This appears to have the strict aliasing problem I am worried about.
Attempt #2
void _mm_storeu_epi64(int64_t *p, __m128i a) {
_mm_storeu_si128(static_cast<__m128i *>(static_cast<void *>(p)), a);
}
Possibly better in general, but I do not think it makes any difference in this case.
Attempt #3
void _mm_storeu_epi64(int64_t *p, __m128i a) {
union TypePun {
int64_t a[2];
__m128i v;
};
TypePun *p_u = reinterpret_cast<TypePun *>(p);
p_u->v = a;
}
This generates incorrect code on my compiler (GCC 4.9.0), which emits an aligned movaps instruction instead of an unaligned movups. (The union is aligned, so the reinterpret_cast tricks GCC into assuming p_u is aligned, too.)
Attempt #4
void _mm_storeu_epi64(int64_t *p, __m128i a) {
union TypePun {
int64_t a[2];
__m128i v;
};
TypePun *p_u = reinterpret_cast<TypePun *>(p);
_mm_storeu_si128(&p_u->v, a);
}
This appears to emit the code I want. The "type-punning via union" trick, although technically undefined in C++, is widely-supported. But is this example -- where I pass a pointer to an element of a union rather than access via the union itself -- really a valid way to use the union for type-punning?
Attempt #5
void _mm_storeu_epi64(int64_t *p, __m128i a) {
p[0] = _mm_extract_epi64(a, 0);
p[1] = _mm_extract_epi64(a, 1);
}
This works and is perfectly valid, but it emits two instructions instead of one.
Attempt #6
void _mm_storeu_epi64(int64_t *p, __m128i a) {
std::memcpy(p, &a, sizeof(a));
}
This works and is perfectly valid... I think. But it emits frankly terrible code on my system. GCC spills a to an aligned stack slot via an aligned store, then manually moves the component words to the destination. (Actually it spills it twice, once for each component. Very strange.)
...
Is there any way to write this function that will (a) generate optimal code on a typical modern compiler and (b) have minimal risk of running afoul of strict aliasing?
SSE intrinsics is one of those niche corner cases where you have to push the rules a bit.
Since these intrinsics are compiler extensions (somewhat standardized by Intel), they are already outside the specification of the C and C++ language standards. So it's somewhat self-defeating to try to be "standard compliant" while using a feature that clearly is not.
Despite the fact that the SSE intrinsic libraries try to act like normal 3rd party libraries, underneath, they are all specially handled by the compiler.
The Intent:
The SSE intrinsics were likely designed from the beginning to allow aliasing between the vector and scalar types - since a vector really is just an aggregate of the scalar type.
But whoever designed the SSE intrinsics probably wasn't a language pedant.(That's not too surprising. Hard-core low-level performance programmers and language lawyering enthusiasts tend to be very different groups of people who don't always get along.)
We can see evidence of this in the load/store intrinsics:
__m128i _mm_stream_load_si128(__m128i* mem_addr) - A load intrinsic that takes a non-const pointer?
void _mm_storeu_pd(double* mem_addr, __m128d a) - What if I want to store to __m128i*?
The strict aliasing problems are a direct result of these poor prototypes.
Starting from AVX512, the intrinsics have all been converted to void* to address this problem:
__m512d _mm512_load_pd(void const* mem_addr)
void _mm512_store_epi64 (void* mem_addr, __m512i a)
Compiler Specifics:
Visual Studio defines each of the SSE/AVX types as a union of the scalar types. This by itself allows strict-aliasing. Furthermore, Visual Studio doesn't do strict-aliasing so the point is moot:
The Intel Compiler has never failed me with all sorts of aliasing. It probably doesn't do strict-aliasing either - though I've never found any reliable source for this.
GCC does do strict-aliasing, but from my experience, not across function boundaries. It has never failed me to cast pointers which are passed in (on any type). GCC also declares SSE types as __may_alias__ thereby explicitly allowing it to alias other types.
My Recommendation:
For function parameters that are of the wrong pointer type, just cast it.
For variables declared and aliased on the stack, use a union. That union will already be aligned so you can read/write to them directly without intrinsics. (But be aware of store-forwarding issues that come with interleaving vector/scalar accesses.)
If you need to access a vector both as a whole and by its scalar components, consider using insert/extract intrinsics instead of aliasing.
When using GCC, turn on -Wall or -Wstrict-aliasing. It will tell you about strict-aliasing violations.
Related
This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.
struct Point {
int x, y;
};
bool nonzero_ptr(Point const* a) {
return a->x || a->y;
}
bool nonzero_ref(Point const& a) {
return a.x || a.y;
}
https://godbolt.org/z/e49h6d
For nonzero_ptr, clang -O3 (all versions) produces this or similar code:
mov al, 1
cmp dword ptr [rdi], 0
je .LBB0_1
ret
.LBB0_1:
cmp dword ptr [rdi + 4], 0
setne al
ret
This strictly implements the short-circuiting behavior of the C++ function, loading the y field only if the x field is zero.
For nonzero_ref, clang 3.6 and earlier generate the same code as they do for nonzero_ptr, but clang 3.7 through 11.0.1 produce
mov eax, dword ptr [rdi + 4]
or eax, dword ptr [rdi]
setne al
ret
which loads y unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?
The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4], but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:
char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi); // ok???
ppt->x = 42; // ok???
Point& rpt = *ppt; // UB???
But if the spec implies that, I'm not seeing how.
This is a missed optimization; the branchless code is safe for both C++ source versions.
In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)
The obvious difference in your asm is that the pointer version avoids access to a->y if a->x != 0, and that this only matters for correctness1 if a->y was in an unmapped page; you're right about that being the relevant corner case.
But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x, the compiler can assume it's safe to also read a->y.
This would of course not be the case for int *p; and p[0] || p[1], because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.
As #Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.
It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt), like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.
An experiment: deref the pointer to get a full tmp object
bool nonzero_ptr_full_deref(Point const* pa) {
Point a = *pa;
return a.x || a.y;
}
https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y effectively unconditional in the C++ source.
Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.
Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4] so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point object.
Part of data races (on non-atomic objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile* casts for ACCESS_ONCE (https://lwn.net/Articles/793253/#Invented%20Loads).
I believe that from the point of view of standard C++, the compiler could emit the same code for both, since there is no provision in the standard for "partial objects" like the one you've constructed. The fact that it doesn't could simply be a missed optimization.
One could compare code like a->x || b->y where the compiler really does have to emit a branch, since the caller could legally pass a null or invalid pointer for b so long as a->x is nonzero. On the other hand, if a,b are references, then a.x || b.y should not need a branch according to the standard, since they must always be references to valid objects. So the "missed optimization" in your nonzero_ptr could just be the compiler not noticing that it can take advantage of the fact that the pointers in a->x and a->y are the same pointer.
Alternatively, it's possible that clang is, as an extension, trying to produce code that will still work when you use non-standard features to create objects in which only some members can be accessed. The fact that this works for pointers but not for references could be a bug or limitation of that extension, but I don't think it's any sort of conformance violation.
I am trying to write a C++ code for conversion of assembly dq 3FA999999999999Ah into C++ double. What to type inside asm block? I dont know how to take out the value.
int main()
{
double x;
asm
{
dq 3FA999999999999Ah
mov x,?????
}
std::cout<<x<<std::endl;
return 0;
}
From the comments it sounds a lot like you want to use a reinterpret cast here. Essentially what this does is to tell the compiler to treat the sequence of bits as if it were of the type that it was casted to but it doesn't do any attempt to convert the value.
uint64_t raw = 0x3FA999999999999A;
double x = reinterpret_cast<double&>(raw);
See this in action here: http://coliru.stacked-crooked.com/a/37aec366eabf1da7
Note that I've used the specific 64bit integer type here to make sure the bit representation required matches that of the 64bit double. Also the cast has to be to double& because of the C++ rules forbidding the plain cast to double. This is because reinterpret cast deals with memory and not type conversions, for more details see this question: Why doesn't this reinterpret_cast compile?. Additionally you need to be sure that the representation of the 64 bit unsigned here will match up with the bit reinterpretation of the double for this to work properly.
EDIT: Something worth noting is that the compiler warns about this breaking strict aliasing rules. The quick summary is that more than one value refers to the same place in memory now and the compiler might not be able to tell which variables are changed if the change occurs via the other way it can be accessed. In general you don't want to ignore this, I'd highly recommend reading the following article on strict aliasing to get to know why this is an issue. So while the intent of the code might be a little less clear you might find a better solution is to use memcpy to avoid the aliasing problems:
#include <iostream>
int main()
{
double x;
const uint64_t raw = 0x3FA999999999999A;
std::memcpy(&x, &raw, sizeof raw);
std::cout<<x<<std::endl;
return 0;
}
See this in action here: http://coliru.stacked-crooked.com/a/5b738874e83e896a
This avoids the issue with the aliasing issue because x is now a double with the correct constituent bits but because of the memcpy usage it is not at the same memory location as the original 64 bit int that was used to represent the bit pattern needed to create it. Because memcpy is treating the variable as if it were an array of char you still need to make sure you get any endianness considerations correct.
I learned that pointer aliasing may hurt performance, and that a __restrict__ attribute (in GCC, or equivalent attributes in other implementations) may help keeping track of which pointers should or should not be aliased. Meanwhile, I also learned that GCC's implementation of valarray stores a __restrict__'ed pointer (line 517 in https://gcc.gnu.org/onlinedocs/libstdc++/libstdc++-html-USERS-4.1/valarray-source.html), which I think hints the compiler (and responsible users) that the private pointer can be assumed not to be aliased anywhere in valarray methods.
But if we alias a pointer to a valarray object, for example:
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
is it valid to say that the member pointer of a is aliased too? And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise? (Is it bad practice to point to optimized pointer containers?)
Let's first understand how aliasing hurts optimization.
Consider this code,
void
process_data(float *in, float *out, float gain, int nsamps)
{
int i;
for (i = 0; i < nsamps; i++) {
out[i] = in[i] * gain;
}
}
In C or C++, it is legal for the parameters in and out to point to overlapping regions in memory.... When the compiler optimizes the function, it does not in general know whether in and out are aliases. It must therefore assume that any store through out can affect the memory pointed to by in, which severely limits its ability to reorder or parallelize the code (For some simple cases, the compiler could analyze the entire program to determine that two pointers cannot be aliases. But in general, it is impossible for the compiler to determine whether or not two pointers are aliases, so to be safe, it must assume that they are).
Coming to your code,
#include <valarray>
int main() {
std::valarray<double> *a = new std::valarray<double>(10);
std::valarray<double> *b = a;
return 0;
}
Since a and b are aliases. The underlying storage structure used by valarray will also be aliased(I think it uses an array. Not very sure about this). So, any part of your code that uses a and b in a fashion similar to that shown above will not benefit from compiler optimizations like parallelization and reordering. Note that JUST the existence of b will not hurt optimization but how you use it.
Credits:
The quoted part and the code is take from here. This should serve as a good source for more information about the topic as well.
is it valid to say that the member pointer of a is aliased too?
Yes. For example, a->[0] and b->[0] reference the same object. That's aliasing.
And would the very existence of b hurt any optimizations that valarray methods could benefit otherwise?
No.
You haven't done anything with b in your sample code. Suppose you have a function much larger than this sample code that starts with the same construct. There's usually no problem if the first several lines of that function uses a but never b, and the remaining lines uses b but never a. Usually. (Optimizing compilers do rearrange lines of code however.)
If on the other hand you intermingle uses of a and b, you aren't hurting the optimizations. You are doing something much worse: You are invoking undefined behavior. "Don't do it" is the best solution to the undefined behavior problem.
Addendum
The C restrict and gcc __restrict__ keywords are not constraints on the developers of the compiler or the standard library. Those keywords are promises to the compiler/library that restricted data do not overlap other data. The compiler/library doesn't check whether the programmer violated this promise. If this promise enables certain optimizations that might otherwise be invalid with overlapping data, the compiler/library is free to apply those optimizations.
What this means is that restrict (or __restrict__) is a restriction on you, not the compiler. You can violate those restrictions even without your b pointer. For example, consider
*a = a->[std::slice(a.size()-1,a.size(),-1)];
This is undefined behavior.
The code below performs a fast inverse square root operation by some bit hacks.
The algorithm was probably developed by Silicon Graphics in early 1990's and it's appeared in Quake 3 too.
more info
However I get the following warning from GCC C++ compiler: dereferencing type-punned pointer will break strict-aliasing rules
Should I use static_cast, reinterpret_cast or dynamic_cast instead in such situations?
float InverseSquareRoot(float x)
{
float xhalf = 0.5f*x;
int32_t i = *(int32_t*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
Forget casts. Use memcpy.
float xhalf = 0.5f*x;
uint32_t i;
assert(sizeof(x) == sizeof(i));
std::memcpy(&i, &x, sizeof(i));
i = 0x5f375a86 - (i>>1);
std::memcpy(&x, &i, sizeof(i));
x = x*(1.5f - xhalf*x*x);
return x;
The original code tries to initialize the int32_t by first accessing the float object through an int32_t pointer, which is where the rules are broken. The C-style cast is equivalent to a reinterpret_cast, so changing it to reinterpret_cast would not make much difference.
The important difference when using memcpy is that the bytes are copied from the float into the int32_t, but the float object is never accessed through an int32_t lvalue, because memcpy takes pointers to void and its insides are "magical" and don't break the aliasing rules.
There are a few good answers here that address the type-punning issue.
I want to address the "fast inverse square-root" part. Don't use this "trick" on modern processors. Every mainstream vector ISA has a dedicated hardware instruction to give you a fast inverse square-root. Every one of them is both faster and more accurate than this oft-copied little hack.
These instructions are all available via intrinsics, so they are relatively easy to use. In SSE, you want to use rsqrtss (intrinsic: _mm_rsqrt_ss( )); in NEON you want to use vrsqrte (intrinsic: vrsqrte_f32( )); and in AltiVec you want to use frsqrte. Most GPU ISAs have similar instructions. These estimates can be refined using the same Newton iteration, and NEON even has the vrsqrts instruction to do part of the refinement in a single instruction without needing to load constants.
Update
I no longer believe this answer is correct, due to feedback I've gotten from the committee. But I want to leave it up for informational purposes. And I am purposefully hopeful that this answer can be made correct by the committee (if it chooses to do so). I.e. there's nothing about the underlying hardware that makes this answer incorrect, it is just the judgement of a committee that makes it so, or not so.
I'm adding an answer not to refute the accepted answer, but to augment it. I believe the accepted answer is both correct and efficient (and I've just upvoted it). However I wanted to demonstrate another technique that is just as correct and efficient:
float InverseSquareRoot(float x)
{
union
{
float as_float;
int32_t as_int;
};
float xhalf = 0.5f*x;
as_float = x;
as_int = 0x5f3759df - (as_int>>1);
as_float = as_float*(1.5f - xhalf*as_float*as_float);
return as_float;
}
Using clang++ with optimization at -O3, I compiled plasmacel's code, R. Martinho Fernandes code, and this code, and compared the assembly line by line. All three were identical. This is due to the compiler's choice to compile it like this. It had been equally valid for the compiler to produce different, broken code.
If you have access to C++20 or later then you can use std::bit_cast
float InverseSquareRoot(float x)
{
float xhalf = 0.5f*x;
int32_t i = std::bit_cast<int32_t>(x);
i = 0x5f3759df - (i>>1);
x = std::bit_cast<float>(i);
x = x*(1.5f - xhalf*x*x);
return x;
}
At the moment std::bit_cast is only supported by MSVC. See demo on Godbolt
While waiting for the implementation, if you're using Clang you can try __builtin_bit_cast. Just change the casts like this
int32_t i = __builtin_bit_cast(std::int32_t, x);
x = __builtin_bit_cast(float, i);
Demo
Take a look at this for more information on type punning and strict aliasing.
The only safe cast of a type into an array is into a char array. If you want one data address to be switchable to different types you will need to use a union
The cast invokes undefined behaviour. No matter what form of cast you use, it will still be undefined behaviour. It is undefined no matter what type of cast you use.
Most compilers will do what you expect, but gcc likes being mean and is probably going to assume you didn't assign the pointers despite all indication you did and reorder the operation so they give some strange result.
Casting a pointer to incompatible type and dereferencing it is an undefined behaviour. The only exception is casting it to or from char, so the only workaround is using std::memcpy (as per R. Martinho Fernandes' answer). (I am not sure how much it is defined using unions; It does stand a better chance of working though).
That said, you should not use C-style cast in C++. In this case, static_cast would not compile, nor would dynamic_cast, forcing you to use reinterpret_cast and reinterpret_cast is a strong suggestion you might be violating strict aliasing rules.
Based on the answers here I made a modern "pseudo-cast" function for ease of application.
C99 version
(while most compilers support it, theoretically could be undefined behavior in some)
template <typename T, typename U>
inline T pseudo_cast(const U &x)
{
static_assert(std::is_trivially_copyable<T>::value && std::is_trivially_copyable<U>::value, "pseudo_cast can't handle types which are not trivially copyable");
union { U from; T to; } __x = {x};
return __x.to;
}
Universal versions
(based on the accepted answer)
Cast types with the same size:
#include <cstring>
template <typename T, typename U>
inline T pseudo_cast(const U &x)
{
static_assert(std::is_trivially_copyable<T>::value && std::is_trivially_copyable<U>::value, "pseudo_cast can't handle types which are not trivially copyable");
static_assert(sizeof(T) == sizeof(U), "pseudo_cast can't handle types with different size");
T to;
std::memcpy(&to, &x, sizeof(T));
return to;
}
Cast types with any sizes:
#include <cstring>
template <typename T, typename U>
inline T pseudo_cast(const U &x)
{
static_assert(std::is_trivially_copyable<T>::value && std::is_trivially_copyable<U>::value, "pseudo_cast can't handle types which are not trivially copyable");
T to = T(0);
std::memcpy(&to, &x, (sizeof(T) < sizeof(U)) ? sizeof(T) : sizeof(U));
return to;
}
Use it like:
float f = 3.14f;
uint32_t u = pseudo_cast<uint32_t>(f);
Update for C++20
C++20 introduces constexpr std::bit_cast in header <bit> which is functionally equivalent for types with the same size. Nevertheless, the above versions are still useful if you want to implement this functionality yourself (supposed that constexpr is not required), or if you want to support types with different sizes.
The only cast that will work here is reinterpret_cast. (And
even then, at least one compiler will go out of its way to
ensure that it won't work.)
But what are you actually trying to do? There's certainly
a better solution, that doesn't involve type punning. There are
very, very few cases where type punning is appropriate, and they
all are in very, very low level code, things like serialization,
or implementing the C standard library (e.g. functions like
modf). Otherwise (and maybe even in serialization), functions
like ldexp and modf will probably work better, and certainly
be more readable.
Is there any advantage to specifying the MSVC/GCC non-standard __restrict qualifier on a function pointer parameter if it is the only pointer parameter? For example,
int longCalculation(int a, int* __restrict b)
My guess is it should allow better optimization since it implies b does not point to a, but all examples I've seen __restrict two pointers to indicate no aliasing between them.
As mentioned in the comments b can't point to a anyways, so there is no aliasing potential there anyways. So if the function is pure in the sense that it works only on its parameters there shouldn't be any real benefits.
However if the function uses global variables internally then __restrict might offer benefits once again, since it makes clear that b doesn't point to any of those global variables.
An interesting case might be the situation where you allocate and deallocate memory inside the function. The compiler could theoretically be sure that b doesn't point to that memory, however whether or not it realizes that I'm not sure and might depend how the allocation is called.
Personally however I prefer to keep __restrict out of the signature and do something like this
int longCalculation(int a, int* b){
assert(...);//ensure that b doesn't point to anything used
int* __restrict bx = b;
...
}
IMO this has the following advantages:
The function signature doesn't expose the non standard __restrict used
The ability to ensure that the variables actually conform to __restrict using assert, since passing aliasing pointers to a function expecting them to be nonaliasing can lead to hard to track down bugs.