Get member of __m128 by index? - c++

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.

A union is probably the most portable way to do this:
union {
__m128 v; // SSE 4 x float vector
float a[4]; // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
U u;
assert(i <= 3);
u.v = V;
return u.a[i];
}

As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)

Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
union {
__m128 v;
float a[4];
} converter;
converter.v = V;
return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.

The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
vec v.sse = v;
return v.f[index];
}
Seems to work out pretty well for me.

Late to this party but found that this works for me in MSVC where z is a variable of type __m128.
#define _mm_extract_f32(v, i) _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))
__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);
float f = _mm_extract_f32(z, 2);
OR even simpler
__m128 z;
float f = z.m128_f32[2]; // to get the 3rd float value in the vector

Related

Clang: autovectorize conversion of bool[64] array to uint64_t bit mask

I want to convert a bool[64] into a uint64_t where each bit represents the value of an element in the input array.
On modern x86 processors, this can be done quite efficiently, e.g. using vptestmd with AVX512 or vpmovmskb with AVX256. When I use clang's bool vector extension in combination with __builtin_convertvector, I'm happy with the results:
uint64_t handvectorized(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
using VecBool64 __attribute__((vector_size(64))) = char;
using VecBitmaskT __attribute__((ext_vector_type(64))) = bool;
auto input_vec = *reinterpret_cast<const VecBool64*>(input);
auto result_vec = __builtin_convertvector(input_vec, VecBitmaskT);
return reinterpret_cast<uint64_t&>(result_vec);
}
produces (godbolt):
vmovdqa64 zmm0, zmmword ptr [rdi]
vptestmb k0, zmm0, zmm0
kmovq rax, k0
vzeroupper
ret
However, I can not get clang (or GCC, or ICX) to produce anything that uses vector mask extraction like this with (portable) scalar code.
For this implementation:
uint64_t loop(const bool* input_) noexcept {
const bool* __restrict input = std::assume_aligned<64>(input_);
uint64_t result = 0;
for(size_t i = 0; i < 64; ++i) {
if(input[i]) {
result |= 1ull << i;
}
}
return result;
}
clang produces a 64*8B = 512B lookup table and 39 instructions.
This implementation, and some other scalar implementations (branchless, inverse bit order, using std::bitset) that I've tried, can all be found on godbolt. None of them results in code close to the handwritten vector instructions.
Is there anything I'm missing or any reason the optimization doesn't work well here? Is there a scalar version I can write that produces reasonably vectorized code?
I'm wondering especially since the "handvectorized" version doesn't use any platform-specific intrinsics, and doesn't really have much programming to it. All it does is "load as vector" and "convert to bitmask". Perhaps clang simply doesn't detect the loop pattern? It just feels strange to me, a simple bitwise OR reduction loop feels like a common pattern, and the documentation of the loop vectorizer explicitly lists reductions using OR as a supported feature.
Edit: Updated godbolt link with suggestions from the comments

Storing two integer values in 32 bits while avoiding UB with negative numbers

I have two int16_t integers that I want to store in one one 32-bit int.
I can use the lower and upper bits to do this as answered in this question.
The top answers demonstrates this
int int1 = 345;
int int2 = 2342;
take2Integers( int1 | (int2 << 16) );
But my value could also be negative, so will << 16 cause undefined behavior?
If so, what is the solution to storing and retrieving those two numbers which can also be negative in a safe way?
Further context
The whole 32 bit number will be stored in an array with other numbers similar to it.
These numbers will never be utilized as a whole, I'll only want to work with one of the integers packed inside at once.
So what is required is two functions which ensure safe storing and retrieval of two int16_t values into one 4 byte integer
int pack(int16_t lower, int16_t upper);
int16_t get_lower(int);
int16_t get_upper(int);
Here’s the solution with a single std::memcpy from the comments:
std::int32_t pack(std::int16_t l, std::int16_t h) {
std::int16_t arr[2] = {l, h};
std::int32_t result;
std::memcpy(&result, arr, sizeof result);
return result;
}
Any compiler worth its salt won’t emit code for the temporary array. Case in point, the resulting assembly code on GCC with -O2 looks like this:
pack(short, short):
sal esi, 16
movzx eax, di
or eax, esi
ret
Without checking, I’m confident that clang/ICC/MSVC produce something similar.
C++ (pre-20) makes it notoriously hard to alias values. If you want to remain within defined behavior pre-20, you'd have to use memcpy.
Like following:
int32_t pack(int16_t l, int16_t h) {
int32_t r;
memcpy(&r, &l, 2);
memcpy(reinterpret_cast<char*>(&r) + 2, &h, 2);
return r;
}
Unpack is similar, with reverse order of memcpy arguments.
Note, that this code is very-well optimized with both Clang and gcc in my experiments. Here is what was produced by latest version of Clang:
pack(short, short): # #pack(short, short)
movzwl %di, %eax
shll $16, %esi
orl %esi, %eax
retq
It all ended up being the same combination of shifts and conjunctions.

C++ SSE Intrinsics: Storing results in variables [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:
void _mm_store_ps (float* mem_addr, __m128 a)
Store 128-bits (composed of 4 packed single-precision (32-bit)
floating-point elements) from a into memory. mem_addr must be aligned
on a 16-byte boundary or a general-protection exception may be
generated.
The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.
Does mem_addr need to be an array of 4 floats?
How can I access only a specific 32bit element in a and store it in a single float?
What am I missing conceptually?
Are there better options then the _mm_store_ps intrinsic?
Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.
struct vec2 {
union {
struct {
float data[2];
};
struct {
float x, y;
};
};
void doSomething() {
__m128 v1 = _mm_setr_ps(x, y, 0, 0);
__m128 v2 = _mm_setr_ps(1, 1, 0, 0);
__m128 result = _mm_add_ps(v1, v2);
// ?? How to store results in x,y ??
}
}
It's a 128-bit load or store, so yes the arg is like float mem[4]. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.
Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer) doesn't violate strict-aliasing even if it's a pointer to long. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float, though.
How can I access only a specific 32bit element in a and store it in a single float?
Use a vector shuffle and then _mm_cvtss_f32 to cast the vector to scalar.
loading / storing 64 bits
Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).
But you can express what you're trying to do efficiently like this:
struct vec2 {
float x,y;
};
void foo(const struct vec2 *in, struct vec2 *out) {
__m128d tmp = _mm_load_sd( (const double*)in ); //64-bit zero-extending load with MOVSD
__m128 inv = _mm_castpd_ps(tmp); // keep the compiler happy
__m128 result = _mm_add_ps(inv, _mm_setr_ps(1, 1, 0, 0) );
_mm_storel_pi( out, result );
}
GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq instead of movsd for the load. gcc 6.3 uses movsd.
foo(vec2 const*, vec2*):
movq xmm0, QWORD PTR [rdi] # 64-bit integer load
addps xmm0, XMMWORD PTR .LC0[rip] # packed 128-bit float add
movlps QWORD PTR [rsi], xmm0 # 64-bit store
ret
For a 64-bit store of the low half of a vector (2 floats or 1 double), you can use _mm_store_sd. Or better _mm_storel_pi (movlps). Unfortunately the intrinsic for it wants a __m64* arg instead of float*, but that's just a design quirk of Intel's intrinsics. They often require type casting.
Notice that instead of _mm_setr, I used _mm_load_sd((const double*)&(in->x)) to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.

Why don't C++ compilers do better constant folding?

I'm investigating ways to speed up a large section of C++ code, which has automatic derivatives for computing jacobians. This involves doing some amount of work in the actual residuals, but the majority of the work (based on profiled execution time) is in calculating the jacobians.
This surprised me, since most of the jacobians are propagated forward from 0s and 1s, so the amount of work should be 2-4x the function, not 10-12x. In order to model what a large amount of the jacobian work is like, I made a super minimal example with just a dot product (instead of sin, cos, sqrt and more that would be in a real situation) that the compiler should be able to optimize to a single return value:
#include <Eigen/Core>
#include <Eigen/Geometry>
using Array12d = Eigen::Matrix<double,12,1>;
double testReturnFirstDot(const Array12d& b)
{
Array12d a;
a.array() = 0.;
a(0) = 1.;
return a.dot(b);
}
Which should be the same as
double testReturnFirst(const Array12d& b)
{
return b(0);
}
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s. Even with fast-math (https://godbolt.org/z/GvPXFy) the optimizations are very poor in GCC and Clang (still involve multiplications and additions), and MSVC doesn't do any optimizations at all.
I don't have a background in compilers, but is there a reason for this? I'm fairly sure that in a large proportion of scientific computations being able to do better constant propagation/folding would make more optimizations apparent, even if the constant-fold itself didn't result in a speedup.
While I'm interested in explanations for why this isn't done on the compiler side, I'm also interested for what I can do on a practical side to make my own code faster when facing these kinds of patterns.
This is because Eigen explicitly vectorize your code as 3 vmulpd, 2 vaddpd and 1 horizontal reduction within the remaining 4 component registers (this assumes AVX, with SSE only you'll get 6 mulpd and 5 addpd). With -ffast-math GCC and clang are allowed to remove the last 2 vmulpd and vaddpd (and this is what they do) but they cannot really replace the remaining vmulpd and horizontal reduction that have been explicitly generated by Eigen.
So what if you disable Eigen's explicit vectorization by defining EIGEN_DONT_VECTORIZE? Then you get what you expected (https://godbolt.org/z/UQsoeH) but other pieces of code might become much slower.
If you want to locally disable explicit vectorization and are not afraid of messing with Eigen's internal, you can introduce a DontVectorize option to Matrix and disable vectorization by specializing traits<> for this Matrix type:
static const int DontVectorize = 0x80000000;
namespace Eigen {
namespace internal {
template<typename _Scalar, int _Rows, int _Cols, int _MaxRows, int _MaxCols>
struct traits<Matrix<_Scalar, _Rows, _Cols, DontVectorize, _MaxRows, _MaxCols> >
: traits<Matrix<_Scalar, _Rows, _Cols> >
{
typedef traits<Matrix<_Scalar, _Rows, _Cols> > Base;
enum {
EvaluatorFlags = Base::EvaluatorFlags & ~PacketAccessBit
};
};
}
}
using ArrayS12d = Eigen::Matrix<double,12,1,DontVectorize>;
Full example there: https://godbolt.org/z/bOEyzv
I was disappointed to find that, without fast-math enabled, neither GCC 8.2, Clang 6 or MSVC 19 were able to make any optimizations at all over the naive dot-product with a matrix full of 0s.
They have no other choice unfortunately. Since IEEE floats have signed zeros, adding 0.0 is not an identity operation:
-0.0 + 0.0 = 0.0 // Not -0.0!
Similarly, multiplying by zero does not always yield zero:
0.0 * Infinity = NaN // Not 0.0!
So the compilers simply cannot perform these constant folds in the dot product while retaining IEEE float compliance - for all they know, your input might contain signed zeros and/or infinities.
You will have to use -ffast-math to get these folds, but that may have undesired consequences. You can get more fine-grained control with specific flags (from http://gcc.gnu.org/wiki/FloatingPointMath). According to the above explanation, adding the following two flags should allow the constant folding:
-ffinite-math-only, -fno-signed-zeros
Indeed, you get the same assembly as with -ffast-math this way: https://godbolt.org/z/vGULLA. You only give up the signed zeros (probably irrelevant), NaNs and the infinities. Presumably, if you were to still produce them in your code, you would get undefined behavior, so weigh your options.
As for why your example is not optimized better even with -ffast-math: That is on Eigen. Presumably they have vectorization on their matrix operations, which are much harder for compilers to see through. A simple loop is properly optimized with these options: https://godbolt.org/z/OppEhY
One way to force a compiler to optimize multiplications by 0's and 1`s is to manually unroll the loop. For simplicity let's use
#include <array>
#include <cstddef>
constexpr std::size_t n = 12;
using Array = std::array<double, n>;
Then we can implement a simple dot function using fold expressions (or recursion if they are not available):
<utility>
template<std::size_t... is>
double dot(const Array& x, const Array& y, std::index_sequence<is...>)
{
return ((x[is] * y[is]) + ...);
}
double dot(const Array& x, const Array& y)
{
return dot(x, y, std::make_index_sequence<n>{});
}
Now let's take a look at your function
double test(const Array& b)
{
const Array a{1}; // = {1, 0, ...}
return dot(a, b);
}
With -ffast-math gcc 8.2 produces:
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
ret
clang 6.0.0 goes along the same lines:
test(std::array<double, 12ul> const&): # #test(std::array<double, 12ul> const&)
movsd xmm0, qword ptr [rdi] # xmm0 = mem[0],zero
ret
For example, for
double test(const Array& b)
{
const Array a{1, 1}; // = {1, 1, 0...}
return dot(a, b);
}
we get
test(std::array<double, 12ul> const&):
movsd xmm0, QWORD PTR [rdi]
addsd xmm0, QWORD PTR [rdi+8]
ret
Addition. Clang unrolls a for (std::size_t i = 0; i < n; ++i) ... loop without all these fold expressions tricks, gcc doesn't and needs some help.

Best way to convert floating point array to integer. [Replacing my asm code for x64]

I have a function to convert floating point array to unsigned char array. This uses asm code to do that. The code was written many years ago. Now I am trying to build the solution in x64 bit. I understand that _asm is not supported on X64.
What is the best way to remove asm dependency?
Will the latest MS VC compiler optimize if I write C code? Does anyone know if there is anything in the boost or intrinsic funtions to accomplish this?
Thanks
--Hari
I solved by the following code and this is faster than asm
inline static void floatTOuchar(float * pInbuf, unsigned char * pOutbuf, long len)
{
std::copy(pInbuf, pInbuf + len, pOutbuf);
return ;
}
With SSE2, you can use intrinsics to pack from float down to unsigned char, with saturation to unsigned the 0..255 range.
Convert four vectors of floats to vectors of ints, with CVTPS2DQ (_mm_cvtps_epi32) to round to nearest, or convert with truncation (_mm_cvttps_epi32) if you want the default C floor behaviour.
Then pack those vectors together, first to two vectors of signed 16bit int with two PACKSSDW (_mm_packs_epi32), then to one vector of unsigned 8bit int with PACKUSWB (_mm_packus_epi16). Note that PACKUSWB takes signed input, so using SSE4.1 PACKUSDW as the first step just makes things more difficult (extra masking step). int16_t can represent all possible values of uint8_t, so there's no problem.
Store the resulting vector of uint8_t and repeat for the next four vectors of floats.
Without manual vectorization, normal compiler output is good for code like.
int ftoi_truncate(float f) { return f; }
cvttss2si eax, xmm0
ret
int dtoi(double d) { return nearbyint(d); }
cvtsd2si eax, xmm0 # only with -ffast-math, though. Without, you get a function call :(
ret
You can try the following and let me know:
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// Same thing but it's not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
So the trick is the double parameters. In the current code , the function rounds the float to the nearest whole number.If you want truncation , use 6755399441055743.5 (0.5 less).
Very informative article available at: http://stereopsis.com/sree/fpu2006.html