C++ SSE Intrinsics: Storing results in variables [closed] - c++

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have trouble understanding the usage of SSE intrinsics to store results of some SIMD calculation back into "normal variables". For example the _mm_store_ps intrinsic is described in the "Intel Intrinsics Guide" as follows:
void _mm_store_ps (float* mem_addr, __m128 a)
Store 128-bits (composed of 4 packed single-precision (32-bit)
floating-point elements) from a into memory. mem_addr must be aligned
on a 16-byte boundary or a general-protection exception may be
generated.
The first argument is a pointer to a float which has a size of 32bits. But the description states, that the intrinsic will copy 128 bits from a into the target mem_addr.
Does mem_addr need to be an array of 4 floats?
How can I access only a specific 32bit element in a and store it in a single float?
What am I missing conceptually?
Are there better options then the _mm_store_ps intrinsic?
Here is a simple struct where doSomething() adds 1 to x/y of the struct. Whats missing is the part on how to store the result back into x/y while only the higher 32bit wide elements 2 & 3 are used, while 1 & 0 are unused.
struct vec2 {
union {
struct {
float data[2];
};
struct {
float x, y;
};
};
void doSomething() {
__m128 v1 = _mm_setr_ps(x, y, 0, 0);
__m128 v2 = _mm_setr_ps(1, 1, 0, 0);
__m128 result = _mm_add_ps(v1, v2);
// ?? How to store results in x,y ??
}
}

It's a 128-bit load or store, so yes the arg is like float mem[4]. Remember that in C, passing an array to a function / intrinsic is the same as passing a pointer.
Intel's intrinsics are somewhat special because they don't follow the normal strict-aliasing rules, at least for integer. (e.g. _mm_loadu_si128((const __m128i*)some_pointer) doesn't violate strict-aliasing even if it's a pointer to long. I think the same applies to the float/double load/store intrinsics, so you can safely use them to load/store from/to whatever you want. Usually you'd use _mm_load_ps to load single-precision FP bit patterns, and usually you'd be keeping those in C objects of type float, though.
How can I access only a specific 32bit element in a and store it in a single float?
Use a vector shuffle and then _mm_cvtss_f32 to cast the vector to scalar.
loading / storing 64 bits
Ideally you could operate on 2 vectors at once packed together, or an array of X values and an array of Y values, so with a pair of vectors you'd have the X and Y values for 4 pairs of XY coordinates. (struct-of-arrays instead of array-of-structs).
But you can express what you're trying to do efficiently like this:
struct vec2 {
float x,y;
};
void foo(const struct vec2 *in, struct vec2 *out) {
__m128d tmp = _mm_load_sd( (const double*)in ); //64-bit zero-extending load with MOVSD
__m128 inv = _mm_castpd_ps(tmp); // keep the compiler happy
__m128 result = _mm_add_ps(inv, _mm_setr_ps(1, 1, 0, 0) );
_mm_storel_pi( out, result );
}
GCC 8.2 compiles it like this (on Godbolt), for x86-64 System V, strangely using movq instead of movsd for the load. gcc 6.3 uses movsd.
foo(vec2 const*, vec2*):
movq xmm0, QWORD PTR [rdi] # 64-bit integer load
addps xmm0, XMMWORD PTR .LC0[rip] # packed 128-bit float add
movlps QWORD PTR [rsi], xmm0 # 64-bit store
ret
For a 64-bit store of the low half of a vector (2 floats or 1 double), you can use _mm_store_sd. Or better _mm_storel_pi (movlps). Unfortunately the intrinsic for it wants a __m64* arg instead of float*, but that's just a design quirk of Intel's intrinsics. They often require type casting.
Notice that instead of _mm_setr, I used _mm_load_sd((const double*)&(in->x)) to do a 64-bit load that zero-extends to a 128-bit vector. You don't want a movlps load because that merges into an existing vector. That would create a false dependency on whatever value was there before, and costs an extra ALU uop.

Related

Is there a better way to any detect bits that are set in a 16-byte array of flags?

ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };
// Code detects noise and sets noise_frame_flags omitted
__m128i xmm0 = _mm_load_si128((__m128i*)noise_frame_flags);
bool isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);
if (isNoiseToCancel)
cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);
This is a code snippet from my AV Capture tool on Linux. noise_frame_flags here is an array of flags for 16-channel audio. For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel. For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).
Even if a single "flag" is set then I need to call cancelNoises. And this code seems to work fine in that matter. As you see I used _mm_load_si128 to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64 to extract "flags". My question is there a better way to do this (using pop count maybe)?
Note: ALIGNTO(16) is a macro expands to correct GCC equivalent but nicer looking.
Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.
In asm you just want a mov load and a memory-source or or add, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extract from a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq.
In C++, use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b). This is still SIMD, just SWAR (SIMD Within A Register).
add is even better than or: it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). or can't fuse with branch instructions on any CPUs. Since your values are 0 or 1, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.
(Some addressing modes may defeat micro or macro-fusion on Intel. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load + add/sub-and-branch). Assuming it's about the same as cmp on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)
uint64_t a, b;
memcpy(&a, noise_frame_flags+0, sizeof(a)); // strict-aliasing-safe loads
memcpy(&b, noise_frame_flags+8, sizeof(b)); // which optimize to MOV qword
bool isNoiseToCancel = a + b; // equivalent to a | b for bool inputs
This should compile to 3 asm instructions which will decode to 2 uops total, or 3 on AMD CPUs where JCC can only fuse with cmp or test.
union { alignas(16) uint8_t flags[16]; uint64_t chunks[2];}; would be safe in C99, but not ISO C++. Most but not all C++ compilers that support Intel intrinsics define the behaviour of union type-punning. (I think #jww has said SunCC doesn't.)
In C++11, you don't need a custom macro for ALIGNTO(16), just use alignas(16). Also supported in C11 if you #include <stdalign.h>
Alternatives:
movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc.
AMD CPUs run ptest as 1 uop, but it still can't fuse.
If you had an all-ones or all-zeros constant in a register, ptest xmm0, [mem] would work to save a uop on Intel (depending on addressing mode), but that's still 3 total.
PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Surprisingly, vptest ymm only requires AVX1). Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test+jnz. See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.
popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.
__m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
vflags = _mm_slli_epi32(vflags, 7);
unsigned flagmask = _mm_movemask_epi8(vflags);
if (flagmask) {
unsigned flagcount = __builtin_popcount(flagmask); // popcnt with -march=nehalem or higher
unsigned first_setflag = __builtin_ctz(flagmask); // tzcnt if available, else BSF
vflags &= vflags - 1; // clear lowest set bit. blsr if compiled with -march=haswell or bdver2 or newer.
...
}
(Don't actually use -march=bdver2 or -march=nehalem, unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)
Here's what I came up with for doing this:
#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1, 6,7, 4,5)
#define SWAP32 V(1, 0, 3, 2, 5, 4, 7, 6)
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
With VLEN = 8, this will use 256-bit registers if the arch supports it. Change to 4 to use 128 bit.
This should compile to a single vptest instruction.

Extract the low bit of each bool byte in a __m128i? bool array to packed bitmap

(Editor's note: this question was originally: How should one access the m128i_i8 member, or members in general, of the __m128i object?, trying to use an MSVC-specific method on GCC's definition of __m128i. But this was an XY problem and the accepted answer is about the XY problem here. Another answer does answer this question.)
I realize that Microsoft suggests against directly accessing the members of these objects, but I need to set them and the documentation is sorely lacking.
I continue getting the error "request for member ‘m128i_i8’ in ‘(my var name)', which is of non-class type ‘wirelabel {aka __vector(2) long long int}’" which I don't understand because I've included all the correct headers and it does recognize __m128i variables.
Note1: wirelabel is a typedef for __m128i i.e. there exists in a header
typedef __m128i wirelabel
Note2: The reason Note1 was used is explained in the following other question:
tbb::cache_aligned_allocator: Getting "request for member...which is of non-class type" with __m128i. User error or bug?
Note3: I'm using the compiler g++
Note4: This following question doesn't answer mine but does discuss related information Why should you not access the __m128i fields directly?
I also know that there is a _mm_set_epi8 function but it requires you set all 8 bit sections at once and that is not an option for me currently.
The question the accepted answer answers:
Edit: I was asked for more specifics as to why I think I need to access each of the 16 8-bit parts of the __m128i object, and here is why: I have a bool array with size 'n*128' (n is a size_t) and I need to store these within an array of 'wirelabel' with size 'n'.
Now because wirelabel is just an alias/typedef (correct me if there is a difference) for __m128i, each of the 'n' indices of 128 bools can be stored in the 'wirelabel' array.
However, in order to do this I believe need to convert every 8-bits into it's signed equivalent and store it in the correct 8bit index in each 'wirelabel' pointer in the array.
So your source data is contiguous? You should use _mm_load_si128 instead of messing around with scalar components of vector types.
Your real problem is packing an array of bool (1 byte per element in the ABI used by g++ on x86) into a bitmap. You should do this with SIMD, not with scalar code to set 1 bit or byte at a time.
pmovmskb (_mm_movemask_epi8) is fantastic for extracting one bit per byte of input. You just need to arrange to get the bit you want into the high bit.
The obvious choice would be a shift, but vector shift instructions compete for the same execution port as pmovmskb on Haswell (port 0). (http://agner.org/optimize/). Instead, adding 0x7F will produce 0x80 (high bit set) for an input of 1, but 0x7F (high bit clear) for an input of 0. (And a bool in the x86-64 System V ABI must be stored in memory as an integer 0 or 1, not simply 0 vs. any non-zero value).
Why not pcmpeqb against _mm_set1_epi8(1)? Skylake runs pcmpeqb on ports 0/1, but paddb on all 3 vector ALU ports (0/1/5). It's very common to use pmovmskb on the result of pcmpeqb/w/d/q, though.
#include <immintrin.h>
#include <stdint.h>
// n is the number of uint16_t dst elements
// We access n*16 bool elements from src.
void pack_bools(uint16_t *dst, const bool *src, size_t n)
{
// you can later access dst with __m128i loads/stores
__m128i carry_to_highbit = _mm_set1_epi8(0x7F);
for (size_t i = 0 ; i < n ; i+=1) {
__m128i boolvec = _mm_loadu_si128( (__m128i*)&src[i*16] );
__m128i highbits = _mm_add_epi8(boolvec, carry_to_highbit);
dst[i] = _mm_movemask_epi8(highbits);
}
}
Because we want to use scalar stores when writing this bitmap, we want dst to be in uint16_t for strict-aliasing reasons. With AVX2, you'd want uint32_t. (Or if you did combine = tmp1 << 16 | tmp to combine two pmovmskb results. But probably don't do that.)
This compiles into an asm loop like this (with gcc7.3 -O3, on the Godbolt compiler explorer)
.L3:
movdqu xmm0, XMMWORD PTR [rsi]
add rsi, 16
add rdi, 2
paddb xmm0, xmm1
pmovmskb eax, xmm0
mov WORD PTR [rdi-2], ax
cmp rdx, rsi
jne .L3
So it's not wonderful (7 fuse-domain uops -> front-end bottleneck at 16 bools per ~1.75 clock cycles). Clang unrolls by 2, and should manage 16 bools per 1.5 cycles.
Using a shift (pslld xmm0, 7) would only run at one iteration per 2 cycles on Haswell, bottlenecked on port 0.
Create an anonymous union containing a _m128i member and an array of the other type whose members you want to set. Type-punning is legal in C, and supported as an extension in g++, clang++ and MSVC. If you want to set individual bits, you can declare the other member as a struct of bitfields. The order of a bitfield is implementation-defined, but you’re using an Intel intrinsic anyway, so it’ll be little-endian.

About returning more than one value in C/C++/Assembly

I have read some questions about returning more than one value such as What is the reason behind having only one return value in C++ and Java?, Returning multiple values from a C++ function and Why do most programming languages only support returning a single value from a function?.
I agree with most of the arguments used to prove that more than one return value is not strictly necessary and I understand why such feature hasn't been implemented, but I still can't understand why can't we use multiple caller-saved registers such as ECX and EDX to return such values.
Wouldn't it be faster to use the registers instead of creating a Class/Struct to store those values or passing arguments by reference/pointers, both of which use memory to store them? If it is possible to do such thing, does any C/C++ compiler use this feature to speed up the code?
Edit:
An ideal code would be like this:
(int, int) getTwoValues(void) { return 1, 2; }
int main(int argc, char** argv)
{
// a and b are actually returned in registers
// so future operations with a and b are faster
(int a, int b) = getTwoValues();
// do something with a and b
return 0;
}
Yes, this is sometimes done. If you read the Wikipedia page on x86 calling conventions under cdecl:
There are some variations in the interpretation of cdecl, particularly in how to return values. As a result, x86 programs compiled for different operating system platforms and/or by different compilers can be incompatible, even if they both use the "cdecl" convention and do not call out to the underlying environment. Some compilers return simple data structures with a length of 2 registers or less in the register pair EAX:EDX, and larger structures and class objects requiring special treatment by the exception handler (e.g., a defined constructor, destructor, or assignment) are returned in memory. To pass "in memory", the caller allocates memory and passes a pointer to it as a hidden first parameter; the callee populates the memory and returns the pointer, popping the hidden pointer when returning.
(emphasis mine)
Ultimately, it comes down to calling convention. It's possible for your compiler to optimize your code to use whatever registers it wants, but when your code interacts with other code (like the operating system), it needs to follow the standard calling conventions, which typically uses 1 register for returning values.
Returning in stack isn't necessarily slower, because once the values are available in L1 cache (which the stack often fulfills), accessing them will be very fast.
However in most computer architectures there are at least 2 registers to return values that are twice (or more) as wide as the word size (edx:eax in x86, rdx:rax in x86_64, $v0 and $v1 in MIPS (Why MIPS assembler has more that one register for return value?), R0:R3 in ARM1, X0:X7 in ARM64...). The ones that don't have are mostly microcontrollers with only one accumulator or a very limited number of registers.
1"If the type of value returned is too large to fit in r0 to r3, or whose size cannot be determined statically at compile time, then the caller must allocate space for that value at run time, and pass a pointer to that space in r0."
These registers can also be used for returning directly small structs that fits in 2 (or more depending on architecture and ABI) registers or less.
For example with the following code
struct Point
{
int x, y;
};
struct shortPoint
{
short x, y;
};
struct Point3D
{
int x, y, z;
};
Point P1()
{
Point p;
p.x = 1;
p.y = 2;
return p;
}
Point P2()
{
Point p;
p.x = 1;
p.y = 0;
return p;
}
shortPoint P3()
{
shortPoint p;
p.x = 1;
p.y = 0;
return p;
}
Point3D P4()
{
Point3D p;
p.x = 1;
p.y = 2;
p.z = 3;
return p;
}
Clang emits the following instructions for x86_64 as you can see here
P1(): # #P1()
movabs rax, 8589934593
ret
P2(): # #P2()
mov eax, 1
ret
P3(): # #P3()
mov eax, 1
ret
P4(): # #P4()
movabs rax, 8589934593
mov edx, 3
ret
For ARM64:
P1():
mov x0, 1
orr x0, x0, 8589934592
ret
P2():
mov x0, 1
ret
P3():
mov w0, 1
ret
P4():
mov x1, 1
mov x0, 0
sub sp, sp, #16
bfi x0, x1, 0, 32
mov x1, 2
bfi x0, x1, 32, 32
add sp, sp, 16
mov x1, 3
ret
As you can see, no stack operations are involved. You can switch to other compilers to see that the values are mainly returned on registers.
Return data is put on the stack. Returning a struct by copy is literally the same thing as returning multiple values in that all it's data members are put on the stack. If you want multiple return values that is the simplest way. I know in Lua that's exactly how it handles it, just wraps it in a struct. Why it was never implemented, probably because you could just do it with a struct, so why implement a different method? As for C++, it actually does support multiple return values, but it's in the form of a special class, really the same way Java handles multiple return values (tuples) as well. So in the end, it's all the same, either you copy the data raw (non-pointer/non-reference to a struct/object) or just copy a pointer to a collection that stores multiple values.

How to write int64=int32*int32 in a standard/portable and efficient way? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Related:
Is this treatment of int64_t a GCC AND Clang bug?
The only solution I can think of is to explicitly convert one of the operands to int64, forcing the product to be also at least int64.
But if done this way, then it's up to the compiler's intelligence to actually do int64*int32, or int64*int64, or ideally, optimize it back to int32*int32.
As discussed in the related question, assigning the result of int32*int32 to int64 doesn't change the fact that int32*int32 already causes UB.
Any thought?
You've already indicated how to do this in a standard, portable, and efficient way:
int64_t mul(int32_t x, int32_t y) {
return (int64_t)x * y;
// or static_cast<int64_t>(x) * y if you prefer not to use C-style casts
// or static_cast<int64_t>(x) * static_cast<int64_t>(y) if you don't want
// the integral promotion to remain implicit
}
Your question seems to be about a hypothetical architecture that has assembly instructions corresponding to the function signatures
int64_t intrinsic_mul(int32_t x, int32_t y);
int64_t intrinsic_mul(int64_t x, int64_t y);
int64_t intrinsic_mul(int64_t x, int32_t y); // and maybe this too
and, on this hypothetical architecture, the first of these has relevant advantages, and furthermore, your compiler fails to use this instruction when compiling the function above, and on top of all that, it fails to provide access to the above intrinsic.
I expect such a scenario to be really rare, but if you truly find yourself in such a situation, most compilers also allow you to write inline assembly, so you can write a function that invokes this special instruction directly, and still provides enough metadata so the optimizer can make somewhat efficient use of it (e.g. using symbolic input and output registers so the optimizer can use whichever registers it wants, rather than having the register choice hardcoded).
Built-in arithmetic expressions only exits for homogeneous operand types. Any expression involving mixed types implies integral promotions, and the arithmetic operation itself is only ever defined for and applied to homogeneous types.
Choose either int32_t or int64_t.
As you probably understand correctly, for both choices of type arithmetic operations (at least +, - and *) are susceptible to UB by overflow, but there can be no overflow when operating on two int64_ts which both can be represented as int32_ts. So for example the following works:
int64_t multiply(int32_t a, int32_t b)
{
// guaranteed not to overflow, and the result value is equal
// to the mathematical result of the operation
return static_cast<int64_t>(a) * static_cast<int64_t>(b);
}
As an example, here is how GCC translates this to x86 and x86_64 on Linux (note the different calling conventions):
multiply(int, int):
// x86 (32-bit, "-m32 -march=i386") x86-64 ("-m64 -march=x86-64")
// args are on the stack args are in EDI, ESI
// return in EDX:EAX return in RAX
mov eax, DWORD PTR [esp+8] movsx rax, edi
movsx rsi, esi
imul DWORD PTR [esp+4] imul rax, rsi
ret ret

Get member of __m128 by index?

I've got some code, originally given to me by someone working with MSVC, and I'm trying to get it to work on Clang. Here's the function that I'm having trouble with:
float vectorGetByIndex( __m128 V, unsigned int i )
{
assert( i <= 3 );
return V.m128_f32[i];
}
The error I get is as follows:
Member reference has base type '__m128' is not a structure or union.
I've looked around and found that Clang (and maybe GCC) has a problem with treating __m128 as a struct or union. However I haven't managed to find a straight answer as to how I can get these values back. I've tried using the subscript operator and couldn't do that, and I've glanced around the huge list of SSE intrinsics functions and haven't yet found an appropriate one.
A union is probably the most portable way to do this:
union {
__m128 v; // SSE 4 x float vector
float a[4]; // scalar array of 4 floats
} U;
float vectorGetByIndex(__m128 V, unsigned int i)
{
U u;
assert(i <= 3);
u.v = V;
return u.a[i];
}
As a modification to hirschhornsalz's solution, if i is a compile-time constant, you could avoid the union path entirely by using a shuffle:
template<unsigned i>
float vectorGetByIndex( __m128 V)
{
// shuffle V so that the element that you want is moved to the least-
// significant element of the vector (V[0])
V = _mm_shuffle_ps(V, V, _MM_SHUFFLE(i, i, i, i));
// return the value in V[0]
return _mm_cvtss_f32(V);
}
A scalar float is just the bottom element of an XMM register, and the upper elements are allowed to be non-zero; _mm_cvtss_f32 is free and will compile to zero instructions. This will inline as just a shufps (or nothing for i==0).
Compilers are smart enough to optimize away the shuffle for i==0 (except for long-obsolete ICC13) so no need for an if (i). https://godbolt.org/z/K154Pe. clang's shuffle optimizer will compile vectorGetByIndex<2> into movhlps xmm0, xmm0 which is 1 byte shorter than shufps and produces the same low element. You could manually do this with switch/case for other compilers since i is a compile-time constant, but 1 byte of code size in the few places you use this while manually vectorizing is pretty trivial.
Note that SSE4.1 _mm_extract_epi32(V, i); is not a useful shuffle here: extractps r/m32, xmm, imm can only extract the FP bit-pattern to an integer register or memory (https://www.felixcloutier.com/x86/extractps). (And the intrinsic returns it as an int, so it would actually compile to extractps + cvtsi2ss to do int->float conversion on the FP bit-pattern, unless you type-pun it in your C++ code. But then you'd expect it to compile to extractps eax, xmm0, i / movd xmm0, eax which is terrible vs. shufps.)
The only case where extractps would be useful is if the compiler wanted to store this result straight to memory, and fold the store into the extract instruction. (For i!=0, otherwise it would use movss). To leave the result in an XMM register as a scalar float, shufps is good.
(SSE4.1 insertps would be usable but unnecessary: it makes it possible to zero other elements while taking an arbitrary source element.)
Use
template<unsigned i>
float vectorGetByIndex( __m128 V) {
union {
__m128 v;
float a[4];
} converter;
converter.v = V;
return converter.a[i];
}
which will work regardless of the available instruction set.
Note: Even if SSE4.1 is available and i is a compile time constant, you can't use pextract etc. this way, because these instructions extract a 32-bit integer, not a float:
// broken code starts here
template<unsigned i>
float vectorGetByIndex( __m128 V) {
return _mm_extract_epi32(V, i);
}
// broken code ends here
I don't delete it because it is a useful reminder how to not do things.
The way I use is
union vec { __m128 sse, float f[4] };
float accessmember(__m128 v, int index)
{
vec v.sse = v;
return v.f[index];
}
Seems to work out pretty well for me.
Late to this party but found that this works for me in MSVC where z is a variable of type __m128.
#define _mm_extract_f32(v, i) _mm_cvtss_f32(_mm_shuffle_ps(v, v, i))
__m128 z = _mm_setr_ps(1.0, 2.0, 3.0, 4.0);
float f = _mm_extract_f32(z, 2);
OR even simpler
__m128 z;
float f = z.m128_f32[2]; // to get the 3rd float value in the vector